Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

Closed
Gauravtolani opened this issue Nov 14, 2017 · 13 comments
Closed
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models models Issues related to the statistical models

Comments

@Gauravtolani
Copy link

en_core_web_md and en_core_web_lg models are giving 'False' for all words in the sentence using "is_stop" attribute.

PS : en_core_web_sm is working fine.

System Information :

  • Python version: 2.7.12
  • Platform: Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
  • spaCy version: 2.0.2
  • Models: en
@ines ines added lang / en English language data and models models Issues related to the statistical models labels Nov 14, 2017
@ines
Copy link
Member

ines commented Nov 21, 2017

Thanks for the report and sorry about that – this should be fixed in the next update to the models.

In the meantime, here's a workaround:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

@ines ines added the bug Bugs and behaviour differing from documentation label Nov 21, 2017
@kbulygin
Copy link
Contributor

kbulygin commented Dec 6, 2017

With en_core_web_sm (spaCy 2.0.4), is_stop depends on casing:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> [nlp(s)[0].is_stop for s in 'this This THIS tHIS the The THE tHE'.split()]
[True, False, False, False, True, False, False, False]
# Expected [True, True, False/True, False/True, True, True, False/True, False/True].

Info about spaCy

  • spaCy version: 2.0.4
  • Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.8
  • Python version: 3.6.0
  • Models: en_core_web_lg, en_vectors_web_lg, en_core_web_sm

en_core_web_sm for spaCy 2.0.0a10 correctly returned t.is_stop == True for both this and This.

@brickpattern
Copy link

Bump. I'm facing the same using en_core_web_sm.
Is this the expected output for the is_stop ? Or should we be using a different approach?

@honnibal
Copy link
Member

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: 262d0a3

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

@adrianog
Copy link

adrianog commented May 8, 2018

Hi

Is this still the way to go:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

@randomsven
Copy link

This still leaves the is_stop property sensitive to case (ie "What" vs "what") - sounds like the fix needs to be applied upstream - as it is this is simple enough to handle outside of the token attribute system.

@rajhans
Copy link

rajhans commented May 23, 2018

Any updates or ETA on this?

@nathanwdavis
Copy link

nathanwdavis commented May 24, 2018

Slightly better stopwords workaround (but still not a good solution):

for word in nlp.Defaults.stop_words:
    for w in (word, word[0].upper() + word[1:], word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

This sets is_stop on the lowercase, uppercase, and first-char uppercase form for each.

@KingNoosh
Copy link

Any progress?

I'm using the above snippet atm.

@adrianog
Copy link

adrianog commented May 28, 2018

Why does a stop word need to be in the vocabulary? (general question)

@ines
Copy link
Member

ines commented May 28, 2018

We're currently training new models for the upcoming nightly release of the develop branch (spaCy v2.1.0). You can lurk the spacy-models repo for updates and progress, but it's all currently pre-alpha. Sorry this was taking so long – it really did come down to getting the infrastructure right to be able to train our current model family reliably (and be able to add more languages in the future).

@adrianog The is_stop attribute is an attribute on the lexeme, i.e. the context-independent entry in the vocabulary.

@adrianog
Copy link

@ines I see. Where vocabulary here is to be intended as spacy vocaculary i.e. lexeme.is_oov() could still return "False"?

@lock
Copy link

lock bot commented Jun 28, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation lang / en English language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

10 participants