-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574
Comments
Thanks for the report and sorry about that – this should be fixed in the next update to the models. In the meantime, here's a workaround: nlp = spacy.load('en_core_web_lg')
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True |
With >>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> [nlp(s)[0].is_stop for s in 'this This THIS tHIS the The THE tHE'.split()]
[True, False, False, False, True, False, False, False]
# Expected [True, True, False/True, False/True, True, True, False/True, False/True]. Info about spaCy
|
Bump. I'm facing the same using en_core_web_sm. |
I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle. The following fix to the The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon. |
Hi Is this still the way to go:
|
This still leaves the is_stop property sensitive to case (ie "What" vs "what") - sounds like the fix needs to be applied upstream - as it is this is simple enough to handle outside of the token attribute system. |
Any updates or ETA on this? |
Slightly better stopwords workaround (but still not a good solution):
This sets is_stop on the lowercase, uppercase, and first-char uppercase form for each. |
Any progress? I'm using the above snippet atm. |
Why does a stop word need to be in the vocabulary? (general question) |
We're currently training new models for the upcoming nightly release of the @adrianog The |
@ines I see. Where vocabulary here is to be intended as spacy vocaculary i.e. lexeme.is_oov() could still return "False"? |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
en_core_web_md and en_core_web_lg models are giving 'False' for all words in the sentence using "is_stop" attribute.
PS : en_core_web_sm is working fine.
System Information :
The text was updated successfully, but these errors were encountered: