stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

Gauravtolani · 2017-11-14T10:18:18Z

en_core_web_md and en_core_web_lg models are giving 'False' for all words in the sentence using "is_stop" attribute.

PS : en_core_web_sm is working fine.

System Information :

Python version: 2.7.12
Platform: Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version: 2.0.2
Models: en

ines · 2017-11-21T22:40:18Z

Thanks for the report and sorry about that – this should be fixed in the next update to the models.

In the meantime, here's a workaround:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

kbulygin · 2017-12-06T15:02:27Z

With en_core_web_sm (spaCy 2.0.4), is_stop depends on casing:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> [nlp(s)[0].is_stop for s in 'this This THIS tHIS the The THE tHE'.split()]
[True, False, False, False, True, False, False, False]
# Expected [True, True, False/True, False/True, True, True, False/True, False/True].

Info about spaCy

spaCy version: 2.0.4
Platform: Linux-3.16.0-4-amd64-x86_64-with-debian-8.8
Python version: 3.6.0
Models: en_core_web_lg, en_vectors_web_lg, en_core_web_sm

en_core_web_sm for spaCy 2.0.0a10 correctly returned t.is_stop == True for both this and This.

brickpattern · 2018-02-13T19:31:42Z

Bump. I'm facing the same using en_core_web_sm.
Is this the expected output for the is_stop ? Or should we be using a different approach?

honnibal · 2018-02-17T21:16:44Z

I know this has been an issue a long time --- the delay comes back to some infrastructure problems, which have made getting all the models retrained a hassle.

The following fix to the spacy train CLI command should make sure the issue doesn't reoccur: 262d0a3

The models for v2.1.0 are currently training, so fingers crossed updated models should be deployed soon.

adrianog · 2018-05-08T11:45:57Z

Hi

Is this still the way to go:

nlp = spacy.load('en_core_web_lg')

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True

randomsven · 2018-05-18T03:22:03Z

This still leaves the is_stop property sensitive to case (ie "What" vs "what") - sounds like the fix needs to be applied upstream - as it is this is simple enough to handle outside of the token attribute system.

rajhans · 2018-05-23T00:04:44Z

Any updates or ETA on this?

nathanwdavis · 2018-05-24T14:19:40Z

Slightly better stopwords workaround (but still not a good solution):

for word in nlp.Defaults.stop_words:
    for w in (word, word[0].upper() + word[1:], word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

This sets is_stop on the lowercase, uppercase, and first-char uppercase form for each.

KingNoosh · 2018-05-24T18:45:24Z

Any progress?

I'm using the above snippet atm.

adrianog · 2018-05-28T14:07:53Z

Why does a stop word need to be in the vocabulary? (general question)

ines · 2018-05-28T14:36:11Z

We're currently training new models for the upcoming nightly release of the develop branch (spaCy v2.1.0). You can lurk the spacy-models repo for updates and progress, but it's all currently pre-alpha. Sorry this was taking so long – it really did come down to getting the infrastructure right to be able to train our current model family reliably (and be able to add more languages in the future).

@adrianog The is_stop attribute is an attribute on the lexeme, i.e. the context-independent entry in the vocabulary.

adrianog · 2018-05-29T00:58:34Z

@ines I see. Where vocabulary here is to be intended as spacy vocaculary i.e. lexeme.is_oov() could still return "False"?

lock · 2018-06-28T01:42:02Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added lang / en English language data and models models Issues related to the statistical models labels Nov 14, 2017

ines mentioned this issue Nov 21, 2017

stopwords missing from en_core_web_lg #1625

Closed

ines added the bug Bugs and behaviour differing from documentation label Nov 21, 2017

This was referenced Jan 16, 2018

prob = -20.0 for all lemmas in en_core_web_sm #1590

Closed

is_oov returns True for common values of DEP, TAG, etc. #1822

Closed

Gauravtolani closed this as completed May 28, 2018

lock bot locked as resolved and limited conversation to collaborators Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

Gauravtolani commented Nov 14, 2017

ines commented Nov 21, 2017

kbulygin commented Dec 6, 2017

brickpattern commented Feb 13, 2018

honnibal commented Feb 17, 2018

adrianog commented May 8, 2018

randomsven commented May 18, 2018

rajhans commented May 23, 2018

nathanwdavis commented May 24, 2018 •

edited

Loading

KingNoosh commented May 24, 2018

adrianog commented May 28, 2018 •

edited

Loading

ines commented May 28, 2018

adrianog commented May 29, 2018

lock bot commented Jun 28, 2018

stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

stop words missing for en_core_web_md and en_core_web_lg for spaCy v2.0 #1574

Comments

Gauravtolani commented Nov 14, 2017

ines commented Nov 21, 2017

kbulygin commented Dec 6, 2017

Info about spaCy

brickpattern commented Feb 13, 2018

honnibal commented Feb 17, 2018

adrianog commented May 8, 2018

randomsven commented May 18, 2018

rajhans commented May 23, 2018

nathanwdavis commented May 24, 2018 • edited Loading

KingNoosh commented May 24, 2018

adrianog commented May 28, 2018 • edited Loading

ines commented May 28, 2018

adrianog commented May 29, 2018

lock bot commented Jun 28, 2018

nathanwdavis commented May 24, 2018 •

edited

Loading

adrianog commented May 28, 2018 •

edited

Loading