Best practices for text pre-processing using Spacy #7228

thatGreekGuy96 · 2021-02-28T18:54:26Z

thatGreekGuy96
Feb 28, 2021

Hey everyone!
I'm new to NLP and i've been playing around with spacy for sentiment analysis.

Suppose I have a sentence that I want to classify as a positive or negative one. I want to remove stop words from that sentence before feeding it to the classifier. I know that spacy's en_core_web_sm model will create tokens with the is_stop attribute, which is super helpful.

My question is, if i chuck a textcat component at the end of the en_core_web_sm model (using add_pipe), will stop words automatically be filtered out before being fed to the classifier? Or do i need to use the is_stop attribute to remove them myself, and then feed them to another pipeline?

I apologise if this is a silly question but I couldn't find an obvious answer! Many thanks in advance for any help!

Andy

polm · 2021-03-01T07:37:08Z

polm
Mar 1, 2021

Howdy, and welcome to the forums!

My question is, if i chuck a textcat component at the end of the en_core_web_sm model (using add_pipe), will stop words automatically be filtered out before being fed to the classifier? Or do i need to use the is_stop attribute to remove them myself, and then feed them to another pipeline?

Not a silly question, it's always good to pay attention to the details. In this case, textcat doesn't have any special treatment for stopwords, and neither do any of the standard spaCy pipeline components. In general, modern NLP methods work without special treatment for stop words, and in some cases removing stop words could make things significantly worse - for example, it would break dependency parsing, and could confuse Transformers models, which expect complete sentences.

For sentiment analysis in particular you don't want to remove stop words - not is a stop word in English, but obviously very important in determining whether a sentence is positive or negative (consider "I like cheese" vs "I do not like cheese").

2 replies

thatGreekGuy96 Mar 1, 2021
Author

Hey, thanks a lot! I did read in places that removing stop words can mess up the results from sentiment analysis so thanks for reminding me of that!
For completeness though, say I did want to filter out tokens based on some attribute (stop word or other). What is the neatest way of doing this in spacy? Sounds like I would need to use one pipeline to tag tokens with said attribute, filter them using something like a list comprehension?

Again, many thanks!

polm Mar 2, 2021

Ines gave a good answer to the question of how to filter tokens on Stack Overflow a few years ago. Basically, you can add a Token extension like this to set attributes on tokens:

from spacy.tokens import Token

def get_is_excluded(token):
    # Getter function to determine the value of token._.is_excluded
    return token.text in ['some', 'excluded', 'words']

Token.set_extension('is_excluded', getter=get_is_excluded)

And then you can use a list comprehension or whatever to filter that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for text pre-processing using Spacy #7228

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Best practices for text pre-processing using Spacy #7228

thatGreekGuy96 Feb 28, 2021

Replies: 1 comment · 2 replies

polm Mar 1, 2021

thatGreekGuy96 Mar 1, 2021 Author

polm Mar 2, 2021

thatGreekGuy96
Feb 28, 2021

Replies: 1 comment 2 replies

polm
Mar 1, 2021

thatGreekGuy96 Mar 1, 2021
Author