# Removing stop words with spaCy library in Python
#### Introduction
When working with text data in NLP, we usually have to preprocess our data before carrying out the main task.

One common preprocessing step we take is removing stop words which is what I will be showing you in this exercise.

#### What are Stop Words
In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling. These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.

When we remove stopwords it reduces the size of the text corpus which increases the performance and robustness of the NLP model. But sometimes removing the stopwords may have an adverse effect if it changes the meaning of the sentence. For example, if we consider the example “This is not a good way to talk” which is a negative sentence. When we remove stopwords from this sentence it becomes a positive sentence: “good way talk”.

While there is no universal list of stop words in NLP, many NLP libraries in Python provide their list. We can also decide to create our own list of stop words.

#### spaCy Stop Words
Here we will be using the list of stop words provided by the spaCy library, so we don’t have to write our own.

However, before we can use these stopwords from the spaCy library, we need to download it first.

In [31]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

print(len(stopwords))
print(stopwords)

328
{'within', 'either', 'call', 'nor', 'above', 'latterly', 'hence', 'becomes', 'almost', 'before', 'again', 'put', 'get', 'seems', 'anywhere', 'being', 'none', 'became', 'will', 'whoever', 'one', 'everyone', 'was', 'next', 'whatever', 'into', 'that', 'amongst', 'not', 'whom', 'ca', 'wherein', 'forty', 'and', 'but', 'below', 'see', 'hereafter', 'least', 'become', 'else', 'n’t', 'many', 'why', 'make', 'Farden', 'may', '’re', '’m', 'whereas', 'somewhere', 're', "'m", 'in', "'re", 'or', 'sixty', 'n‘t', "'ve", 'where', 'toward', 'last', 'nowhere', '‘m', 'however', 'whenever', 'mine', 'anyhow', 'further', 'himself', 'these', 'ten', 'up', 'herein', 'whereby', 'he', 'hereby', 'back', 'her', 'no', 'other', 'while', 'much', 'all', 'otherwise', '‘ll', 'about', 'sometimes', '‘ve', 'beyond', 'third', 'name', 'do', 'such', 'until', 'hereupon', '‘s', 'am', 'done', 'rather', 'others', 'same', 'i', 'regarding', 'our', 'your', 'amount', 'they', 'everything', 'have', 'yet', 'on', 'perhaps', 'always', '

#### Checking if a word is Stopword
We can check whether a word is a stopword or not by using the is_stop method of Spacy.

In [32]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("The wedding started arrived too late, We could not use the allocated venue")

for token in doc:
    print(token.text,token.is_stop)

The True
wedding False
started False
arrived False
too True
late False
, False
We True
could True
not True
use False
the True
allocated False
venue False


#### Remove Stopwords

First, we import the spacy library.
2nd, we load the English language model of the Spacy object.
3rd, we store the list of stopwords in a variable. 
4th, We create an empty list to store words that are not stopwords.

Using a for loop that iterates over the text (that has been split on whitespace) we check whether the word is present in the stopword list, if not we append it in the list.

At last, we join the list of words that don’t contain stopwords using the “join()” function, and thus we have a final output where all stopwords are removed from the string.

In [33]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

text = " Let me show how to remove stopwords using spacy library"

lst=[]
for token in text.split():
    if token.lower() not in stopwords:    #checking whether the word is not 
        lst.append(token)                    #present in the stopword list.
        
#Join items in the list
print("Original text  : ",text)
print("Text after removing stopwords  :   ",' '.join(lst))

Original text  :   Let me show how to remove stopwords using spacy library
Text after removing stopwords  :    Let remove stopwords spacy library


#### Adding Stopwords to Default Spacy List
By default, Spacy has 326 English stopwords, but at times you may like to add your own custom stopwords to the default list. I will show you how in the below example.

To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.

This code shows how to add a single stopword:

In [34]:
import spacy    

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("my_new_stopword")

In [35]:
#To add several stopwords at once:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words |= {"Wycliffe","Mwebi"}

#### Remove Stopwords from Default Spacy List
There may be some scenarios where you will like to preserve some stopwords in your text. In this case, you may remove those stopwords from Spacy default list by the remove() method as shown in the below examples.

To remove a single stopword:

In [39]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.remove("if")

In [37]:
# To remove several stopwords at once:
import spacy    

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words -= {"who", "when"}

#### Filtering Stopwords from Text File
In the code below i will remove the stopwords from an entire text file using Spacy as explained above. The only difference is that i have imported the text using the Python file operation “with open()”

In [24]:
import spacy

en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

with open(r"C:\Users\user\NLP\biden.txt", 'r', encoding= 'UTF-8') as f:
    text=f.read()
    
lst=[]
for token in text.split():
    if token.lower() not in stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst))

Original Text
Joseph Robinette Biden Jr. (born November 20, 1942) is an American politician who is the 46th and current president of the United States. A member of the Democratic Party, he previously served as the 47th vice president from 2009 to 2017 under President Barack Obama and represented Delaware in the United States Senate from 1973 to 2009. 


Text after removing stop words
Joseph Robinette Biden Jr. (born November 20, 1942) American politician who 46th current president United States. member Democratic Party, previously served 47th vice president 2009 2017 President Barack Obama represented Delaware United States Senate 1973 2009.
