# Removing stop words with NLTK library in Python

## Stop Words
#### What are Stopwords?
In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text, But for purposes of for NLP operations and modeling, they do not add any valuable information. These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.

#### Why Remove stop words

When we remove stopwords it reduces the size of the text corpus which increases the performance and robustness of the NLP model. But sometimes removing the stopwords may have an adverse effect if it changes the meaning of the sentence. For example, if we consider the example “This is not a good way to talk” which is a negative sentence. When we remove stopwords from this sentence it becomes a positive sentence: “good way talk”.

While there is no universal list of stop words in NLP, many NLP libraries in Python provide their list. We can also decide to create our own list of stop words.


#### Stopwords in NLTK
NLTK holds a built-in list of around 179 English Stopwords. The default list of these stopwords can be loaded by using stopwords.word() module of NLTK. This list can be modified as per our needs.

A very common usage of stopwords.word() is in the text preprocessing phase or pipeline before actual NLP techniques like text classification.

In [2]:
#download stopwords -NLTK
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
import nltk
from nltk.corpus import stopwords
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [4]:
print(stopwords.words('french'))

['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aur

#### Remove Stopwords from Text
Below, we will remove stopwords from a string with NLTK. 

- First we will create a “stopwords.word()” object with English vocabulary and store the list of stopwords in a variable. 
- Then we will create an empty list to store words that are not stopwords.

Using a for loop that iterates over the text (that has been split on whitespace) we will check whether the word is present in the stopword list, if not we appended it in the list.

At last, we join the list of words that don’t contain stopwords using “join()” function and thus we have a final output where all stopwords are removed from the string using the NLTK stopwords list.


In [6]:
# Example -1

from nltk.corpus import stopwords

text = "Spread love everywhere you go. Let no one ever come to you without leaving happier"
en_stopwords = stopwords.words('english')

lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:    #checking whether the word is not 
        lst.append(token)                    #present in the stopword list.
        
#Join items in the list
print(' '.join(lst))

Spread love everywhere go. Let one ever come without leaving happier


In [7]:
# Example -2
from nltk.corpus import stopwords

text = "Life is what happens when you're busy making other plans"
en_stopwords = stopwords.words('english')

lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)
        
print(' '.join(lst))

Life happens busy making plans


### Adding Stop Words to Default NLTK Stopwords List
There are 179 English stopwords however, we can add our own stopwords to the list of stopwords. To add a word to NLTK stop words list, we first create a list from the “stopwords.word(‘english’)” object. Next, we use the extend method on the list to add our list of words to the default stopwords list.

#### Example
The following script adds a list of words to the NLTK stop word collection. Initially, the length of words in stopwords.words(‘english’) object is 179 but on adding 3 more words the length of the list becomes 182.

In [8]:
en_stopwords = stopwords.words('english')
print(len(en_stopwords))
new_stopwords = ["you're","i'll","we'll"]
en_stopwords.extend(new_stopwords)
len(en_stopwords)

179


182

### NLTK Stopwords for other Languages
Other than English, NLTK supports these languages having stopwords. We can get the list of supported languages below.

In [9]:
from nltk.corpus import stopwords
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


### Removing stopwords from Text File
In the code below we have removed the stopwords in the same process as discussed above, the only difference is that we have imported the text by using the Python file operation “with open()”

In [13]:
from nltk.corpus import stopwords 

en_stopwords = stopwords.words('english') 
with open(r"C:\Users\user\NLP\biden.txt") as f:
    text=f.read()
    
lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst)) 

Original Text
Joseph Robinette Biden Jr. (born November 20, 1942) is an American politician who is the 46th and current president of the United States. A member of the Democratic Party, he previously served as the 47th vice president from 2009 to 2017 under President Barack Obama and represented Delaware in the United States Senate from 1973 to 2009. 


Text after removing stop words
Joseph Robinette Biden Jr. (born November 20, 1942) American politician 46th current president United States. member Democratic Party, previously served 47th vice president 2009 2017 President Barack Obama represented Delaware United States Senate 1973 2009.
