The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”

We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing and the goal we want to achieve. For example, if we are training a model that can perform the sentiment analysis task, we might not remove the stop words.


Movie review: “The movie was not good at all.” 

Text after removal of stop words: “movie good”

# Through Natural Language Toolkit (NLTK)

In [1]:
import nltk   
from nltk.corpus import stopwords   
sw_nltk = stopwords.words('english')   
print(sw_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [2]:
print(len(sw_nltk))

179


there are 179 default stop words

**method 1:**

In [3]:
text = "When I first met her she was very quiet. She remained quiet during the entire two hour long journey from Stony Brook to New York."
words = [word for word in text.split() if word.lower() not in sw_nltk]   
new_text = " ".join(words)

print(new_text)   
print("Old length: ", len(text))   
print("New length: ", len(new_text))

first met quiet. remained quiet entire two hour long journey Stony Brook New York.
Old length:  129
New length:  82


# Ignoring punctuations in the texts

In [1]:
import nltk
import string
from nltk.corpus import stopwords

stop_words=stopwords.words('english')
punct=string.punctuation

**METHOD 2:**

![image.png](attachment:image.png)

In [3]:
text = "When I first met her she was very quiet. She remained quiet during the entire two hour long journey from Stony Brook to New York."

cleaned_text=[]

for word in nltk.word_tokenize(text):
    if word not in punct:
        if word not in stop_words:
            cleaned_text.append(word)
            
print("orginal length",len(text))
print("punctuation and stop words removed length",len(cleaned_text))

orginal length 129
punctuation and stop words removed length 17


# spaCy

In [5]:
import sys
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en

Collecting spacy
  Downloading spacy-3.1.4-cp38-cp38-win_amd64.whl (12.0 MB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.5-cp38-cp38-win_amd64.whl (6.6 MB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.6-cp38-cp38-win_amd64.whl (36 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.6-cp38-cp38-win_amd64.whl (21 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp38-cp38-win_amd64.whl (113 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp38-cp38-win_amd64.whl (2.0 MB)
Collecting smart-open<6.0.0,>=5.0.0
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp38-cp38-win_amd64.whl (452

You should consider upgrading via the 'C:\Users\91890\anaconda3\python.exe -m pip install --upgrade pip' command.


Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.1.0
⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2021-11-02 17:15:17.726258: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-11-02 17:15:17.726624: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
You should consider upgrading via the 'C:\Users\91890\anaconda3\python.exe -m pip install --upgrade pip' command.


In [6]:
import spacy   #loading the english language small model of spacy   
en = spacy.load('en_core_web_sm')   
sw_spacy = en.Defaults.stop_words   
print(sw_spacy)

{'the', 'back', 'thereby', 'around', 'show', 'their', 'this', 'less', 'all', 'if', 'who', 'can', 'would', 'during', 'have', 'same', 'keep', 'nothing', 'must', 'thereupon', 'while', 'does', 'seems', 'ten', 'be', 'meanwhile', 'from', 'neither', 'over', 'twelve', 'unless', 'both', 'there', 'next', 'to', '‘re', "'re", '’ll', 'latter', 'except', 'themselves', 'here', "n't", 'latterly', 'seem', 'get', 'still', 'whenever', 'me', 'hundred', 'whereas', 'whereafter', 'name', 'give', 'into', 'each', 'those', 're', '’d', 'behind', 'n’t', 'nevertheless', 'sometimes', 'am', 'its', 'nobody', 'but', 'too', 'when', 'above', '’re', 'an', 'anyone', 'whence', 'his', 'becoming', 'were', 'any', 'yourself', 'anyway', 'been', 'some', 'and', 'as', 'with', 'anything', 'anyhow', 'towards', 'ca', 'your', 'mine', 'often', 'call', 'had', 'in', 'two', 'was', 'among', 'beside', 'together', 'amongst', 'though', 'others', 'own', 'elsewhere', 'may', 'thence', 'yours', 'should', 'you', 'first', 'various', 'us', 'every', 

# Gensim

In [9]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\91890\anaconda3\python.exe -m pip install --upgrade pip' command.


In [10]:
import gensim   
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS   
print(STOPWORDS)

frozenset({'the', 'thick', 'back', 'eg', 'thereby', 'around', 'show', 'their', 'this', 'less', 'all', 'if', 'who', 'would', 'can', 'during', 'have', 'same', 'keep', 'nothing', 'must', 'con', 'fire', 'thereupon', 'while', 'does', 'seems', 'ten', 'be', 'meanwhile', 'from', 'neither', 'over', 'twelve', 'unless', 'both', 'there', 'next', 'to', 'latter', 'except', 'themselves', 'here', 'latterly', 'seem', 'get', 'still', 'whenever', 'me', 'hundred', 'whereas', 'whereafter', 'name', 'give', 'into', 'each', 'those', 're', 'sincere', 'behind', 'nevertheless', 'sometimes', 'am', 'its', 'nobody', 'but', 'too', 'find', 'when', 'above', 'an', 'anyone', 'whence', 'his', 'becoming', 'were', 'any', 'ie', 'ltd', 'yourself', 'anyway', 'been', 'some', 'and', 'as', 'with', 'anything', 'anyhow', 'towards', 'kg', 'your', 'cry', 'mine', 'often', 'call', 'had', 'in', 'two', 'was', 'among', 'fill', 'beside', 'together', 'amongst', 'though', 'others', 'own', 'elsewhere', 'co', 'may', 'thence', 'yours', 'should

# Scikit-Learn

In [11]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS   
print(ENGLISH_STOP_WORDS)

frozenset({'the', 'thick', 'back', 'eg', 'thereby', 'around', 'show', 'their', 'this', 'less', 'all', 'if', 'who', 'can', 'would', 'during', 'have', 'same', 'keep', 'nothing', 'must', 'con', 'fire', 'thereupon', 'while', 'seems', 'ten', 'be', 'meanwhile', 'from', 'neither', 'over', 'twelve', 'both', 'there', 'next', 'to', 'latter', 'except', 'themselves', 'here', 'latterly', 'seem', 'get', 'still', 'whenever', 'me', 'hundred', 'whereas', 'whereafter', 'name', 'give', 'into', 'each', 'those', 're', 'sincere', 'behind', 'nevertheless', 'sometimes', 'am', 'its', 'nobody', 'but', 'too', 'find', 'when', 'above', 'an', 'anyone', 'whence', 'his', 'becoming', 'were', 'any', 'ie', 'ltd', 'yourself', 'anyway', 'been', 'some', 'and', 'as', 'with', 'anything', 'anyhow', 'towards', 'your', 'cry', 'mine', 'often', 'call', 'had', 'in', 'two', 'was', 'among', 'fill', 'beside', 'together', 'amongst', 'though', 'others', 'own', 'elsewhere', 'co', 'may', 'thence', 'yours', 'should', 'you', 'first', 'de',

# REMOVING STOP WORDS FROM TEXT

In [12]:
#Removes stop words
words = [word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS]   
new_text = " ".join(words)   

print(new_text)   
print("Old length: ", len(text))   
print("New length: ", len(new_text))

met quiet. remained quiet entire hour long journey Stony Brook New York.
Old length:  129
New length:  72


# ADDING CUSTOM STOP WORDS TO LIST

In [13]:
sw_nltk.extend(['first', 'second', 'third', 'me'])   
print(len(sw_nltk))

183


# remove stop words 

In [14]:
# FROM PREMADE LIST
sw_nltk.remove('not')