# STOP WORDS

- These are most common words
- words filtered out before preprocessing
- the words in any language which does not add much meaning to a sentence

#### what is stop words?

- In computing, stop words are words that are filtered out before or after the natural language data (text) are processed. While “stop words” typically refers to the most common words in a language, all-natural language processing tools don’t use a single universal list of stop words.

EXAMPLE


- These are some of the most common, short function words, such as the, is, at, which, and on

#### why remove stop words ?

- If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model
- text size will be reduced
- less tokens for analysis

In [5]:
import spacy

In [6]:
nlp = spacy.load('en_core_web_sm')

#### Method 1

In [7]:
from spacy.lang.en.stop_words import STOP_WORDS

In [8]:
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [9]:
len(STOP_WORDS)

326

#### Method 2

In [10]:
print(nlp.Defaults.stop_words)

{"'d", 'anyhow', 'yet', 'itself', 'along', 'all', 'beforehand', 'take', 'had', 'about', 'on', 'latterly', 'unless', 'what', 'elsewhere', 'once', 'beyond', 'rather', 'thus', 'anyone', 'done', 'really', 'although', 'each', 'from', 'some', 'keep', 'whereupon', 'with', 'another', 'call', 'several', 'first', 'towards', '‘re', 'ca', 'meanwhile', 'our', 'alone', 'an', 'anything', 'side', 'onto', 'whereas', 'more', 'herein', 'always', 'whenever', 'were', 'both', 'amount', 'n‘t', 'see', 'still', 'whence', 'amongst', 'being', 'two', 'herself', 'quite', 'indeed', 'him', 'ever', 'among', 'nor', 'my', 'hereby', 'there', 'many', 'moreover', 'becomes', 'empty', 'even', 'into', 'much', 'noone', 'been', 'became', 'was', 're', 'we', 'something', 'will', 'is', 'during', '’d', 'are', 'but', 'using', 'below', 'those', 'already', 'within', 'whose', '’ve', 'thru', 'others', 'his', 'of', 'whole', 'not', 'did', 'here', 'ours', 'or', 'hence', 'if', 'never', 'be', "'re", '‘m', 'former', 'could', 'her', 'sometime

In [11]:
len(nlp.Defaults.stop_words)

326

In [12]:
nlp.vocab["in"].is_stop

True

In [13]:
nlp.vocab["name"].is_stop

True

In [14]:
nlp.vocab["Data Coaster"].is_stop

False

#### HOW TO FILTER STOP WORDS

In [18]:
sentence = nlp("My name is Ajay Goswami and i am from Allahabad and my channel's name is Data Coaster")

In [19]:
sentence

My name is Ajay Goswami and i am from Allahabad and my channel's name is Data Coaster

In [20]:
for word in sentence:
    print(word)

My
name
is
Ajay
Goswami
and
i
am
from
Allahabad
and
my
channel
's
name
is
Data
Coaster


In [21]:
for word in sentence:
    if word.is_stop == True:
        print(word)

My
name
is
and
i
am
from
and
my
's
name
is


In [22]:
for word in sentence:
    if word.is_stop == False:
        print(word)

Ajay
Goswami
Allahabad
channel
Data
Coaster


In [23]:
#using list comprihention
[word for word in sentence if word.is_stop == True]

[My, name, is, and, i, am, from, and, my, 's, name, is]

In [24]:
#using list comprihention
[word for word in sentence if word.is_stop == False]

[Ajay, Goswami, Allahabad, channel, Data, Coaster]

# ADD or REMOVE STOPWORDS

#### ADD

In [25]:
nlp.Defaults.stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [26]:
len(nlp.Defaults.stop_words)

326

In [28]:
nlp.Defaults.stop_words.add('Ajay')

In [29]:
nlp.vocab['Ajay'].is_stop = True

In [30]:
len(nlp.Defaults.stop_words)

327

In [31]:
nlp.vocab['Ajay'].is_stop

True

#### REMOVE

In [32]:
nlp.Defaults.stop_words.remove('Ajay')

In [33]:
nlp.vocab['Ajay'].is_stop = False

In [34]:
len(nlp.Defaults.stop_words)

326

In [35]:
nlp.vocab['Ajay'].is_stop

False