# Stop Words

Stop words are words that considered not significant to a text when processing the data. Such words may include:

[Determiners (words placed in front of nouns)](https://www.grammar-monster.com/glossary/determiner.htm#:~:text=A%20determiner%20is%20a%20word,%2C%20that%2C%20these%2C%20those):

    An Article (a/an, the)
    A Demonstrative (this, that, these, those)
    A Possessive (my, your, his, her, its, our, their)
    A Quantifier (common examples include many, much, more, most, some)
[Prepositions (indicate direction, time, location, and spatial relationships)](https://www.grammarly.com/blog/prepositions/?gclid=Cj0KCQiApY6BBhCsARIsAOI_GjacuOmcuc2Whs3mZEn16FeI_P0cK3_0Hq-g2wGZD3Je7t5iE6fPwo0aAtr8EALw_wcB&gclsrc=aw.ds):
    
    Direction: to, aross, through
    Time: since, about, after
    Location: at, elsewhere, in front of
    Space: under, above, below, inside

## Methods

1. Sklearn
2. NLTK
3. SPACY

This Jupyter Notebook compares three diffrent approaches for using stop wordds. In order to determine why to use one package for another package during text processing. 

### SKLEARN Stop Words

In [1]:
from sklearn.feature_extraction import stop_words

list = []

s =  set(stop_words.ENGLISH_STOP_WORDS)

for x in s:
    
    list.append(x)
    
print(sorted(list[:680]))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give



### NLTK Stop Words

In [44]:
from nltk.corpus import stopwords
import nltk

nl = stopwords.words('english')
print(sorted(stopwords.words('english')[:680]))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

### Compare SKLEARN and NLTK Stop Words

What is the difference between SKLEARN and NLTK? What words are included in SKLEARN, that are not included in NLTK? and vise versa. 

#### 1. Stop words in SKLEARN that are not in NLTK


In [3]:
list_difference = [item for item in s if item not in nl]

print(list_difference)

['many', 'nowhere', 'enough', 'eg', 'wherein', 'eight', 'around', 'cannot', 'either', 'afterwards', 'would', 'thus', 'see', 'beside', 'inc', 'cant', 'amongst', 'done', 'next', 'moreover', 'co', 'someone', 'please', 'anyhow', 'put', 'thick', 'almost', 'must', 'whose', 'describe', 'yet', 'thereby', 'anything', 'still', 'always', 'third', 'rather', 'well', 'whereas', 'another', 'ten', 'get', 'none', 'mill', 'also', 'among', 'sometime', 'fifteen', 'thereupon', 'fifty', 'latter', 'whether', 'meanwhile', 'serious', 'several', 'move', 'become', 'least', 'nothing', 'something', 'without', 'besides', 'fill', 'side', 'onto', 'whole', 'along', 'hereafter', 'call', 'first', 'alone', 'sometimes', 'sixty', 'four', 'whereafter', 'etc', 'much', 'empty', 'per', 'already', 'upon', 'therefore', 'throughout', 'whereupon', 'name', 'anywhere', 'somehow', 'beforehand', 'forty', 'hundred', 'found', 'others', 'became', 'due', 'bill', 'take', 'full', 'couldnt', 'thin', 'former', 'mine', 'sincere', 'formerly', '

A lot of prepositions that are in SKLEARN or not in NLTK.

Some Determiners: Many, anywhere, thru, thereafter, therefore, upon, hereafter, along, alone, somtimes


#### 2. Stop words in NLTK that are not included in sklearn

In [4]:

list_difference = [item for item in nl if item not in s]

print(list_difference)

["you're", "you've", "you'll", "you'd", "she's", "it's", 'theirs', "that'll", 'having', 'does', 'did', 'doing', 's', 't', 'just', 'don', "don't", "should've", 'd', 'll', 'm', 'o', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


There are more stop words in SKLEARN, than in NLTK.NLTK Stop words includes more possessive determiners such as you're, she, its, their.

### Spacy Stop Words

In [None]:
#remove stop words
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")

# Stop words from spacy
spc = nlp.Defaults.stop_words

In [19]:
print(spc)

{'many', 'own', 'nowhere', 'wherein', 'eight', 'cannot', '’s', 'either', 'afterwards', 'n’t', '’re', 'see', 'was', 'beside', 'each', 'into', 'if', 'someone', 'should', "'re", 'who', 'please', 'other', 'anyhow', 'whom', 'or', 'almost', 'whose', 're', 'yet', 'thereby', 'anything', 'still', 'were', 'in', 'rather', "'ve", 'well', 'hers', 'whereas', 'say', "'s", 'why', 'few', 'sometime', 'fifteen', 'and', 'fifty', 'only', 'doing', 'all', 'serious', 'you', 'several', 'here', 'them', 'something', 'with', 'without', 'besides', 'where', 'whole', 'hereafter', 'call', 'first', 'sometimes', 'the', 'four', 'between', 'much', 'per', 'these', 'further', 'upon', 'too', 'used', 'itself', 'therefore', 'throughout', 'name', 'beforehand', 'being', 'hundred', 'up', 'others', 'down', 'by', 'regarding', 'n‘t', 'take', 'mine', 'than', 'yourself', 'formerly', 'ever', 'seems', 'becomes', 'hereupon', '‘m', 'no', 'otherwise', 'until', 'it', 'yourselves', 'most', 'even', 'so', 'toward', 'hereby', 'often', 'an', 'b

### Compare SKLEARN and Spacy Stop Words

#### 1. Stop words in Spacy that are not in SKLEARN

In [8]:
# Stop words in spacy that are not in scikit-learn
list_difference = [item for item in spc if item not in s]

print('stop words in spacy that are not in scikit-learn:')
print('')
print(list_difference)

stop words in spacy that are not in scikit-learn:

['’s', 'n’t', '’re', "'re", "'ve", 'say', "'s", 'doing', 'used', 'regarding', 'n‘t', '‘m', '’d', "'ll", '‘re', 'various', 'does', "'d", "'m", "n't", 'using', '’m', '‘ll', 'unless', '’ve', 'really', 'just', 'make', 'ca', '‘ve', '‘s', 'did', 'quite', '’ll', '‘d']


#### 2. Stop words in SKLEARN that are not in SPACY

In [20]:
# Stop words in scikit learn that are not in spacy
list_difference = [item for item in s if item not in spc]

print('Stop words in scikit learn that are not in spacy:')
print('')
print(list_difference)

Stop words in scikit learn that are not in spacy:

['eg', 'inc', 'cant', 'co', 'thick', 'describe', 'mill', 'fill', 'etc', 'found', 'bill', 'couldnt', 'thin', 'sincere', 'cry', 'hasnt', 'de', 'find', 'amoungst', 'system', 'con', 'interest', 'ie', 'ltd', 'detail', 'un', 'fire']


Spacy contains more stop words than sklearn, but not a lot. 

### Test out stop words on twitter docs

In [23]:
import pandas as pd
import numpy as np

# import june date with sentiment analysis
df = pd.read_csv('data/2020-08_sentiment.csv')

In [29]:
df['full_text'][0]

'This is our last stand folks And here’s your last defender If they take him down America is gone forever Vote for realDonaldTrump like your life depends on it'

In [66]:
# NLTK

doc = df['full_text'][0]
tokens = nltk.tokenize.word_tokenize(doc)

print(df['full_text'][0])
print('')
print('Removing STOP WORDS with NLTK')
print([word for word in tokens if word is not nl])

This is our last stand folks And here’s your last defender If they take him down America is gone forever Vote for realDonaldTrump like your life depends on it

Removing STOP WORDS with NLTK
['This', 'is', 'our', 'last', 'stand', 'folks', 'And', 'here', '’', 's', 'your', 'last', 'defender', 'If', 'they', 'take', 'him', 'down', 'America', 'is', 'gone', 'forever', 'Vote', 'for', 'realDonaldTrump', 'like', 'your', 'life', 'depends', 'on', 'it']


In [67]:
# SKLEARN

listofwords = df['full_text'][0].split()

print(df['full_text'][0])
print('')
print('Removing STOP WORDS with SKLEARN')
print([word for word in listofwords if word is not s])

This is our last stand folks And here’s your last defender If they take him down America is gone forever Vote for realDonaldTrump like your life depends on it

Removing STOP WORDS with SKLEARN
['This', 'is', 'our', 'last', 'stand', 'folks', 'And', 'here’s', 'your', 'last', 'defender', 'If', 'they', 'take', 'him', 'down', 'America', 'is', 'gone', 'forever', 'Vote', 'for', 'realDonaldTrump', 'like', 'your', 'life', 'depends', 'on', 'it']


In [68]:
# SPACY

doc = sp(df['full_text'][0])
tokens = [token.text for token in doc if not token.is_stop]

print(df['full_text'][0])
print('')
print('Removing STOP WORDS with SPACY')
print(tokens)

This is our last stand folks And here’s your last defender If they take him down America is gone forever Vote for realDonaldTrump like your life depends on it

Removing STOP WORDS with SPACY
['stand', 'folks', 'defender', 'America', 'gone', 'forever', 'Vote', 'realDonaldTrump', 'like', 'life', 'depends']


### Summary

SPACY removes the most amount of stop words compared to NLTK and SKLEARN. One thing to think about is whether you want these words removed. For me, my focus is tweets from twitter and identifying the topics within the tweet. Therefore for my objective, removing more stop words the better. Therefore, I will use SPACY for applying NLP to Tweets or Social Media content.