# StopWords

###### Definition:
Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models,
these stopwords might not add much value to the meaning of the document. Generally, 
the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.

##### Why are they called stop words?


Words like the in, at, that, which, and on are called stop words. Coined by Hans Peter Luhn, an early pioneer of information
retrieval techniques, stop words are words so common they can be excluded from searches because 
they increase the work required by software to parse them while providing minimal benefit.

###### What English words are stop words for Google?

Stop words are all those words that are filtered out and do not have a meaning by themselves. 
Google stop words are usually articles, prepositions, conjunctions, pronouns, etc

### spacy

In [None]:
pip install spacy
import spacy
print('spaCy Version: %s' % (spacy.__version__))
spacy_nlp = spacy.load('en_core_web_sm')
# Check pre-defined stop words
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])
# Remove stop words
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)
# Add customize stop words
customize_stop_words = [
    'computing', 'filtered'
]
for w in customize_stop_words:
    spacy_nlp.vocab[w].is_stop = True
doc = spacy_nlp(article)
tokens = [token.text for token in doc if not token.is_stop]
print('Original Article: %s' % (article))
print()
print(tokens)

### NLTK

In [None]:
pip install nltk
# Import library
import nltk 
print('NLTK Version: %s' % (nltk.__version__))
nltk.download('stopwords')
#Check pre-defined stop words
nltk_stopwords = nltk.corpus.stopwords.words('english')
print('Number of stop words: %d' % len(nltk_stopwords))
print('First ten stop words: %s' % list(nltk_stopwords)[:10])
# Remove stop words
tokens = nltk.tokenize.word_tokenize(article)
tokens = [token for token in tokens if not token in nltk_stopwords]
print('Original Article: %s' % (article))
print()
print(tokens)