## Stop Words

A stop word is a commonly used word such as "a", "an", "the" in English language. 

Even though they add meaning to a sentence while writing and speaking, they generally do not add any more information for semantic analysis for the model that we are creating. So, we need to remove those words and removing those words will largely decrease the vocabulary size and will help our model to train faster and also require less storage space. 

NLTK provides a list of stopwords and we can use that to remove those stopwords from our text while preprocessing the text corpus. We can also create our custom stopword list and use that to remove those stopwords from the text corpus.


import nltk
nltk.download('stopwords')

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [5]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Note:  We can add or delete words from the stop words list based on our requirement.  
For example, if we are doing a sentiment analysis, we need to preserve the words like "not", "is not", "need not" etc. because the word "not" may change the meaning of the sentence. 

In [8]:
from nltk.tokenize import TreebankWordTokenizer

In [9]:
corpus = ["I love my dog", "You love your cat", "They love their birds"]

In [15]:
def preprocess(doc, tokenizer):
    p_tokens = []
    doc = doc.lower()
    tokens = tokenizer.tokenize(doc)
    for token in tokens:
        #check if token not in stop words list, then add to result
        if token not in stop_words:
            p_tokens.append(token)            
    #convert the token list into a single string sentence
    return " ".join(p_tokens)            

In [16]:
tokenizer = TreebankWordTokenizer()
corpus_processed = [preprocess(doc, tokenizer) for doc in corpus] 

In [17]:
corpus_processed

['love dog', 'love cat', 'love birds']

### As you see from output, words like "I", "my", "You", "your", "They", "their" are removed.  In a large corpus, removing thses stop words will considerably reduce the vocabulary size and will decrease storage space requirement and training of the model may be faster since vocabular size is reduced.

### Removing Punctuations and other special characters

In some cases, punctuations and other special characters do not add any more value to the text corpus and it would be wise to remove those. We will use regular expression to remove those.

In [18]:
import re

In [94]:
PUNCTUATIONS_RE = '[.,?!:;]'
SPECIAL_CHARS_RE = '[(),{},\,/&%$#@*[\]]'

In [95]:
corpus = ["I (love) my dog, really?", "You [love] your {cat}!", "#They *love /their %birds$;"]

In [102]:
def remove_punct(doc):   
    doc = doc.lower()
    doc = re.sub(PUNCTUATIONS_RE, '', doc)
    doc = re.sub(SPECIAL_CHARS_RE, '', doc)
    return doc   

In [103]:
corpus_processed = [remove_punct(doc) for doc in corpus]

In [104]:
corpus_processed

['i love my dog really', 'you love your cat', 'they love their birds']

### As you seen in the output, all the punctuations and special characters are removed. 

In [105]:
# Let's combine both the techniques and proprocess text

In [110]:
def preprocess_text(doc, tokenizer):
    result = []
    doc = doc.lower()
    doc = re.sub(PUNCTUATIONS_RE, '', doc)
    doc = re.sub(SPECIAL_CHARS_RE, '', doc)
    tokens = tokenizer.tokenize(doc)
    for token in tokens:
        #check if token not in stop words list, then add to result
        if token not in stop_words:
            result.append(token)            
    #convert the token list into a single string sentence
    return " ".join(result)        

In [111]:
corpus = ["I (love) my dog, really?", "You [love] your {cat}!", "#They *love /their %birds$;"]

In [114]:
tokenizer = TreebankWordTokenizer()
corpus_processed = [preprocess_text(doc, tokenizer) for doc in corpus] 

In [115]:
print(corpus_processed)

['love dog really', 'love cat', 'love birds']


#### As you see in the output, all punctuations, special characters, and stop words are removed. This preprocessed text can be further used for semantic analysis.  Also, see how the vocabulary size is reduced after preprocessing.