In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/alok-kumar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/alok-
[nltk_data]     kumar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Stop Words

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words .

The general strategy for determining a stop list is to sort the terms by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing.

<img src="https://nlp.stanford.edu/IR-book/html/htmledition/img95.png">

> A stop list of 25 semantically non-selective words which are common in Reuters-RCV1.

> **Source**: https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

In [3]:
# import librarie
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [4]:
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


### Step 1: Normalize the text

In [5]:
## TODO: convert to lower case and remove punctuation
normalized_text = re.sub(r'[^\w\s\d]', '', text.lower())
normalized_text

'the first time you see the second renaissance it may look boring look at it at least twice and definitely watch part 2 it will change your view of the matrix are the human people the ones who started the war  is ai a bad thing '

### Step 2: Tokenize the text into words

In [7]:
## TODO: use word_tokenize for tokenization
tokenized_text = word_tokenize(normalized_text)
print(tokenized_text)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


### Step 3: Removing stop words

In [8]:
## TODO: construct a word list without having any stopword

words = [w for w in tokenized_text if w not in stopwords.words('english')]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


In [9]:
## Check both word list to see how many stop words has been removed
org_word_len = len(tokenized_text)
stop_words_len = len(words)

print("Total {} stopwords has been removed after applying stopwords corpus.".format(org_word_len-stop_words_len))

Total 20 stopwords has been removed after applying stopwords corpus.
