### Stop Words 
- Some words in a Corpus do nto play a major role in determining the output of a Model 
- e.g. words like 'I, them, he, she,' etc dont play a relevant role in determining the output.
- These words are known as Stop Words.
- We should remove these words from the main Corpus

### Process for removal of Stop Words
- 1) Remove Stop words from the Corpus.
- 2) Do Lemmatization (not Stemming) on the remaining Corpus

### In this document
- A) Stop Words Removal and PorterStemmer Stemming
- B) Stop Words Removal and SnowballStemmer Stemming
- C) Stop Words Removal and WornetLemmatizer Lemmatization

In [26]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords') # Download the list of stopwords

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet', quiet=True)

[nltk_data] Downloading package stopwords to /Users/ankur/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
corpus = """
I have three visions for India. In three thousand years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds, and taken away our wealth. Yet, we have not done this to any other nation. Why? Because we respected the freedom of others. That is why my first vision is that India must stand up to the world. No one should underestimate us. We should be strong not only as a military power but also as an economic power. Both must go hand in hand.
My second vision for India is development. For fifty years we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top five nations of the world in terms of GDP, but we are still a developing nation. Why? We are still not free from poverty, illiteracy, and unemployment. A developed India means a nation where people are empowered, where education and health are accessible to all, where science and technology drive growth, and where self-reliance is a way of life.
My third vision for India is that India must stand up to the world. Because only strength respects strength. We must be strong enough to protect our sovereignty and confident enough to contribute to global peace and prosperity. Strength does not mean aggression; it means confidence, preparedness, and self-belief. A strong nation is one that respects others and is respected in return.
To achieve these visions, you—the young people of India—are the most important resource. If you want to shine like the sun, first burn like the sun. Dreams are not what you see in sleep; dreams are things which do not let you sleep. Dream, dream, dream. Dreams transform into thoughts, and thoughts result in action. Excellence is a continuous process, not an accident. You must have the courage to think differently, to invent, to travel the unexplored path, and to succeed.
If a country is to be corruption-free and become a nation of beautiful minds, I strongly feel there are three key societal members who can make a difference: the father, the mother, and the teacher. Parents must nurture values, teachers must ignite curiosity, and students must carry the nation forward with integrity and hard work. We should not give up, and we should not allow problems to defeat us. Confidence and hard work are the best weapons.
Let us together build an India that is strong, developed, and respected—a nation that future generations will be proud of.
"""

In [28]:
# All stop words in different languages
# Stop words Download Directory = /Users/ankur/nltk_data
stopwords_english = stopwords.words('english')
print('stopwords - english = ',stopwords_english)

stopwords_hinglish = stopwords.words('hinglish')
print('stopwords - hinglish = ',stopwords_hinglish)

stopwords_arabic = stopwords.words('arabic')
print('stopwords - arabic = ',stopwords_arabic)

stopwords - english =  ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same',

In [29]:
#A) Stop Words Removal and PorterStemmer Stemming

# 1) Break the 'Corpus' into 'Documents/Sentences'
# 2) Remove Stop words from all sentences
# 3) Apply Stemming on the remaining words = Result = 'words' with Stop words removed, Stemming applied
# 4) Rejoin all the words in #3 with Space to form Sentences/Documents (for Context)
porter_stemmer = PorterStemmer()
documents = nltk.sent_tokenize(corpus) # sent_tokenize() means tokenize into sentences / documents.

for i in range(len(documents)):
    words = nltk.word_tokenize(documents[i])
    stemmed_words = [porter_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))] #2 and #3) Remove Stop words and apply Stemming on the remaining words
    documents[i] = ' '.join(stemmed_words) #4) Converting all words into Documents / Sentences

print('Stop words removed, Stemming applied: Result = ',stemmed_words)    
print('Documents / Sentences with Stop words removed, Stemming applied: Result = ',documents)



Stop words removed, Stemming applied: Result =  ['let', 'us', 'togeth', 'build', 'india', 'strong', ',', 'develop', ',', 'respected—a', 'nation', 'futur', 'gener', 'proud', '.']
Documents / Sentences with Stop words removed, Stemming applied: Result =  ['i three vision india .', 'in three thousand year histori , peopl world come invad us , captur land , conquer mind , taken away wealth .', 'yet , done nation .', 'whi ?', 'becaus respect freedom other .', 'that first vision india must stand world .', 'no one underestim us .', 'we strong militari power also econom power .', 'both must go hand hand .', 'my second vision india develop .', 'for fifti year develop nation .', 'it time see develop nation .', 'we among top five nation world term gdp , still develop nation .', 'whi ?', 'we still free poverti , illiteraci , unemploy .', 'a develop india mean nation peopl empow , educ health access , scienc technolog drive growth , self-reli way life .', 'my third vision india india must stand wor

In [30]:
#B) Stop Words Removal and Snowball Stemming

# 1) Break the 'Corpus' into 'Documents/Sentences'
# 2) Remove Stop words from all sentences
# 3) Apply Stemming on the remaining words = Result = 'words' with Stop words removed, Stemming applied
# 4) Rejoin all the words in #3 with Space to form Sentences/Documents (for Context)
snowball_stemmer = SnowballStemmer('english')
documents = nltk.sent_tokenize(corpus) # sent_tokenize() means tokenize into sentences / documents.

for i in range(len(documents)):
    words = nltk.word_tokenize(documents[i])
    stemmed_words = [snowball_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))] #2 and #3) Remove Stop words and apply Stemming on the remaining words
    documents[i] = ' '.join(stemmed_words) #4) Converting all words into Documents / Sentences

print('Stop words removed, Stemming applied: Result = ',stemmed_words)    
print('Documents / Sentences with Stop words removed, Stemming applied: Result = ',documents)


Stop words removed, Stemming applied: Result =  ['let', 'us', 'togeth', 'build', 'india', 'strong', ',', 'develop', ',', 'respected—a', 'nation', 'futur', 'generat', 'proud', '.']
Documents / Sentences with Stop words removed, Stemming applied: Result =  ['i three vision india .', 'in three thousand year histori , peopl world come invad us , captur land , conquer mind , taken away wealth .', 'yet , done nation .', 'whi ?', 'becaus respect freedom other .', 'that first vision india must stand world .', 'no one underestim us .', 'we strong militari power also econom power .', 'both must go hand hand .', 'my second vision india develop .', 'for fifti year develop nation .', 'it time see develop nation .', 'we among top five nation world term gdp , still develop nation .', 'whi ?', 'we still free poverti , illiteraci , unemploy .', 'a develop india mean nation peopl empow , educ health access , scienc technolog drive growth , self-reli way life .', 'my third vision india india must stand w

In [None]:
#C) Stop Words Removal and WordNetLemmatizer Lemmitization - (Most Preferred Approach)

# 1) Break the 'Corpus' into 'Documents/Sentences'
# 2) Remove Stop words from all sentences
# 3) Apply Lemmitization on the remaining words = Result = 'words' with Stop words removed, Stemming applied
# 4) Rejoin all the words in #3 with Space to form Sentences/Documents (for Context)
wordnet_lemmatizer = WordNetLemmatizer()
documents = nltk.sent_tokenize(corpus) # sent_tokenize() means tokenize into sentences / documents.

# wordnet_lemmatizer.lemmatize(word,pos='n) = pos = Parts of Speech Tagging, n = Noun
for i in range(len(documents)):
    words = nltk.word_tokenize(documents[i])
    lemmatized_words = [wordnet_lemmatizer.lemmatize(word,pos='n') for word in words if word not in set(stopwords.words('english'))] #2 and #3) Remove Stop words and apply Lemmitization (as a Noun) on the remaining words
    documents[i] = ' '.join(lemmatized_words) #4) Converting all words into Documents / Sentences

print('Stop words removed, Stemming applied: Result = ',lemmatized_words)    
print('Documents / Sentences with Stop words removed, Stemming applied: Result = ',documents)

Stop words removed, Stemming applied: Result =  ['Let', 'u', 'together', 'build', 'India', 'strong', ',', 'developed', ',', 'respected—a', 'nation', 'future', 'generation', 'proud', '.']
Documents / Sentences with Stop words removed, Stemming applied: Result =  ['I three vision India .', 'In three thousand year history , people world come invaded u , captured land , conquered mind , taken away wealth .', 'Yet , done nation .', 'Why ?', 'Because respected freedom others .', 'That first vision India must stand world .', 'No one underestimate u .', 'We strong military power also economic power .', 'Both must go hand hand .', 'My second vision India development .', 'For fifty year developing nation .', 'It time see developed nation .', 'We among top five nation world term GDP , still developing nation .', 'Why ?', 'We still free poverty , illiteracy , unemployment .', 'A developed India mean nation people empowered , education health accessible , science technology drive growth , self-reli