# 1. Introduction
## 1.1 Definition
## Stemming
* **Stemming** is a natural language processing technique *used to reduce words to their base or root form, often by removing suffixes.*
* The main goal of **stemming** is to simplify the words so that variations of a word can be treated as the same term, which is useful in tasks like
  * information retrieval,
  * search engines, and
  * text analysis.



### How Stemming Works
**Stemming** algorithms, such as the **Porter Stemmer,** work by stripping common suffixes from words. For example:
* **"running"** becomes **"run"**
* **"jumps"** becomes **"jump"**
* **"easily"** becomes **"easi"**

### Applications of Stemming
* **Search Engines:** Improves search results by matching different forms of a word.
* **Text Mining and Sentiment Analysis:** Helps in clustering similar terms.
* **Document Classification:** Reduces the dimensionality of feature space by grouping similar words.

However, **stemming** may sometimes produce non-dictionary words, making it slightly inaccurate compared to other techniques like **lemmatization**, which reduces words to their dictionary form.

In [None]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [None]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my first vision is that of freedom. I believe that India got its first vision of this in 1857, when we started the War of Independence. It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career."""

# Tokenize the entire **paragraph**

In [None]:
ps = PorterStemmer()
sent_list = nltk.sent_tokenize(paragraph)
sent_list

['I have three visions for India.',
 'In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my first vision is that of freedom.',
 'I believe that India got its first vision of this in 1857, when we started the War of Independence.',
 'It is this freedom that we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s development.',
 'For fifty years we have been a developing nation.',
 'It is time we see ourselves as a devel

# use of **stopwords** as per the different language

In [None]:
# stopwords.words('german')
# stopwords.words('french')
# stopwords.words('english')

### **Stemming** of entire paragraph, using **for loop**.

In [None]:
for i in range(len(sent_list)):
  word_list = nltk.word_tokenize(sent_list[i])
  for word in word_list:
    if not word in set(stopwords.words('english')):
      words = [ps.stem(word)]
      sent_list[i] = ' '.join(words)
      print(sent_list[i])

i
three
vision
india
.
in
3000
year
histori
,
peopl
world
come
invad
us
,
captur
land
,
conquer
mind
.
from
alexand
onward
,
greek
,
turk
,
mogul
,
portugues
,
british
,
french
,
dutch
,
came
loot
us
,
took
.
yet
done
nation
.
we
conquer
anyon
.
we
grab
land
,
cultur
,
histori
tri
enforc
way
life
.
whi
?
becaus
respect
freedom
others.that
first
vision
freedom
.
i
believ
india
got
first
vision
1857
,
start
war
independ
.
it
freedom
must
protect
nurtur
build
.
if
free
,
one
respect
us
.
my
second
vision
india
’
develop
.
for
fifti
year
develop
nation
.
it
time
see
develop
nation
.
we
among
top
5
nation
world
term
gdp
.
we
10
percent
growth
rate
area
.
our
poverti
level
fall
.
our
achiev
global
recognis
today
.
yet
lack
self-confid
see
develop
nation
,
self-reli
self-assur
.
isn
’
incorrect
?
i
third
vision
.
india
must
stand
world
.
becaus
i
believ
unless
india
stand
world
,
one
respect
us
.
onli
strength
respect
strength
.
we
must
strong
militari
power
also
econom
power
.
both
must
go
h

### **Stemming** of entire paragraph, using list comprehension.

In [None]:
sent_list = nltk.sent_tokenize(paragraph)

for i in range(len(sent_list)):
  word_list = nltk.word_tokenize(sent_list[i]) # you have to iterate each sentence then each word of every sentence
  words = [ps.stem(word) for word in word_list if word not in set(stopwords.words('english'))] # checking "if word not in"
  sent_list[i] = ' '.join(words)
  print(sent_list[i])

i three vision india .
in 3000 year histori , peopl world come invad us , captur land , conquer mind .
from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .
yet done nation .
we conquer anyon .
we grab land , cultur , histori tri enforc way life .
whi ?
becaus respect freedom others.that first vision freedom .
i believ india got first vision 1857 , start war independ .
it freedom must protect nurtur build .
if free , one respect us .
my second vision india ’ develop .
for fifti year develop nation .
it time see develop nation .
we among top 5 nation world term gdp .
we 10 percent growth rate area .
our poverti level fall .
our achiev global recognis today .
yet lack self-confid see develop nation , self-reli self-assur .
isn ’ incorrect ?
i third vision .
india must stand world .
becaus i believ unless india stand world , one respect us .
onli strength respect strength .
we must strong militari power also econom power .
both must go h

In [None]:
sent_list = nltk.sent_tokenize(paragraph)

for i in range(len(sent_list)):
  word_list = nltk.word_tokenize(sent_list[i]) # you have to iterate each sentence then each word of every sentence
  words = [ps.stem(word) for word in word_list if not word in set(stopwords.words('english'))] # checking "if not word in"
  sent_list[i] = ' '.join(words)
  print(sent_list[i])

i three vision india .
in 3000 year histori , peopl world come invad us , captur land , conquer mind .
from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .
yet done nation .
we conquer anyon .
we grab land , cultur , histori tri enforc way life .
whi ?
becaus respect freedom others.that first vision freedom .
i believ india got first vision 1857 , start war independ .
it freedom must protect nurtur build .
if free , one respect us .
my second vision india ’ develop .
for fifti year develop nation .
it time see develop nation .
we among top 5 nation world term gdp .
we 10 percent growth rate area .
our poverti level fall .
our achiev global recognis today .
yet lack self-confid see develop nation , self-reli self-assur .
isn ’ incorrect ?
i third vision .
india must stand world .
becaus i believ unless india stand world , one respect us .
onli strength respect strength .
we must strong militari power also econom power .
both must go h