### <mark> Stemming using nltk
    
    Reducing words to their basic form or root
    There are several different algorithms for stemming, including the Porter stemmer, Snowball stemmer, and the Lancaster stemmer. 
    To build a robust model, it is essential to normalize text by removing repetition and transforming words to their base form through stemming.
    Every word can be represented as a sequence of consonants and vowels.
        CVCV…C
        CVCV…V
        VCVC…C
        VCVC…V

##### <mark> Types of stemmers

**PORTER**
    
    A suffix stripping algorithm.
    It uses predefined rules to strip words
    applies more than 50 rules, grouped in 5 steps
    most commonly used
    limited to English words.
    Need not give meaningful words
    It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes.
    
**LANCASTER**
    
    one of the most aggressive stemmers as it tends to over stem many words
    more strict rules
    more than 100 rules, around double that of Porter stemmer
        consists of a set of rules where each rule specifies either deletion or replacement of an ending
        some rules are restricted to intact words, and some rules are applied iteratively as the word goes through them
    The stemmer is really faster, but the algorithm is really confusing when dealing with small words

**SNOWBALL**
    
    multi-lingual stemmer
    way more aggressive than Porter Stemmer and is also referred to as Porter2
    having greater computational speed than porter

##### <mark> There are mainly two errors in stemming – 

**over-stemming**
    
    when the stemmer is too aggressive in removing suffixes or when it does not consider the context of the word
    stemmer produces a root form that is not a valid word or is not the correct root form of a word

**under-stemming**
    
    when the stemmer is not aggressive enough in removing suffixes
    
In some cases, using a lemmatizer instead of a stemmer may be a better solution as it takes into account the context of the word, making it less prone to errors.

In [6]:
import nltk

from nltk.stem.porter import *
p_stemmer = PorterStemmer()

In [2]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

for word in words:
    print(word+' --> '+ p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


In [3]:
from nltk.stem.snowball import SnowballStemmer
s_stemmer = SnowballStemmer(language='english')

words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

for word in words:
    print(word+' --> ' + s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


In [5]:
words = ['generous', 'generation', 'generously', 'generate']

for word in words:
    print(word+' --> ' + s_stemmer.stem(word))
    print(word+' --> ' + p_stemmer.stem(word), '\n')    

generous --> generous
generous --> gener 

generation --> generat
generation --> gener 

generously --> generous
generously --> gener 

generate --> generat
generate --> gener 



In [10]:
from nltk.stem import PorterStemmer , LancasterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
ls =  LancasterStemmer()

In [15]:
words = ["is", "was", "be", "been", "are", "were"]

for w in words:
    print(f'{w} ---> {ps.stem(w)} ---> {ls.stem(w)}')

is ---> is ---> is
was ---> wa ---> was
be ---> be ---> be
been ---> been ---> been
are ---> are ---> ar
were ---> were ---> wer


In [16]:
words = ["book","booking","booked","books","booker","bookstore"]

for w in words:
    print(f'{w} ---> {ps.stem(w)} ---> {ls.stem(w)}')

book ---> book ---> book
booking ---> book ---> book
booked ---> book ---> book
books ---> book ---> book
booker ---> booker ---> book
bookstore ---> bookstor ---> bookst


In [19]:
sentence = 'Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of human beings or animals.'
word_list = word_tokenize(sentence)

print("{0:20}{1:20}{2:20}".format("Original Word", "Porter Stemmer", "lancaster Stemmer", '\n'))

for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word, ps.stem(word), ls.stem(word)))

Original Word       Porter Stemmer      lancaster Stemmer   
Artificial          artifici            art                 
intelligence        intellig            intellig            
(                   (                   (                   
AI                  ai                  ai                  
)                   )                   )                   
is                  is                  is                  
the                 the                 the                 
intelligence        intellig            intellig            
of                  of                  of                  
machines            machin              machin              
or                  or                  or                  
software            softwar             softw               
,                   ,                   ,                   
as                  as                  as                  
opposed             oppos               oppos               
to                  to  

In [20]:
word_list = ["friend", "friendship", "friends", "friendships", "stabil", "destabilize", "misunderstanding",
             "railroad", "moonlight", "football"]

print("{0:20}{1:20}{2:20}".format("Original Word", "Porter Stemmer", "lancaster Stemmer"))

for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word, ps.stem(word), ls.stem(word)))

Original Word       Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


### <mark> stemming using spacy

In [22]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [27]:
doc1 = nlp(u"Saturn is the sixth planet from the Sun and the second-largest in the Solar System, after Jupiter.")
print(type(doc1))

for token in doc1:
    print(token.text, '\t', token.pos_, '\t', token.lemma_)

<class 'spacy.tokens.doc.Doc'>
Saturn 	 NOUN 	 saturn
is 	 AUX 	 be
the 	 DET 	 the
sixth 	 ADJ 	 sixth
planet 	 NOUN 	 planet
from 	 ADP 	 from
the 	 DET 	 the
Sun 	 PROPN 	 Sun
and 	 CCONJ 	 and
the 	 DET 	 the
second 	 ADV 	 second
- 	 PUNCT 	 -
largest 	 ADJ 	 large
in 	 ADP 	 in
the 	 DET 	 the
Solar 	 PROPN 	 Solar
System 	 PROPN 	 System
, 	 PUNCT 	 ,
after 	 ADP 	 after
Jupiter 	 PROPN 	 Jupiter
. 	 PUNCT 	 .


In [29]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [30]:
doc2 = nlp(u"Jupiter is the fifth planet from the Sun.")

show_lemmas(doc2)

Jupiter      PROPN  2889603257431922515    Jupiter
is           AUX    10382539506755952630   be
the          DET    7425985699627899538    the
fifth        ADJ    4490412142941298567    fifth
planet       NOUN   2468667252130234137    planet
from         ADP    7831658034963690409    from
the          DET    7425985699627899538    the
Sun          PROPN  2663045040185303238    Sun
.            PUNCT  12646065887601541794   .


#### <mark> Lemmatization using nltk

In [32]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [35]:
words = ["cats", "cacti", "radii", "feet", "speechless", 'runner']

for word in words : 
    print(lemmatizer.lemmatize(word))

cat
cactus
radius
foot
speechless
runner


In [37]:
print(lemmatizer.lemmatize("enjoying", "n"))
print(lemmatizer.lemmatize("enjoying", 'v'))

enjoying
enjoy


In [44]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "We are a non-profit organisation focused on dialogue and advocacy, and memory and legacy work, founded by enjoying people"
punctuations="?:!.,;"
sentence_words = nltk.word_tokenize(sentence)

for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)

# sentence_words
print("{0:20}{1:20}{2:20}".format("Word", "Lemma(pos='noun')", "Lemma(pos='verb')"))
for word in sentence_words:
    print ("{0:20}{1:20}{2:20}".format(word, wordnet_lemmatizer.lemmatize(word), wordnet_lemmatizer.lemmatize(word, pos='v')))

Word                Lemma(pos='noun')   Lemma(pos='verb')   
We                  We                  We                  
are                 are                 be                  
a                   a                   a                   
non-profit          non-profit          non-profit          
organisation        organisation        organisation        
focused             focused             focus               
on                  on                  on                  
dialogue            dialogue            dialogue            
and                 and                 and                 
advocacy            advocacy            advocacy            
and                 and                 and                 
memory              memory              memory              
and                 and                 and                 
legacy              legacy              legacy              
work                work                work                
founded             foun

In [45]:
words = ["is", "was", "be", "been", "are", "were"]

for word in words : 
    print(lemmatizer.lemmatize(word))

is
wa
be
been
are
were


In [47]:
words = ["feet", "radii", "men", "children", "carpenter", "fighter"]
for word in words : 
    print(lemmatizer.lemmatize(word,'n'))

foot
radius
men
child
carpenter
fighter
