<a href="https://colab.research.google.com/github/fmaina1/nlp-fellowship/blob/main/Stemmer%26Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming
*A stemming algorithm, a procedure to reduce all words with the same
stem to a common form, is useful in many areas of computational linguistics and information-retrieval work.* ~ Lovin,1968

Examples of Stemmers include:


1.   PorterStemmer
2.   SnowballStemmer
3. LancasterStemmer
4. RegexStemmer



In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

## PorterStemmer
Designed and built by Martin Porter in 1980

Takes Five steps each with its own mapping rules. Easy and fast


In [None]:

from nltk.stem import PorterStemmer, SnowballStemmer,LancasterStemmer,RegexpStemmer,WordNetLemmatizer
words = ["friend", "friendship", "friends", "friendships","generate","generates","generating","general","generally","generic","generically","generous","generously","went","ate"]
Porter = PorterStemmer()

for word in words:
    print(word,"--->",Porter.stem(word))

In [None]:
new_words = ["walk","walking","walked","walks"]
for word in new_words:
    print(word,"--->",Porter.stem(word))

## SnowballStemmer/Porter2Stemmer
Designed and built by Martin Porter
Advancement of PorterStemmer

Faster and more precise than Porter Stemmer

In [None]:
snowball = SnowballStemmer(language='english')

for word in words:
    print(word,"--->",snowball.stem(word))

## LancasterStemmer
Simpler

Results to over stemming of words, which leads to meaningless words

In [None]:
lancaster = LancasterStemmer()

for word in words:
    print(word,"--->",lancaster.stem(word))

## RegexStemmer
Uses regex

Substring matching the regex will be discarded

Worst performer

In [None]:
regex = RegexpStemmer('ing$|s$|e$|able$|lly$|ate$', min=3)

for word in words:
    print(word,"--->",regex.stem(word))

# Lemmantizing

In [None]:
wordnet = WordNetLemmatizer()
lemm_word = ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
for word in lemm_word:
    print(word,"--->",wordnet.lemmatize(word))

In [None]:
from gensim.utils import tokenize

In [None]:
from nltk import  pos_tag
text = '''President Paul Kagame has said that deliberate efforts are needed to forge private-public partnerships to bridge internet usage gaps. He was speaking during the inauguration of the Mobile World Congress 2022 which convened more than 2000 people representing 99 countries, on October 25.
Global mobile operators, device manufacturers, technology providers, vendors, content owners, and policymakers are in Kigali to identify gaps and discuss effective measures needed to drive digital transformation in Africa. To address the usage gap –the number of people who can’t use mobile internet services while living in an area covered by broadband networks –Kagame said that neither the private nor the public sector has all that is required to cover the gap, hence, the need for partnerships. '''

tokens = list(tokenize(text))
tokens

In [None]:
pos_list = pos_tag(tokens)

In [None]:
matched_tags = {'NNP':'n',"VBP":'v'}
processed_tag = []
for token, tag in pos_list:
  token = wordnet.lemmatize(token,matched_tags[tag])
  processed_tag.append(token)
  #print(token,'-------------',tag)

In [None]:
#print(wordnet.lemmatize('countries'))
#pos_tag(['best'])
print(wordnet.lemmatize('better','a'))

# Stopwords
Common simple words that add little value

The goal is to reduce the size of the matrix as much as possible, therefore removing common words that do not add value makes sense. An example is I, a, an


In [None]:
from nltk.corpus import stopwords
sw = stopwords.words('english')
print(sw)

In [None]:
len(tokens)

tokens_no_sw = []
for token in tokens:
  if token not in sw:
    tokens_no_sw.append(token)

print(len(tokens_no_sw))

# In-class practicls
1. How many stop words are in NLTK, Spacy,Gensim. Compare them an select one
2. Lemmantize the above text using a for loop
3. Compare the Stemmers, get the best and compare in with Lemmantizer. 
4. Remove stop words from the text 

# Assignment
Create a function that takes the tokens, normalize the tokens and remove the stop words  

In [None]:

def stemming_lem_sw (tokens):
  new_tokens = []
  for token in tokens:
    token = snowball.stem(token)
    if token not in sw:
      new_tokens.append(token)

  return new_tokens

