# **Stemming and Lemmatization**

**Aim** : To understand the concept of stemming and lemmatization

**Tools** : Jupyter / any editor of python

**Library** : NLTK

**Method** : Using already available functions like:

**Stemming**
1. PorterStemmer
2. LancasterStemmer
3. RegexpStemmer
4. SnowballStemmer

**Lemmatization**
1. WordNetLemmatizer

**ye mat likh dena**

https://www.tutorialspoint.com/natural_language_toolkit/natural_language_toolkit_stemming_lemmatization.htm

**Stemming**

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

In [None]:
import nltk.corpus
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import RegexpStemmer
from nltk.stem import SnowballStemmer

1. **Porter stemming algorithm**

It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning.

In [None]:
word_stemmer = PorterStemmer()
word_stemmer.stem('writing')

'write'

In [None]:
word_stemmer.stem('eating')

'eat'

2. **Lancaster stemming algorithm**

It was developed at Lancaster University and it is another very common stemming algorithms.

With the help of **LancasterStemmer** we can easily implement Lancaster Stemmer algorithms for the word we want to stem.

In [None]:
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')

'eat'

3. **Regular Expression stemming algorithm**

With the help of this stemming algorithm, we can construct our own stemmer.

It basically takes a single regular expression and removes any prefix or suffix that matches the expression.

In [None]:
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['connecting','connect','factionally','faction',"consult","consulation"]
for word in words:
  print(word,"--->",regexp.stem(word))

connecting ---> connect
connect ---> connect
factionally ---> factionally
faction ---> faction
consult ---> consult
consulation ---> consulation


4. **Snowball stemming algorithm**

**SnowballStemmer** supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method.

In [None]:
from nltk.stem import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

In [None]:
French_stemmer = SnowballStemmer('french')
French_stemmer.stem ('Bonjoura')

'bonjour'

**Lemmatization**

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

**WordNetLemmatizer** class is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
  
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
better : good


****
**POS tagging or Chunking**

https://www.geeksforgeeks.org/python-lemmatization-approaches-with-examples/