# <center>NLP💬🔉 By 🎯Udaya ( Data Engineer 📚) </center>

# Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).
* "running" -> "run"
* "happiness" -> "happi"
* "caresses" -> "caress"

### Overstemming
Definition: Overstemming occurs when a stemming algorithm removes more characters than necessary, leading to stems that are too general or incorrect.

#### Example:

* "university" -> "univers" (correct stem: "universi")
* "generalization" -> "gener" (correct stem: "general")
### Understemming
Definition: Understemming occurs when a stemming algorithm does not remove enough characters, leading to stems that are too specific and fail to capture the common root of related words.

#### Example:

* "running" -> "running" (correct stem: "run")
* "happiness" -> "happiness" (correct stem: "happi")

### Types

## Porter Stemmer
Definition: The Porter Stemmer is one of the most widely used stemming algorithms. It uses a series of predefined rules to iteratively strip suffixes from words.

#### Example:

* "running" -> "run"
* "happiness" -> "happi"
* "caresses" -> "caress"
* Code 👇

In [1]:
from nltk.stem import PorterStemmer

stemming = PorterStemmer()

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

for word in words:
    print(word+" 👉--> "+stemming.stem(word))

eating 👉--> eat
eats 👉--> eat
eaten 👉--> eaten
writing 👉--> write
writes 👉--> write
programming 👉--> program
programs 👉--> program
history 👉--> histori
finally 👉--> final
finalized 👉--> final


In [2]:
stemming.stem('Congratulation')

'congratul'

In [3]:
stemming.stem("sitting")

'sit'

## Lancaster Stemmer
Definition: The Lancaster Stemmer is another stemming algorithm that is more aggressive than the Porter Stemmer. It uses a more extensive set of rules, resulting in shorter stems.

#### Example:

* "running" -> "run"
* "happiness" -> "happy"
* "caresses" -> "caress"
* Code 👇

In [4]:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()

words = ["running", "happiness", "caresses"]
for word in words:
    print(word+" 👉--> "+lancaster.stem(word))

running 👉--> run
happiness 👉--> happy
caresses 👉--> caress


In [5]:
lancaster.stem('Congratulation')

'congrat'

## RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [6]:
from nltk.stem import RegexpStemmer
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

print(reg_stemmer.stem('eating'))
print(reg_stemmer.stem('ingeating'))

eat
ingeat


In [7]:
import re

def regex_stem(word):
    patterns = [
        (r'ing$', ''),  # remove 'ing'
        (r'ness$', ''), # remove 'ness'
        (r'es$', 'e'),  # replace 'es' with 'e'
        (r's$', '')     # remove 's'
    ]
    for pattern, repl in patterns:
        word = re.sub(pattern, repl, word)
    return word

words = ["running", "happiness", "caresses"]
stems = [regex_stem(word) for word in words]
print(stems)

['runn', 'happi', 'caresse']


## Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [8]:
from nltk.stem import SnowballStemmer
snowballsstemmer=SnowballStemmer('english')

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]
for word in words:
    print(word+"👉->"+snowballsstemmer.stem(word))

eating👉->eat
eats👉->eat
eaten👉->eaten
writing👉->write
writes👉->write
programming👉->program
programs👉->program
history👉->histori
finally👉->final
finalized👉->final


### `Porter Stemmer` v/s `Snowball Stemmer`
#### `Example 👇`

In [9]:
porter_stems = [stemming.stem(word) for word in ["fairly", "sportingly", "goes"]]
snowball_stems = [snowballsstemmer.stem(word) for word in ["fairly", "sportingly", "goes"]]

print("Porter:", porter_stems)  
print("Snowball:", snowball_stems)

Porter: ['fairli', 'sportingli', 'goe']
Snowball: ['fair', 'sport', 'goe']


In [10]:
stemming.stem("fairly"),stemming.stem("sportingly"),stemming.stem('goes')

('fairli', 'sportingli', 'goe')

In [11]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly"),snowballsstemmer.stem('goes')

('fair', 'sport', 'goe')

# Lemmatization
* Lemmatization technique is like stemming.
*  The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming.
*  After lemmatization, we will be getting a valid word that means the same thing.
* Lemmatization is a process in natural language processing (NLP) that reduces words to their base or dictionary form, known as the lemma.
*  Unlike stemming, which often removes suffixes in a crude way, lemmatization uses linguistic knowledge, including vocabulary and morphological analysis, to produce more accurate and meaningful base forms.

### Example
#### Consider the following words and how they are lemmatized:

#### Running:
* Lemmatized: "run" (verb)
* Explanation: The verb "running" is reduced to its base form "run".
#### Better:
* Lemmatized: "good" (adjective)
* Explanation: The adjective "better" is reduced to "good" based on its comparative form.
#### Geese:
* Lemmatized: "goose" (noun)
* Explanation: The plural noun "geese" is reduced to its singular form "goose".

### Wordnet Lemmatizer
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −



In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

lemmatizer.lemmatize("going")

'going'

#### POS tag - 
* Noun-n
* verb-v
* adjective-a
* adverb-r

In [13]:
lemmatizer.lemmatize("going",pos='v')

'go'

In [14]:
lemmatizer.lemmatize("going",pos='n')

'going'

In [15]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

for word in words:
    print(word+"👉-> "+lemmatizer.lemmatize(word,pos='v'))

eating👉-> eat
eats👉-> eat
eaten👉-> eat
writing👉-> write
writes👉-> write
programming👉-> program
programs👉-> program
history👉-> history
finally👉-> finally
finalized👉-> finalize


In [16]:
lemma = [lemmatizer.lemmatize(word) for word in words]
print(lemma,end=" ")

['eating', 'eats', 'eaten', 'writing', 'writes', 'programming', 'program', 'history', 'finally', 'finalized'] 

In [17]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

### `Lemmatization` v/s `Stemming`
* Stemming: "running" -> "run", "better" -> "better", "geese" -> "gees"
* Lemmatization: "running" -> "run", "better" -> "good", "geese" -> "goose"