## Text Preprocessing for Machine Learning - Lemmatization


Lemmatization is the automated process of reducing a word to its base, 

dictionary form (known as a "lemma") by using vocabulary and morphological analysis.

The goal is to remove inflections and return a valid word that you would find in a dictionary.

Simple Analogy:
    Think of a linguist carefully looking up a word in a dictionary to find its canonical entry. They use grammar rules and meaning to find the standard form.

Key Point:
    Lemmatization is a more intelligent and accurate method than stemming. It requires understanding the word's part of speech (like verb, noun) and returns a real word.

Examples:

- running → run (verb)

- happily → happy (adjective)

- better → good (adjective)

- was → be (verb)

- mice → mouse (noun)

Why it's used:

Its purpose is the same as stemming (to group word variations), but it's used when accuracy is critical. 

It's essential for advanced text analysis, language understanding, and generating human-readable output, as it doesn't produce gibberish stems.

## Lemmatization code Implementation 

In [1]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/aljebra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [3]:
lemmatizer.lemmatize('going', pos = 'v')

'go'

In [4]:
lemmatizer.lemmatize('going', pos = 'n')

'going'

In [5]:
lemmatizer.lemmatize('going')

'going'

In [6]:
words = ["eats", 'eaten', 'eating', 'writes', 'wrote', 'writen', 'go', 'goes', 'going', 'programming', 'program', 'history', 'finally', 'final', 'congratulate']

In [7]:
for word in words:
    print(word + " ----> " + lemmatizer.lemmatize(word))

eats ----> eats
eaten ----> eaten
eating ----> eating
writes ----> writes
wrote ----> wrote
writen ----> writen
go ----> go
goes ----> go
going ----> going
programming ----> programming
program ----> program
history ----> history
finally ----> finally
final ----> final
congratulate ----> congratulate


In [8]:
'''
POS - Part of speech tag
Verb - v
None - n
Adjective - a
Adverb = r
'''

for word in words:
    print(word + " ----> " + lemmatizer.lemmatize(word, pos='v'))

eats ----> eat
eaten ----> eat
eating ----> eat
writes ----> write
wrote ----> write
writen ----> writen
go ----> go
goes ----> go
going ----> go
programming ----> program
program ----> program
history ----> history
finally ----> finally
final ----> final
congratulate ----> congratulate


In [9]:
lemmatizer.lemmatize('fairly', pos = 'v')

'fairly'

In [10]:
lemmatizer.lemmatize('sportingly', 'v')

'sportingly'