<a href="https://colab.research.google.com/github/YashNigam65/gitfolder/blob/master/notebook/concept_example/stemming_and_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explain the concept of stemming and lemmatization.

Stemming and lemmatization are both techniques used in natural language processing (NLP) to reduce words to their base or root form, but they do so in slightly different ways.



Stemming: This is a more crude process that chops off the ends of words to get to a common "stem." The stem might not always be a grammatically correct word. For example, the stem of "programming" and "programmer" might be "programm," which isn't a real English word. It's faster but can be less accurate.

Example from the notebook:
programming becomes program

programmer becomes programm

Intelligence becomes intellig

Lemmatization (as shown in cell V3U6nqCMmk0s): This is a more sophisticated process that considers the word's meaning and aims to return the base or dictionary form of a word, known as a "lemma." The lemma is always a valid word. It's generally slower but more accurate.

Example from the notebook:
rocks becomes rock

corpora becomes corpus

better (as an adjective) becomes good

programming becomes programming (it's already a base form in this context)

programmer becomes programmer (also a base form)

In [4]:
!pip install nltk



In [6]:
import nltk

This cell downloads the punkt tokenizer from nltk. The punkt tokenizer is used for splitting text into a list of sentences and words, which is a common prerequisite for many NLP tasks.

In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

This cell downloads the wordnet corpus from nltk. WordNet is a lexical database of semantic relations between words, often used for lemmatization.

In [8]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

This cell demonstrates stemming using the PorterStemmer from nltk. It initializes a stemmer and then applies it to a list of words, showing how different forms of a word (e.g., "program", "programs", "programming") are reduced to their base or root form (e.g., "program", "programm").

In [11]:
# import these modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers", "Intelligence", "better"]

for w in words:
	print(w, " : ", ps.stem(w))


program  :  program
programs  :  program
programmer  :  programm
programming  :  program
programmers  :  programm
Intelligence  :  intellig
better  :  better


This cell demonstrates lemmatization using the WordNetLemmatizer from nltk. It initializes a lemmatizer and applies it to various words. Unlike stemming, lemmatization considers the word's meaning and aims to return the base or dictionary form of a word (the 'lemma'), which is often a valid word itself (e.g., "rocks" becomes "rock", "better" with pos="a" (adjective) becomes "good").



In [12]:
# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("Intelligence :", lemmatizer.lemmatize("Intelligence"))


# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))

words = ["program", "programs", "programmer", "programming", "programmers", "better"]

for w in words:
	print(w, " : ", lemmatizer.lemmatize(w))


rocks : rock
corpora : corpus
Intelligence : Intelligence
better : good
program  :  program
programs  :  program
programmer  :  programmer
programming  :  programming
programmers  :  programmer
better  :  better
