# Introduction

In NLP(Natural Language Processing), Lemmatization and Stemming play important roles in Text Normalization.

*<ins>Text Normalization :</ins> It is process of cleaning and preprocessing a text data to make it consistent and useable for different NLP tasks.*

These fundamental methods are employed to prepare words, text, and documents for subsequent processing. While we implement and compare these two approaches it is essential to to understand and recognize their distinct approaches in simplifying and standardizing language, enhancing the efficiency of various NLP applications.

Various languages suach as Hindi and English consists of several words which are often derived fro one another. Inflected Language is a term that we use for these languages which hve derrived words. 

There is always a common root for all the inflected words altough the degree of inflection highly depends upon the language we are dealing with. 

**To sum up, root form of derived or inflected words are attained using Stemming and Lemmatization.**

So, if the overall idea of the preprocessing step to bring in normalization to the text is clear let us jump into knowing what is Stemming.

*<ins>Stemming:</ins>* It is a process of reducing the inflected words to their stem. For instance, stemming will replace the word <ins>history</ins> and <ins>historical</ins> with <ins>histori</ins>. 

Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn't have any meaning.

## Why we Need Stemming?
In NLP usecases such as sentiment analysis, spam classification, res


In [14]:
# Stemming

import nltk
try:
    from nltk.corpus import stopwords
    from nltk.stem import (PorterStemmer, 
                           SnowballStemmer, 
                           RegexpStemmer, 
                           LancasterStemmer)
except:
    nltk.download('stopwords')
    nltk.download('punkt')
    from nltk.corpus import stopwords
    from nltk.stem import (PorterStemmer, 
                           SnowballStemmer, 
                           RegexpStemmer, 
                           LancasterStemmer)

In [3]:
paragraph = """
Generative artificial intelligence (generative AI, GenAI,or GAI) is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts. 
Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.
Improvements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s. 
These include chatbots such as ChatGPT, Copilot, Gemini and LLaMA, text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney and DALL-E, and text-to-video AI generators such as Sora.
Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.
"""

In [4]:
scentences = nltk.sent_tokenize(paragraph)
print(scentences)

['\nGenerative artificial intelligence (generative AI, GenAI,or GAI) is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts.', 'Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.', 'Improvements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s.', 'These include chatbots such as ChatGPT, Copilot, Gemini and LLaMA, text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney and DALL-E, and text-to-video AI generators such as Sora.', 'Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.']


In [17]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer(language='english')
regex_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
lancaster_stemmer = LancasterStemmer()

In [13]:
results:list = []
for scent in scentences:
    words = nltk.word_tokenize(scent)
    words = [porter_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    results.append(' '.join(words))

print(scentences)


['gener artifici intellig ( gener ai , genai , gai ) artifici intellig capabl gener text , imag , video , data use gener model , often respons prompt .', 'gener ai model learn pattern structur input train data gener new data similar characterist .', 'improv transformer-bas deep neural network , particularli larg languag model ( llm ) , enabl ai boom gener ai system earli 2020 .', 'these includ chatbot chatgpt , copilot , gemini llama , text-to-imag artifici intellig imag gener system stabl diffus , midjourney dall- , text-to-video ai gener sora .', 'compani openai , anthrop , microsoft , googl , baidu well numer smaller firm develop gener ai model .']


In [18]:
results:list = []
for scent in scentences:
    words = nltk.word_tokenize(scent)
    words = [snowball_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    results.append(' '.join(words))

print(scentences)


['gener artifici intellig ( gener ai , genai , gai ) artifici intellig capabl gener text , imag , video , data use gener model , often respons prompt .', 'gener ai model learn pattern structur input train data gener new data similar characterist .', 'improv transformer-bas deep neural network , particularli larg languag model ( llm ) , enabl ai boom gener ai system earli 2020 .', 'these includ chatbot chatgpt , copilot , gemini llama , text-to-imag artifici intellig imag gener system stabl diffus , midjourney dall- , text-to-video ai gener sora .', 'compani openai , anthrop , microsoft , googl , baidu well numer smaller firm develop gener ai model .']


In [19]:
results:list = []
for scent in scentences:
    words = nltk.word_tokenize(scent)
    words = [regex_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    results.append(' '.join(words))

print(scentences)


['gener artifici intellig ( gener ai , genai , gai ) artifici intellig capabl gener text , imag , video , data use gener model , often respons prompt .', 'gener ai model learn pattern structur input train data gener new data similar characterist .', 'improv transformer-bas deep neural network , particularli larg languag model ( llm ) , enabl ai boom gener ai system earli 2020 .', 'these includ chatbot chatgpt , copilot , gemini llama , text-to-imag artifici intellig imag gener system stabl diffus , midjourney dall- , text-to-video ai gener sora .', 'compani openai , anthrop , microsoft , googl , baidu well numer smaller firm develop gener ai model .']


In [20]:
results:list = []
for scent in scentences:
    words = nltk.word_tokenize(scent)
    words = [lancaster_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    results.append(' '.join(words))

print(scentences)

['gener artifici intellig ( gener ai , genai , gai ) artifici intellig capabl gener text , imag , video , data use gener model , often respons prompt .', 'gener ai model learn pattern structur input train data gener new data similar characterist .', 'improv transformer-bas deep neural network , particularli larg languag model ( llm ) , enabl ai boom gener ai system earli 2020 .', 'these includ chatbot chatgpt , copilot , gemini llama , text-to-imag artifici intellig imag gener system stabl diffus , midjourney dall- , text-to-video ai gener sora .', 'compani openai , anthrop , microsoft , googl , baidu well numer smaller firm develop gener ai model .']


In [7]:
#Lemmatization

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\BiswajitRajguruMohap\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BiswajitRajguruMohap\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\BiswajitRajguruMohap\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
sentences = nltk.sent_tokenize(paragraph)
print(sentences)

['\nGenerative artificial intelligence (generative AI, GenAI,or GAI) is artificial intelligence capable of generating text, images, videos, or other data using generative models, often in response to prompts.', 'Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics.', 'Improvements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the early 2020s.', 'These include chatbots such as ChatGPT, Copilot, Gemini and LLaMA, text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney and DALL-E, and text-to-video AI generators such as Sora.', 'Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.']


In [9]:
lemmatizer = WordNetLemmatizer()
# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

In [10]:
print(sentences)

['Generative artificial intelligence ( generative AI , GenAI , GAI ) artificial intelligence capable generating text , image , video , data using generative model , often response prompt .', 'Generative AI model learn pattern structure input training data generate new data similar characteristic .', 'Improvements transformer-based deep neural network , particularly large language model ( LLMs ) , enabled AI boom generative AI system early 2020s .', 'These include chatbots ChatGPT , Copilot , Gemini LLaMA , text-to-image artificial intelligence image generation system Stable Diffusion , Midjourney DALL-E , text-to-video AI generator Sora .', 'Companies OpenAI , Anthropic , Microsoft , Google , Baidu well numerous smaller firm developed generative AI model .']
