<a href="https://colab.research.google.com/github/Utkarsha1a/NLP/blob/main/Lemmatization_Stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Stemming and Lemmatization in Python NLTK?
Stemming and Lemmatization in Python NLTK are text **normalization techniques** for Natural Language Processing. These techniques are widely used for text preprocessing. 

### The difference between stemming and lemmatization is that 

stemming it is faster as it cuts words without knowing the context. 

lemmatization provides better results by performing an analysis that depends on the word's part-of-speech and producing real, dictionary words. As a result, lemmatization is harder to implement and slower compared to stemming.


**For Example** Word - 'Studying'

Lemmatize -- 'Study'. 
Stem -- 'Studi' and this is erroneous.

In [36]:
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import 	WordNetLemmatizer
from nltk.corpus import stopwords

In [32]:
word_data = "studies studying caring very history historical"

In [33]:
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

In [34]:
porter_stemmer = PorterStemmer()
#Next find the roots of the word
for w in nltk_tokens:
  print("Stemming for {} is {}".format(w,porter_stemmer.stem(w))) 

Stemming for studies is studi
Stemming for studying is studi
Stemming for caring is care
Stemming for very is veri
Stemming for history is histori
Stemming for historical is histor


In [35]:
wordnet_lemmatizer = WordNetLemmatizer()
for w in nltk_tokens:
	print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

Lemma for studies is study
Lemma for studying is studying
Lemma for caring is caring
Lemma for very is very
Lemma for history is history
Lemma for historical is historical


Below code we are removing stop words.

Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Generally, the most common words used in a text are “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” et

In [59]:
print(set(stopwords.words('english')))

{'he', 'am', 've', "shouldn't", 'doing', 'o', 'mustn', 'mightn', "you'll", 'these', 'the', 'more', 'an', 'couldn', 'doesn', 'then', 'you', 'during', 'myself', "don't", 'as', 'do', "wouldn't", 'yourselves', 'from', 'weren', 'yourself', 'herself', 'itself', 'have', 'they', 'why', 'her', 'ain', 'ours', 'just', 'ma', 'she', 'be', 'through', 'was', "weren't", 'won', "should've", 'off', "mightn't", "that'll", 'once', 'can', 'but', 'to', 'our', 'them', "doesn't", 'above', 'when', 'how', 'other', 'his', 'who', 'in', "didn't", 'again', 'll', "you've", "you're", 'its', "shan't", 'by', 'or', 'don', 'had', 'of', 'shan', 's', 'm', 'those', 'their', 'haven', 'does', 'being', 'while', 'up', 'under', "won't", 'some', 'are', "isn't", 'until', 'wasn', 'where', 'which', 'nor', 't', 'not', 'y', 'been', 'few', 'isn', 'out', 'here', 'hasn', "it's", 'about', 'your', 'if', 'no', 'because', 'before', 'so', 'there', 'i', 'too', "mustn't", 'same', 'should', 'needn', 'themselves', 'after', 'himself', 'd', 'aren',

In [82]:
example_sent = """Data science is an interdisciplinary field that uses scientific methods, processes, 
                  algorithms and systems to extract knowledge and insights from data in various forms, both 
                  structured and unstructured,[1][2] similar to data mining."""
 
stop_words = set(stopwords.words('english'))
 
word_tokens = word_tokenize(example_sent)
print(word_tokens)

filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(filtered_sentence)

wordnet_lemmatizer = WordNetLemmatizer()
for w in filtered_sentence:
	print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))  

['Data', 'science', 'is', 'an', 'interdisciplinary', 'field', 'that', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'and', 'systems', 'to', 'extract', 'knowledge', 'and', 'insights', 'from', 'data', 'in', 'various', 'forms', ',', 'both', 'structured', 'and', 'unstructured', ',', '[', '1', ']', '[', '2', ']', 'similar', 'to', 'data', 'mining', '.']
['Data', 'science', 'interdisciplinary', 'field', 'uses', 'scientific', 'methods', ',', 'processes', ',', 'algorithms', 'systems', 'extract', 'knowledge', 'insights', 'data', 'various', 'forms', ',', 'structured', 'unstructured', ',', '[', '1', ']', '[', '2', ']', 'similar', 'data', 'mining', '.']
Lemma for Data is Data
Lemma for science is science
Lemma for interdisciplinary is interdisciplinary
Lemma for field is field
Lemma for uses is us
Lemma for scientific is scientific
Lemma for methods is method
Lemma for , is ,
Lemma for processes is process
Lemma for , is ,
Lemma for algorithms is algorithm
Lemma for systems 

One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”
Below is the implementation of lemmatization words using NLTK:

In [62]:
lemmatizer = WordNetLemmatizer()
 
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
 
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))

rocks : rock
corpora : corpus
better : good
