<a href="https://colab.research.google.com/github/alokssingh/CS-332-NLP-Tutorials-Notes/blob/main/Tutorial%202/normalisation_lemmatization_stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



#  **Tutorial 2**




*  **Normalization**
*   **Parsing**
*   **Morpheme**
*   **Stemming**
*   **Lemmatization**


# **Text normalisation**

---





1.   **Process of transforming text into a single canonical form**
2.   **Before text normalization we should be aware of**
    
       **i)** what type of text is to be normalized and,
      
      **ii)** how it is to be processed afterwards





# **Why text normalisation ?**

---



> 1) **In string searching**   *e.g. ‘john’ and ‘John’ (Case based matching)*

> 2) **American or British English spelling**

> 3) **Multiple form of single word** e.g USA or US

> 4) **frequently used in converting text to speech**

> > Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context


# **Sample example of text normalisation**

---



In [1]:
import numpy as np
import string
import re

In [4]:
data_list = ['The Patient! is s wai@ting, for5 you in room Number'] #
print(data_list[0])

The Patient! is s wai@ting, for5 you in room Number


In [3]:
preprocessed_text = []

In [5]:
for i in range(len(data_list)):
		data = data_list[i]
		# Tokenize i.e. split on white spaces
		data = data.split()
		# Convert to lowercase
		data = [word.lower() for word in data]
  	# Remove punctuation from each token
	  # Prepare translation table for removing punctuation
		table = str.maketrans('', '', string.punctuation)    
		data = [w.translate(table) for w in data]
		#Remove hanging 's' and 'a'
		#data = [word for word in data if len(word)>1]
		# Remove tokens with special character
		data = [re.sub(r"[@?\(^)+\) 0-9]", "", word) for word in data ]  # Note: re.sub(pattern, repl, string) 
		# Store as string
		data  =  ' '.join(data)
		preprocessed_text.append(data)

In [6]:
print("******Text before preprocessing******")
for text in data_list:
  print(text)

print("******Text after preprocessing******")
for text in preprocessed_text:
  print(text)

******Text before preprocessing******
The Patient! is s wai@ting, for5 you in room Number
johN
John
JOHN
******Text after preprocessing******
the patient is s waiting for you in room number
john
john
john


cat s stem and affix 

# **1. Parsing**

---

# **2. Morpheme** 

---
# **3. Stemming** 

  *   chopping affies from a word 
  *   may or may not has dictionary meaning  

* Used in **information retrieval** for searching, **Sentiment Analysis**, **document clustring**. e.g search for party (search engine will show parties)  

* Mostly used stemmer **Porter stemmer**
* other stemmers: 

        1. Lovins Stemmer 

        2. Dawson Stemmer

        3. Krovetz Stemmer

        4. Xerox Stemmer 

        5. N-Gram Stemmer 

        6. Snowball Stemmer 

        7. Lancaster stemmers

---
[https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.htmlt](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

# Steps to use Porter stemmer

**Step 1** - Import the NLTK library and from NLTK import PorterStemmer

**Step 2** - Creat a variable and store PorterStemmer into it

**Step 3** - use PorterStemmer


# Stemming of words

In [7]:
import nltk
from nltk.stem import PorterStemmer

In [8]:
ps = PorterStemmer()

In [18]:
print(ps.stem('foxes'))
print(ps.stem('Studying'))

studi
studi


# Stemming of a sentence

In [10]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [11]:
text = "This was not the map we found in Billy Bones’s chest, but an accurate copy, complete in all things-names and heights and soundings-with the single exception of the red crosses and the written notes."

In [12]:
words = word_tokenize(text) # or you can use above approach of spliting based on space and cleaning the text like normalization
print(words)

['This', 'was', 'not', 'the', 'map', 'we', 'found', 'in', 'Billy', 'Bones', '’', 's', 'chest', ',', 'but', 'an', 'accurate', 'copy', ',', 'complete', 'in', 'all', 'things-names', 'and', 'heights', 'and', 'soundings-with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']


In [13]:
stemed_words = []
for w in words:
  stemed_words.append(ps.stem(w)) 

In [14]:
print(stemed_words)

['thi', 'wa', 'not', 'the', 'map', 'we', 'found', 'in', 'billi', 'bone', '’', 's', 'chest', ',', 'but', 'an', 'accur', 'copi', ',', 'complet', 'in', 'all', 'things-nam', 'and', 'height', 'and', 'soundings-with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']


In [15]:
# converting list into text stream
stemmed_text = ' '.join(stemed_words)
print(stemmed_text)

thi wa not the map we found in billi bone ’ s chest , but an accur copi , complet in all things-nam and height and soundings-with the singl except of the red cross and the written note .


# **3. Lemmatization**

---

1. task of determining that two words have the same root, despite their surface differences
2. **Studies and Studying** ---> *Study* 
3. have dictionary meaning
4. e.g lemmatized form of a sentence:
     

       He is reading detective stories
       ⏬ ⏬  ⏬         ⏬        ⏬
       He be read   detective   story     NOTE:  am, are, and is have the shared lemma be

5. Different lemmatizers:

        1. WordnetLemmatizer

        2. spaCy

        3. TexxtBlob Lemmatizer

        4. Pattern Lemmatizer

        5. Standford CoreNLP Lemmatizer

        6. Gensim lemmatizer            

# Lemmatization of words

In [19]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') #wordnet last tutorial 
nltk.download('averaged_perceptron_tagger') # will download pos tag 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [20]:
lemmatizer = WordNetLemmatizer()

In [21]:
print(lemmatizer.lemmatize('sang'))
print(lemmatizer.lemmatize('sung'))
print(lemmatizer.lemmatize('sings'))

sang
sung
sings


*no actual root form has been given for any word, this is because they are given without context. You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS)*

**Sometimes, the same word can have a multiple lemmas based on the meaning / context.**

In [22]:
print(nltk.pos_tag('sang'))
print(nltk.pos_tag(['sing']))

[('s', 'VB'), ('a', 'DT'), ('n', 'JJ'), ('g', 'NN')]
[('sing', 'VBG')]


In [23]:
print(lemmatizer.lemmatize('sang',pos='v'))
print(lemmatizer.lemmatize('sung',pos='v'))
print(lemmatizer.lemmatize('sings',pos='v'))

sing
sing
sing


# Lemmatization of text

In [32]:
text = "This was not the map we found in Billy Bones’s chest, @ but an accurate copy, complete in all things names and heights and soundings with the single exception of the red crosses and the written notes."

In [None]:
text = text.split('@')
print(text)

In [25]:
words = word_tokenize(text) # or you can use above approach of spliting based on space and cleaning the text like normalization
print(words)

['This', 'was', 'not', 'the', 'map', 'we', 'found', 'in', 'Billy', 'Bones', '’', 's', 'chest', ',', 'but', 'an', 'accurate', 'copy', ',', 'complete', 'in', 'all', 'things', 'names', 'and', 'heights', 'and', 'soundings', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']


In [26]:
lemma_words = []
for w in words:
  lemma_words.append(lemmatizer.lemmatize(w, pos='v')) 

In [27]:
print(lemma_words)

['This', 'be', 'not', 'the', 'map', 'we', 'find', 'in', 'Billy', 'Bones', '’', 's', 'chest', ',', 'but', 'an', 'accurate', 'copy', ',', 'complete', 'in', 'all', 'things', 'name', 'and', 'heights', 'and', 'sound', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'cross', 'and', 'the', 'write', 'note', '.']


In [28]:
# converting list into text stream
print("***** Text before lemmatization*****\n")
print(text)
lemmatized_text = ' '.join(lemma_words)
print("\n***** Text after lemmatization*****\n")
print(lemmatized_text)

***** Text before lemmatization*****

This was not the map we found in Billy Bones’s chest, but an accurate copy, complete in all things names and heights and soundings with the single exception of the red crosses and the written notes.

***** Text after lemmatization*****

This be not the map we find in Billy Bones ’ s chest , but an accurate copy , complete in all things name and heights and sound with the single exception of the red cross and the write note .


# Assignment 1

text = "The study of grammar has an ancie5nt pe6digree; Panini’s grammar of Sanskrit was written over two thousand ye@ars ago and is st\ill referenced today in teaching San]skrit. Despite this histor$y, knowledge of grammar remains spotty at best. In this chapter, we make a pr%eliminary st-ab at addressing some of these gaps in our knowledge of grammar and syntax, (as well as introducing some of the formal mechanisms that are available for capturing this knowledge in a computationally useful manner."

1. Perform normalization on above text and then do below actions:

       Apply all the stemmers to the above text and note down the pro and cons of each stemmer
  
       Use all the lemmatizer on about text and note down the pro and cons of each lemmatizer

# Assignment 2
text2 = "व्याकरण के अध्ययन में एक पुरानी 5 वीं डिग्री है; पाणिनी का संस्कृत व्याकरण दो हजार साल पहले लिखा गया था और आज भी संस्कृत के संस्कृत शिक्षण में इसका उल्लेख नहीं किया गया है। इस इतिहास के बावजूद, व्याकरण का ज्ञान सबसे अच्छा रहता है। इस अध्याय में, हम व्याकरण और वाक्य रचना के अपने ज्ञान में इनमें से कुछ अंतरालों को संबोधित करने के लिए एक pr% एलिमिनरी st-ab बनाते हैं, (साथ ही कुछ औपचारिक तंत्रों को पेश करते हैं जो इस ज्ञान को कम्प्यूटेशनल रूप से उपयोगी तरीके से कैप्चर करने के लिए उपलब्ध हैं।"


1. Perform normalization on above text and then do below actions:

       Perform stemming to the above text using IndicNLP tokenizer
  
       Perform lemmatizing on about text using IndicNLP tokenizer

For IndicNLP ⏬⏬⏬

[IndicNLP](https://anoopkunchukuttan.github.io/indic_nlp_library/)

[IndicNLP site](https://indicnlp.org/)

# Hints

**Libraries required for other lemmatizers**

        import spacy

        from textblob import TextBlob, Word

        import pattern (colab !pip instal pattern)

        from stanfordcorenlp import StanfordCoreNLP

        from gensim.utils import lemmatize

**Libraries required for other stemmers**

        from nltk.stem.snowball import SnowballStemmer



1. http://www.nltk.org/install.html

2. [https://text-processing.com/demo/stem/](https://text-processing.com/demo/stem/)

3. [http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf](http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf)
    



