<a href="https://colab.research.google.com/github/alokssingh/CS-332-NLP-Tutoiral-Notes/blob/main/Tutorial%201/normalisation_lemmatization_stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text normalisation**

---





1.   **Process of transforming text into a single canonical form**
2.   **Before text normalization we should be aware of**
    
       **i)** what type of text is to be normalized and,
      
      **ii)** how it is to be processed afterwards





# **Why text normalisation ?**

---



> 1) **In string searching**   *e.g. ‘john’ and ‘John’ (Case based matching)*

> 2) **American or British English spelling**

> 3) **Multiple form of single word** e.g USA or US

> 4) **frequently used in converting text to speech**

> > Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context


# **Sample example of text normalisation**

---



In [None]:
import numpy as np
import string
import re

In [None]:
data_list = ['The Patient! is s wai@ting, for5 you in room Number','johN', 'John', 'JOHN']
print(data_list)

In [None]:
preprocessed_text = []

In [None]:

for i in range(len(data_list)):
		data = data_list[i]
		# Tokenize i.e. split on white spaces
		data = data.split()
		# Convert to lowercase
		data = [word.lower() for word in data]
  	# Remove punctuation from each token
	  # Prepare translation table for removing punctuation
		table = str.maketrans('', '', string.punctuation)    
		data = [w.translate(table) for w in data]
		#Remove hanging 's' and 'a'
		#data = [word for word in data if len(word)>1]
		# Remove tokens with special character
		data = [re.sub(r"[@?\(^)+\) 0-9]", "", word) for word in data ]  # Note: re.sub(pattern, repl, string) 
		# Store as string
		data  =  ' '.join(data)
		preprocessed_text.append(data)

In [None]:
print("******Text before preprocessing******")
for text in data_list:
  print(text)

print("******Text after preprocessing******")
for text in preprocessed_text:
  print(text)

# **1. Parsing**

---

# **2. Morpheme** 

---
# **3. Stemming** 

  *   chopping affies from a word 
  *   may or may not has dictionary meaning  

* Used in **information retrieval** for searching, Sentiment Analysis, document clustring. e.g search for party (search engine will show parties)  

* Mostly used stemmer **Porter stemmer**
* other stemmers: 

        1. Lovins Stemmer 

        2. Dawson Stemmer

        3. Krovetz Stemmer

        4. Xerox Stemmer 

        5. N-Gram Stemmer 

        6. Snowball Stemmer 

        7. Lancaster stemmers

---
[https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.htmlt](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

# Steps to use Porter stemmer

**Step 1** - Import the NLTK library and from NLTK import PorterStemmer

**Step 2** - Creat a variable and store PorterStemmer into it

**Step 3** - use PorterStemmer


# Stemming of words

In [None]:
import nltk
from nltk.stem import PorterStemmer

In [None]:
ps = PorterStemmer()

In [None]:
print(ps.stem('bat'))
print(ps.stem('batting'))

# Stemming of a sentence

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

In [None]:
text = "This was not the map we found in Billy Bones’s chest, but an accurate copy, complete in all things-names and heights and soundings-with the single exception of the red crosses and the written notes."

In [None]:
words = word_tokenize(text) # or you can use above approach of spliting based on space and cleaning the text like normalization
print(words)

In [None]:
stemed_words = []
for w in words:
  stemed_words.append(ps.stem(w)) 

In [None]:
print(stemed_words)

In [None]:
# converting list into text stream
stemmed_text = ' '.join(stemed_words)
print(stemmed_text)

# **3. Lemmatization**

---

1. task of determining that two words have the same root, despite their surface differences
2. **Studies and Studying** ---> *Study* 
3. have dictionary meaning
4. e.g lemmatized form of a sentence:
     

       He is reading detective stories
       ⏬ ⏬  ⏬         ⏬        ⏬
       He be read   detective   story     NOTE:  am, are, and is have the shared lemma be

5. Different lemmatizers:

        1. WordnetLemmatizer

        2. spaCy

        3. TexxtBlob Lemmatizer

        4. Pattern Lemmatizer

        5. Standford CoreNLP Lemmatizer

        6. Gensim lemmatizer            

# Lemmatization of words

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet') #wordnet last tutorial 
nltk.download('averaged_perceptron_tagger') # will download pos tag 

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
print(lemmatizer.lemmatize('sang'))
print(lemmatizer.lemmatize('sung'))
print(lemmatizer.lemmatize('sings'))

*no actual root form has been given for any word, this is because they are given without context. You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS)*

**Sometimes, the same word can have a multiple lemmas based on the meaning / context.**

In [None]:
print(nltk.pos_tag('sang'))
print(nltk.pos_tag(['sing']))

In [None]:
print(lemmatizer.lemmatize('sang',pos='v'))
print(lemmatizer.lemmatize('sung',pos='v'))
print(lemmatizer.lemmatize('sings',pos='v'))

# Lemmatization of text

In [None]:
text = "This was not the map we found in Billy Bones’s chest, but an accurate copy, complete in all things names and heights and soundings with the single exception of the red crosses and the written notes."

In [None]:
words = word_tokenize(text) # or you can use above approach of spliting based on space and cleaning the text like normalization
print(words)

In [None]:
lemma_words = []
for w in words:
  lemma_words.append(lemmatizer.lemmatize(w, pos='v')) 

In [None]:
print(lemma_words)

In [None]:
# converting list into text stream
print("***** Text before lemmatization*****\n")
print(text)
lemmatized_text = ' '.join(lemma_words)
print("\n***** Text after lemmatization*****\n")
print(lemmatized_text)