# Text Normalization in NLP
* Notebook by Adam Lang
* Date: 3/18/2024

# NLP Normalization - Overview
* Morpheme = base of a word
* Structure of a token in NLP
    * prefix-morepheme-suffix
    *Example: `Antinationalist: Anti + national + ist`
    * 'national' is the morpheme here.
* Normalization is the NLP process of converting a token into its base form which is a morpheme.

## Why do we need Text Normalization in NLP?
* Reduces data dimensionality
* Text cleaning

## What are the common methods for Text Normalization in NLP?
1. Stemming
    * Elementary rule based process removes the inflectional forms from a token.
    * Example: "laughing", "laughed", "laughs", "laugh" ====> "laugh"
    * Not ideal because it will normalize text to non-normal words

2. Lemmatization
    * Systematic process for reducing tokens to its lemma.
    * Produces a root word or lemma.
    * Examples:
        a) am, are, is ==> be
        b) running, ran, run, rans ==> run
    * Running, 'verb' ==> run
    * Running, 'noun' ==> running
    * More meaningful technique than `Stemming` as it produces more meaningful words.


# Implementing Text Normalization in Python - spaCy

## Lemmatization

### Example 1

In [1]:
# define a string to test
text = "The sky is clear and the stars are twinkling."

In [2]:
import spacy

In [3]:
# load spacy model
nlp = spacy.load('en_core_web_sm')

In [4]:
# create nlp doc object
doc = nlp(text)

In [5]:
# lemmatize text
[(token.text, token.lemma_) for token in doc]

[('The', 'the'),
 ('sky', 'sky'),
 ('is', 'be'),
 ('clear', 'clear'),
 ('and', 'and'),
 ('the', 'the'),
 ('stars', 'star'),
 ('are', 'be'),
 ('twinkling', 'twinkle'),
 ('.', '.')]

summary:
* above we can see the text on the left and the lemma result on the right.
* this is ideal for question-answer problems.

### Example 2

In [6]:
text = "The moon looks beautiful at night. It's hard to resist its beauty."

# Creating doc object
doc = nlp(text)

# lemmatizing the text with list comprehension
[(token.text, token.lemma_) for token in doc]

[('The', 'the'),
 ('moon', 'moon'),
 ('looks', 'look'),
 ('beautiful', 'beautiful'),
 ('at', 'at'),
 ('night', 'night'),
 ('.', '.'),
 ('It', 'it'),
 ("'s", 'be'),
 ('hard', 'hard'),
 ('to', 'to'),
 ('resist', 'resist'),
 ('its', 'its'),
 ('beauty', 'beauty'),
 ('.', '.')]

summary: we can again see the lemmatiztaion technique applied.

# Lemmatization Exercise
* We will use the text file called `Switzerland.txt` to lemmatize the text.

In [7]:
# import file
file = open('/content/drive/MyDrive/Colab Notebooks/Classical NLP/switzerland.txt', mode='r', encoding='utf-8')

# read file into text variable
text = file.read()

# close file
file.close()



In [11]:
# print file to make sure that we have loaded
print(text)

Switzerland, officially the Swiss Confederation, is a country situated in the confluence of Western, Central, and Southern Europe. It is a federal republic composed of 26 cantons, with federal authorities based in Bern. Switzerland is a landlocked country bordered by Italy to the south, France to the west, Germany to the north, and Austria and Liechtenstein to the east. It is geographically divided among the Swiss Plateau, the Alps, and the Jura, spanning a total area of 41,285 km2 (15,940 sq mi), and land area of 39,997 km2 (15,443 sq mi). While the Alps occupy the greater part of the territory, the Swiss population of approximately 8.5 million is concentrated mostly on the plateau, where the largest cities and economic centres are located, among them Zürich, Geneva and Basel, where multiple international organisations are domiciled (such as FIFA, the UN's second-largest Office, and the Bank for International Settlements) and where the main international airports of Switzerland are.



In [12]:
# create a spacy doc object
doc = nlp(text)

In [13]:
# lemmatize text using list comprehension
[(token.text, token.lemma_) for token in doc]

[('Switzerland', 'Switzerland'),
 (',', ','),
 ('officially', 'officially'),
 ('the', 'the'),
 ('Swiss', 'Swiss'),
 ('Confederation', 'Confederation'),
 (',', ','),
 ('is', 'be'),
 ('a', 'a'),
 ('country', 'country'),
 ('situated', 'situate'),
 ('in', 'in'),
 ('the', 'the'),
 ('confluence', 'confluence'),
 ('of', 'of'),
 ('Western', 'western'),
 (',', ','),
 ('Central', 'central'),
 (',', ','),
 ('and', 'and'),
 ('Southern', 'Southern'),
 ('Europe', 'Europe'),
 ('.', '.'),
 ('It', 'it'),
 ('is', 'be'),
 ('a', 'a'),
 ('federal', 'federal'),
 ('republic', 'republic'),
 ('composed', 'compose'),
 ('of', 'of'),
 ('26', '26'),
 ('cantons', 'canton'),
 (',', ','),
 ('with', 'with'),
 ('federal', 'federal'),
 ('authorities', 'authority'),
 ('based', 'base'),
 ('in', 'in'),
 ('Bern', 'Bern'),
 ('.', '.'),
 ('Switzerland', 'Switzerland'),
 ('is', 'be'),
 ('a', 'a'),
 ('landlocked', 'landlocked'),
 ('country', 'country'),
 ('bordered', 'border'),
 ('by', 'by'),
 ('Italy', 'Italy'),
 ('to', 'to'),

summary:
* We can see the full lemmatization of the text file above.
* Obviously there are other libraries for doing this such as NLTK and Gensim but this is the easiest and most simplest method.