# Introduction

The notebooks is intended to show NLP Data Preprocessing techniques and how to use them.

In [12]:
# Import Standard Libraries
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

import unicodedata

# Download stopwords if not already done
# import nltk
# nltk.download('stopwords') # Stopwords Removal
# nltk.download('wordnet') # Lemmitization

# Stopwords Removal

This process aims to remove stopwords from an input text, in order to reduce the noise introduction for any NLP model.

Stopwords are words that are not particularly useful for any NLP model. In the `input_text` you can see some of them: *in*, *not*, *a*, etc.

In [1]:
# Define input text
input_text = """I’m amazed how often in practice, not only does a @huggingface NLP model solve your problem, but one of their public finetuned checkpoints, is good enough for the job.

Both impressed, and a little disappointed how rarely I get to actually train a model that matters :("""""

In [8]:
# Retrieve stopwords list from NLTK
stop_words_english = stopwords.words('english')

print(stop_words_english[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [9]:
# Convert the stopwords list to a set for speed up process
stop_words_english = set(stop_words_english)

In [12]:
# Cast the input text in lowercase and convert it into a list of words
input_text = input_text.lower().split()

In [13]:
# Remove stopwords
input_text_no_stopwords = [word for word in input_text if word not in stop_words_english]

In [15]:
print(' '.join(input_text_no_stopwords))

i’m amazed often practice, @huggingface nlp model solve problem, one public finetuned checkpoints, good enough job. impressed, little disappointed rarely get actually train model matters :(


# Tokens

It is quite common to not just tokenize the text input, but also to add some special tokens like:
- [PAD] - Used for maintaining the same length across input sequences
- [UNK] - Used when a token is unknown
- [CLS] - Used at the start of the input sequence
- [SEP] - Used at the end of the input sequence
- [MASK] - Used for masking token (e.g., in MLM training)

**Example**

From: *"@joebloggs thinks that the NLP models that @huggingface made are super cool"*

To: "[CLS] [UNK] thinks that the NLP models that [UNK] made are super cool [SEP] [PAD] [PAD]"

# Stemming

It is a Text Normalization technique that involves simplify the text before it's being processed.

The final characters of a word are removed, in order to normalize the word to its root version. All the different forms of a word are called *inflections* in linguistics.

Keep in mind that this transformation can be performed by sophisticated models and that it is not required by all the models, because it can change the meaning of a word.

In [2]:
# Words to Stemmize
words_to_stem = ['happy', 'happiest', 'happier', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly', 'cement', 'owed', 'maximum']

In [4]:
# Define two different Stemmers
porter = PorterStemmer()
lancaster = LancasterStemmer()

# Create the stemmings of the words in `words_to_stem` with the two different Stemmers
stemmed = [(porter.stem(word), lancaster.stem(word)) for word in words_to_stem]

print("Porter | Lancaster")
for stem in stemmed:
    print(f"{stem[0]} | {stem[1]}")

Porter | Lancaster
happi | happy
happiest | happiest
happier | happy
cactu | cact
cactii | cacti
eleph | eleph
eleph | eleph
amaz | amaz
amaz | amaz
amazingli | amaz
cement | cem
owe | ow
maximum | maxim


# Lemmatization

It is very similar to Stemming, but it casts the words done to a real common word, called "*Lemma*".

In [3]:
# Initialise the Lemmatizer
lemmatizer = WordNetLemmatizer()

In [4]:
# Input words
words = ['amaze', 'amazed', 'amazing']

In [5]:
# Nothing happens, because the speech tag is missing (e.g., Verb or Noun)
[lemmatizer.lemmatize(word) for word in words]

['amaze', 'amazed', 'amazing']

In [7]:
# Lemmatize again with speech tag "VERB"
[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']

# Unicode Normalization

Many particular unicode characters might be a problem when training NLP models. For example, even if two characters looks the same, they might actually have two distinct unicode values. This is called *Canonical Equivalence*.

In [10]:
# Characters look the same
print("\u00C7", "\u0043\u0327")

Ç Ç


In [11]:
# But of course they are not
"\u00C7" == "\u0043\u0327"

False

Another important topic is the *Compatibility Equivalence*, where two characters do not look the same, because maybe they are in a different format style, but they are in reality the same letter.

<br>

The solution for *Canonical Equivalence* is to decompose the characters into single units. While for the *Compatibility Equivalence*, it is necessary to normalize the characters to a common format. This implies the usage of the so called *Normal Forms*.

In [13]:
# Define an ambiguous character
c_with_cedilla = '\u00C7'
c_with_cedilla

'Ç'

In [15]:
# Define the ambiguous character in another way
c_plus_cidilla = '\u0043\u0327'
c_plus_cidilla

'Ç'

In [16]:
c_with_cedilla == c_plus_cidilla

False

In [17]:
# Apply a Normal Form Decomposition to the Cidilla character
unicodedata.normalize('NFD', c_with_cedilla) == c_plus_cidilla

True

In [18]:
# It is also possible to apply a Normal Form Composition, in order to construct the character from its pieces
c_with_cedilla == unicodedata.normalize('NFC', c_plus_cidilla)

True