# Text Preprocessing

## Introduction

Text pre-processing is a crucial step in the Natural Language Processing (NLP) pipeline. Before any sophisticated algorithms can be applied to text data, it must first be cleaned and transformed into a format that is suitable for analysis. Raw text data is often noisy and unstructured, containing inconsistencies, irrelevant information, and various linguistic peculiarities. Effective pre-processing improves the quality of the data, making it more accessible and interpretable for NLP models. By reducing noise and standardizing the text, we can enhance the performance of machine learning models and ensure more accurate and reliable results.

We will explore the primary techniques used in text pre-processing. These techniques help in preparing raw text data for further analysis and modeling. The key techniques we will cover include:

- Tokenization: breaking down text into individual tokens.
- Removing Stop Words: eliminating common words that do not contribute much to the meaning (e.g., "and", "the", "is").
- Lemmatization and Stemming: reducing words to their base or root form.

NLTK (Natural Language Toolkit) will be our main library (https://www.nltk.org/).

By the end of this notebook, you will have a comprehensive understanding of these pre-processing techniques and be able to apply them to your own text data, preparing it for successful NLP applications.

The following code snippet downloads the necessary models for NLTK. You need to run this code before performing any tasks to ensure that NLTK has the required data files.

In [1]:
from nltk import download
download('punkt')
download('stopwords')
download('wordnet')
download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Tokenization

Let's import the functions `sent_tokenize` and `word_tokenize` from the `nltk` module. These two functions allow us to split text into sentences and words, respectively.


In [2]:
from nltk import sent_tokenize, word_tokenize

We take some lines from the introduction to "1984" by George Orwell as an example of text.

In [3]:
sample_text = """It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting several times on the way. On each landing, opposite the
lift-shaft, the poster with the enormous face gazed from the wall. It was
one of those pictures which are so contrived that the eyes follow you about
when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran."""

The result of the tokenization into sentences is a list where each element is a sentence.

In [4]:
sentence_tokens = sent_tokenize(sample_text)
print("Result type: ", type(sentence_tokens))
print("Result: ", sentence_tokens)

Result type:  <class 'list'>
Result:  ['It was a bright cold day in April, and the clocks were striking thirteen.', 'Winston Smith, his chin nuzzled into his breast in an effort to escape the\nvile wind, slipped quickly through the glass doors of Victory Mansions,\nthough not quickly enough to prevent a swirl of gritty dust from entering\nalong with him.', 'The hallway smelt of boiled cabbage and old rag mats.', 'At one end of it a\ncoloured poster, too large for indoor display, had been tacked to the wall.', 'It depicted simply an enormous face, more than a metre wide: the face of a\nman of about forty-five, with a heavy black moustache and ruggedly handsome\nfeatures.', 'Winston made for the stairs.', 'It was no use trying the lift.', 'Even\nat the best of times it was seldom working, and at present the electric\ncurrent was cut off during daylight hours.', 'It was part of the economy drive\nin preparation for Hate Week.', 'The flat was seven flights up, and Winston,\nwho was thirty-

The result of the tokenization into words is a list where each element is a word or a punctuation mark.

In [5]:
word_tokens = word_tokenize(sample_text)
print("Result type: ", type(word_tokens))
print("Result: ", word_tokens)

Result type:  <class 'list'>
Result:  ['It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April', ',', 'and', 'the', 'clocks', 'were', 'striking', 'thirteen', '.', 'Winston', 'Smith', ',', 'his', 'chin', 'nuzzled', 'into', 'his', 'breast', 'in', 'an', 'effort', 'to', 'escape', 'the', 'vile', 'wind', ',', 'slipped', 'quickly', 'through', 'the', 'glass', 'doors', 'of', 'Victory', 'Mansions', ',', 'though', 'not', 'quickly', 'enough', 'to', 'prevent', 'a', 'swirl', 'of', 'gritty', 'dust', 'from', 'entering', 'along', 'with', 'him', '.', 'The', 'hallway', 'smelt', 'of', 'boiled', 'cabbage', 'and', 'old', 'rag', 'mats', '.', 'At', 'one', 'end', 'of', 'it', 'a', 'coloured', 'poster', ',', 'too', 'large', 'for', 'indoor', 'display', ',', 'had', 'been', 'tacked', 'to', 'the', 'wall', '.', 'It', 'depicted', 'simply', 'an', 'enormous', 'face', ',', 'more', 'than', 'a', 'metre', 'wide', ':', 'the', 'face', 'of', 'a', 'man', 'of', 'about', 'forty-five', ',', 'with', 'a', 'heavy', 'black', 'moustach

**Bonus**: What's the average number of words per sentence?

To answer the question, we have to remove all the punctuation marks from the `word_tokens` list. For this purpose, we can use the `string` package: it contains the `punctuation` constant, which is a string of all punctuation characters. By leveraging this constant, we can filter out the punctuation marks from our list of word tokens.

In [6]:
from string import punctuation
print("Punctuation constant: ", punctuation)

word_tokens_no_punctuation = [word for word in word_tokens if word not in punctuation]
print("Result: ", word_tokens_no_punctuation)

Punctuation constant:  !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Result:  ['It', 'was', 'a', 'bright', 'cold', 'day', 'in', 'April', 'and', 'the', 'clocks', 'were', 'striking', 'thirteen', 'Winston', 'Smith', 'his', 'chin', 'nuzzled', 'into', 'his', 'breast', 'in', 'an', 'effort', 'to', 'escape', 'the', 'vile', 'wind', 'slipped', 'quickly', 'through', 'the', 'glass', 'doors', 'of', 'Victory', 'Mansions', 'though', 'not', 'quickly', 'enough', 'to', 'prevent', 'a', 'swirl', 'of', 'gritty', 'dust', 'from', 'entering', 'along', 'with', 'him', 'The', 'hallway', 'smelt', 'of', 'boiled', 'cabbage', 'and', 'old', 'rag', 'mats', 'At', 'one', 'end', 'of', 'it', 'a', 'coloured', 'poster', 'too', 'large', 'for', 'indoor', 'display', 'had', 'been', 'tacked', 'to', 'the', 'wall', 'It', 'depicted', 'simply', 'an', 'enormous', 'face', 'more', 'than', 'a', 'metre', 'wide', 'the', 'face', 'of', 'a', 'man', 'of', 'about', 'forty-five', 'with', 'a', 'heavy', 'black', 'moustache', 'and', 'ruggedly', 'handsome', 'fe

Now, we can calculate the average number of words per sentence by dividing the length of the `word_tokens_no_punctuation` list by the length of the `sentence_tokens` list.

In [7]:
sentence_tokens_length = len(sentence_tokens)
word_tokens_no_punctuation_length = len(word_tokens_no_punctuation)
average_words_per_sentence = word_tokens_no_punctuation_length/sentence_tokens_length
print("Avg. word per sentence: ", average_words_per_sentence)

Avg. word per sentence:  17.76923076923077


There are, on average, 18 words per sentence in the first lines of "1984" by George Orwell!

## Removing Stopwords

We can remove stopwords by tokenizing the text at the word level and removing from the resulting list all words that are contained in a predefined list of stopwords available in the `nltk` module. So, let's import the already seen `word_tokenize` function and the list of stopwords.

In [8]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

We can see the first ten predefined stopwords for the English language as an example.

In [9]:
stopwords_list = stopwords.words('english')
print("First ten stopwords: ", stopwords_list[:10])

First ten stopwords:  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


We take some lines from the introduction to "1984" by George Orwell as an example of text.

In [10]:
sample_text = """It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting several times on the way. On each landing, opposite the
lift-shaft, the poster with the enormous face gazed from the wall. It was
one of those pictures which are so contrived that the eyes follow you about
when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran."""

Now we remove the stopwords from the text. To check if a word is contained in the stopwords list, we should lowercase that word, because the `stopwords_list` contains only lowercased words.

In [11]:
word_tokens = word_tokenize(sample_text)
word_tokens_no_stopwords = [word for word in word_tokens if word.lower() not in stopwords_list]
sample_text_no_stopwords = ' '.join(word_tokens_no_stopwords)
print("Result: ", sample_text_no_stopwords)

Result:  bright cold day April , clocks striking thirteen . Winston Smith , chin nuzzled breast effort escape vile wind , slipped quickly glass doors Victory Mansions , though quickly enough prevent swirl gritty dust entering along . hallway smelt boiled cabbage old rag mats . one end coloured poster , large indoor display , tacked wall . depicted simply enormous face , metre wide : face man forty-five , heavy black moustache ruggedly handsome features . Winston made stairs . use trying lift . Even best times seldom working , present electric current cut daylight hours . part economy drive preparation Hate Week . flat seven flights , Winston , thirty-nine varicose ulcer right ankle , went slowly , resting several times way . landing , opposite lift-shaft , poster enormous face gazed wall . one pictures contrived eyes follow move . BIG BROTHER WATCHING , caption beneath ran .


## Lemmatization and Stemming

Stemming and lemmatization are techniques used to normalize text, but they have different advantages and disadvantages. Stemming is faster and more straightforward, as it simply chops off word endings to reach the root form, but it can result in non-dictionary words and less accurate base forms. Lemmatization, on the other hand, is more accurate as it reduces words to their dictionary forms using morphological analysis, but it is slower and requires more computational resources. While stemming is suitable for applications where speed is crucial and exact base forms are less important, lemmatization is preferred in contexts where grammatical correctness and precision are essential.

Let's start with stemming and import the `PorterStemmer` object from `nltk`.

In [12]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

For this demonstration, we will start by using a simple list of four words.

In [13]:
sample_words = ["run", "running", "runner", "ran", "runs"]
print("Sample list of words: ", sample_words)

Sample list of words:  ['run', 'running', 'runner', 'ran', 'runs']


We will now see how stemming modifies that list.

In [14]:
stemmed_words = [stemmer.stem(word) for word in sample_words]
print("Result: ", stemmed_words)

Result:  ['run', 'run', 'runner', 'ran', 'run']


Stemming has reduced the different verb forms to the same root. We note that stemming typically does not handle irregular verbs well because it applies a simple rule-based approach to strip suffixes, leading to less accurate results. For example, the word 'ran' is stemmed to 'ran' instead of the correct base form 'run'. What about lemmatization? Let's try using the `WordNetLemmatizer` object from the `nltk` library. With the parameter `pos='v'`, we treat each word as a verb during the lemmatization.

In [15]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
sample_words = ["run", "running", "runner", "ran", "runs"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in sample_words]  # 'v' indica che trattiamo i verbi

print("Result: ", lemmatized_words)

Result:  ['run', 'run', 'runner', 'run', 'run']


Using lemmatization, the different forms of the verb 'run' have been converted to their base form 'run', except for 'runner', which is already a noun. What if we treat each word as a noun?

In [16]:
lemmatizer = WordNetLemmatizer()
sample_words = ["run", "running", "runner", "ran", "runs"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='n') for word in sample_words]  # 'v' indica che trattiamo i verbi

print("Result: ", lemmatized_words)

Result:  ['run', 'running', 'runner', 'ran', 'run']


When lemmatizing the list ['run', 'running', 'runner', 'ran', 'runs'] with pos='n' (as nouns), the result was ['run', 'running', 'runner', 'ran', 'run']. "Running" and "runner" remained unchanged as they were treated as nouns, and only "runs" was correctly lemmatized to "run". This highlights the importance of specifying the correct part of speech for accurate lemmatization.

**Bonus**: Now, we will lemmatize the introduction of "1984" by George Orwell.

In [17]:
sample_text = """It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions,
though not quickly enough to prevent a swirl of gritty dust from entering
along with him.

The hallway smelt of boiled cabbage and old rag mats. At one end of it a
coloured poster, too large for indoor display, had been tacked to the wall.
It depicted simply an enormous face, more than a metre wide: the face of a
man of about forty-five, with a heavy black moustache and ruggedly handsome
features. Winston made for the stairs. It was no use trying the lift. Even
at the best of times it was seldom working, and at present the electric
current was cut off during daylight hours. It was part of the economy drive
in preparation for Hate Week. The flat was seven flights up, and Winston,
who was thirty-nine and had a varicose ulcer above his right ankle, went
slowly, resting several times on the way. On each landing, opposite the
lift-shaft, the poster with the enormous face gazed from the wall. It was
one of those pictures which are so contrived that the eyes follow you about
when you move. BIG BROTHER IS WATCHING YOU, the caption beneath it ran."""

To tag each part of speech, we use the basic NLTK model, which is based on a supervised learning algorithm trained on an annotated text corpus.

In [18]:
from nltk import pos_tag

word_tokens = word_tokenize(sample_text)
tagged_word_tokens = pos_tag(word_tokens)
print("Result: ", tagged_word_tokens[:10])

Result:  [('It', 'PRP'), ('was', 'VBD'), ('a', 'DT'), ('bright', 'JJ'), ('cold', 'JJ'), ('day', 'NN'), ('in', 'IN'), ('April', 'NNP'), (',', ','), ('and', 'CC')]


We define a function that map tags starting with 'J' to adjectives, 'V' to verbs, 'N' to nouns, and 'R' to adverbs. If the tag doesn't match any of these, we assign None to it.

In [19]:
from nltk.corpus import wordnet

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

Now, we can lemmatize the text assigning the correct part of speech tag to each word

In [20]:
lemmatized_words= []

for word, tag in tagged_word_tokens:
    wn_tag = get_wordnet_pos(tag)
    if wn_tag is not None:
        lemmatized_words.append(lemmatizer.lemmatize(word, pos=wn_tag))
    else:
        lemmatized_words.append(lemmatizer.lemmatize(word))

lemmatized_text = ' '.join(lemmatized_words)
print("Result: ", lemmatized_text)

Result:  It be a bright cold day in April , and the clock be strike thirteen . Winston Smith , his chin nuzzle into his breast in an effort to escape the vile wind , slip quickly through the glass door of Victory Mansions , though not quickly enough to prevent a swirl of gritty dust from enter along with him . The hallway smelt of boiled cabbage and old rag mat . At one end of it a coloured poster , too large for indoor display , have be tack to the wall . It depict simply an enormous face , more than a metre wide : the face of a man of about forty-five , with a heavy black moustache and ruggedly handsome feature . Winston make for the stair . It be no use try the lift . Even at the best of time it be seldom work , and at present the electric current be cut off during daylight hour . It be part of the economy drive in preparation for Hate Week . The flat be seven flight up , and Winston , who be thirty-nine and have a varicose ulcer above his right ankle , go slowly , rest several time

## Conclusion

In this notebook, we explored essential text pre-processing techniques for Natural Language Processing (NLP). We covered tokenization, stop word removal, and lemmatization/stemming, demonstrating how to clean and standardize raw text data using NLTK. Proper pre-processing reduces noise and enhances the performance of machine learning models. With these skills, you can effectively prepare your text data for successful NLP applications, ensuring more accurate and reliable results.