# **Working with Text Data - Text Preprocessing**
  
## **Text Preprocessing Steps**

Text Preprocessing steps include some essential tasks to clean and remove the noise from the available data.

1. **Removing Special Characters and Punctuation**
2. **Converting to Lower Case**
3. **Tokenization (Sentence Tokenization and Word Tokenization)**
4. **Removing Stop Words**
5. **Stemming or Lemmatization**
6. **HTML Parsing and Cleanup**
7. **Spell Correction**

Note that text preprocessing is the most important step that has the implications for all other aspects of the NLP pipeline. Further, it can also be the most time-consuming part of the project.

## **1. Removing Special Characters and Punctuation**

Special characters like `^`, `~`, `@`, `$`, etc... Punctuations like `.`, `?`, `,`, etc...

In Python **string** module contains a constant `punctuation`. It is a string containing all the punctuation characters defined by the ASCII standard. It includes characters such as `!`,`?`, `@`, `#`, etc...

**A note about ASCII Standard**
- The ASCII (American Standard Code for Information Interchange) standard is a character encoding standard used for representing text in computers and other devices that use text.
- ASCII is a fundamental character encoding standard that assigns numerical values to characters, enabling text representation in computers.
- ASCII is limited to 128 characters, which is insufficient for representing characters in many languages, currency symbols, emojis and for including various special symbols.
- To address these limitations, **Unicode (i.e. UTF-8, UTF-16, etc...)** was developed as a more comprehensive standard that includes a vast array of characters from various languages and symbol sets.


In [1]:
raw_text = """
We're LaRninG1 Natural-LAnguage-Processing!😀 🚀 ❤️ 
In this\ example wE are goIng to Learn variouS text9 preprocessing steps.
I'm GoIng TO-bE Mr. Rich. ₹ 
"""

print(raw_text)


We're LaRninG1 Natural-LAnguage-Processing!😀 🚀 ❤️ 
In this\ example wE are goIng to Learn variouS text9 preprocessing steps.
I'm GoIng TO-bE Mr. Rich. ₹ 



In [2]:
import string

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [3]:
# Let's now use "string" module to clear up all the punctuations.

text = "".join([char for char in raw_text if char not in string.punctuation and not char.isdigit()])

print(text)


Were LaRninG NaturalLAnguageProcessing😀 🚀 ❤️ 
In this example wE are goIng to Learn variouS text preprocessing steps
Im GoIng TObE Mr Rich ₹ 



### **A more powerful weapon to remove special characters and punctuations**

In [4]:
import re

In [5]:
# Let's define a regex to match special characters and digits
regex = "[^a-zA-Z.!]"

text = re.sub(regex, " ", raw_text)

print(text)

 We re LaRninG  Natural LAnguage Processing!        In this  example wE are goIng to Learn variouS text  preprocessing steps. I m GoIng TO bE Mr. Rich.    


## **2. Converting to Lower Case**

We convert the whole text corpus to lower case to reduce the size of the vocabulary of our text data.

In [6]:
# change sentence to lower case

text = text.lower()

print(text)

 we re larning  natural language processing!        in this  example we are going to learn various text  preprocessing steps. i m going to be mr. rich.    


## **3. Tokenization (Sentence Tokenization and Word Tokenization)**

This is a simple step to break the text into sentences or words.

Note that in, any NLP task tokenization is one of the most important step. Hence, any NLP pipeline has to start with a reliable system to spit the text into sentences and further split a sentence into words.

In [7]:
words = text.split(" ")

print(words)

['', 'we', 're', 'larning', '', 'natural', 'language', 'processing!', '', '', '', '', '', '', '', 'in', 'this', '', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', '', 'preprocessing', 'steps.', 'i', 'm', 'going', 'to', 'be', 'mr.', 'rich.', '', '', '', '']


In [8]:
sentences = text.split(".")

print(sentences)

[' we re larning  natural language processing!        in this  example we are going to learn various text  preprocessing steps', ' i m going to be mr', ' rich', '    ']


### **Introducing NLTK**

In [9]:
# !pip install nltk

In [10]:
import nltk

### **Sentence Tokenization**

At surface level this might look like a very simple task. We can use a simple rule to perform sentence segmentation by breaking up text into sentences at the appearance of full stops and question marks. However, there may be abbreviations, forms of addresses **(Dr., Mr., etc.)** that may break the simple rule.

Thankfully, we don't have to worry about how to solve these issues, as most NLP libraries come with some form of sentence and word splitting implemented.

In [11]:
# Download the Punkt tokenizer models from the Natural Language Toolkit (NLTK) library
# The Punkt tokenizer is used for sentence and word tokenization.

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
from nltk.tokenize import sent_tokenize

# tokenize text into sentences
my_sentences = sent_tokenize(text)

print(my_sentences)

[' we re larning  natural language processing!', 'in this  example we are going to learn various text  preprocessing steps.', 'i m going to be mr. rich.']


### **Word Tokenization**

In [13]:
from nltk.tokenize import word_tokenize

# Tokenize text to words
words = word_tokenize(text)

print(words)

['we', 're', 'larning', 'natural', 'language', 'processing', '!', 'in', 'this', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', 'preprocessing', 'steps', '.', 'i', 'm', 'going', 'to', 'be', 'mr.', 'rich', '.']


In [14]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Tokenize sentences to words
for sentence in sent_tokenize(text):
    print(word_tokenize(sentence))

['we', 're', 'larning', 'natural', 'language', 'processing', '!']
['in', 'this', 'example', 'we', 'are', 'going', 'to', 'learn', 'various', 'text', 'preprocessing', 'steps', '.']
['i', 'm', 'going', 'to', 'be', 'mr.', 'rich', '.']


## **4. Removing Stop Words**

Stopwords don't contribute to the meaning of a sentence. So, we can safely remove them without changing the meaning of the sentence. For eg: a, an, the, was, is, by, etc are the stopwords.

Such words are called **stop words** and are typically(though not always) removed from further analysis. There is no standard list of stop words for English. There are some popular lists for example, NLTK has one.

In [15]:
from nltk.corpus import stopwords

In [16]:
# Download the stop words corpus

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
print("List of stopwords:")
print(stopwords.words("english"))

List of stopwords:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'sam

In [18]:
# Removing stop words

words = [word for word in words if word not in stopwords.words("english")]

print(words)

['larning', 'natural', 'language', 'processing', '!', 'example', 'going', 'learn', 'various', 'text', 'preprocessing', 'steps', '.', 'going', 'mr.', 'rich', '.']


## **5. Stemming**

Stemming is the process of removing suffixes and reducing a word to some root form such that all different variants of that word can be represented by the same form. For eg: warm, warmer, warming can be converted to warm.

This is accomplished by applying a fixed set of rules (eg: if the word ends with "-es", remove "-es".

Stemming is commonly used in text classification to reduce the feature space to train machine learning models.

Popular stemming techniques are:
1. Porter Stemmer
2. Snowball Stemmer (AKA Porter2)
3. Lancaster Stemmer (Fastest approach, but not advised to use.)

In [19]:
# Stemming
from nltk.stem.porter import PorterStemmer

## initialise the inbuilt Stemmer
stemmer = PorterStemmer()

clean_tokens_stem = [stemmer.stem(word) for word in words]

print(clean_tokens_stem)

['larn', 'natur', 'languag', 'process', '!', 'exampl', 'go', 'learn', 'variou', 'text', 'preprocess', 'step', '.', 'go', 'mr.', 'rich', '.']


## **6. Lemmatization**

Lemmatization is the process of mapping all the different forms of a word to its base word. While this seems close to the definition of stemming, they are, in fact, different. Lemmatization uses more linguistic knowledge to keep output human readable.

To implement NLTK Lemmatizer, we must download 'wordnet' and 'omw-1.4'.
1. **wordnet** - WordNet is a large lexical database of English. By downloading WordNet, you gain access to a comprehensive database that can be used to understand word meanings, relationships, and hierarchies.
2. **omw-1.4** - The Open Multilingual WordNet (OMW) is a collection of wordnets in various languages. It provides translations and multilingual synsets, which are useful for cross-linguistic NLP tasks.

In [20]:
# Downloading wordnet before applying Lemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [21]:
# Lemmatizing
from nltk.stem import WordNetLemmatizer

## We can also use Lemmatizer instead of Stemmer
lemmatizer = WordNetLemmatizer()

clean_tokens_lem = [lemmatizer.lemmatize(word) for word in words]

print(clean_tokens_lem)

['larning', 'natural', 'language', 'processing', '!', 'example', 'going', 'learn', 'various', 'text', 'preprocessing', 'step', '.', 'going', 'mr.', 'rich', '.']


## **Putting all the steps together**

In [22]:
import pandas as pd

lst_text = ["We are Learning Machine Learning $", 
            "Processing natural - language data.", 
            "10 machine - learning algorithms.", 
            "we Are Mimicing natural intelligence"]

df = pd.DataFrame({'text': lst_text})

df.head()

Unnamed: 0,text
0,We are Learning Machine Learning $
1,Processing natural - language data.
2,10 machine - learning algorithms.
3,we Are Mimicing natural intelligence


### **Explicitly Cleaning the Text and Vectorizing**

In [23]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def clean(doc): 
    # doc is a string of text
    
    # Let's define a regex to match special characters and digits
    regex = "[^a-zA-Z.]"
    doc = re.sub(regex, " ", doc)

    # Convert to lowercase
    doc = doc.lower()
        
    # Tokenization
    tokens = nltk.word_tokenize(doc)

    # Stop word removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # Join and return
    return " ".join(lemmatized_tokens)

In [24]:
df['clean_text'] = df['text'].apply(lambda x : clean(x))

df.head()

Unnamed: 0,text,clean_text
0,We are Learning Machine Learning $,learning machine learning
1,Processing natural - language data.,processing natural language data .
2,10 machine - learning algorithms.,machine learning algorithm .
3,we Are Mimicing natural intelligence,mimicing natural intelligence


In [25]:
# import feature extraction methods from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# instantiate a vectoriezer
bow_vect = CountVectorizer()

# use it to extract features from training data
text_dtm = bow_vect.fit_transform(df['clean_text'])

print()
print(f"Shape of text_dtm (# of docs, # of unique vocabulary): {text_dtm.shape}")
print(f"Vocab: {bow_vect.get_feature_names_out()}")


Shape of text_dtm (# of docs, # of unique vocabulary): (4, 9)
Vocab: ['algorithm' 'data' 'intelligence' 'language' 'learning' 'machine'
 'mimicing' 'natural' 'processing']


In [26]:
pd.DataFrame(text_dtm.toarray(), 
            columns=bow_vect.get_feature_names_out())

Unnamed: 0,algorithm,data,intelligence,language,learning,machine,mimicing,natural,processing
0,0,0,0,0,2,1,0,0,0
1,0,1,0,1,0,0,0,1,1
2,1,0,0,0,1,1,0,0,0
3,0,0,1,0,0,0,1,1,0


##### **Observe that full stop is removed by CountVectorizer.**

In [27]:
def tokenizer(doc):
    return nltk.word_tokenize(doc)

In [28]:
# import feature extraction methods from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# instantiate a vectoriezer
bow_vect = CountVectorizer(token_pattern=None,
                           tokenizer=tokenizer, 
                           ngram_range=(1, 1), 
                           lowercase=False)

# use it to extract features from training data
text_dtm = bow_vect.fit_transform(df['clean_text'])

print()
print(f"Shape of text_dtm (# of docs, # of unique vocabulary): {text_dtm.shape}")
print(f"Vocab: {bow_vect.get_feature_names_out()[:100]}")


Shape of text_dtm (# of docs, # of unique vocabulary): (4, 10)
Vocab: ['.' 'algorithm' 'data' 'intelligence' 'language' 'learning' 'machine'
 'mimicing' 'natural' 'processing']


In [29]:
pd.DataFrame(text_dtm.toarray(), 
            columns=bow_vect.get_feature_names_out())

Unnamed: 0,.,algorithm,data,intelligence,language,learning,machine,mimicing,natural,processing
0,0,0,0,0,0,2,1,0,0,0
1,1,0,1,0,1,0,0,0,1,1
2,1,1,0,0,0,1,1,0,0,0
3,0,0,0,1,0,0,0,1,1,0


### **Implicitly Cleaning the Text during Vectorization**

In [30]:
def tokenizer(doc):
    return nltk.word_tokenize(doc)

In [31]:
# import feature extraction methods from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# instantiate a vectoriezer
bow_vect = CountVectorizer(token_pattern=None,
                           tokenizer=tokenizer, 
                           ngram_range=(1, 1), 
                           lowercase=False, 
                           preprocessor=clean, 
                           stop_words=None)

# use it to extract features from training data
text_dtm = bow_vect.fit_transform(df['text'])

print()
print(f"Shape of text_dtm (# of docs, # of unique vocabulary): {text_dtm.shape}")
print(f"Vocab: {bow_vect.get_feature_names_out()[:100]}")


Shape of text_dtm (# of docs, # of unique vocabulary): (4, 10)
Vocab: ['.' 'algorithm' 'data' 'intelligence' 'language' 'learning' 'machine'
 'mimicing' 'natural' 'processing']


In [32]:
# Converting the sparse matrix to a dataframe

pd.DataFrame(text_dtm.toarray(), 
             columns=bow_vect.get_feature_names_out())

Unnamed: 0,.,algorithm,data,intelligence,language,learning,machine,mimicing,natural,processing
0,0,0,0,0,0,2,1,0,0,0
1,1,0,1,0,1,0,0,0,1,1
2,1,1,0,0,0,1,1,0,0,0
3,0,0,0,1,0,0,0,1,1,0


## **More Custom Text Cleaning Steps**

### **Spell Corrector**

In [33]:
# !pip install textblob

In [34]:
from textblob import TextBlob

# Define a text with spelling errors
raw_text = """
We're LaRninG1 Natural-LAnguage-Processing!😀 🚀 ❤️ 
In this\ example wE are goIng to Learn variouS text9 preprocessing steps.
I'm GoIng TO-bE Mr. Rich. ₹ 
"""

# Create a TextBlob object
blob = TextBlob(text)

# Correct the spelling
corrected_text = blob.correct()

print("Original text:", text)
print()
print("Corrected text:", corrected_text)

Original text:  we re larning  natural language processing!        in this  example we are going to learn various text  preprocessing steps. i m going to be mr. rich.    

Corrected text:  we re learning  natural language processing!        in this  example we are going to learn various text  preprocessing steps. i m going to be mr. rich.    


### **HTML Parsing and Cleanup**

In [35]:
# Machine Learning Wiki Page Source

data = """
<figure class="mw-default-size" typeof="mw:File/Thumb">
    <a href="/wiki/File:AI_hierarchy.svg" class="mw-file-description"><img src="dummy_src" /></a>
    <figcaption>Machine learning as subfield of AI<sup id="cite_ref-journalimcms.org_22-0" class="reference">
    <a href="#cite_note-journalimcms.org-22">&#91;22&#93;</a></sup></figcaption>
</figure>
<p>As a scientific endeavor, machine learning grew out of the quest for <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> (AI). In the early days of AI as an <a href="/wiki/Discipline_(academia)" class="mw-redirect" title="Discipline (academia)">academic discipline</a>, some researchers were interested in having machines learn from data. They attempted to approach the problem with various symbolic methods, as well as what were then termed "<a href="/wiki/Artificial_neural_network" class="mw-redirect" title="Artificial neural network">neural networks</a>"; these were mostly <a href="/wiki/Perceptron" title="Perceptron">perceptrons</a> and <a href="/wiki/ADALINE" title="ADALINE">other models</a> that were later found to be reinventions of the <a href="/wiki/Generalized_linear_model" title="Generalized linear model">generalized linear models</a> of statistics.<sup id="cite_ref-23" class="reference"><a href="#cite_note-23">&#91;23&#93;</a></sup> <a href="/wiki/Probabilistic_reasoning" class="mw-redirect" title="Probabilistic reasoning">Probabilistic reasoning</a> was also employed, especially in <a href="/wiki/Automated_medical_diagnosis" class="mw-redirect" title="Automated medical diagnosis">automated medical diagnosis</a>.<sup id="cite_ref-aima_24-0" class="reference"><a href="#cite_note-aima-24">&#91;24&#93;</a></sup><sup class="reference nowrap"><span title="Page / location: 488">&#58;&#8202;488&#8202;</span></sup></p>
<p>However, an increasing emphasis on the <a href="/wiki/Symbolic_AI" class="mw-redirect" title="Symbolic AI">logical, knowledge-based approach</a> caused a rift between AI and machine learning. Probabilistic systems were plagued by theoretical and practical problems of data acquisition and representation.<sup id="cite_ref-aima_24-1" class="reference"><a href="#cite_note-aima-24">&#91;24&#93;</a></sup><sup class="reference nowrap"><span title="Page / location: 488">&#58;&#8202;488&#8202;</span></sup> By 1980, <a href="/wiki/Expert_system" title="Expert system">expert systems</a> had come to dominate AI, and statistics was out of favor.<sup id="cite_ref-changing_25-0" class="reference"><a href="#cite_note-changing-25">&#91;25&#93;</a></sup> Work on symbolic/knowledge-based learning did continue within AI, leading to <a href="/wiki/Inductive_logic_programming" title="Inductive logic programming">inductive logic programming</a>(ILP), but the more statistical line of research was now outside the field of AI proper, in <a href="/wiki/Pattern_recognition" title="Pattern recognition">pattern recognition</a> and <a href="/wiki/Information_retrieval" title="Information retrieval">information retrieval</a>.<sup id="cite_ref-aima_24-2" class="reference"><a href="#cite_note-aima-24">&#91;24&#93;</a></sup><sup class="reference nowrap"><span title="Page / location: 708–710, 755">&#58;&#8202;708–710,&#8202;755&#8202;</span></sup> Neural networks research had been abandoned by AI and <a href="/wiki/Computer_science" title="Computer science">computer science</a> around the same time. This line, too, was continued outside the AI/CS field, as "<a href="/wiki/Connectionism" title="Connectionism">connectionism</a>", by researchers from other disciplines including <a href="/wiki/John_Hopfield" title="John Hopfield">Hopfield</a>, <a href="/wiki/David_Rumelhart" title="David Rumelhart">Rumelhart</a>, and <a href="/wiki/Geoff_Hinton" class="mw-redirect" title="Geoff Hinton">Hinton</a>. Their main success came in the mid-1980s with the reinvention of <a href="/wiki/Backpropagation" title="Backpropagation">backpropagation</a>.<sup id="cite_ref-aima_24-3" class="reference"><a href="#cite_note-aima-24">&#91;24&#93;</a></sup><sup class="reference nowrap"><span title="Page / location: 25">&#58;&#8202;25&#8202;</span></sup></p>
<p>Machine learning (ML), reorganized and recognized as its own field, started to flourish in the 1990s. The field changed its goal from achieving artificial intelligence to tackling solvable problems of a practical nature. It shifted focus away from the <a href="/wiki/Symbolic_artificial_intelligence" title="Symbolic artificial intelligence">symbolic approaches</a> it had inherited from AI, and toward methods and models borrowed from statistics, <a href="/wiki/Fuzzy_logic" title="Fuzzy logic">fuzzy logic</a>, and <a href="/wiki/Probability_theory" title="Probability theory">probability theory</a>.<sup id="cite_ref-changing_25-1" class="reference"><a href="#cite_note-changing-25">&#91;25&#93;</a></sup></p>
"""

In [36]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(data)

soup.get_text()

'\n\nMachine learning as subfield of AI\n[22]\n\nAs a scientific endeavor, machine learning grew out of the quest for artificial intelligence (AI). In the early days of AI as an academic discipline, some researchers were interested in having machines learn from data. They attempted to approach the problem with various symbolic methods, as well as what were then termed "neural networks"; these were mostly perceptrons and other models that were later found to be reinventions of the generalized linear models of statistics.[23] Probabilistic reasoning was also employed, especially in automated medical diagnosis.[24]:\u200a488\u200a\nHowever, an increasing emphasis on the logical, knowledge-based approach caused a rift between AI and machine learning. Probabilistic systems were plagued by theoretical and practical problems of data acquisition and representation.[24]:\u200a488\u200a By 1980, expert systems had come to dominate AI, and statistics was out of favor.[25] Work on symbolic/knowled