# Text pre-processing

Text can come in a variety of forms from a list of individual words, to sentences to multiple paragraphs with special characters (like tweets for example). This step is very important in pipeline which helps to feed right data to the model.
**Text preprocessing seems severely overlooked topic.** A few people I spoke to mentioned inconsistent results from their NLP applications only to realize that they were not  preprocessing their text or were using the wrong kind of text preprocessing for their project.
<img src="Text preprocessing.png" alt="Text pre-processing pipeline" style="width: 250px;"/>

## Table of content:
-  Introduction
-  Dataset
-  Types of text preprocessing techniques
   - Lowercasing
   - Tokenization
     - Sentence tokenization
     - Word tokenization
   - Normalization
      - Lemmatization
      - Stemming
   - Cleaning
     - Noise removal
       - Removal of HTML characters, URL, Expressions like [laughing], [Crying], Alphanumeric character
     - Punctuation
     - Stopwords
   - How is lemmatizing different from stemming?

## Introduction

In this tutorial, you will discover how you can clean and prepare your text ready for modeling with machine learning model. <br>
There is a whole suite of text preparation methods that you may need to use, and the choice of methods really depends on your natural language processing task.

## Dataset

Let's consider below opinion related to iPhone extracted tweet from Twitter.

In [1]:
text = "RT @Joydeep_911: I ♥ my &lt;3 iphone &amp; you’rè awsm apple. DisplayIsAwesome ★★★, sooo happppppy 🙂 http://www.apple.com. I have plans to buy 2 more of same varient"

## Types of text preprocessing techniques

### Lowercasing

This is the basic and must have preprocessing step in the pipeline. Text often has a variety of capitalization reflecting the beginning of sentences, proper nouns emphasis. Lowercasing solves the sparsity issue, where the same words with different cases map to the same lowercase form: 'Canada' vs. 'canada'
> But it is important to remember that some words, like “US” to “us”, can change meanings when reduced to the lower case.

Python has rich set of __[string functions](https://www.w3schools.com/python/python_ref_string.asp)__. Will use lower() function to convert string into lower case.

In [2]:
## Lowercasing

lowercased_text = text.lower()
lowercased_text

'rt @joydeep_911: i ♥ my &lt;3 iphone &amp; you’rè awsm apple. displayisawesome ★★★, sooo happppppy 🙂 http://www.apple.com. i have plans to buy 2 more of same varient'

### Normalization

Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “loooove” and “luv” can be transformed to “love”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

Unicode normalization is one of the techinque where special character(ä, ö, ü, ß) has to be normalized. <br>
It is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use. Read __[this](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16)__ stack overflow answer to know details about Unicode.

In [3]:
## In the next step è from you’rè normalized to e and other non-ascii character like ♥ and ★ is ignored.

from unicodedata import normalize
unicode_norm_text = normalize('NFD', lowercased_text).encode('ascii', 'ignore')
unicode_norm_text = unicode_norm_text.decode('UTF-8')
unicode_norm_text

'rt @joydeep_911: i  my &lt;3 iphone &amp; youre awsm apple. displayisawesome , sooo happppppy  http://www.apple.com. i have plans to buy 2 more of same varient'

Contractions are words that we write with an apostrophe: "ain’t" or "aren’t". Since we want to normalize the text, it makes sense to expand these contractions.

In [4]:
## contractions.json from util folder has whole list of contraction.

"""
{
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot"
  ...
  }
"""

import json
import re

## Contraction dictionary is stored in JSON file as Python dictionary
contractions_data=open("../../utils/contractions.json", encoding="utf8").read()
contractions = json.loads(contractions_data)

## Expanding contraction
for key, value in contractions.items() :                  
    if key in unicode_norm_text :
        contraction_norm_text = re.sub(r'\b'+key+r'\b',value,unicode_norm_text)
contraction_norm_text

'rt @joydeep_911: i  my &lt;3 iphone &amp; you are awsm apple. displayisawesome , sooo happppppy  http://www.apple.com. i have plans to buy 2 more of same varient'

Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent.<br>
Misspellings can be easily corrected with paid spell checking API services like Google and __[Azure spell checker](https://azure.microsoft.com/en-in/services/cognitive-services/spell-check/)__.
<img src="Azure_spell_check_output.PNG" alt="Drawing" style="width: 1050px;"/>

In [5]:
## Here manually spelling is corrected.

spell_corrected_text = re.sub("displayisawesome", "display is awesome", contraction_norm_text)
spell_corrected_text

'rt @joydeep_911: i  my &lt;3 iphone &amp; you are awsm apple. display is awesome , sooo happppppy  http://www.apple.com. i have plans to buy 2 more of same varient'

There are scenario's where even neural network based spell checker module couldn't correct like **awsm** and **sooo happppppy**. For large dataset it could be painful task and might not be scalable. Here problem can be solved using regex after indentifying pattern(repeated letter for more than 2 times) but it crudely harming **www**. So very naive regex expression might mess up original text and it might be difficult to come with global regex which might solve all miss spell issues.

In [6]:
import itertools
norm_text = ''.join(''.join(s)[:2] for _, s in itertools.groupby(spell_corrected_text))
norm_text

'rt @joydeep_911: i  my &lt;3 iphone &amp; you are awsm apple. display is awesome , soo happy  http://ww.apple.com. i have plans to buy 2 more of same varient'

## Cleaning

### Noise removal
Noise removal is about removing characters, digits and pieces of text that can interfere with your text analysis. Noise removal is one of the most essential text preprocessing steps and one of the first things you should be looking into when it comes to Text Mining and NLP. <br>
There are various ways to remove noise. This includes punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal (e.g. ‘RT’ for retweet), source code removal, header removal and more. It all depends on which domain you are working in and what entails noise for your task.

Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. 
It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. 
Another approach is to use appropriate packages and modules (for example **htmlparser** or **BeautifulSoup** of Python), which can convert these entities to standard html tags. 
For example: "&lt"; is converted to “<” and "&amp"; is converted to “&”.

In [7]:
## Cleaning

## HTML tags

from bs4 import BeautifulSoup

soup = BeautifulSoup(norm_text)

clean_text = soup.get_text()

clean_text

'rt @joydeep_911: i  my <3 iphone & you are awsm apple. display is awesome , soo happy  http://ww.apple.com. i have plans to buy 2 more of same varient'

In [8]:
## Cleaning

## URL etc

clean_text = re.sub("http.*\.com","",clean_text)
clean_text = re.sub("rt\s+@.*:","",clean_text)
clean_text

' i  my <3 iphone & you are awsm apple. display is awesome , soo happy  . i have plans to buy 2 more of same varient'

## Tokenization

### Sentence tokenization
Sentence tokenization is dividing a string of written language into its component sentences. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.

In [9]:
from nltk.tokenize import sent_tokenize, word_tokenize

tokenized_sentences = sent_tokenize(clean_text)

tokenized_sentences

[' i  my <3 iphone & you are awsm apple.',
 'display is awesome , soo happy  .',
 'i have plans to buy 2 more of same varient']

### Word tokenization
Word tokenization is dividing a string of written language into its component words. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.

Most of NLP python library has function to perform sentence and word tokenization.

In [10]:
## Tokenization

tokenized_words = []
# Word tokenization
for sent in tokenized_sentences :
    tokenized_words.append(word_tokenize(sent))

tokenized_words

[['i', 'my', '<', '3', 'iphone', '&', 'you', 'are', 'awsm', 'apple', '.'],
 ['display', 'is', 'awesome', ',', 'soo', 'happy', '.'],
 ['i', 'have', 'plans', 'to', 'buy', '2', 'more', 'of', 'same', 'varient']]

## Cleaning

### Punctuation
All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.

In [11]:
punctuation_cleaned_text = []
import string
#table = str.maketrans('', '', re.sub("!\.\?",'',string.punctuation)) ## What is this maketrans table?

## https://www.programiz.com/python-programming/methods/string/maketrans
table = str.maketrans('', '', string.punctuation)
for sent in tokenized_words :
    punctuation_cleaned_text.append([word.translate(table) for word in sent])
punctuation_cleaned_text

[['i', 'my', '', '3', 'iphone', '', 'you', 'are', 'awsm', 'apple', ''],
 ['display', 'is', 'awesome', '', 'soo', 'happy', ''],
 ['i', 'have', 'plans', 'to', 'buy', '2', 'more', 'of', 'same', 'varient']]

### Stopwords
When applying machine learning to text, these words can add a lot of noise hence it needs to be removed.

In [12]:
stopwords_cleaned_text = []
import nltk
stop = nltk.corpus.stopwords.words('english')
for sent in punctuation_cleaned_text : 
    stopwords_cleaned_text.append([i for i in sent if i not in stop])

stopwords_cleaned_text

[['', '3', 'iphone', '', 'awsm', 'apple', ''],
 ['display', 'awesome', '', 'soo', 'happy', ''],
 ['plans', 'buy', '2', 'varient']]

## Normalization

### Lemmatizing
It's the process of grouping together the inflected forms of a word so they can be analyzed as a single term, identified by the word's lemma. The lemma is the canonical form of a set of words. More simply put, lemmatizing is using vocabulary analysis of words to remove inflectional endings and return to the dictionary form of a word.

In [13]:
from nltk.stem.wordnet import WordNetLemmatizer
lemma = WordNetLemmatizer()

lemmatized_text = []
for stopwords_cleaned_sent in stopwords_cleaned_text :
    lemmatized_text.append(" ".join([lemma.lemmatize(word) for word in stopwords_cleaned_sent]))

lemmatized_text

[' 3 iphone  awsm apple ', 'display awesome  soo happy ', 'plan buy 2 varient']

### Stemming
It is the process of reducing inflected or derived words to their word stem or root.
More simply put, the process of stemming means often crudely chopping off the end of a word, to leave only the base. So this means taking words with various suffixes and condensing them under the same root word.

A few that are included in the NLTK package are the Porter Stemmer, the Snowball Stemmer, the Lancaster Stemmer, and a Regex-Based Stemmer.

In [14]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("plans")

stemmed_text = []
for stopwords_cleaned_sent in stopwords_cleaned_text :
    stemmed_text.append(" ".join([ps.stem(word) for word in stopwords_cleaned_sent]))

stemmed_text

[' 3 iphon  awsm appl ', 'display awesom  soo happi ', 'plan buy 2 varient']

#### How is lemmatizing different from stemming? <br>
The goal of both is to condense derived words into their base forms.
<img src="../../Images/Stemming_Lemmatization.PNG" style="width: 550px;"/>

In [15]:
## In this exercise stemmer is giving wrong result for Electricity, Electrical, Berries, Berry as it is crudely chopping off
## Where as Lemmatization output is better.

import pandas as pd
words = ["Stemmed", "Stemming", "Electricity", "Electrical", "Berries", "Berry"]

lemma_stemming_df = pd.DataFrame({"words" : words},columns=['words', 'stemming','lemmatize'])

for word in words: 
    lemma_stemming_df.loc[lemma_stemming_df["words"] == word, "stemming"] = ps.stem(word)
    lemma_stemming_df.loc[lemma_stemming_df["words"] == word, "lemmatize"] = lemma.lemmatize(word)
lemma_stemming_df

Unnamed: 0,words,stemming,lemmatize
0,Stemmed,stem,Stemmed
1,Stemming,stem,Stemming
2,Electricity,electr,Electricity
3,Electrical,electr,Electrical
4,Berries,berri,Berries
5,Berry,berri,Berry


### Summary

This is extensive list of steps to clean the messy text. Not all steps are applicable to all text related problem so choose judiciously.

> **This tutorial is intended to be a public resource. As such, if you see any glaring inaccuracies or if a critical topic is missing, please feel free to point it out or (preferably) submit a pull request to improve the tutorial. Also, we are always looking to improve the scope of this article. For anything feel free to mail us @ colearninglounge@gmail.com**

> **Author of this article is Yogesh Kothiya. You can follow him on __[LinkedIn](https://www.linkedin.com/in/yogeshkothiya/)__, __[Medium](https://medium.com/@kothiya.yogesh)__, __[GitHub](https://github.com/kothiyayogesh)__, __[Twitter](https://twitter.com/Yogesh_Kothiya)__.**