Text Preprocessing:
===
In this notebook, you will discover what is text preprocessing and how to pre-process text before extracting features from it or feeding it into a machine learning model.


*You can read the full blog at [here](https://gauravchopracg.github.io/Text-Preprocessing/).*


Text preprocessing is the process of cleaning the data, preparing the data to be used for machine learning. The text preprocessing steps consider in this notebook are:

1. Tokenization
2. Text Normalization

Tokenization:
---
Tokenization is the process of splitting an input text into meaningful chunks. We can think of text as a sequence of words and further word as a meaningful sequence of characters (We can also think text as a sequence of characters or phrases but right now we will consider words as a part of text for simpler understanding). In that way, we need to first split text into small chunks which we will call tokens, a token is a useful unit for further semantic processing. 

We can split a text 
* by whitespaces,
* by punctuation, or
* any set of rules specifically to that task, in this notebook we will consider the splitting of text by the rules of English grammar

In [0]:
import nltk

In [0]:
text = "This is Andrew's text, isn't it?"

### Split by white spaces:

To split the text into tokens or meaningful words using white spaces, we can use python library NLTK, it offers different classes of tokenizer which we can use it to split text into meaningful chunks for example to splits the input sequence on white spaces, that could be a space or any other character that is not visible. We can use nltk.tokenize.WhitespaceTokenizer() function to do that.

In [0]:
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', "Andrew's", 'text,', "isn't", 'it?']

However, the problem is 'text,' and 'text' are two different words for tokenizer similarly 'it' and 'it?' we might want to merge these two tokens because they have essentially the same meaning,.

### Split by punctuation

Similarly as before, we can split the text by punctuation using nltk.tokenize.WordPunctTokenizer() and the result will be

In [0]:
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

the problem is 's', 'isn' 't' are not very meaningful and punctuation are different tokens hence, it doesn't make sense to analyze them

### Split by set of heuristics

We can also come up with a set of rules or heuristics which can be easily found in TreebankWordTokenizer and it actually uses rules of english language grammar to make it tokenization that actually makes sense for further analysis. In reality this is very close to perfect tokenization that we want for English language

In [0]:
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

Text Normalization:
---
The next thing we might want to do is token normalization. We may want the same token for different forms of the word. like wolf, wolves -> wolf or talk, talks -> talk. The process of normalizing the words into same form is called Text Normalization. They consist of:
* Stemming
* Lemmatization

We will first define a sample text, tokenize it and then experiment with different text normalization techniques

In [0]:
text = "feet wolves cats talked"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

### Stemming:

Stemming is the process of removing and replacing suffixes to get to the root form of the word, which is called the stem. It is usually refers to heuristics that chop off suffixes. To apply stemming, we can use NLTK library function nltk.stem.PorterStemmer(), it is the oldest stemmer for English language. It has five heuristic phases of word reductions applied sequentially.

In [0]:
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

'feet wolv cat talk'

### Lemmatization

Lemmatization is usually refers to doing things properly with the use of a vocabulary and morphological analysis. It returns the base or dictionary form of a word, which is known as the lemma. To apply lemmatization, we can use NLTK library function nltk.stem.WordNetLemmatizer(), it uses WordNet Database to lookup lemmas.

In [0]:
nltk.download('wordnet')
stemmer = nltk.stem.WordNetLemmatizer()
" ".join(stemmer.lemmatize(token) for token in tokens)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


'foot wolf cat talked'

In reality, we need to try both stemming or lemmatization to decide which is best for our task