# NLP Basics: Implementing a pipeline to clean text

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove extra spaces**
2. **Remove punctuation**
3. **Case normalization**
4. **Tokenization**
5. **Remove stopwords**
6. **Lemmatize/Stem**


In [1]:
import pandas as pd
# setting option for how many characters we can see in a pandas dataframe when printing, Widens te columns
pd.set_option('display.max_colwidth', 100)
# tsv- tab seperated value
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns = ['label', 'text'] # giving column names

data.head()

Unnamed: 0,label,text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Removing extra white spaces

In [6]:
# This can be donw by using regular expressions
# Example
import re
doc = "NLP  is an interesting    field."
new_doc = re.sub("\s+"," ",doc)
new_doc

'NLP is an interesting field.'

In [7]:
# Removing spaces from each record from our dataset
def remove_extraspaces(text):
    text_nospaces = re.sub("\s+"," ", text)
    return text_nospaces
data['text_nospaces'] = data['text'].apply(lambda x: remove_extraspaces(x))
data.head()
# here all the characters are printed individually

Unnamed: 0,label,text,text_nospaces
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!


### Remove punctuation

In [8]:
# python string package contains all the punctuations
import string
string.punctuation
# Can use regular expressions also to remove all charachters except words and digits and space

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
"I like NLP." == 'I like NLP'
# Because of full stop it is different and elsehwere it is all same

False

In [12]:
# removing punctuation from each record
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    # using join to concatenate all the characters together else all charachters will be printed individually
    return text_nopunct
data['text_nopunct'] = data['text_nospaces'].apply(lambda x: remove_punct(x))
data.head()


Unnamed: 0,label,text,text_nospaces,text_nopunct
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


### Tokenization

In [13]:
# Splitting each text into list of words by splitting with a non word character('\W+'==not including [a-zA-Z0-9_])
# Using non word characters because if some character is present other than space then it can be used for splitting
# Tokenization is done after removing punctuations
import re

def tokens(text):
    tok = re.split('\W+',text)
    return(tok)
data['text_tokens'] = data['text_nopunct'].apply(lambda x: tokens(x.lower()))
# Here doing case normalization to bring all words to lower case
data.head()

Unnamed: 0,label,text,text_nospaces,text_nopunct,text_tokens
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


## Removing stopwords

In [15]:
import nltk
stopwords = nltk.corpus.stopwords.words('english') # Argument contains the language to be used
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [16]:
def remove_stopwords(token_list):
    nostopwords = [word for word in token_list if word not in stopwords]
    return(nostopwords)
data['text_nostopwords'] = data['text_tokens'].apply(lambda x: remove_stopwords(x))
data.head()

Unnamed: 0,label,text,text_nospaces,text_nopunct,text_tokens,text_nostopwords
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


## Stemming text

In [18]:
import nltk
ps = nltk.PorterStemmer()

In [34]:
def stem_text(nostopwords):
    text = [ps.stem(word) for word in nostopwords]
    text = " ".join(text) # to form a sentence and not list of words
    return(text)
data['text_stem'] = data['text_nostopwords'].apply(lambda x: stem_text(x))
data.head()
# Here all the words are brought down to their root words but the new words may not be the dictionary words

Unnamed: 0,label,text,text_nospaces,text_nopunct,text_tokens,text_nostopwords,text_stem,text_lemm,final_cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...",ive search right word thank breather promis wont take help grant fulfil promis wonder bless time,ive searching right word thank breather promise wont take help granted fulfil promise wonderful ...,ive search right word thank breather promis wont take help grant fulfil promis wonder bless time
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questions...,free entri wkli comp win cup final tkt 21st may 2005 text 87121 receiv entri questionstd txt rat...
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though,nah dont think goe usf live around though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent,even brother like speak treat like aid patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]",date sunday,date sunday,date sunday


## Lemmatizing the text
- We have to either Stemming or Lemmatizing.
- In Lemmatizing the base words are called as Lemma and the new words have the meaning.
- Lemmatizing is used when we need to get the understanding of the context as in Chatbots or question answer applications. It takes more to process than Stemming
- Stemming helps to reduce the number of features as compared to Lemmatizing and is used majorly for sentiment analysis.

In [20]:
wn = nltk.WordNetLemmatizer()

In [23]:
def lemmatizing_text(nostopwords):
    text = [wn.lemmatize(word) for word in nostopwords]
    text = " ".join(text) # to form a sentence and not list of words
    return(text)
data['text_lemm'] = data['text_nostopwords'].apply(lambda x: lemmatizing_text(x))
data.head()
# Here all the words are compared to its synonyms to understand the meaning based on the context.
# So it takes more time and its complex

Unnamed: 0,label,text,text_nospaces,text_nopunct,text_tokens,text_nostopwords,text_stem,text_lemm
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...",ive searching right word thank breather promise wont take help granted fulfil promise wonderful ...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...",free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questions...
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]",nah dont think go usf life around though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]",even brother like speak treat like aid patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]",date sunday


### Till now we have done:
- Removing extra spaces
- Removing punctions
- Case normalization
- Tokenization
- Removed stopwords
- Stemming/Lemmitizing

### We have done this cleaning individually. Now creating a user defined function which defines all the cleaning functions

In [32]:
# Creating a user defined function to perform all the above text cleaning in single fucntion
def clean_text(text):
    # Stripping white spaces before and after the text
    text = text.strip(" ")
    # Replacing multiple spaces with a single space
    text = re.sub("\s+"," ", text)
    # Replacing punctuations
    text = "".join([char for char in text if char not in string.punctuation])
    # Creating tokens
    tokens = re.split('\W+', text)
    # removing stopwords and stemming - snowball stemming
    text_final = [ps.stem(word) for word in tokens if word not in stopwords and len(word)>2]
    # creating a list of tokens
    text_final = " ".join(text_final)
    return text_final

In [33]:
data['final_cleaned_text'] = data['text'].apply(lambda x: clean_text(x.lower()))
data.head()
# Now as we have done stemming in the user defined function so,:
# test_stem and final_cleaned_text will be same
# This way all the text cleaning can be done by creating a user defined function

Unnamed: 0,label,text,text_nospaces,text_nopunct,text_tokens,text_nostopwords,text_stem,text_lemm,final_cleaned_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...",ive searching right word thank breather promise wont take help granted fulfil promise wonderful ...,ive search right word thank breather promis wont take help grant fulfil promis wonder bless time
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...",free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questions...,free entri wkli comp win cup final tkt 21st may 2005 text 87121 receiv entri questionstd txt rat...
2,ham,"Nah I don't think he goes to usf, he lives around here though","Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]",nah dont think go usf life around though,nah dont think goe usf live around though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]",date sunday,date sunday
