# Step By Step NLP Tutorial

**Machine Learning Pipeline for NLP**
1. **Raw Text:** Model cannot distinguish the words.
2. **Tokenization:** Tell model what to look at.
3. **Clean Text:** Remove Stop Words/Punctuations, Stemming, etc.
4. **Vectorize:** Convert Words into numeric form.
5. **Machine Learining Algorithms:** Fit/Train the model.

## Data Loading


In [1]:
import pandas as pd

**Data Content**

- Our data is tab separated file.
- There is also no header. 
- We need to tab separate the content and put the proper header.
![File Content](data_content.PNG)

In [2]:
data = pd.read_csv('SMSSpamCollection.tsv', sep= '\t', header = None)

In [3]:
data.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [4]:
data.columns=['Label','Text']

In [5]:
data.head()

Unnamed: 0,Label,Text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Exploring the data

In [6]:
print(f'Number of rows {data.shape[0]} \nNumber of columns {data.shape[1]}')

Number of rows 5568 
Number of columns 2


In [7]:
data.isna().sum()

Label    0
Text     0
dtype: int64

In [8]:
data['Label'].value_counts()

Label
ham     4822
spam     746
Name: count, dtype: int64

## Cleaning the data

In [9]:
import re
import string
import nltk

### Removing the punctuaction

In [10]:
def remove_punct(text):
    # ''.join helps to put together all the words until it encounters whitespace
    text_nopunct = ''.join([char for char in text if char not in string.punctuation])     
    return text_nopunct

In [11]:
data['Text_No_Punc'] = data['Text'].apply(lambda x: remove_punct(x))
data.head()

Unnamed: 0,Label,Text,Text_No_Punc
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


### Tokenization

In [12]:
# Tokenizing the text without using explicit libraries
def tokenize_text(text):
    tokens = re.split('\W',text)# '\W' Checks for every other words or digits or any special characters.
    return tokens

In [13]:
# Example Text
text = 'Hello How are you 50'
print(tokenize_text(text))

['Hello', 'How', 'are', 'you', '50']


In [14]:
data['Text_Tokens'] = data['Text_No_Punc'].apply(lambda x: tokenize_text(x.lower()))
data.head()

Unnamed: 0,Label,Text,Text_No_Punc,Text_Tokens
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


### Stop Word Removal

In [15]:
stopword = nltk.corpus.stopwords.words('english')

In [16]:
def remove_stopword(text):
    clean_text = [char for char in text if char not in stopword]
    return clean_text

In [17]:
data['Text_No_Stopwords'] = data['Text_Tokens'].apply(lambda x: remove_stopword(x))
data.head()

Unnamed: 0,Label,Text,Text_No_Punc,Text_Tokens,Text_No_Stopwords
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"


### Stemming

In [18]:
ps = nltk.PorterStemmer()

In [19]:
# Example 
print(ps.stem('grow'))
print(ps.stem('growing'))
print(ps.stem('grows'))
print(ps.stem('running'),ps.stem('ran'),ps.stem('run'))

grow
grow
grow
run ran run


In [20]:
def stem_text(text):
    stem_word = [ps.stem(word) for word in text]
    return stem_word
data['Text_Stemmed'] = data['Text_No_Stopwords'].apply(lambda x:stem_text(x))

In [21]:
data.head()

Unnamed: 0,Label,Text,Text_No_Punc,Text_Tokens,Text_No_Stopwords,Text_Stemmed
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]"


### Lemmatization

In [22]:
wn = nltk.WordNetLemmatizer()

In [23]:
# Example 
print(wn.lemmatize('goose'))
print(wn.lemmatize('geese'))
print(wn.lemmatize('lives'))

goose
goose
life


In [24]:
def lemmatize_text(text):
    lemma_word = [wn.lemmatize(word) for word in text]
    return lemma_word
data['Text_Lemmatized'] = data['Text_No_Stopwords'].apply(lambda x:stem_text(x))

In [25]:
data.head()

Unnamed: 0,Label,Text,Text_No_Punc,Text_Tokens,Text_No_Stopwords,Text_Stemmed,Text_Lemmatized
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr...","[ive, search, right, word, thank, breather, pr..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho...","[nah, dont, think, goe, usf, live, around, tho..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]","[date, sunday]","[date, sunday]"


## Vectorization of data

### Count Vectorization

### N-Grams

### TF-IDF

## Feature Engineering