In this notebook a simple gradient boosting algorithm will be implemented to classify the well known basic NLP problem of "Spam-Ham" email classification. 

Different steps will be done sperately to show the process of preparing the data before feeding into the ML algorithm. New feature creation will took place and it's necessity wiil be evaluated. 

In [6]:
#importing necessary libraries
import nltk
import re
import pandas as pd
import string as st

In [7]:
data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header = None)
data.columns = ['labels', 'email_text']
data.head()

Unnamed: 0,labels,email_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [8]:
#Let's remove the punctuations from the email text and insert it as a new column
#This join functions allows to get back the words as stacked string charecters

def no_punc(text):
    text_no_punc = ''.join(i for i in text if i not in st.punctuation)
    return text_no_punc

data['no_punc'] = data['email_text'].apply(no_punc)
data.head()

Unnamed: 0,labels,email_text,no_punc
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


In [9]:
#Now we need to tokenize each word of the email text
def tokenizing_text(text):
    text_tokenize = re.split('\W+', text)
    return text_tokenize

data['tokenized'] = data['no_punc'].apply(tokenizing_text)
data.head()

Unnamed: 0,labels,email_text,no_punc,tokenized
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[Ive, been, searching, for, the, right, words,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[Nah, I, dont, think, he, goes, to, usf, he, l..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[Even, my, brother, is, not, like, to, speak, ..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]"


There are many words in English which do not affect the NLP model positively and those are useless for the model. These are called stopwords and we need to remove it. It is to be mentioned that:

    - stopwords in English are in lower case. So we need to make our email_text in lower case before comapering with stopwords
    
    - There are some stopwords which contain some punctuations. As we have already removed the punctuations, using stopword function after "stemming" or "lemmatizing" may results in better. 

In [10]:
#Now lets load the stopwords of English language
stopword = nltk.corpus.stopwords.words('english')

def removing_stopwords(text):
    text_nostopword = [i.lower() for i in text if i.lower() not in stopword]
    return text_nostopword

data['no_stopword'] = data['tokenized'].apply(removing_stopwords)
data.head()

Unnamed: 0,labels,email_text,no_punc,tokenized,no_stopword
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[Ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[Nah, I, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[Even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[date, sunday]"


Stemming means chopping off the extra characters from a word (e.g. "played" to "play"). It does not consider any neighbouring words before getting rid of extra letters. 

In [11]:
#To stemm the words, lets use the PorterStemmer from nltk. 
#(But according to the nltk.org website, Snowball stemmer is better for English Language)

ps_stemmer = nltk.PorterStemmer()

def stemming(text):
    text_stemmed = [ps_stemmer.stem(i) for i in text]
    return text_stemmed

data['stemmed'] = data['no_stopword'].apply(stemming)
data.head()

Unnamed: 0,labels,email_text,no_punc,tokenized,no_stopword,stemmed
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[Ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[Nah, I, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[Even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[date, sunday]","[date, sunday]"


Lammetizing functions same like Stemming but it considers the meaning of the word, sorrounding words, parts of speech and some other factors. It is more accurate than Stemming but eventually slower.

In [12]:
#We will use Wordnet lemmatizer. 
#Thers are some other lemmatizers e.g. TextBlob, Stanford CoreNLP (requires Java installed) 

lem_wordnet = nltk.WordNetLemmatizer()

def lemmitization(text):
    text_lemmitized = [lem_wordnet.lemmatize(i) for i in text]
    return text_lemmitized

data['lemmatiezed'] = data['no_stopword'].apply(lemmitization)
data.head()

Unnamed: 0,labels,email_text,no_punc,tokenized,no_stopword,stemmed,lemmatiezed
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[Ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather...","[ive, search, right, word, thank, breather, pr...","[ive, searching, right, word, thank, breather,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[Free, entry, in, 2, a, wkly, comp, to, win, F...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entri, 2, wkli, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[Nah, I, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, goe, usf, live, around, tho...","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[Even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids...","[even, brother, like, speak, treat, like, aid,...","[even, brother, like, speak, treat, like, aid,..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[I, HAVE, A, DATE, ON, SUNDAY, WITH, WILL]","[date, sunday]","[date, sunday]","[date, sunday]"


Vectorization is the process of converting process text data into feature form. 

In [13]:
#Lets use a simple method which is Count vectorizing
# It works on counting the occurance of each words 

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer=lemmitization)
X_count = count_vect.fit_transform(data['no_stopword'])
X_count_df = pd.DataFrame(X_count.toarray())
X_count_df.columns = count_vect.get_feature_names()
print(X_count.shape)
X_count_df.head()

(5568, 8914)


Unnamed: 0,Unnamed: 1,0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,...,zindgi,zoe,zogtorius,zoom,zouk,zyada,é,ü,üll,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


N-gram vectorization method gives the flexibility to create bundle of one or more words and then calculate the occurance to create the feature vector. 

To use N-gram vectorization we need to add the words together again. So we have to write the cleaning process again. A compact apporach of data ceaning has been implemented here.

In [14]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'email_text']

def text_preprocessing(text):
    text_no_punc = "".join([i.lower() for i in text if i.lower() not in st.punctuation])
    text_tokenized = re.split('\W+', text_no_punc)
    text_lemmetized = " ".join([lem_wordnet.lemmatize(i) for i in text_tokenized if i not in stopword])
    return text_lemmetized

data['preprocessed_text'] = data['email_text'].apply(text_preprocessing)
data.head()

Unnamed: 0,label,email_text,preprocessed_text
0,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
1,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though
2,ham,Even my brother is not like to speak with me. ...,even brother like speak treat like aid patent
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,date sunday
4,ham,As per your request 'Melle Melle (Oru Minnamin...,per request melle melle oru minnaminunginte nu...


In [15]:
#We will use of range of 2,2 which means it will only return the feature vectors for exactly
    #two word's combination 

ngram_vect = CountVectorizer(ngram_range = (2,2))
X_count_ngram = ngram_vect.fit_transform(data['preprocessed_text'])
print(X_count_ngram.shape)

(5567, 31621)


In [None]:
#So the number of columns (feature vectors) are much higher than found in Count vector method.
#Because this time, same word's combination with different words has been counted separately. 

TFIDF vectorizer uses the weight of each word instead of number of occurance. More the word is rare, the higher the weight will be.

This vectorizer works almost similar to Count Vectorizer and does not require to pass the full sentence to pass as input. So we need to go back to our old data cleaning method.

In [16]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'email_text']

def text_preprocessing(text):
    text_no_punc = "".join([i.lower() for i in text if i.lower() not in st.punctuation])
    text_tokenized = re.split('\W+', text_no_punc)
    text_lemmetized = [lem_wordnet.lemmatize(i) for i in text_tokenized if i not in stopword]
    return text_lemmetized


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
stopword = nltk.corpus.stopwords.words('english')
lem_wordnet = nltk.WordNetLemmatizer()
tfidf_vectorized = TfidfVectorizer(analyzer=text_preprocessing)
X_tfidf_vectorized = tfidf_vectorized.fit_transform(data['email_text'])
X_tfidf_vectorized.shape

(5567, 8911)

Creating features 

We can also create feature to improve the model. It is not necessary to do and also, not necessarily to be found useful. There will be two features to be experimented.

In [18]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'email_text']

#Feature 01: Amount of non puncuation charecter in the email
data['body_length'] = data['email_text'].apply(lambda x : len(x) - x.count(' '))

#Feature 02: Percentage of non puncuation charecter in each email
def punc_percentage(text):
    count = sum([1 for i in text if i in st.punctuation]) 
    return round((count/(len(text) - text.count(' '))*100),2)
data['punc_perc'] = data['email_text'].apply(punc_percentage)
data.head()

Unnamed: 0,label,email_text,body_length,punc_perc
0,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128,4.69
1,ham,"Nah I don't think he goes to usf, he lives aro...",49,4.08
2,ham,Even my brother is not like to speak with me. ...,62,3.23
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,28,7.14
4,ham,As per your request 'Melle Melle (Oru Minnamin...,135,4.44


In [19]:
#Now let's combine the new features with the feature vectors calculated from TFIDF vectorization

X_features = pd.concat([data['body_length'], data['punc_perc'], 
                        pd.DataFrame(X_tfidf_vectorized.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_length,punc_perc,0,1,2,3,4,5,6,7,...,8901,8902,8903,8904,8905,8906,8907,8908,8909,8910
0,128,4.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we will implement a simple Random forest classifier. It is less likely to be overfitted and works on voting methods. But we do not know the optimized parameter so we will use the Gridsearch to find the best parameter combination. 

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [18]:
rf = RandomForestClassifier()
parameter_combos = {'n_estimators': [10, 100, 200, 300],
                   'max_depth': [30, 60, None]}
gs = GridSearchCV(rf, parameter_combos, cv = 5, n_jobs=-1)
gs_fitted = gs.fit(X_features, data['label'])

pd.DataFrame(gs_fitted.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
10,55.778557,0.084044,0.645684,0.060309,,200,"{'max_depth': None, 'n_estimators': 200}",0.978456,0.975763,0.973944,0.96496,0.975741,0.973773,0.004636,1
11,69.591669,7.549986,0.513887,0.136619,,300,"{'max_depth': None, 'n_estimators': 300}",0.978456,0.978456,0.973944,0.966757,0.969452,0.973413,0.004715,2
6,40.26867,1.582302,0.635856,0.059975,60.0,200,"{'max_depth': 60, 'n_estimators': 200}",0.980251,0.97307,0.973944,0.964061,0.97035,0.972335,0.005257,3
4,3.985775,0.345032,0.340631,0.048647,60.0,10,"{'max_depth': 60, 'n_estimators': 10}",0.970377,0.972172,0.973046,0.97035,0.974843,0.972158,0.001699,4
7,58.420125,1.564029,0.680125,0.042289,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.976661,0.975763,0.972147,0.96496,0.971249,0.972156,0.004145,5


In [21]:
#So from gridsearch we can see that n_est = 200 and no bound to the max_depth performs best
#Now first split the dataset into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)

In [25]:
#Let's create a RF model and evaluate the performance
from sklearn.metrics import precision_recall_fscore_support as scores

rf = RandomForestClassifier(n_estimators=200, max_depth=None, n_jobs=-1)
rf_model = rf.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = scores(y_test, y_pred, pos_label='spam', average='binary')
print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    200, 'None', round(precision, 3), round(recall, 3),
    round((y_pred==y_test).sum() / len(y_pred), 3)))

Est: 200 / Depth: None ---- Precision: 1.0 / Recall: 0.851 / Accuracy: 0.98


Precision = 1 means when the classifiere classifies an email as "spam", it is 100% correct everytime

Recall = 0.85 means when an email is spam, the classifier is 85% time correct to classify it as spam

Accuracy = 0.98 means the classifier is overall 98% time correct for all the emails

Now we will try with Gradient boosting. It is similar to RandomForest to some extent. But it carries the learning from previous step so the process of building trees can not be parallelized. Which makes it slower. It is also easy to be overfitted.

In [22]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score


gb = GradientBoostingClassifier()
param = {
    'n_estimators': [150], 
    'max_depth': [15],
    'learning_rate': [0.1]
}

clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
cv_fit = clf.fit(X_features, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,347.168293,3.505811,0.213228,0.024696,0.1,15,150,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.968582,0.976661,0.97035,0.966757,0.972147,0.970899,0.003394,1
