# 04-Spam-Classifier

It's time to make our first real Machine Learning application of NLP: a spam classifier!

A spam classifier is a Machine Learning model that classifier texts (email or SMS) into two categories: Spam (1) or legitimate (0).

To do that, we will reuse our knowledge: we will apply preprocessing and BOW (Bag Of Words) on a dataset of texts.
Then we will use a classifier to predict to which class belong a new email/SMS, based on the BOW.

First things first: import the needed libraries.

In [102]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *spam.csv* using pandas. Use the 'latin-1' encoding as loading option.

In [103]:
# TODO: Load the dataset 
df= pd.read_csv('spam.csv', encoding ='latin1')
df.head(5)

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


As usual, I suggest you to explore a bit this dataset.

In [104]:
# TODO: explore the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5572 non-null   object
 1   Message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


So as you see we have a column containing the labels, and a column containing the text to classify.

We will begin by doing the usual preprocessing: tokenization, punctuation removal and lemmatization.

In [105]:
# TODO: Perform preprocessing over all the text
#Tokenize
from nltk.tokenize import word_tokenize
prepro_message = df.Message
for i in range(len(prepro_message)):
    prepro_message[i] = nltk.word_tokenize(prepro_message[i])
#Remove punctuation
for i in range(len(prepro_message)):
    prepro_message[i] = [k for k in prepro_message[i] if k.isalnum()]
prepro_message
#Lemmatization
wnl = nltk.WordNetLemmatizer()
for i in range(len(prepro_message)):
    prepro_message[i] = [wnl.lemmatize(w) for w in prepro_message[i]]
def untokenize(data):
    for tokens in data:
        yield ' '.join(tokens)
prepro_message = list(untokenize(prepro_message))
prepro_message

['Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat',
 'Ok lar Joking wif u oni',
 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry question std txt rate T C apply 08452810075over18',
 'U dun say so early hor U c already then say',
 'Nah I do think he go to usf he life around here though',
 'FreeMsg Hey there darling it been 3 week now and no word back I like some fun you up for it still Tb ok XxX std chgs to send to rcv',
 'Even my brother is not like to speak with me They treat me like aid patent',
 'As per your request Melle Oru Minnaminunginte Nurungu Vettam ha been set a your callertune for all Callers Press 9 to copy your friend Callertune',
 'WINNER As a valued network customer you have been selected to receivea prize reward To claim call 09061701461 Claim code KL341 Valid 12 hour only',
 'Had your mobile 11 month or more U R entitled to Update to the latest colour mobile wi

Ok now we have our preprocessed data. Next step is to do a BOW.

In [106]:
# TODO: compute the BOW
BOW = []
vectorizer = CountVectorizer(max_features=10000, stop_words='english')
BOW.append(vectorizer.fit_transform(prepro_message).toarray())
print(BOW)

[array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])]


Then make a new dataframe as usual to have a visual idea of the words used and their frequencies.

In [153]:
# TODO: Make a new dataframe with the BOW
tokens = vectorizer.get_feature_names()
BOW_matrix = pd.DataFrame(data=BOW[0], columns=tokens)
BOW_matrix

Unnamed: 0,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,...,zebra,zed,zero,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's check what is the most used word in the spam category and the non spam category.

There are two steps: first add the class to the BOW dataframe. Second, filter on a class, sum all the values and print the most frequent one.

In [154]:
BOW_matrix.insert(0, "Class", df.Class)
BOW_matrix

Unnamed: 0,Class,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zebra,zed,zero,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,spam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,spam,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,ham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [155]:
BOW_matrix_group = BOW_matrix.groupby(['Class']).sum()
BOW_matrix_group

Unnamed: 0_level_0,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,...,zebra,zed,zero,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ham,0,0,0,0,0,1,0,0,0,0,...,0,0,1,1,2,1,1,1,0,1
spam,2,1,1,1,2,0,1,2,1,2,...,1,6,0,0,0,1,0,0,1,0


In [156]:
# TODO: print the most used word in the spam and non spam category
print(BOW_matrix_group.idxmax(1))
print("The most used spam word is: free")
print("The most used non_spam word is: gt")

Class
ham       gt
spam    free
dtype: object
The most used spam word is: free
The most used non_spam word is: gt


You should find that the most frequent spam word is 'free', not so surprising, right?

Now we can make a classifier based on our BOW. We will use a simple logistic regression here for the example.

You're an expert, you know what to do, right? Split the data, train your model, predict and see the performance.

In [157]:
#Make spam target value binary instead of string
BOW_matrix['Class'] = BOW_matrix.Class.apply(lambda x: 1 if x == "spam" else 0)
BOW_matrix

        

Unnamed: 0,Class,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zebra,zed,zero,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [167]:
# TODO: Perform a classification to predict whether a message is a spam or not
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Create training set
X_train, X_test, y_train, y_test = train_test_split(BOW_matrix.loc[:, BOW_matrix.columns != 'Class'], BOW_matrix.Class, test_size = 0.3)
# Create model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)
#Create confusion matrix with test scores
reg_test_matrix = metrics.confusion_matrix(y_test, y_pred)
reg_test_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Matrix:\n", reg_test_matrix,"\n Test Accuracy for logistic regression:\n", reg_test_accuracy )

Matrix:
 [[1430    0]
 [  44  198]] 
 Test Accuracy:
 0.9736842105263158


What precision do you get? Check by hand on some samples where it did predict well to check what could go wrong...

Try to use other models and try to improve your results.

In [170]:
from sklearn.naive_bayes import MultinomialNB
#Naïve Bayes
#Create NB classifier
nb = MultinomialNB()
#Fit Data
nb.fit(X_train, y_train)
#Predict
y_pred = nb.predict(X_test)
#Create confusion matrix with test scores
nb_test_matrix = metrics.confusion_matrix(y_test, y_pred)
nb_test_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Matrix:\n", nb_test_matrix,"\n Test Accuracy for logistic regression:\n", nb_test_accuracy )

Matrix:
 [[1407   23]
 [  23  219]] 
 Test Accuracy for logistic regression:
 0.972488038277512
