# Spam Classifier using NLP and Multinomial Naive Bayes

Background:

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

This is a kaggle problem which can be found [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset)

In this excercise we will be first doing some data cleaning to make our data understandable for the model and then we will be fitting a Multinomial Naive Bayes model on top of it.

This workbook elaborates the processes of importing nltk and performing stemming and lemmatization on the data before fitting the model.

For quicker access, I have placed the dataset in my [Github repository](https://github.com/asheshds/datascience)



First, we import pandas which will be required to read the dataset

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv('spam.csv', encoding='latin-1')

Now, we are checking the head of the dataset to see the various columns. We get to see that we have three redundant columns (Unnamed: 2, Unnamed: 3, Unnamed: 4) in the dataset which will not add any value in our modelling excercise

In [8]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


We drop the three irrelevant columns from the dataset to have a data fit for our modelling purpose 

In [9]:
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1)

In [10]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


After we have removed the irrelevant columns, we import and download the nltk library which we will be using for the next steps of data processing

In [1]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Once we have downloaded and installed nltk, we import a few more libraries which will be required for the data manipulation

In [11]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [12]:
entire_corpus = []

The below step is very important for building the spam classifier. Here we first make all the letters into lower case so that all of the similar words look same to our model. After that, we implement stemming on the data and load the messages one by one to our entire corpus. We are using a for loop to run through all the different messages that are there in our dataset.

In [45]:
for i in range(0, len(df)):
    msg = re.sub('[^a-zA-Z]', ' ', df['v2'][i])
    msg = msg.lower()
    msg = msg.split()
    
    msg = [ps.stem(word) for word in msg if not word in stopwords.words('english')]
    msg = ' '.join(msg)
    entire_corpus.append(msg)

Once we are done with the above step, we get to see that out entire_corpus is now filled with all the messages. The words have been converted to lower case and stemming has been implemented on all the words.

Once we run the model on this data, we will again run the model the same data but by implementing lemmatization instead of stemming. That way we would be able to compare the accuracies and understand which process (stemming/lemmatization) is more suited for our use case.

In [14]:
entire_corpus

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free',
 'gonna home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw',
 'search right word thank breather

After we have our entire corpus, we are running the CountVectorizer on the corpus so that we get to transform all the words into a matrix of 1's and 0's. This technique is called implementing 'Bag of Words' model in Natural Language Processing

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=2500)
X = count_vectorizer.fit_transform(entire_corpus).toarray()

After we have our Bag of Words in X, we are generating dummy variables for the ham/spam classification. After generating the dummy variables, we are just considering one single column as out 'y' or dependent variable

In [16]:
y=pd.get_dummies(df['v1'])
y=y.iloc[:,1].values

In [18]:
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

Once we have the y variable ready, we are performing a train_test_split on the data with a test size of 20%

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

Next, we create an instance of Multinomial Naive Bayes model for the ham/spam classification and fit the model on our training dataset

In [20]:
from sklearn.naive_bayes import MultinomialNB
spam_classifier = MultinomialNB().fit(X_train, y_train)

Once the model is fit on the training dataset, we are now predicting the values for our tes dataset

In [21]:
y_pred=spam_classifier.predict(X_test)

As we now have generated our predictions we are printing out the confusion matrix and accuracy score to understand how well our model performed

In [22]:
from sklearn.metrics import confusion_matrix

In [23]:
print(confusion_matrix(y_test, y_pred))

[[943   6]
 [  9 157]]


In [24]:
from sklearn.metrics import accuracy_score

In [25]:
print(accuracy_score(y_test, y_pred))

0.9865470852017937


Looks like our model performed really well with an accuracy of 98.6% !

Our next step would be to implement Lemmatization instead of stemming in the data preprocessing and then we would compare the accuracy of both these methods

In [26]:
from nltk.stem import WordNetLemmatizer

In [27]:
lemmatizer = WordNetLemmatizer()

In [31]:
entire_corpus = []

In [32]:
for i in range(0, len(df)):
    msg = re.sub('[^a-zA-Z]', ' ', df['v2'][i])
    msg = msg.lower()
    msg = msg.split()
    
    msg = [lemmatizer.lemmatize(word) for word in msg if not word in stopwords.words('english')]
    msg = ' '.join(msg)
    entire_corpus.append(msg)

In [33]:
entire_corpus

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though',
 'freemsg hey darling week word back like fun still tb ok xxx std chgs send rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press copy friend callertune',
 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour',
 'mobile month u r entitled update latest colour mobile camera free call mobile update co free',
 'gonna home soon want talk stuff anymore tonight k cried enough today',
 'six chance win cash pound txt csh send cost p day day tsandcs apply reply hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw'

In [34]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=2500)
X = count_vectorizer.fit_transform(entire_corpus).toarray()

In [35]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [36]:
y=pd.get_dummies(df['v1'])
y=y.iloc[:,1].values

In [37]:
y

array([0, 0, 1, ..., 0, 0, 0], dtype=uint8)

In [38]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [39]:
from sklearn.naive_bayes import MultinomialNB
spam_classifier = MultinomialNB().fit(X_train, y_train)

In [40]:
y_pred=spam_classifier.predict(X_test)

In [41]:
from sklearn.metrics import confusion_matrix

In [42]:
print(confusion_matrix(y_test, y_pred))

[[941   8]
 [  9 157]]


In [43]:
from sklearn.metrics import accuracy_score

In [44]:
print(accuracy_score(y_test, y_pred))

0.9847533632286996


The model performed quite well when we used lemmatization instead of stemming as well with an accuracy of 98.4% ! 

However, the accuracy that we got in case of stemming was still marginally better than lemmatization at 98.6%


Thus, we have build a spam classifier model using Multinomial Naive Bayes and the basic concepts of Natural Language Processing!