# Spam SMS Detector

The **goal** of this project is to make a a model which can predict whether a SMS is spam or not. For this purpose, NLP techniques are being used, along with the *Naive Bayes* classification algorithm.

## Import the Libraries

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import string
from nltk.tokenize import word_tokenize

## Get the data

I'll be using a dataset from the [UCI datasets](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

The dataset contains a collection of more than 5 thousand SMS phone messages, labelled as either spam or ham (normal sms).

In [2]:
messages = pd.read_csv('data/SMSSpamCollection.csv',sep='\t',names=['label','message'])
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## EDA

In [3]:
# Display a single message
messages['message'][1085]

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

In [4]:
messages.describe()

Unnamed: 0,label,message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
messages.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


## Text Preprocessing

In [9]:
# remove punctuation, tokenize and remove stopwords
def text(message):
    """
    Takes in a string of text, then performs the following:
    1. Convert the string into lowecase
    2. Remove all punctuation
    3. Tokenize and remove all stopwords
    4. Returns a list of the cleaned text
    """
    message = message.lower()
    msg_w_o_punc = [c for c in message if c not in string.punctuation]
    msg_w_o_punc = ''.join(msg_w_o_punc)
    
    return [word for word in word_tokenize(msg_w_o_punc) if word not in stopwords.words('english')]

## Train - Test Split   

In [10]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(messages['message'],messages['label'],test_size=0.2)
print(x_train.shape,y_train.shape,x_test.shape,y_test.shape)

(4457,) (4457,) (1115,) (1115,)


## Train the model

### I will create a pipeline for the training process

In [11]:
from sklearn.pipeline import Pipeline
# import the classification model
from sklearn.naive_bayes import MultinomialNB

Let's create the pipeline, which will:
* Perform a TF-IDF vectorization on the preprocessed text and
* Create a Naive Bayes Classifier

In [12]:
pipeline = Pipeline([
                    ('tfidf', TfidfVectorizer(analyzer=text)), # 'text' is the custom-made function
                    ('Naive Bayes Classifier',MultinomialNB())
])

In [13]:
pipeline.fit(x_train,y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(analyzer=<function text at 0x000002AFB45C6310>)),
                ('Naive Bayes Classifier', MultinomialNB())])

In [14]:
predictions = pipeline.predict(x_test)

## Evaluate the model

In [15]:
from sklearn.metrics import confusion_matrix, classification_report

In [16]:
score = pipeline.score(x_test,y_test)
print(score)

0.95695067264574


In [17]:
print(confusion_matrix(y_test,predictions))

[[954   0]
 [ 48 113]]


In [18]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.95      1.00      0.98       954
        spam       1.00      0.70      0.82       161

    accuracy                           0.96      1115
   macro avg       0.98      0.85      0.90      1115
weighted avg       0.96      0.96      0.95      1115



As we can see, the model has an overall accuracy of 96%