# SPAM MESSAGE DETECTION
# Natural Language Processing with Multinomial Naive Bayes Classifier 

<b>Steps:</b>
1. Imports
2. Get the Data adn Create the Dataframe
3. Explore the Data
4. Text Pre-processing
5. Train Test Split
6. Creating a Data Pipeline
7. Testing a sample

## 1 - Imports

In [67]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

## 2 - Get the Data and Create the Dataframe

 - The data we have is not csv, in order to open with Pandas we need to check how the elements are seperated in the file

In [68]:
messages = [row.rstrip() for row in open('SMSSpamCollection')]
print(len(messages))

5574


In [69]:
# Displaying the first messages to see how its constructed
messages[0]

'ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

- With '\t' its clear that this is a tab seperated file.
- There are two elements in it --> <b>Label</b> and the <b>message</b>.
- Now we can open this with Pandas by mentioning the seperator as '\t' and names as 'label' and 'message'.

In [70]:
df_messages = pd.read_csv('SMSSpamCollection', sep='\t', names=["label", "message"])

In [71]:
df_messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## 3 - Explore the Data

In [72]:
# Create the length column
df_messages['length'] = df_messages['message'].apply(len)

In [73]:
df_messages.head()

Unnamed: 0,label,message,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


## 4 - Text Pre-processing

### Let's create a function to apply to the Dataframe

In [76]:
def text_handler(message):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuations and put the letters in a list
    2. join the letters and create the words without puntuations
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Remove all punctuations and put the letters in a list
    nopunc_letters = [char for char in message if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc_mess = ''.join(nopunc_letters)
    
    # Remove any stopwords
    return [word for word in nopunc_mess.split() if word.lower() not in stopwords.words('english')]

In [77]:
# Check to make sure its working
# This is not applying to the dataframe --> only checking
df_messages['message'].head().apply(text_handler)

0    [Go, jurong, point, crazy, Available, bugis, n...
1                       [Ok, lar, Joking, wif, u, oni]
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
3        [U, dun, say, early, hor, U, c, already, say]
4    [Nah, dont, think, goes, usf, lives, around, t...
Name: message, dtype: object

## 5 -Train Test Split

In [78]:
from sklearn.model_selection import train_test_split

In [79]:
X = df_messages['message']
y = df_messages['label']

In [80]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [81]:
print(len(X_train), len(X_test), len(y_train) + len(y_test))

4457 1115 5572


## 6 - Creating a Data Pipeline

In [82]:
from sklearn.pipeline import Pipeline

In [83]:
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_handler)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

In [84]:
pipeline.fit(X_train,y_train)

Pipeline(steps=[('bow',
                 CountVectorizer(analyzer=<function text_handler at 0x00000289979ED040>)),
                ('tfidf', TfidfTransformer()),
                ('classifier', MultinomialNB())])

In [85]:
predictions = pipeline.predict(X_test)

In [86]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

         ham       1.00      0.96      0.98       987
        spam       0.75      1.00      0.86       128

    accuracy                           0.96      1115
   macro avg       0.87      0.98      0.92      1115
weighted avg       0.97      0.96      0.96      1115



In [87]:
print(confusion_matrix(predictions,y_test))

[[944  43]
 [  0 128]]


## 7 - Testting a sample

TEST THE FOLLOWING MESSAGE:
<br/>
<br/>
CONGRATULATIONS!
We chose your number to get a chance to win QR100.000!
Reply with "YES" now for FREE and participate NOW! JUST QR5/day

In [88]:
data = input('PLEASE INPUT THE TEXT: ')
test_to_predict = pd.Series(np.array([data]))
test_to_predict
predict_new = pipeline.predict(test_to_predict)
predict_new

PLEASE INPUT THE TEXT: CONGRATULATIONS! We chose your number to get a chance to win QR100.000! Reply with "YES" now for FREE and participate NOW! JUST QR5/day


array(['spam'], dtype='<U4')