# Spam Filter using Naive Bayes Classifier

In [None]:
import os
print(os.listdir("../input"))

**Import libraries**

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**Read csv file**

In [None]:
df = pd.read_csv('../input/spam.csv', encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'message']
df.head()

**Describe dataset and visualize ham/spam count**

In [None]:
df.groupby('label').describe()

In [None]:
sns.countplot(data=df, x='label')

** Lets move directly to creating spam filter <br>
Our approach:
**
1. Clean and Normalize text
2. Convert text into vectors (using bag of words model) that machine learning models can understand
3. Train and test Classifier

**Clean and normalize text**<br>
It will be done in following steps:<br>
1. Remove punctuations
2. Remove all stopwords
3. Apply [stemming](https://en.wikipedia.org/wiki/Stemming) (converting to normal form of word). <br>
   For example, 'driving car' and 'drives car' becomes drive car<br>

** Write a method to return normailzed text in form of tokens (lemmas)**

In [None]:
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer
def process(text):
    # lowercase it
    text = text.lower()
    # remove punctuation
    text = ''.join([t for t in text if t not in string.punctuation])
    # remove stopwords
    text = [t for t in text.split() if t not in stopwords.words('english')]
    # stemming
    st = Stemmer()
    text = [st.stem(t) for t in text]
    # return token list
    return text

In [None]:
# Testing
process('It\'s holiday and we are playing cricket. Jeff is playing very well!!!')

In [None]:
# Test with our dataset
df['message'][:20].apply(process)

**Convert each message to vectors that machine learning models can understand.<br>We will do that using bag-of-words model**
<br>We will use TfidfVectorizer. It will convert collection of text documents (SMS corpus) into 2D matrix.
<br>One dimension represent documents and other dimension repesents each unique word in SMS corpus .
.
<br>If **n<sup>th</sup> term t has occured p times in m<sup>th</sup> document**, (m, n) value in this matrix will be TF-IDF(t), <br><center>where [TF-IDF(t)](https://en.wikipedia.org/wiki/Tf–idf) = Term Frequency (TF) * Inverse Document Frequency (IDF)</center>
<br>Term Frequency (TF) is a measure of how frequent a term occurs in a document.<br>
<br><center>TF(t)= Number of times term t appears in document (p) / Total number of terms in that document</center>
<br>Inverse Document Frequency (IDF) is measure of how important term is. For TF, all terms are equally treated. But, in IDF, for words that occur frequently like 'is' 'the' 'of' are assigned less weight. While terms that occur rarely that can easily help identify class of input features will be weighted high.<br>
<br><center>Inverse Document Frequency, IDF(t)= log<sub><i>e</i></sub>(Total number of documents / Number of documents with term t in it)</center>
<br>At end we will have for every message, vectors normalized to unit length equal to size of vocalbulary (number of unique terms from entire SMS corpus)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

**Fit and transform SMS corpus**

In [None]:
tfidfv = TfidfVectorizer(analyzer=process)
data = tfidfv.fit_transform(df['message'])

**Lets check what values it gives for a message**

In [None]:
mess = df.iloc[2]['message']
print(mess)

In [None]:
print(tfidfv.transform([mess]))

**A better view**

In [None]:
j = tfidfv.transform([mess]).toarray()[0]
print('index\tidf\ttfidf\tterm')
for i in range(len(j)):
    if j[i] != 0:
        print(i, format(tfidfv.idf_[i], '.4f'), format(j[i], '.4f'), tfidfv.get_feature_names()[i],sep='\t')

**Having messages in form of vectors, we are ready to train our classifier. <br>We will use Naive Bayes which is well known classifier while working with text data. 
<br>Before that we will use pipeline feature of sklearn to create a pipeline of TfidfVectorizer followed by Classifier.**
<br>Input will be message passed to first stage TfidfVectorizer which will transform it and pass it to Naive Bayes Classifier to get output label

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
spam_filter = Pipeline([
    ('vectorizer', TfidfVectorizer(analyzer=process)), # messages to weighted TFIDF score
    ('classifier', MultinomialNB())                    # train on TFIDF vectors with Naive Bayes
])

**Perform train test split**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.20, random_state = 21)

**Train spam_filter**

In [None]:
spam_filter.fit(x_train, y_train)

**Predict for test cases**

In [None]:
predictions = spam_filter.predict(x_test)

In [None]:
count = 0
for i in range(len(y_test)):
    if y_test.iloc[i] != predictions[i]:
        count += 1
print('Total number of test cases', len(y_test))
print('Number of wrong of predictions', count)

**Check for wrong predictions that were classified as ham**

In [None]:
x_test[y_test != predictions]

**Use classification report to get more details**

In [None]:
from sklearn.metrics import classification_report
print(classification_report(predictions, y_test))

Looking at precision column (for ham, it is 1.00), we can say that all number of wrong predictions (in output of [18]) came from spam predicted as ham. It is ok and cost of predicting spam as ham is negligible to that of predicting ham as spam.

Function to predict whether passed message is ham or spam

In [None]:
def detect_spam(s):
    return spam_filter.predict([s])[0]
detect_spam('Your cash-balance is currently 500 pounds - to maximize your cash-in now, send COLLECT to 83600.')