# Creating an SMS Spam Filter Using Naive Bayes Algorithm
[Data Set Description](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)

In [1]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import pandas as pd # Data processing
import numpy as np # Linear algebra
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white')

import timeit # measure runtimes

In [2]:
data = pd.read_csv("SMSSpamCollection", sep='\t',names=["Label","SMS"]);
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print("Number of rows: {}\n".format(data.shape[0]))
print("Percentage of spam vs ham (non-spam) messages")
print("-"*45)
print((data["Label"].value_counts(normalize=True)*100).to_string())

Number of rows: 5572

Percentage of spam vs ham (non-spam) messages
---------------------------------------------
ham     86.593683
spam    13.406317


To-Do:
- Data cleaning: Remove punctuation and conver to lowercase
- Create a vocabulary set
- Create a dictionary with word counts per sms
- Define random train-test sets
- Naive Bayes

## Data Cleaning
- Remove punctuation using regex
- Convert strings to lowercase

In [4]:
data["SMS"]=data["SMS"].str.replace("\W"," ")
data["SMS"]=data["SMS"].str.lower()

## Feature Transformation
- Split messages at space characters
- Create a vocabulary list iterating on each message

In [5]:
words_per_sms = data["SMS"].str.split()
vocabulary_set = {word for sms in words_per_sms for word in sms}

In [6]:
data_clean = data.copy()

for word in vocabulary_set:
    pattern = r"\b" + word + r"\b"
    data_clean[word] = data_clean["SMS"].str.count(word)

## Splitting Data Into Train/Test Samples
Notes:
- Not adding column recording the count of spacing (and punctuation) characters
- The column `data_clean["SMS"]` will not be used for the model

In [7]:
from sklearn.model_selection import train_test_split

X=data_clean.iloc[:,2:] # features
y=data_clean.iloc[:,:1] # target values, spam or ham
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1
)

## Multinomial Naive Bayes Algorithm
Multinomial variant chosen, because we have a discrete distribution and the features are represented by a whole number.

In [8]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, 
        y_train.values #.values transforms the values into an array of shape (n,1)
        .ravel()) #.ravel converts the array to shape (n, )
predictions = clf.predict(X_test)

In [9]:
clf.score(X_test,y_test)

0.9874439461883409

In [10]:
results=pd.DataFrame(predictions)

## To-Do:
- Improve the algorithm looking at the false positives and false negatives: Is there anything special?
- Create a Naive Bayes algorithm from scratch