# Creating an SMS Spam Filter Using Naive Bayes Algorithm
[Data Set Description](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)

In [1]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import pandas as pd # Data processing
import numpy as np # Linear algebra
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white')

import timeit # measure runtimes

In [2]:
data = pd.read_csv("SMSSpamCollection", sep='\t',names=["Label","SMS"]);
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
print("Number of rows: {}\n".format(data.shape[0]))
print("Percentage of spam vs ham (non-spam) messages")
print("-"*45)
print((data["Label"].value_counts(normalize=True)*100).to_string())

Number of rows: 5572

Percentage of spam vs ham (non-spam) messages
---------------------------------------------
ham     86.593683
spam    13.406317


To-Do:
- Data cleaning: Remove punctuation and conver to lowercase
- Create a vocabulary set
- Create a dictionary with word counts per sms
- Define random train-test sets
- Naive Bayes

## Data Cleaning
- Remove punctuation using regex
- Convert strings to lowercase

In [14]:
data["SMS"]=data["SMS"].str.replace("\W"," ")
data["SMS"]=data["SMS"].str.lower()

## Feature Transformation
- Split messages at space characters
- Create a vocabulary list iterating on each message

In [15]:
data["SMS"].str.split(" ")

0       [go, until, jurong, point, , crazy, , , availa...
1              [ok, lar, , , , joking, wif, u, oni, , , ]
2       [free, entry, in, 2, a, wkly, comp, to, win, f...
3       [u, dun, say, so, early, hor, , , , u, c, alre...
4       [nah, i, don, t, think, he, goes, to, usf, , h...
5       [freemsg, hey, there, darling, it, s, been, 3,...
6       [even, my, brother, is, not, like, to, speak, ...
7       [as, per, your, request, , melle, melle, , oru...
8       [winner, , , as, a, valued, network, customer,...
9       [had, your, mobile, 11, months, or, more, , u,...
10      [i, m, gonna, be, home, soon, and, i, don, t, ...
11      [six, chances, to, win, cash, , from, 100, to,...
12      [urgent, , you, have, won, a, 1, week, free, m...
13      [i, ve, been, searching, for, the, right, word...
14         [i, have, a, date, on, sunday, with, will, , ]
15      [xxxmobilemovieclub, , to, use, your, credit, ...
16                  [oh, k, , , i, m, watching, here, , ]
17      [eh, u