## Building a Spam Filter with Naive Bayes
The goal of this project is to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm, which can classify new messages with an accuracy greater than 80%.

To train the algorithm, a dataset of 5,572 SMS messages is used, which are already classified by humans and can be downloaded from the The UCI Machine Learning Repository

### Exploring the dataset

In [1]:
import pandas as pd
messages = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])

In [2]:
messages.shape

(5572, 2)

In [3]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
messages.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Below, we see that about 87% of messages are ham(not spam) and remaining 13% are spam. This sample looks representative, since in real world as well most messages that people receive would be ham.

In [5]:
messages['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

### Training and Testing
Will now split the dataset into training and testing sets.  
Training Set - 80% of messages (4458)  
Testing Set - 20% of messages (1114)

In [6]:
# Randomize the dataset
data_randomized = messages.sample(frac=1, random_state=1)

# Calculate index for split
train_test_index = round(len(data_randomized) * 0.8)

# Training/Testing Split
training_set = messages[:train_test_index].reset_index(drop=True)
testing_set = messages[train_test_index:].reset_index(drop=True)

print(training_set.shape)
print(testing_set.shape)

(4458, 2)
(1114, 2)


Checking the spam and ham percentages in both training and testing data, we see that both have similar division as in complete dataset.

In [7]:
training_set['Label'].value_counts(normalize=True)* 100

ham     86.496187
spam    13.503813
Name: Label, dtype: float64

In [8]:
testing_set['Label'].value_counts(normalize=True)* 100

ham     86.983842
spam    13.016158
Name: Label, dtype: float64

### Data Cleaning

In [9]:
training_set['SMS'] = training_set['SMS'].str.replace('\W',' ')
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,go until jurong point crazy available only ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i don t think he goes to usf he lives aro...


In [10]:
training_set['SMS'] = training_set['SMS'].str.split()

In [11]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[go, until, jurong, point, crazy, available, o..."
1,ham,"[ok, lar, joking, wif, u, oni]"
2,spam,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"[nah, i, don, t, think, he, goes, to, usf, he,..."


In [12]:
# Create a list of all words in all the SMS messages
vocabulary = []
for msg in training_set['SMS']:
    for word in msg:
       vocabulary.append(word)

print(vocabulary[:10])
print(len(vocabulary))

['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n']
72566


In [13]:
# Remove duplicates from the vocabulary list
vocabulary = list(set(vocabulary))
print(len(vocabulary))

7813


In [14]:
training_set.head(2)

Unnamed: 0,Label,SMS
0,ham,"[go, until, jurong, point, crazy, available, o..."
1,ham,"[ok, lar, joking, wif, u, oni]"


In [15]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [16]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [17]:
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,ü
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
training_set_clean = pd.concat([training_set,word_counts], axis=1)

In [19]:
training_set_clean.head(1)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585236,01223585334,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,ü
0,ham,"[go, until, jurong, point, crazy, available, o...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Calculating Probabilities
For classification, we would need to find probability of new set of words(new message) being spam or ham.

p_spam_given_word = p_spam x p_word_given_spam  
p_ham_given_word = p_ham x p_word_given_ham  
  
where  
p_spam = probability of spam messages out of all messages  
p_ham = probability of ham messages out of all messages  
n_spam = no of words in spam messages  
n_ham = no of words in ham messages   

p_word_given_spam = probability of each word in vocabulary within spam messages
p_word_given_ham = probability of each word in vocabulary within ham messages

In [20]:
# Separating spam and ham messages from dataset
training_set_spam = training_set_clean[training_set_clean['Label'] == 'spam']
training_set_ham = training_set_clean[training_set_clean['Label'] == 'ham']

# Calculating proportion of spam and ham messages
p_spam = len(training_set_spam) / len(training_set_clean)
p_ham = len(training_set_ham) / len(training_set_clean)

# Calculating no of words in spam and ham messages
n_spam = training_set_spam['SMS'].apply(len).sum()
n_ham = training_set_ham['SMS'].apply(len).sum()

# Calculating no of unique words in dataset
n_vocab = len(vocabulary)

# Laplace Smoothing
alpha = 1


In [21]:
#Initiate Parameters
parameters_spam = {word:0 for word in vocabulary} 
parameters_ham = {word:0 for word in vocabulary} 

#Calculate Parameters - P(wi|Spam) and P(wi|Ham)
for word in vocabulary:
    n_word_given_spam = training_set_spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocab)
    parameters_spam[word] = p_word_given_spam  
    
    n_word_given_ham = training_set_ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocab)
    parameters_ham[word] = p_word_given_ham 

In [24]:
def classify(message):
    message = message.replace('\W',' ')
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|Message): ',p_spam_given_message)
    print('P(Ham|Message): ',p_ham_given_message)
    
    if p_spam_given_message > p_ham_given_message:
        print('Label: Spam')
    elif p_spam_given_message < p_ham_given_message: 
        print('Label: Ham')
    else:
        print('Equal Probability. Human intervention required !!')    

In [25]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|Message):  1.0674508110611438e-18
P(Ham|Message):  2.284519422063104e-19
Label: Spam


In [26]:
classify('Sounds good, Tom, then see u there')

P(Spam|Message):  1.3005185582923354e-17
P(Ham|Message):  3.2778699633045125e-14
Label: Ham


In [30]:
def classify_test(message):
    message = message.replace('\W',' ')
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_spam_given_message > p_ham_given_message:
        return 'spam'
    elif p_spam_given_message < p_ham_given_message: 
        return 'ham'
    else:
        return 'needs manual classification'    

In [31]:
testing_set['Predicted'] = testing_set['SMS'].apply(classify_test)

In [32]:
testing_set.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,Aight should I just plan to come up later toni...,ham
1,ham,Die... I accidentally deleted e msg i suppose ...,ham
2,spam,Welcome to UK-mobile-date this msg is FREE giv...,spam
3,ham,This is wishing you a great day. Moji told me ...,ham
4,ham,Thanks again for your reply today. When is ur ...,ham


In [37]:
#Calculating Accuracy
compare = testing_set['Label'] == testing_set['Predicted']
correct = compare.sum()
total = testing_set.shape[0]
accuracy = (correct / total) * 100
print(accuracy)

97.93536804308796
