# Naive Bayes Algorithm: Filtering SMS Spam Messages

This guided project is applying the Multinomial Naive Bayes Algorithm to sort and classify messages as either spam or ham (not spam). The data set is from the UCI Machine Learning Repository that has some pre-classified SMS messages, along with new ones, totaling 5572 messages.

### Supported by Dataquest

## Exploring the Data

In [2]:
#Open the CSV file and examine dataset
import pandas as pd

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])
print(data.shape)
data.describe()

(5572, 2)


Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [3]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [4]:
#Now to randomize dataset and do a train-test split
rand_data = data.sample(frac=1, random_state=1)

train_length = round(len(rand_data)*0.8)
train = rand_data[:train_length].reset_index(drop=True)
test = rand_data[train_length:].reset_index(drop=True)

print('train:',train.shape)
print('test:',test.shape)
pct_train = train['Label'].value_counts(normalize=True)
pct_test = test['Label'].value_counts(normalize=True)
print('train pct:',pct_train)
print('test pct:',pct_test)

train: (4458, 2)
test: (1114, 2)
train pct: ham     0.86541
spam    0.13459
Name: Label, dtype: float64
test pct: ham     0.868043
spam    0.131957
Name: Label, dtype: float64


The random sample has been distributed very similar to the original sampling itself for both the train and test set, with 87% being ham and 13% being spam.

In [5]:
#Cleaning the data: Now to remove punctuation and make every letter lowercase and uniform

train['SMS'] = train['SMS'].str.replace('\W', ' ')
train['SMS'] = train['SMS'].str.lower()

test['SMS'] = test['SMS'].str.replace('\W', ' ')
test['SMS'] = test['SMS'].str.lower()

train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [6]:
test.head()

Unnamed: 0,Label,SMS
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest oran...
3,ham,all sounds good fingers makes it difficult ...
4,ham,all done all handed in don t know if mega sh...


Now to label encode each word so we can see the counts of each unique word in SMS messages. This is establishing the vocabulary of the messages.

In [7]:
train['SMS'] = train['SMS'].str.split()

vocabulary = []
for col in train['SMS']:
    for word in col:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

## cannot rerun cell

In [8]:
len(vocabulary)

7783

Looks like there are 7783 unique words in the dataset. Making the vocabulary list into a set gets rid of duplicates, thus leaving behind the unique values, which is converted back into a list.

In [9]:
#counting the times a unique word appears
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts_df = pd.DataFrame(word_counts_per_sms)


In [11]:
new_train = pd.concat([train,word_counts_df], axis=1)

In [12]:
new_train.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


### Setting up Constant Parameters
Now that we have cleaned the data, the constants such as probability of spam, probability of ham, and number of spam, ham and vocab words need to be established. They will be used directly into the Naive Bayes theorem.

In [13]:
#Setting up constants
#Isolating spam and ham messages first
spam_messages = new_train[new_train['Label'] == 'spam']
ham_messages = new_train[new_train['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(new_train)
p_ham = len(ham_messages) / len(new_train)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()
nvocab = len(vocabulary)

#Laplace Smoothing so nothing has zero probability
alpha = 1

print(p_spam, p_ham, n_spam, n_ham)

0.13458950201884254 0.8654104979811574 15190 57237


In [14]:
#Calculating the parameters for both spam and ham
spam_dict = {uword:0 for uword in vocabulary}
ham_dict = {uword:0 for uword in vocabulary}

for word in vocabulary:
    nword_given_spam = spam_messages[word].sum()
    p_word_given_spam = (nword_given_spam + alpha) / (n_spam + alpha*nvocab)
    spam_dict[word] = p_word_given_spam
    
    nword_given_ham = ham_messages[word].sum()
    p_word_given_ham = (nword_given_ham + alpha) / (n_ham + alpha*nvocab)
    ham_dict[word] = p_word_given_ham
    

## Classifying the Messages
Now to create a function to identify whether a message is spam or ham given our probabilities and dictionaries of known words.

In [15]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:
    '''
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
        

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [16]:
#Testing classify()
classify('WINNER!! This is the secret code to unlock the money: C3421.')
classify('Sounds great, Tom, see u there later')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.0310313054906764e-26
P(Ham|message): 9.115913222856268e-22
Label: Ham


## Measuring the Accuracy of the Spam Filter
To make sure the model is valid, we now need to test the `classify()` function with the test data.

In [17]:
#modify classify to return label instead of other information
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [20]:
test['predicted'] = test["SMS"].apply(classify_test_set)
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
accuracy = correct/total

In [22]:
print("accuracy:", accuracy)
print("correct:", correct)
print("incorrect:", total-correct)

accuracy: 0.9874326750448833
correct: 1100
incorrect: 14


# Conclusions

The accuracy of our spam filter is 98.7%! That is much better than I expected. The Naive Bayes algorithm seems to work really well, and would probably improve in accuracy if given more words in its vocabulary.

Another way to improve it is to see what words were incorrect, and figure out why they were marked incorrect with the algorithm. 