# Intro

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

In [1]:
import pandas as pd

data = pd.read_csv("SMSSpamCollection", sep = '\t', 
                   header = None, names =["Label", "SMS"])

data.shape

(5572, 2)

In [2]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#Percentage ham

num_ham = data[data["Label"]== "ham"]
num_ham.shape

(4825, 2)

In [4]:
per_ham = 4825/5572 * 100
print("Ham:", round(per_ham,2), "%")

Ham: 86.59 %


In [5]:
#Percentage spam

num_spam = data[data["Label"]== "spam"]
num_spam.shape

(747, 2)

In [6]:
per_spam = 747/5572 * 100
print("Spam:", round(per_spam,2), "%")

Spam: 13.41 %


# Randomising data and splitting them

In [7]:
random = data.sample(frac = 1, random_state = 1)

split = round(5572 * 0.8)

# 80% of the data will be used for the training set
train = random[: split].reset_index(drop = True)

# 20% of the data will be used for the training set
test = random[split :].reset_index(drop = True)


In [8]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [9]:
# Percentage ham and spam in the training set

train["Label"].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [10]:
# Percentage ham and spam in the test set

test["Label"].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

The percentages in the two sets are similar to the full dataset.

# Data Cleaning

In [11]:
train["SMS"] = train["SMS"].str.replace("\W", " ")
train["SMS"] = train["SMS"].str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


# Vocabulary List

In [12]:
vocab = []

train["SMS"] = train["SMS"].str.split()


In [13]:
for sentence in train["SMS"]:
    for word in sentence:
        vocab.append(word)

In [14]:
len(vocab)

72427

In [15]:
# Transform vocab list to remove duplicates

vocab = list(set(vocab))

len(vocab)

7783

# Create Dictionary

In [16]:
word_counts_sms = {word: [0] * len(train["SMS"]) for word in vocab}

for index, sms in enumerate(train["SMS"]):
    for word in sms:
        word_counts_sms[word][index] += 1


In [17]:
dic_table = pd.DataFrame.from_dict(word_counts_sms)
dic_table.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [18]:
# concatenate with training data set

clean = pd.concat([train, dic_table], axis = 1)

clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Calculating Probabilities

In [19]:
num_ham = len(clean[clean["Label"] == "ham"])
num_spam = len(clean[clean["Label"] == "spam"])

p_ham = num_ham/ len(clean) * 100
p_spam = num_spam/ len(clean) * 100

num_vocab = len(clean.columns) - 2

# Laplace smoothing
alpha = 1

In [20]:
# Initialise two dictionaries, one for P(wi|Spam) and one for P(wi|ham)

dic_ham = {unique_word:0 for unique_word in vocab}
dic_spam = {unique_word:0 for unique_word in vocab}


In [21]:
for word in vocab:
    n_word_ham = clean[clean["Label"] == "ham"][word].sum()
    p_word_ham = (n_word_ham + alpha) / num_ham + (alpha * num_vocab)
    dic_ham[word] = p_word_ham
    
    n_word_spam = clean[clean["Label"] == "spam"][word].sum()
    p_word_spam = (n_word_spam + alpha) / num_spam + (alpha * num_vocab)
    dic_spam[word] = p_word_spam


# Classifying a New Message

In [22]:
import re

def classify(sms):
    sms = sms.replace("\W", " ").lower().split()
    
    p_ham_given_sms = p_ham
    p_spam_given_sms = p_spam
    
    for word in sms:
        p_ham_given_sms *= dic_ham[word]
        
    for word in sms:
        p_spam_given_sms *= dic_spam[word]
    
    print("P(Ham|message):", p_ham_given_sms)
    print("P(Spam|message):", p_spam_given_sms)
    
    if p_ham_given_sms > p_spam_given_sms:
        print("Label: Ham")
    if p_ham_given_sms < p_spam_given_sms:
        print("Label: Spam")
    else:
        print("Equal probabilities, have a human classify this.")

In [23]:
classify("WINNER!! This is the secret code to unlock the money: C3421.")

KeyError: 'winner!!'

In [None]:
classify("Sounds good, Tom, then see u there")

In [None]:
# Write a function that returns classification labels instead

def classifier(sms):
    sms = sms.replace("\W", " ").lower().split()
    
    p_ham_given_sms = p_ham
    p_spam_given_sms = p_spam
    
    for word in sms:
        p_ham_given_sms *= dic_ham[word]
        
    for word in sms:
        p_spam_given_sms *= dic_spam[word]
    
    print("P(Ham|message):", p_ham_given_sms)
    print("P(Spam|message):", p_spam_given_sms)
    
    if p_ham_given_sms > p_spam_given_sms:
        return "Ham"
    if p_ham_given_sms < p_spam_given_sms:
        return "Spam"
    else:
        return: "Needs human classification"

In [None]:
test["Filter"] = test["SMS"].apply(classifier)
test.head()

In [None]:
# Write a function that calculates accuracy

def same(x,y):
    if x == y:
        return 1
    else:
        return 0

In [None]:
test["New"] = test.apply(lambda x: same(x.Label, x.Filter),axis = 1)
correct = test["New"].sum()

In [None]:
print("Number of correct labels:", correct)

total = len(test)
print("Number of incorrect labels:", total - correct)

accuracy = correct/ total * 100
print("Accuracy of spam filter:", accuracy, "%")

The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

# Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
Make the filtering process more complex by making the algorithm sensitive to letter case