# Spam Filter Exercise

Exercise to create a Spam Filter using Naive Bayes algorithm and machine learning. 

## 1) Explore the data

In [3]:
import pandas as pd

data = pd.read_csv('SMSSPamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(data.shape)
data.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [7]:
# Randomize the dataset
random = data.sample(frac=1, random_state=1)

# Calculate the index to split the data into a training and test set
training_index = round(len(random) * 0.8)

# Split random data into training and test set
training_set = random[:training_index].reset_index(drop=True)
test_set = random[training_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [8]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [9]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

After Randomizing the data, we've shown that both sets closely match the proportion of spam and ham of the full data set.

## 2) Clean the Data

Clean the data by removing punctuation and changing to lowercase. This way we'll be able to characterize the strings by word count. 

In [10]:
# Showing data before cleaning

training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
# Showing data after cleaning

training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()

training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## 3) Create a Vocabulary

Create a vocabulary of words used in the messages in our data.

In [12]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

len(vocabulary)

7783

Note: There are 7,783 unique words in the vocabulary

## 4) Finish Creating the Training Set

Transform the data using the vocabulary created above.

In [17]:
# Create a dictionary for word counts
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

# Create a new data set with the dictionary showing the word count in each message
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,airport,plans,eating,jaykwon,increments,flung,themob,pest,mmmmm,advisors,...,fifteen,teeth,3100,ron,grooved,daywith,openin,spoken,yor,witot
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# Add back the original Label and SMS

training_set_final = pd.concat([training_set, word_counts], axis=1)
training_set_final.head()

Unnamed: 0,Label,SMS,airport,plans,eating,jaykwon,increments,flung,themob,pest,...,fifteen,teeth,3100,ron,grooved,daywith,openin,spoken,yor,witot
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 5) Calculate the Constants

Find the probability that a message is Spam or Ham. Then find the number of number of words in Spam, Ham, and the full vocabulary.

In [19]:
# Isolate Ham and Spam messages from the data
spam = training_set_final[training_set_final['Label'] == 'spam']
ham = training_set_final[training_set_final['Label'] == 'ham']

# Calculate the probabilities: P(Spam) and P(Ham)
p_spam = len(spam) / len(training_set_final)
p_ham = len(ham) / len(training_set_final)

# Calculate Nspam
n_words_spam = spam['SMS'].apply(len)
n_spam = n_words_spam.sum()

# Calculate Nham
n_words_ham = ham['SMS'].apply(len)
n_ham = n_words_ham.sum()

# Calculate Nvocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

## 6) Calculate the Parameters

Next find the probability of each word in the vocabularity to be spam or ham. 

In [20]:
# Initiate a dictionary of all words in the vocabulary for spam and ham probabilities
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Loop through the vocabulary to find the probability of ham or spam for each word
for word in vocabulary:
    n_word_given_spam = spam[word].sum() 
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum() 
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham

## 7) Create a function to classify a message as Spam or Ham

Create a function that takes a message (text string) as input and classifies it as Spam, Ham, or even odds (requires a human to classify). 

In [21]:
import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [22]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [23]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## 8) Measure the accuracy of the function as a spam filter

After creating a version of the classify function we can use to enter the result into the dataframe, examine the effectiveness of the function given the expected result (compare the label to the prediction).  

In [25]:
# Modified version of the function to return a label we can add to our dataframe

def classify_test_set(message):    
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [26]:
# Apply to the function to our test set

test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head(10)

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham
5,ham,But my family not responding for anything. Now...,ham
6,ham,U too...,ham
7,ham,Boo what time u get out? U were supposed to ta...,ham
8,ham,Genius what's up. How your brother. Pls send h...,ham
9,ham,I liked the new mobile,ham


In [29]:
# Extract the number of correct predictions and the accuracy

correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct / total)
    

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


## 9) Conclusion

Spam filter is 98.7% accurate given the test data, which is higher than our goal of 80%.

In [33]:
unpredictable = []
predictable = []

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        predictable.append(row['SMS'])
    else:
        unpredictable.append(row['SMS'])
        

In [34]:
print(unpredictable)

['Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net', "More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB", 'Unlimited texts. Limited minutes.', '26th OF JULY', 'Nokia phone is lovly..', 'A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked \'hw cn u run so fast?\' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evrydy Moral of d story:- I hv free msgs:D;): gud ni8', 'No calls..messages..missed calls', 'We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us', "Oh my god!