# Spam filter

### Goal: Develop a program to determine if SMS message is spam or ham (not spam) using Naive Bayes algorithm

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("SMSSpamCollection", sep= "\t", header= None, names= ["Label", "SMS"])
data

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
data.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
data["Label"].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [6]:
# percentage of messages labelled "spam" in dataset

(747/5572)*100

13.406317300789663

In [7]:
data["SMS"].value_counts()

Sorry, I'll call later                                                                                                                                                 30
I cant pick the phone right now. Pls send a message                                                                                                                    12
Ok...                                                                                                                                                                  10
Say this slowly.? GOD,I LOVE YOU &amp; I NEED YOU,CLEAN MY HEART WITH YOUR BLOOD.Send this to Ten special people &amp; u c miracle tomorrow, do it,pls,pls do it...     4
Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!                     4
                                                                                                                                                      

### Initial comments on dataset

- 5572 rows, 2 columns
- No missing values
- 13.4% of messages are Spam
- Most common SMS is "Sorry, I'll call later" at 30 instances

### Revised goal: Create a spam filter with > 80% accuracy

In [8]:
# splits data into train and test splits (80:20)

train = data.sample(frac= 1, random_state= 1)[round(5522*0.2):].reset_index()
test = data.sample(frac= 1, random_state= 1)[:round(5522*0.2)].reset_index()

In [9]:
# percentage of spam messages in train split

train["Label"].value_counts()["spam"]/ len(train) *100

13.473589973142344

In [10]:
# percentage of spam messages in test split

test["Label"].value_counts()["spam"]/ len(test) *100

13.134057971014492

In [11]:
train

Unnamed: 0,index,Label,SMS
0,1540,ham,You're not sure that I'm not trying to make xa...
1,3017,ham,"&lt;#&gt; is fast approaching. So, Wish u a v..."
2,2677,ham,* Am on a train back from northampton so i'm a...
3,4834,spam,"New Mobiles from 2004, MUST GO! Txt: NOKIA to ..."
4,5283,ham,"Yeah, probably here for a while"
...,...,...,...
4463,905,ham,"We're all getting worried over here, derek and..."
4464,5192,ham,Oh oh... Den muz change plan liao... Go back h...
4465,3980,ham,CERI U REBEL! SWEET DREAMZ ME LITTLE BUDDY!! C...
4466,235,spam,Text & meet someone sexy today. U can find a d...


In [12]:
# removes punctuation marks & symbols and sets all text to lower case

train["SMS"] = train["SMS"].str.replace("'", "")
train["SMS"] = train["SMS"].str.replace("\W", " ")
train["SMS"] = train["SMS"].str.lower()
train["SMS"]

  train["SMS"] = train["SMS"].str.replace("\W", " ")


0       youre not sure that im not trying to make xavi...
1        lt   gt   is fast approaching  so  wish u a v...
2         am on a train back from northampton so im af...
3       new mobiles from 2004  must go  txt  nokia to ...
4                         yeah  probably here for a while
                              ...                        
4463    were all getting worried over here  derek and ...
4464    oh oh    den muz change plan liao    go back h...
4465    ceri u rebel  sweet dreamz me little buddy   c...
4466    text   meet someone sexy today  u can find a d...
4467                              k k   sms chat with me 
Name: SMS, Length: 4468, dtype: object

In [13]:
# turns "SMS" string object into list of strings (i.e list of words)

train["SMS"] = train["SMS"].str.split()

In [14]:
# creates list of all words from all SMS messages
vocabulary = []

for list_ in train["SMS"]:
    for word in list_:
        vocabulary.append(word)

In [15]:
# drop duplicates from "vocabulary" list

vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

vocabulary

['way2sms',
 'crave',
 'hwd',
 'opportunity',
 'psychologist',
 'day',
 'pobox365o4w45wq',
 'wrnog',
 'geeee',
 'pack',
 'saying',
 'path',
 'echo',
 'taught',
 '1hr',
 'sexiest',
 'problematic',
 'nbme',
 'based',
 'fights',
 '08719899229',
 'things',
 'drizzling',
 'supply',
 'said',
 'tree',
 'prompts',
 'gettin',
 'vewy',
 'description',
 'lemme',
 'rcd',
 'white',
 'reservations',
 'cheaper',
 'lac',
 'voice',
 'buffy',
 'nit',
 'nus',
 'weasels',
 'supposed',
 'spreadsheet',
 'oath',
 'losers',
 'meaningful',
 'individual',
 'adp',
 'ad',
 'ing',
 'soul',
 'erything',
 'jenxxx',
 '2watershd',
 'jeetey',
 'reasonable',
 'nooooooo',
 'buzzzz',
 'mileage',
 'broth',
 'replace',
 'dlf',
 'touch',
 '1843',
 'practical',
 '2000',
 'ahhh',
 'close',
 'bambling',
 '09061743806',
 'movietrivia',
 'doubt',
 'tarpon',
 'phone750',
 'careless',
 'search',
 'shared',
 'vry',
 'needa',
 'hook',
 'systems',
 '99',
 'specialise',
 'outta',
 'system',
 'meg',
 'meets',
 'realise',
 'arrival',
 'e

In [16]:
base_lib = {}

for word in vocabulary:
    base_lib[word] = 0

In [17]:
messages = []

for message in train["SMS"]:
    new_lib = base_lib.copy()
    
    for word in message:
        new_lib[word] += 1
        
    messages.append(new_lib)

In [18]:
binary_columns = pd.DataFrame(messages)
binary_columns

Unnamed: 0,way2sms,crave,hwd,opportunity,psychologist,day,pobox365o4w45wq,wrnog,geeee,pack,...,medical,bite,throw,fed,accordingly,shakespeare,hurry,lucy,lionp,08718727870
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,6,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4464,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4465,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4466,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
train = pd.concat([train, binary_columns], axis= 1)
train

Unnamed: 0,index,Label,SMS,way2sms,crave,hwd,opportunity,psychologist,day,pobox365o4w45wq,...,medical,bite,throw,fed,accordingly,shakespeare,hurry,lucy,lionp,08718727870
0,1540,ham,"[youre, not, sure, that, im, not, trying, to, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3017,ham,"[lt, gt, is, fast, approaching, so, wish, u, a...",0,0,0,0,0,6,0,...,0,0,0,0,0,0,0,0,0,0
2,2677,ham,"[am, on, a, train, back, from, northampton, so...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4834,spam,"[new, mobiles, from, 2004, must, go, txt, noki...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5283,ham,"[yeah, probably, here, for, a, while]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4463,905,ham,"[were, all, getting, worried, over, here, dere...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4464,5192,ham,"[oh, oh, den, muz, change, plan, liao, go, bac...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4465,3980,ham,"[ceri, u, rebel, sweet, dreamz, me, little, bu...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4466,235,spam,"[text, meet, someone, sexy, today, u, can, fin...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
p_spam = train["Label"].value_counts()["spam"] / len(train)
p_ham = train["Label"].value_counts()["ham"] / len(train)

In [21]:
n_spam = 0

for message in train[train["Label"] == "spam"]["SMS"]:
    for word in message:
        n_spam += 1
        
n_spam

15242

In [22]:
n_ham = 0

for message in train[train["Label"] == "ham"]["SMS"]:
    for word in message:
        n_ham += 1
        
n_ham

55569

In [24]:
n_vocabulary = len(vocabulary)
n_vocabulary

7833

In [25]:
alpha = 1

In [27]:
spam_dict = base_lib.copy()
ham_dict = base_lib.copy()

In [29]:
spam_df = train[train["Label"] == "spam"]
ham_df = train[train["Label"] == "ham"]

In [32]:
for word in vocabulary:
    n_w_spam = spam_df[word].sum()
    n_w_ham = ham_df[word].sum()
    
    p_w_spam = (n_w_spam + alpha)/(n_spam + (alpha * n_vocabulary))
    p_w_ham = (n_w_ham + alpha)/(n_ham + (alpha * n_vocabulary))
    
    spam_dict[word] = p_w_spam
    ham_dict[word] = p_w_ham

In [42]:
for word in ham_dict:
    if word == 0:
        print("Yes")

In [55]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
    
    
    
    
    
    

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return "ham"
    elif p_ham_given_message < p_spam_given_message:
        return "spam"
    else:
        return "needs human classification"

In [57]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

P(Spam|message): 2.2464775886040146e-15
P(Ham|message): 7.50508068181119e-13
P(Spam|message): 2.724365505054532e-29
P(Ham|message): 1.9103860704926276e-24
P(Spam|message): 1.0966251397723214e-14
P(Ham|message): 4.155480317777686e-12
P(Spam|message): 5.839042241881839e-06
P(Ham|message): 0.000504948924480889
P(Spam|message): 2.4106460030422984e-90
P(Ham|message): 1.9169018464997302e-75
P(Spam|message): 1.533539288806348e-48
P(Ham|message): 1.638099378402052e-36
P(Spam|message): 1.344309020853674e-41
P(Ham|message): 1.3385900920125532e-38
P(Spam|message): 1.761146043585284e-21
P(Ham|message): 7.074870654839403e-18
P(Spam|message): 1.1881095772181164e-17
P(Ham|message): 2.62006502481497e-13
P(Spam|message): 1.7461895955008533e-26
P(Ham|message): 6.05400509321461e-18
P(Spam|message): 1.8350927732791808e-35
P(Ham|message): 5.455930105192324e-30
P(Spam|message): 7.490169029672912e-08
P(Ham|message): 8.3392080018955e-06
P(Spam|message): 5.8178495059167685e-18
P(Ham|message): 5.366120621656942

P(Spam|message): 5.69787299263721e-25
P(Ham|message): 1.5430713088965942e-21
P(Spam|message): 7.348782300039703e-64
P(Ham|message): 1.0772497826012404e-50
P(Spam|message): 1.1727777893941264e-61
P(Ham|message): 1.9593338134767754e-49
P(Spam|message): 5.393159689652234e-60
P(Ham|message): 4.56830926669129e-49
P(Spam|message): 2.902768511097548e-73
P(Ham|message): 2.089644949439878e-62
P(Spam|message): 1.1477008784883233e-45
P(Ham|message): 3.4506901543744694e-38
P(Spam|message): 1.1269920878071566e-18
P(Ham|message): 3.1787815073544217e-15
P(Spam|message): 4.5060601229034477e-63
P(Ham|message): 4.987193424010945e-50
P(Spam|message): 7.546152316185873e-37
P(Ham|message): 4.692729356560043e-29
P(Spam|message): 1.5622021802344492e-37
P(Ham|message): 5.2608698261998605e-30
P(Spam|message): 3.54411814197951e-25
P(Ham|message): 1.5304841736142287e-20
P(Spam|message): 4.750983473477022e-65
P(Ham|message): 3.904034374910476e-55
P(Spam|message): 2.1427678511596597e-18
P(Ham|message): 1.285893154

P(Spam|message): 2.1415359851950413e-86
P(Ham|message): 6.055945675717643e-73
P(Spam|message): 3.4270017382719655e-33
P(Ham|message): 2.956758667235608e-26
P(Spam|message): 1.0966251397723214e-14
P(Ham|message): 2.4444001869280503e-12
P(Spam|message): 1.449979128849099e-81
P(Ham|message): 1.2019689144058915e-67
P(Spam|message): 9.798445623827521e-55
P(Ham|message): 1.3095163673008324e-44
P(Spam|message): 4.67722708595521e-87
P(Ham|message): 1.4014245661483334e-75
P(Spam|message): 7.890091034629432e-41
P(Ham|message): 4.0057503758703503e-32
P(Spam|message): 9.084025875235985e-59
P(Ham|message): 3.930141611274369e-46
P(Spam|message): 3.385637051240745e-15
P(Ham|message): 2.5992034759261856e-11
P(Spam|message): 1.4459327123031893e-24
P(Ham|message): 7.385224565406313e-20
P(Spam|message): 1.4317881323908364e-47
P(Ham|message): 2.9571952644637373e-42
P(Spam|message): 3.00290290836349e-121
P(Ham|message): 1.7953048441097095e-101
P(Spam|message): 5.8410583344601215e-15
P(Ham|message): 2.67024

Unnamed: 0,index,Label,SMS,predicted
0,1078,ham,"Yep, by the pretty sculpture",ham
1,4028,ham,"Yes, princess. Are you going to make me moan?",ham
2,958,ham,Welp apparently he retired,ham
3,4642,ham,Havent.,ham
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [59]:
test.head()

Unnamed: 0,index,Label,SMS,predicted
0,1078,ham,"Yep, by the pretty sculpture",ham
1,4028,ham,"Yes, princess. Are you going to make me moan?",ham
2,958,ham,Welp apparently he retired,ham
3,4642,ham,Havent.,ham
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [66]:
test.shape[0]

1104

In [64]:
test[test["Label"] == test["predicted"]].shape[0]

1091

In [62]:
1091/1104

0.9882246376811594

### Spam filter accuracy: 98.8%