Notebook borrowed and modified from https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html

## Implementing Naive Bayes Classifier for Spam Classification
=> the dataset of spam/ham sms messages can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), the UCL Machine Learning Repository, or in the github directory.

### Start by Studying Dataset
We will be using pandas, a popular data analysis module, feel free to consult the pandas documentation or just googgle the methods for more detail.

In [1]:
import pandas as pd

# load dataset, giev it the file, what sepeerator the file uses (can differ (,;\t...) and the names of the columns as a list)
sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

# method allows you to know the dimensions of the table (rows, columns) => always in that order
sms_spam.shape

(5572, 2)

In [2]:
# head() lets you see the first 5rows of your dataframe (ie. table)
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# can select a column's elements just like selecting a key in a dictionary sms_spam['Label']
# value_counts gives the number of unique elements (so n° of spam and ham) and the normalize 
# argument means you want a percentage returned

sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### Splitting in train and test
When doing machine learning you want to estimate paramers on one part of th data and see how 
well your model performs on another part, so that the test is fair. Indeed the model will
not have seen the test examples before.
We'll use 80% of the data for training and the remaining 20% for testing. (a common way of splitting)

In [4]:
# Randomize the dataset
# sample selects a sample of the datast at random (meaning the distribution of spam and ham
# should stay the same
# passing 1 as fraction means giving it the whole datset to extract randomly vs. only a portion
data_randomized = sms_spam.sample(frac=1, random_state=1)
data_randomized.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [5]:
# Calculate index for split
# select up to which row the examples will be for training
# round function rounds to lowest integer for anything <= 0.5
training_test_index = round(len(data_randomized) * 0.8)

# Split into training and test sets
# reset_index sets a row  index from 0 to n° of last row
# drop=True means there is no need to create a column with the original indexes
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

In [6]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


We can see in the next 2 cells that the proportion of spam vs. ham is similar in both sets, which is what we want so both datasets are representative.  

In [7]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [8]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

### Data Preprocessing/Cleaning

In [9]:
# before preprocess
training_set.head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [10]:
# can carry out actions on all the data in a column at the same time
# replace => can replace values with a regex pattern search if the data is text for example
# if the data is text can use .str to use methods usually applied to python strings, such as lower()
# in this case.

training_set['SMS'] = training_set['SMS'].replace(regex=r'\W', value=' ') # Removes punctuation
training_set['SMS'] = training_set['SMS'].str.lower()

#after preprocess
training_set.head(3)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired


In [11]:
# can split each meassage using the same logic as previously
training_set['SMS'] = training_set['SMS'].str.split()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


### Creating the Vocabulary

In [12]:
vocabulary = []
for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)
  
    # get list of unique words
vocabulary = list(set(vocabulary))

In [13]:
# alternative using list comprehension
vocabulary = list(set([word for sms in training_set['SMS'] for word in sms]))  

In [14]:
len(vocabulary)

7783

### The Final Training Set

We're now going to use the vocabulary we just created to make the data transformation we want.  
Each sms will be treated with regard to the word counts of the words it contains. This is a popular way of representing sentences in order to then feed them to any sort of algorithm.  
However we will not be truly making use of this for now...
As an example : 

![Vectorize](vectorize.png)



To create a table such as the one above, we can first build a dictionary and then use pandas to transform it into a dataframe :

In [15]:
word_counts_per_sms = {'secret': [2,1,1],
                       'prize': [2,0,1],
                       'claim': [1,0,1],
                       'now': [1,0,1],
                       'coming': [0,1,0],
                       'to': [0,1,0],
                       'my': [0,1,0],
                       'party': [0,1,0],
                       'winner': [0,0,1]
                      }

word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,secret,prize,claim,now,coming,to,my,party,winner
0,2,2,1,1,0,0,0,0,0
1,1,0,0,0,1,1,1,1,0
2,1,1,1,1,0,0,0,0,1


To create the dictionary we need for our training set, we can use the code below:

    We start by initializing a dictionary named word_counts_per_sms, where each key is a unique word (a string) from the vocabulary, and each value is a list of the length of the training set, where each element in that list is a 0.
        The code [0] * 5 outputs [0, 0, 0, 0, 0]. So the code [0] * len(training_set['SMS']) outputs a list of the length of training_set['SMS'].
    We loop over training_set['SMS'] using the enumerate() function to get both the index and the SMS message (index and sms).
        Using a nested loop, we loop over sms (where sms is a list of strings, where each string represents a word in a message).
            We increment word_counts_per_sms[word][index] by 1.


In [16]:
# {'car' : [0,0,0,0...], 'money':[0,0,0,0...] ...}
# each position corresponds to the index of a training example (our rows in the training_set table)
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        # [word] gets the key we want and [index] then locates the index of the example sms we're looking at.
        word_counts_per_sms[word][index] += 1

In [17]:
# can then create a dataframe where each key corresponds to a column name and the values are the rows in the column
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,simply,annoying,commercial,tarot,cardiff,sum,tap,velly,wins,accounts,...,feeling,hun,cherthala,charged,stressful,mi,b4280703,outfit,arsenal,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# now we can merge the previous table to have labels,  sms and the counts for each vocab word in the same table
training_set_clean = pd.concat([training_set, word_counts], axis=1) # axis=1 add columns | axis=0 add rows
training_set_clean.head()

Unnamed: 0,Label,SMS,simply,annoying,commercial,tarot,cardiff,sum,tap,velly,...,feeling,hun,cherthala,charged,stressful,mi,b4280703,outfit,arsenal,your
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Calculating Constants First

In [19]:
# Isolating spam and ham messages first
# make mini tables (dataframes) with only spam or ham sms
# syntax -> dataframe[condition]
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

In [20]:
# P(Spam) and P(Ham)
# count(spams) / count(total n° of sms)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

In [21]:
# N_Spam
# returns a table with the length of each sms
# apply allows us to apply any funciton to the content of the column
# here we just us the len function to get the sms lengths
n_words_per_spam_message = spam_messages['SMS'].apply(len)

# can then sum over the entire column to get the total number of words in spam
n_spam = n_words_per_spam_message.sum()

In [22]:
# N_Ham
# same as N_Spam
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

In [23]:
# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

### Calculating Parameters

In [24]:
# Initiate parameters
# create dictionaries where each key is a word with value 0 for each word in vocab
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [25]:
# Calculate parameters
for word in vocabulary:
    
    # look up the column for the word in the vocab in spam messages (our mini dataframe with only spams)
    # sum up all of the values in the column to get the number of times this particular word appeared in
    # spam
    n_word_given_spam = spam_messages[word].sum() 
    
    # even if word appears 0 times (so n_word_given_spam=0) in spam and only appears in ham sms, 
    # it will have a count of 1 (the value of alpha)
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    
    # Same for ham
    n_word_given_ham = ham_messages[word].sum() # ham_messages already defined
    # Same
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

### Classifying A New Message


In [26]:
import re

def classify(message:str) -> str :
    """From a message with type string, predict what class it belongs to."""
    
    # apply same preprocess as before on our columns
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    
    # remeber proba(spam|sms) is proportinal to P(spam) * P(word1|spam) * P(word2|spam)...
    # so start by putting in place P(spam) and P(ham)
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    
    # Then, for each word, look up its proba in parameters dictionaries and multiply by it.
    # If the word has not been seen and is a new word (ie. isnt in 1 of the dicts) then skip over it basically.
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham: 
            p_ham_given_message *= parameters_ham[word]

            
    # print each proba        
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    # ham higher
    if p_ham_given_message > p_spam_given_message:
        return 'Label: Ham'
    
    # spam higher
    elif p_spam_given_message > p_ham_given_message:
        return 'Label: Spam'
    
    # too close to call...
    else:
        return 'Equal proabilities, have a human classify this!'

In [27]:
# test it out !
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21


'Label: Ham'

In [28]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27


'Label: Spam'

### Testing our Model

In [29]:
# same function as classify(message) but just doesn't print the probas out
def classify_test_set(message:str) -> str:

    message = re.sub('\W', ' ', message)
    message = message.lower().split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
             p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Create a column called predicted which has the models prediction for each test sms

In [30]:
# can just apply our test function to the column of sms
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)

In [31]:
# correct predictions : rows where predicted and Label are the same
correct = (test_set['predicted'] == test_set['Label'])
correct

0       True
1       True
2       True
3       True
4       True
        ... 
1109    True
1110    True
1111    True
1112    True
1113    True
Length: 1114, dtype: bool

In [32]:
# can sum over all the rows where the condtion is True to get count(correct predictions)
correct = correct.sum()

In [33]:
# shape[0] => number of rows, ie. number of examples
# remember shape gives the dimensions of the table (n°rows, n°columns)
accuracy = correct / test_set.shape[0]

In [34]:
print(f'{correct=}')
print(f'incorrect={test_set.shape[0] - correct}')
print(f'{accuracy=}')

correct=1100
incorrect=14
accuracy=0.9874326750448833
