## SMS Spam Detection with Naive Bayes Classifier

**Dataset**

* SMS Spam Collection Dataset: https://archive.ics.uci.edu/dataset/228/sms+spam+collection

<b>Model design: </b><br>

Initial steps: 

1. Load dataset into dataframe.
2. Split data into 80% train and 20% test set.
<br>
<br>

Training: 

1. Load train data into dataframe.
2. Preprocess and tokenize train data.
3. Group ham and spam class into two individual dataframe.
4. Calculate the total number of documents in ham class $N_{h}$ and spam class $N_{s}$.
5. Calculate the probability of ham class $\widehat{P}(H)$ and spam class $\widehat{P}(S)$.

$$ \widehat{P}(H) =  \frac{N_{h}}{N_{h} + N_{s}} $$
$$ \widehat{P}(S) =  \frac{N_{s}}{N_{h} + N_{s}} $$

5. Create a vocabulary vector for each class and create the bag of word matrix of each class from the train data.
6. Calculate the frequency for each vocab in each class and the probability of the word given its class.

$$ n_{h}(w) = \text{frequency of word w in class ham} $$ 
$$ n_{s}(w) = \text{frequency of word w in class spam} $$
$$ \widehat{P}(w|H) =  \frac{n_{h}(w)}{N_{h}} $$
$$ \widehat{P}(w|S) =  \frac{n_{s}(w)}{N_{s}} $$

7. Combine ham and spam dataframe along with its word frequency and probability.
8. Missing values in the table are filled in with laplace add-one smoothing.
<br>
<br>

Testing: 

For Bag of Word method:

1. Load test data into dataframe.
2. Preprocess and tokenize test data.
3. Create bag of words matrix for test data.
4. Calculate probability of each class k given bow matrix of test data using Bayes theorem.

$$ P(C_{k}|b) = P(b|C_{k})P(C_{k}) $$
$$            = P(C_{k})\prod_{i=1}^{V} [b_{i} P(w_{i}|C_{k}) + (1-b_{i})(1-P(w_{i}|C_{k}))] $$

5. The class with the larger probability will be the predicted class.
6. Evaluate model's performance with accuracy score.

For Bag of Word and TF-IDF method:

1. Load test data into dataframe.
2. Preprocess and tokenize test data.
3. Calculate tf-idf matrix for test data.

$$ TF = \frac{\text{Number of times word appears in the document}}{\text{Total number of words in the document}} $$

$$ DF = \text{Number of documents in the corpus that contains the word} $$

$$ IDF = \log{(\frac{\text{Total number of documents}}{DF})} $$

$$ TFIDF = TF * IDF $$

4. Normalize the tf-idf matrix.
5. Calculate probability of each class k given tf-idf matrix of test data using Bayes theorem.
6. The class with the larger probability will be the predicted class.
7. Evaluate model's performance with accuracy score.
<br>
<br>

Evaluation metric:
1. Accuracy score 

$$ acc = \frac{\text{Total number of correct predictions}}{\text{Total predictions made}} $$

Implementation

In [1]:
import pandas as pd
import numpy as np
import string
import math
from sklearn.model_selection import train_test_split

In [2]:
''' 
load file and convert to dataframe and new list of only text content
'''
def load_file(filename: str):
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read().split("\n")
        new_df = []
        for line in text:
            temp = line.split("\t")
            # ignore empty lines
            if len(temp) > 1:
                new_df.append(temp)

    dataframe = pd.DataFrame(new_df, columns=['label', 'content'])
    return dataframe


''' 
tokenize text 
'''
def split_token(line: list[str]):
    cleaned_text = []
    for i in line:
        # ignore any empty sentences
        if i == "":
            continue
        else:
            # convert words to lower case to normalize
            # split sentence into tokens
            sentence = i.lower().split(" ")
            sentence = [word for word in sentence if word]
            if sentence:
                cleaned_text.append(sentence)
                
    return cleaned_text


''' 
pre-process data 
'''
def process_data(line: list[list[str]]):
    for s in range(len(line)):
        clean = []
        for word in line[s]:
            # filter each word to ignore punctuations
            new = ''.join(char for char in word if char not in string.punctuation)
            clean.append(new)
        # ignore empty strings
        clean = [i for i in clean if i]
        line[s] = clean
        
    return line


''' 
create vocab dictionary 
'''
def make_vocab(data: list[str]):
    vocab = []
    for sentence in data:
        for word in sentence:
            if word in vocab:
                continue
            else:
                vocab.append(word)
                
    return vocab


'''
create bag of words vector 
'''
def make_bag_of_words(data: list[list[str]], vocab: list):  
    bow = []
    for line in data:
        curr_bow = [1 if i in line else 0 for i in vocab]
        bow.append(curr_bow)
    
    return np.array(bow) 


'''
create BoW matrix for each class
'''
def get_class_matrix(class1_df, class2_df):
    # ham
    # process data in ham
    ham_content = class1_df['X_train'].to_list()
    ham_content = split_token(ham_content)
    ham_content = process_data(ham_content)
    
    # get ham vocab
    ham_vocab = make_vocab(ham_content)
    
    # create bag of words for ham class
    ham_bow = make_bag_of_words(ham_content, ham_vocab)
    
    # spam
    # process data in ham
    spam_content = class2_df['X_train'].to_list()
    spam_content = split_token(spam_content)
    spam_content = process_data(spam_content)
    
    # get spam vocab
    spam_vocab = make_vocab(spam_content)
    
    # create bag of words for spam class
    spam_bow = make_bag_of_words(spam_content, spam_vocab)
    
    return ham_vocab, spam_vocab, ham_bow, spam_bow


'''
calculate DF
'''
def calc_df(df, vocab):
    df_dict = {}
    for word in vocab:
        count = df[df['X_test'].str.contains(word)].shape[0]
        df_dict[word] = count
        
    return df_dict


'''
calculate TF-IDF
'''
def calc_tfidf(word, doc, total_doc, df_dict):
    tf = doc.count(word) / len(doc)
    df = df_dict[word]
    # add one to denominator to avoid division by 0
    idf = np.log(total_doc / (df + 1))
    return tf * idf


'''
create TF-IDF matrix
'''
def get_tfidf_matrix(vocab, data, df_dict):
    tfidf_matrix = []
    total_doc = len(data)
    for doc in data:
        vector = [calc_tfidf(word, doc, total_doc, df_dict) for word in vocab]
        tfidf_matrix.append(vector)
        
    return np.array(tfidf_matrix)
    
    
'''
Train the model
'''
def train(n1, n2, vocab1, vocab2, bow1, bow2):
    # get total number of each vocab in each class
    n_ham_w = np.sum(bow1, axis=0)
    n_spam_w = np.sum(bow2, axis=0)
    
    # get probability of each vocab in each class
    p_w_ham = np.array(n_ham_w) / n1
    p_w_spam = np.array(n_spam_w) / n2
    
    # arrange into dataframe
    ham_prob_df = pd.DataFrame(np.vstack([vocab1, n_ham_w, p_w_ham]).T, columns=['word', 'n_ham(word)', 'P(word|ham)'])
    spam_prob_df = pd.DataFrame(np.vstack([vocab2, n_spam_w, p_w_spam]).T, columns=['word', 'n_spam(word)', 'P(word|spam)'])
    
    return ham_prob_df, spam_prob_df


'''
Test the model 
'''
def test(data, vocab, merged, p_ham, p_spam, tfidf_method=False, df_dict=None):
    # pre-process test data
    data = data['X_test'].to_list()
    data = split_token(data)
    data = process_data(data)
    
    p_w_ham = merged['P(word|ham)'].astype(float).to_numpy()
    p_w_spam = merged['P(word|spam)'].astype(float).to_numpy()
    
    # Bag of Word method
    if tfidf_method is False:
        # create bag of words
        test_bow = make_bag_of_words(data, vocab)
        
        # calculate probability of sentence being classified as ham or spam
        curr_ham_prob = np.sum(np.log2(test_bow * p_w_ham + (1 - test_bow) * (1 - p_w_ham)), axis=1)
        curr_spam_prob = np.sum(np.log2(test_bow * p_w_spam + (1 - test_bow) * (1 - p_w_spam)), axis=1)
        
    # TF-IDF method
    if tfidf_method is True:
        # create tf-idf matrix
        tfidf_matrix = get_tfidf_matrix(vocab, data, df_dict)
        
        # normalize matrix
        tfidf_norm = np.linalg.norm(tfidf_matrix)
        tfidf_matrix = tfidf_matrix / tfidf_norm
        
        # calculate probability of sentence being classified as ham or spam
        curr_ham_prob = np.sum(np.log2(tfidf_matrix * p_w_ham + (1 - tfidf_matrix) * (1 - p_w_ham)), axis=1)
        curr_spam_prob = np.sum(np.log2(tfidf_matrix * p_w_spam + (1 - tfidf_matrix) * (1 - p_w_spam)), axis=1)
    
    ham_prob = np.log2(p_ham) + curr_ham_prob
    spam_prob = np.log2(p_spam) + curr_spam_prob
    
    # predict if ham or spam
    predicted = np.where(ham_prob > spam_prob, "ham", "spam").tolist()

    return predicted


'''
Evaluate Accuracy Score 
'''
def evaluate(data):
    total_rows = data.shape[0]
    correct_predictions = 0
    
    for _, row in data.iterrows():
        if row['y_test'] == row['predicted']:
            correct_predictions += 1
    
    accuracy = correct_predictions / total_rows
    return accuracy
   

### Load data

In [3]:
# load original dataframe

df = load_file("SMSSpamCollection")
df

Unnamed: 0,label,content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


### Train Test split

In [4]:
# split to 80% train and 20% test data

X = df.iloc[:, 1]
y = df.iloc[:, 0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

### Training

In [5]:
# load train dataframe

train_df = pd.DataFrame({"y_train": y_train, "X_train": X_train})
train_df

Unnamed: 0,y_train,X_train
1350,spam,FREE2DAY sexy St George's Day pic of Jordan!Tx...
5544,ham,Armand says get your ass over to epsilon
1168,ham,Lol now I'm after that hot air balloon!
5551,ham,"You know, wot people wear. T shirts, jumpers, ..."
5320,ham,"Good morning, my Love ... I go to sleep now an..."
...,...,...
3772,ham,"Hi, wlcome back, did wonder if you got eaten b..."
5191,spam,ree entry in 2 a weekly comp for a chance to w...
5226,ham,"""OH FUCK. JUSWOKE UP IN A BED ON A BOATIN THE ..."
5390,ham,NOT MUCH NO FIGHTS. IT WAS A GOOD NITE!!


In [6]:
# split to ham and spam dataframe

ham_df = train_df[train_df["y_train"]=="ham"]
n_ham = ham_df.shape[0]
p_ham = float(n_ham / train_df.shape[0])

spam_df = train_df[train_df["y_train"]=="spam"]
n_spam = spam_df.shape[0]
p_spam = float(n_spam / train_df.shape[0])

In [7]:
ham_vocab, spam_vocab, ham_bow, spam_bow = get_class_matrix(ham_df, spam_df)


In [8]:
ham_prob_df, spam_prob_df = train(n_ham, n_spam, ham_vocab, spam_vocab, ham_bow, spam_bow)
ham_prob_df

Unnamed: 0,word,n_ham(word),P(word|ham)
0,armand,3,0.000774593338497289
1,says,23,0.0059385489284792155
2,get,225,0.05809450038729667
3,your,289,0.07461915827523884
4,ass,12,0.003098373353989156
...,...,...,...
6727,docks,1,0.0002581977794990963
6728,25,1,0.0002581977794990963
6729,spinout,1,0.0002581977794990963
6730,gossip,1,0.0002581977794990963


In [9]:
spam_prob_df

Unnamed: 0,word,n_spam(word),P(word|spam)
0,free2day,2,0.0034129692832764505
1,sexy,9,0.015358361774744027
2,st,2,0.0034129692832764505
3,georges,2,0.0034129692832764505
4,day,10,0.017064846416382253
...,...,...,...
2705,lotr,1,0.0017064846416382253
2706,june,1,0.0017064846416382253
2707,soundtrack,1,0.0017064846416382253
2708,stdtxtrate,1,0.0017064846416382253


In [10]:
# merge train ham and spam dataframe

merged_df = ham_prob_df.merge(spam_prob_df, on='word', how='outer')

# fill in unseen data with Laplace add one smoothing
merged_df['n_ham(word)'].fillna(1, inplace=True)
merged_df['P(word|ham)'].fillna(float(1/n_ham), inplace=True)
merged_df['n_spam(word)'].fillna(1, inplace=True)
merged_df['P(word|spam)'].fillna(float(1/n_spam), inplace=True)

# rearrange columns
merged_df = merged_df[['word', 'n_ham(word)', 'P(word|ham)', 'n_spam(word)', 'P(word|spam)']]

# get list of vocab in training set
train_vocab = merged_df['word'].to_list()

merged_df

Unnamed: 0,word,n_ham(word),P(word|ham),n_spam(word),P(word|spam)
0,armand,3,0.000774593338497289,1,0.001706
1,says,23,0.0059385489284792155,1,0.001706
2,get,225,0.05809450038729667,57,0.09726962457337884
3,your,289,0.07461915827523884,174,0.29692832764505117
4,ass,12,0.003098373353989156,1,0.001706
...,...,...,...,...,...
8515,nowreply,1,0.000258,1,0.0017064846416382253
8516,lotr,1,0.000258,1,0.0017064846416382253
8517,soundtrack,1,0.000258,1,0.0017064846416382253
8518,stdtxtrate,1,0.000258,1,0.0017064846416382253


### Testing 

In [11]:
# load test dataframe

test_df = pd.DataFrame({"y_test": y_test, "X_test": X_test})
test_df

Unnamed: 0,y_test,X_test
3690,ham,You still coming tonight?
3527,ham,"""HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE..."
724,ham,Ya even those cookies have jelly on them
3370,ham,Sorry i've not gone to that place. I.ll do so ...
468,ham,When are you going to ride your bike?
...,...,...
2942,ham,My supervisor find 4 me one lor i thk his stud...
4864,spam,Bored housewives! Chat n date now! 0871750.77....
3227,ham,"Rose for red,red for blood,blood for heart,hea..."
3796,ham,Also remember the beads don't come off. Ever.


#### Test using only Bag of Words

In [12]:
# predict test data

predicted_res = test(test_df, train_vocab, merged_df, p_ham, p_spam, tfidf_method=False)

res_bow_df = test_df.assign(predicted=predicted_res)
res_bow_df

Unnamed: 0,y_test,X_test,predicted
3690,ham,You still coming tonight?,ham
3527,ham,"""HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...",ham
724,ham,Ya even those cookies have jelly on them,ham
3370,ham,Sorry i've not gone to that place. I.ll do so ...,ham
468,ham,When are you going to ride your bike?,ham
...,...,...,...
2942,ham,My supervisor find 4 me one lor i thk his stud...,ham
4864,spam,Bored housewives! Chat n date now! 0871750.77....,spam
3227,ham,"Rose for red,red for blood,blood for heart,hea...",spam
3796,ham,Also remember the beads don't come off. Ever.,ham


In [13]:
# calculate predicted test accuracy

test_acc_bow = evaluate(res_bow_df)
print(f'Test Accuracy using Bag of Words method: {test_acc_bow}')

Test Accuracy using Bag of Words method: 0.9748878923766816


#### Test with Bag of Words and TF-IDF

In [14]:
df_records = calc_df(test_df, train_vocab)

In [15]:
# predict test data

predicted_res1 = test(test_df, train_vocab, merged_df, p_ham, p_spam, tfidf_method=True, df_dict=df_records)

res_tfidf_df = test_df.assign(predicted=predicted_res1)
res_tfidf_df

Unnamed: 0,y_test,X_test,predicted
3690,ham,You still coming tonight?,ham
3527,ham,"""HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...",ham
724,ham,Ya even those cookies have jelly on them,ham
3370,ham,Sorry i've not gone to that place. I.ll do so ...,ham
468,ham,When are you going to ride your bike?,ham
...,...,...,...
2942,ham,My supervisor find 4 me one lor i thk his stud...,ham
4864,spam,Bored housewives! Chat n date now! 0871750.77....,ham
3227,ham,"Rose for red,red for blood,blood for heart,hea...",ham
3796,ham,Also remember the beads don't come off. Ever.,ham


In [16]:
# calculate predicted test accuracy

test_acc_tfidf = evaluate(res_tfidf_df)
print(f'Test Accuracy using Bag of Words and TF-IDF method: {test_acc_tfidf}')

Test Accuracy using Bag of Words and TF-IDF method: 0.8556053811659193
