BBM409: Introduction to Machine Learning Lab.<br>
Instructor: Ahmet Burak Can<br>
TA: Burçak Asal

# Assignment 3 : Naive Bayes Algorithm

This assignment has done by **Desmin Alpaslan** (Student ID: 21945795) and **Mert Doğramacı** (Student ID: 21946055).

### Contents
**[Problem Definition and Data](#problem_definition)**<br>
**[Part 1: Understanding the Data](#part1)**<br>
**[Part 2: Implementing Naive Bayes](#part2)**<br>
&emsp;2.1. [Implementation of Nested Subfunctions](#subfunctions)<br>
&emsp;2.2. [Final Implementation of Naive Bayes](#naive_bayes)<br>
**[Part 3: Analyzes](#part3)**<br>
&emsp;3.a. [Analyzing effect of the words on prediction](#3a)<br>
&emsp;&emsp;3.a.1. *[List the 10 words whose presence most strongly predicts that the mail is ham](#3a1)*<br>
&emsp;&emsp;3.a.2. *[List the 10 words whose absence most strongly predicts that the mail is ham](#3a2)*<br>
&emsp;&emsp;3.a.3. *[List the 10 words whose presence most strongly predicts that the mail is spam](#3a3)*<br>
&emsp;&emsp;3.a.4. *[List the 10 words whose absence most strongly predicts that the mail is spam](#3a4)*<br>
&emsp;3.b. [Stopwords](#3b)<br>
&emsp;3.c. [Analyzing effect of the stopwords](#3c)<br>
**[Part 4: Calculation of Performance Metrics](#part4)**

## 1. Problem Definition and Data<a class="anchor" id="problem_definition"></a>

In this assignment, we will try to `determine whether a mail is ham or spam` from a given mail dataset. We will do it with the help of a Naive Bayes classifier `that we will implement` and verify its performance on again given E-Mail Spam Dataset. We will use the `Naive Bayes classifier algorithm`, that we learned in the class, during this assignment.

As I stated before, a dataset is provided us for both training and validation phases named `emails.csv` which is avaible in the post of assignment 3 at Piazza page. We included dataset with the path as `"emails.csv"`. If you will change the dataset or its location, you should change the argument of the read_csv() function. 

You can see the whole dataset and its shape below.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("emails.csv")
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


E-Mail Spam Dataset is a dataset that consists of 5728 samples with features as the text of mail and spam label. Spam label has two values 0 and 1.<br>
0: Ham<br>1: Spam<br>

## Part 1: Understanding the Data<a class="anchor" id="part1"></a>

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
data = pd.read_csv("emails.csv").to_numpy()
corpus = data[:,0]
y = data[:,1]

In [5]:
vectorizer = CountVectorizer(analyzer = "word", ngram_range = (1,1)) # only consider unigrams
X = vectorizer.fit_transform(corpus).toarray()
X = np.squeeze(np.asarray(X))
N, M = X.shape

In [6]:
freq_words = np.zeros((M, 2))
spam_idx, ham_idx = y == 1, y == 0

print("There are total of %d spam and %d ham mails" % (X[spam_idx].shape[0], X[ham_idx].shape[0]))
print(X[spam_idx].shape, X[ham_idx].shape)

There are total of 1368 spam and 4360 ham mails
(1368, 37303) (4360, 37303)


In [7]:
spam_num, ham_num = X[spam_idx].sum(axis=0), X[ham_idx].sum(axis=0)
freq_words[:,0] += spam_num
freq_words[:,1] += ham_num

print("Number of words only seen in spam mails:", np.sum(freq_words[:,1] == 0))
print("Number of words only seen in ham mails:", np.sum(freq_words[:,0] == 0))

Number of words only seen in spam mails: 10229
Number of words only seen in ham mails: 18529


Some statistics about the data:
1. There are total of **37303** distinct words in the dataset and **5728** lines of mails.
2. In these words, **10229** of them is seen only in spam mails and **18529** of them is seen only in ham mails.

In [8]:
ratios_s = freq_words[:,0] / X.sum(axis=0) # total times in spam / total usage of the word
ratios_h = freq_words[:,1] / X.sum(axis=0) # total times in ham / total usage of the word
idx_s_rats = np.argsort(ratios_s)[::-1]
idx_h_rats = np.argsort(ratios_h)[::-1]
words = vectorizer.get_feature_names()


print("#### Words with highest R_s ####")
count = 0
print("Word","R_s","N_s","N", sep="\t")
for i in idx_s_rats:
    if X[:,i].sum() > 100:       
        print(words[i], "%.2f" % ratios_s[i], freq_words[i,0], X[:,i].sum(), sep='\t')
        count += 1
        if count == 10:
            break

print("#### Words with highest R_h ####")
count = 0
print("Word","R_h","N_h","N", sep="\t")
for i in idx_h_rats:
    if ratios_h[i] < 0.99 and X[:,i].sum() > 500:       
        print(words[i], "%.2f" % ratios_h[i], freq_words[i,1], X[:,i].sum(), sep='\t')
        count += 1
        if count == 10:
            break

#### Words with highest R_s ####
Word	R_s	N_s	N
projecthoneypot	1.00	110.0	110
viagra	1.00	174.0	174
stationery	1.00	120.0	120
2005	0.99	374.0	379
engines	0.97	112.0	115
advertisement	0.97	102.0	105
adobe	0.97	462.0	476
jul	0.96	162.0	168
2004	0.95	169.0	177
grants	0.95	110.0	116
#### Words with highest R_h ####
Word	R_h	N_h	N
na	0.99	616.0	623
model	0.99	1287.0	1306
attached	0.98	898.0	912
schedule	0.98	637.0	647
option	0.98	561.0	570
london	0.98	828.0	843
09	0.98	1085.0	1105
john	0.98	1016.0	1035
summer	0.98	617.0	629
08	0.98	1192.0	1216


We compare the words according to their spam ratio which is defined as follows:<br>
<br>
$$\large R_s = \dfrac{N_s}{N},\ R_h = \dfrac{N_h}{N} $$<br>
where:
- $N_s$ number of occurances in a spam mail of the word.
- $N_h$ number of occurances in a ham mail of the word
- $N$ is the total occurances.<br>

In the upper cell, we print the 10 words with highest $R_s$ and $N > 100$, highest $R_h$ and $N > 500$. We selected the 3 words among them and inspect their statistics:
1. **viagra**: We see that in this dataset all the mails that includes "viagra" are **spam**, since $R_s = 1.0$. Even though the $N$ is quite small (174), from prior experience we know that these type of mails are usually spam.
2. **adobe**: We see that in this dataset most of the mails that includes "adobe" are spam, with $R_s = 0.97$. Furthermore because $N = 476$ and $N_s = 462$ (which are quite high occurances), we can conclude that this word provides a useful distinction between two type of mails.
3. **schedule**: We see that in this dataset mos of the mails that includes "schedule" are ham, with $R_h = 0.98$. We know that from prior experience that mails that mentions schedule are usually not spam.

We can conclude that it is feasible to label mails as spam or ham by looking at the words. However there are few drawbacks in this dataset:
1. Even though some of the words has high $R_s$ their $N$ is quite low (< 100). This will result in a **biased prediction**.
2. There are some words that are only numbers (09, 08, 2005, 2004) which shouldn't be telling a much about the type of the mail. However because of the dataset, some of these words has high $R_s$ and $R_h$.

## Part 2: Implementing Naive Bayes<a class="anchor" id="part2"></a>

For being able to calculate the probability of being a spam or ham mail of an unknown mail sample, we should solve the below equation:

$$\large P(y = spam|word) = \dfrac{P(word|y = spam) * P(y = spam)}{P(word)}$$<br>
$$\large P(y = ham|word) = \dfrac{P(word|y = ham) * P(y = ham)}{P(word)}$$<br>

As you can see, the denominators of the both equation are same. Therefore, we don't need to $P(word)$ probabilities. 

In conclusion, we have to determine below equation for finding the label of the unknown mail example.

$$\large \hat{y} = \underset{y \in spam, ham}{\mathrm{argmax}} P(y|word) = P(word|y) * P(y)$$<br>

For determining this value, we need to calculate $P(word|y = spam)$, $P(word|y = ham)$, $P(y = spam)$ and $P(y = ham)$ values.

### 2.1. Implementation of Nested Subfunctions<a class="anchor" id="subfunctions"></a>

**__vectorizer** is a helper function that we use for detecting every word in every mail sample and their counts for each mail. It means it creates a matrix which stores the words that emails includes and their counts in each email. The result matrix includes N (number of mails) number of rows and column number equals to number of unique words in mails.

```python  
    def __vectorizer(self, arr: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
        """
        Creates a matrix which stores the words that emails includes and their counts in each email

        :param arr: an numpy array with shape (N,1) for
        :return: a CountVectorizer matrix (N, number of different words in emails) and a vector named columns
        with shape (number of different words in emails, 1) which stores all words appears in mails
        """

        # initializes CountVectorizer item with ngram_mode
        vectorizer = CountVectorizer(ngram_range=**Selected ngram mode will come here**, stop_words=self.stop_words)

        # vector that holds all words in emails and their counts for each item
        vector = vectorizer.fit_transform(arr)

        # convert vector variable to array for usability
        count_vector = vector.toarray()

        # names of columns
        columns = vectorizer.get_feature_names_out()

        return count_vector, columns
```

**__calculate_class_prior** function calculates following probability values (prior probabilities) as follows:<br><br>
$$\large P(y = spam) = \dfrac{N_s}{N}$$<br>
$$\large P(y = ham) = \dfrac{N_h}{N}$$<br>
where:
- $N_s$ number of spam mails
- $N_h$ number of ham mails
- $N$ the total number of mails<br>

and stores them in a variable named probability_dict which is a class variable.

We have used logarithm to prevent numerical underflow when calculating multiplicative probabilities.

```python
    def __calculate_class_prior(self, y: np.ndarray) -> None:
        """
        Calculates class probabilities [P(spam) and P(ham)] for training examples, and adds the result into
        self.probability dictionary as "spam" and "ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :return: None
        """

        # labels are only 0 and 1 therefore if we sum all items we get number of 1s
        # instead of a for loop we can use this method
        number_of_spam = np.sum(y)
        number_of_ham = len(y) - number_of_spam

        self.probability_dict["spam"] = np.log(number_of_spam / y.shape[0])  # P(spam) = number of spams / N
        self.probability_dict["ham"] = np.log(number_of_ham / y.shape[0])     # P(ham) = number of hams / N
```

**__calculate_likelihoods** function calculates the likelihoods probabilities as follows:<br><br>
$$\large P(word|y = spam) = \dfrac{N_{ws}}{N_s}$$<br>
$$\large P(word|y = ham) = \dfrac{N_{wh}}{N_h}$$<br>
where:
- $N_{ws}$ number of occurances of the word in all spam mails 
- $N_{wh}$ number of occurances of the word in all ham mails 
- $N_s$ total number of all word occurrences in all spam mails
- $N_h$ total number of all word occurrences in all ham mails<br>

and stores them in a variable named probability_dict which is a class variable.

**Please not that! :** We take the logarithm of the probabilities to prevent numerical underflow when calculating multiplicative probabilities.

```python
    def __calculate_likelihoods(self, y: np.ndarray, columns: np.ndarray, alpha: int) -> None:
        """
        Calculates likelihoods of each word that contains in emails as P(word|spam) and P(word|ham), and adds the 
        results into self.probability dictionary as "word|spam" and "word|ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :param columns: a vector with shape (number of different words in emails,1) which stores all words appears 
        in mails
        :param alpha: int value for smoothing
        :return: None
        """

        N, D = self.count_vector.shape
        spam_vector, ham_vector = np.sum(self.count_vector[y == 1], axis=0), np.sum(self.count_vector[y == 0], axis=0)

        n_s = np.sum(spam_vector) # total number of all word occurrences in all spam mails
        n_h = np.sum(ham_vector) # total number of all word occurrences in all ham mails

        for word_i in range(D):
            n_w_s = spam_vector[word_i] # number of occurances in a spam mail of the word
            n_h_s = ham_vector[word_i] # number of occurances in a ham mail of the word

            self.probability_dict["%s|spam" % columns[word_i]] = np.log((n_w_s + alpha) / (n_s + D))
            self.probability_dict["%s|ham" % columns[word_i]] = np.log((n_h_s + alpha) / (n_h + D))
```

### 2.2. Final Implementation of Naive Bayes<a class="anchor" id="naive_bayes"></a>

If we combine these subfunctions with main fit and predict functions, we get below algorithm and code:

In [9]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from typing import Tuple

MODES = {
    "unigram": (1, 1),
    "bigram": (2, 2)
}


class NaiveBayes:
    def __init__(self, mode: str = "unigram", stop_words=None) -> None:
        """
        Initialize NaiveBayes model

        Args:
            mode (str, optional): Mode for BagOfWords, should be either Unigram or Bigram. Defaults to "unigram".
            stop_words (list, optional): Stop words to eliminate from BagOfWords
        """
        self.count_vector = None
        self.probability_dict = dict()

        assert mode in MODES.keys(), "Mode should be either bigram or unigram"
        self.ngram_mode = MODES[mode]
        self.stop_words = stop_words

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """
        Fit the data, calculate probabilities according to it

        Args:
            x (np.ndarray): data with shape (N,1), each row consists of a mail
            y (np.ndarray): data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        """

        # columns is list of every words without their counts, counts stores in count_vector
        self.count_vector, columns = self.__vectorizer(x)

        self.__calculate_class_prior(y)     # calculates P(spam) and P (ham), and adds them into probability_dict
        # calculates P(x(i)|spam) and P(x(i)|ham) values, and adds them into probability_dict
        self.__calculate_likelihoods(y, columns, 1)

    def predict(self, x_predict: np.ndarray) -> np.ndarray:
        """
        Predict the labels of the mails in x_predict

        Args:
            x_predict (np.ndarray): Data to be predicted, shape (N,1)

        Returns:
            y_predict (np.ndarray): Predictions for the mails, shape (N,)
        """

        n = x_predict.shape[0]
        y_predict = np.zeros(n)

        vector, columns = self.__vectorizer(x_predict)

        for i in range(n):
            probability_of_spam = 0
            probability_of_ham = 0

            word_idx = np.arange(len(columns))[vector[i] > 0] # only work on words that the text has
            # calculate P(vj | text)
            for j in word_idx:
                if "%s|spam" % columns[j] in self.probability_dict.keys():
                    probability_of_spam += vector[i][j] * self.probability_dict[columns[j] + "|spam"]
                    probability_of_ham += vector[i][j] * self.probability_dict[columns[j] + "|ham"]

            probability_of_spam += self.probability_dict["spam"]
            probability_of_ham += self.probability_dict["ham"]

            y_predict[i] = 1 if probability_of_spam > probability_of_ham else 0

        return y_predict

    def __vectorizer(self, arr: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        Creates a matrix which stores the words that emails includes and their counts in each email

        :param arr: an numpy array with shape (N,1) for
        :return: a CountVectorizer matrix (N, number of different words in emails) and a vector named columns
        with shape (number of different words in emails, 1) which stores all words appears in mails
        """

        # initializes CountVectorizer item with ngram_mode
        vectorizer = CountVectorizer(ngram_range=self.ngram_mode, stop_words=self.stop_words)

        # vector that holds all words in emails and their counts for each item
        vector = vectorizer.fit_transform(arr)

        # convert vector variable to array for usability
        count_vector = vector.toarray()

        # names of columns
        columns = vectorizer.get_feature_names()

        return count_vector, columns

    def __calculate_class_prior(self, y: np.ndarray) -> None:
        """
        Calculates class probabilities [P(spam) and P(ham)] for training examples, and adds the result into
        self.probability dictionary as "spam" and "ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :return: None
        """

        # labels are only 0 and 1 therefore if we sum all items we get number of 1s
        # instead of a for loop we can use this method
        number_of_spam = np.sum(y)
        number_of_ham = len(y) - number_of_spam

        self.probability_dict["spam"] = np.log(number_of_spam / y.shape[0])  # P(spam) = number of spams / N
        self.probability_dict["ham"] = np.log(number_of_ham / y.shape[0])     # P(ham) = number of hams / N

    def __calculate_likelihoods(self, y: np.ndarray, columns: np.ndarray, alpha: int) -> None:
        """
        Calculates likelihoods of each word that contains in emails as P(word|spam) and P(word|ham), and adds the results
        into self.probability dictionary as "word|spam" and "word|ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :param columns: a vector with shape (number of different words in emails, 1) which stores all words appears in mails
        :param alpha: int value for smoothing
        :return: None
        """

        N, D = self.count_vector.shape
        spam_vector, ham_vector = np.sum(self.count_vector[y == 1], axis=0), np.sum(self.count_vector[y == 0], axis=0)

        n_s = np.sum(spam_vector) # | Text_spam |
        n_h = np.sum(ham_vector) # | Text_ham  |

        for word_i in range(D):
            n_w_s = spam_vector[word_i]
            n_h_s = ham_vector[word_i]

            self.probability_dict["%s|spam" % columns[word_i]] = np.log((n_w_s + alpha) / (n_s + D))
            self.probability_dict["%s|ham" % columns[word_i]] = np.log((n_h_s + alpha) / (n_h + D))

In [10]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
import os
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


file_path = os.path.join(os.getcwd(), "emails.csv")

df = pd.read_csv(file_path)
data = df.to_numpy()
x, y = data[:, 0], data[:, -1]

shuffled_x, shuffled_y = shuffle(x, y, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)

args = [("unigram", ENGLISH_STOP_WORDS), ("bigram", ENGLISH_STOP_WORDS), ("unigram", None), ("bigram", None)]
for arg in args:
    model = NaiveBayes(*arg)        # initializes the model
    model.fit(X_train, y_train)      # training
    y_predict = model.predict(X_test)
    acc = accuracy_score(y_test.astype(bool), y_predict.astype(bool))
    print("Accuracy with ngram=%s, stop_words=%s:\t%f" % (arg[0], arg[1] != None, acc))

Accuracy with ngram=unigram, stop_words=True:	0.996510
Accuracy with ngram=bigram, stop_words=True:	0.988656
Accuracy with ngram=unigram, stop_words=False:	0.991274
Accuracy with ngram=bigram, stop_words=False:	0.989529


## Part 3: Analyzes<a class="anchor" id="part3"></a>

### 3.a. Analyzing effect of the words on prediction<a class="anchor" id="3a"></a>

To analyze the effects of the words on prediction we created a pipeline that works as follows:
1. We first eliminate the words in the word set according to their tfidf values for each document class (spam, ham). This step is composed of few steps.
    1. Compute tfidf values by using `TfidfTransformer`, we got a matrix of shape (N, D) (where N is equal to sample size and D equal to distinct word count). However we want an output that acts as a scoring metric among words so that we can eliminate among them.
    2. Take sum among `axis = 0`, sum by column, this way our matrix is reduced to shape (D,).
    3. Take the maximum 100 and minimum 100 tfidf valued words. These 200 words will the candidates for finding the words that suggests certain class by their presence/absence. To be accurate, we used minimum 100 tfidf values to be absence candidates and maximum 100 words to be presence candidates.
2. At this step, we see that our saved matrices **contains some stop words**.  To analyze the effects of presence and absence of a word, we compute their posterior probability as follows:
$$P(v_d|w) = \frac{P(w|v_d) * P(v_d)}{P(w)} \text{ where $P(v_d|w)$ is the probability that the class of the document is $v_d$ given that it contains $w$}$$
<center>and
$$P(v_d|\neg w) = \frac{P(w|v_d) * P(v_d)}{P(\neg w)} \text{ where $P(v_d|\neg w)$ is the probability that the class of the document is $v_d$ given that it doesn't contains $w$}$$

From our $\text{N$\dot{a}$ive Bayes}$ model, we know the conditional probilities $P(w|v_d)$, $P(\neg w|v_d) = 1 - P(w|v_d)$ and prior probilities $P(\text{spam})$, $P(\text{ham})$. Hence, we only need to calculate $P(w)$ and $P(\neg w)$. To calculate those we count the total occurance and divide it by total word count.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def find_presence_absence_probilities (X : np.ndarray, y : np.ndarray, stop_words = None):
    ## Create pipelines for two different document class
    pipe_spam = Pipeline([('count', CountVectorizer(stop_words = stop_words)),
                               ('tfidf', TfidfTransformer(use_idf = True, smooth_idf = True))]).fit(X[y==1])
    pipe_ham = Pipeline([('count', CountVectorizer(stop_words = stop_words)),
                               ('tfidf', TfidfTransformer(use_idf = True, smooth_idf = True))]).fit(X[y==0])
    ## Step 1. Eliminate words that has low tfidf values

    y_vals = [(1, pipe_spam), (0, pipe_ham)]
    y_words = dict()
    for y_val, pipe in y_vals:
        tfidf_vals = pipe.transform(X[y==y_val])
        words = pipe["count"].get_feature_names_out()
        sums = np.sum(tfidf_vals, axis=0).A1
        idx = np.argsort(sums)

        print("##### Highest TFIDF values for class %s #####" % y_val)
        print(words[idx[-100:]])
        y_words["y = %d" % y_val] = (idx[:100], idx[-100:])
    
    ## Step 2. Eliminate words that has low tfidf values
    vectorizer = CountVectorizer()
    counts = vectorizer.fit_transform(X)
    presence_counts = counts.sum(axis=0).A1
    total = counts.sum()

    doc_classes = [("spam", pipe_spam),
                   ("ham", pipe_ham)]
    word_p_a_dict = dict()
    N = X.shape[0]
    for doc_class, pipe in tqdm(doc_classes, desc="Calculating probilities for P(class | word in doc) and P(class | word not in doc)"):
        word_p_a_dict["%s absence" % doc_class] = []
        word_p_a_dict["%s presence" % doc_class] = []
        label = 1 if doc_class == "spam" else 0

        low_tfidf, high_tfidf = y_words["y = %d" % label]
        class_prior = model.probability_dict[doc_class]
        words = pipe["count"].get_feature_names_out()

        for word_idx in tqdm(high_tfidf, leave = False, desc="Doing presence %s" % doc_class):
            word = words[word_idx]
            word_idx_count = np.where(vectorizer.get_feature_names_out() == word)[0][0]
            cond_prob = model.probability_dict["%s|%s" % (word, doc_class)]
            presence_count = presence_counts[word_idx_count]
            p1 = (cond_prob + class_prior) - np.log(presence_count / total)
            word_p_a_dict["%s presence" % doc_class].append((words[word_idx], np.exp(p1)))

        for word_idx in tqdm(low_tfidf, leave = False, desc="Doing absence %s" % doc_class):
            word = words[word_idx]
            word_idx_count = np.where(vectorizer.get_feature_names_out() == word)[0][0]

            cond_prob = np.log(1 - np.exp(model.probability_dict["%s|%s" % (word, doc_class)]))
            presence_count = presence_counts[word_idx_count]
            p1 = cond_prob + class_prior - np.log((total-presence_count) / total)
            word_p_a_dict["%s absence" % doc_class].append((words[word_idx], np.exp(p1)))
    
    return word_p_a_dict
    

In [12]:
word_p_a_dict = find_presence_absence_probilities(X, y)

##### Highest TFIDF values for class 1 #####
['visit' 'interested' 'how' 'marketing' 'way' 'may' 'was' 'home' 'net'
 'about' 'start' '2005' 'receive' 'offer' '10' 'offers' '000' 'time' 'has'
 'search' 'only' 'account' 'want' 'but' 'list' 'life' 'what' 'into'
 'viagra' 'see' 'new' 'us' 'any' 'like' 'over' 'www' 'site' 'message'
 'out' 'make' 'logo' '95' 'information' 'mail' 'one' 'online' 'best'
 'please' 'need' 'more' 'save' 'free' 'my' 'no' 'do' 'money' 'now' 'an'
 'just' 'company' 'http' 'get' 'email' 'can' 'if' 'all' 'adobe' 'at' 'as'
 'click' 'by' 'website' 'business' 'software' 'com' 'here' 'on' 'will'
 'or' 'not' 'have' 'are' 'that' 'with' 'from' 'subject' 'be' 'our' 'it'
 'we' 'this' 'for' 'is' 'in' 'of' 'your' 'and' 'you' 'the' 'to']
##### Highest TFIDF values for class 0 #####
['regards' 'him' 'his' 'email' 'forward' '713' 'but' 'up' 'call' 'resume'
 'corp' 'some' 'information' 'need' 'get' 'crenshaw' '04' 'so' 'power'
 '12' 'very' 'conference' 'request' 'stinson' 'th' '11' 'm

Calculating probilities for P(class | word in doc) and P(class | word not in doc):   0%|          | 0/2 [00:00…

Doing presence spam:   0%|          | 0/100 [00:00<?, ?it/s]

Doing absence spam:   0%|          | 0/100 [00:00<?, ?it/s]

Doing presence ham:   0%|          | 0/100 [00:00<?, ?it/s]

Doing absence ham:   0%|          | 0/100 [00:00<?, ?it/s]

From the outputs of above cell, we note that highest tfidf values contains some of the stop words such as `is`, `on`, `be`, `of` etc. In below cells, we calculate the posterior probilities similar to how we did in our $\text{N$\dot{a}$ive Bayes}$ model. However, since we saved our prior probilities and conditional probilities in log form we continue to do our calculations in log form. Below we show the formulation:
$$ln{P(v_d|w)} = ln{\frac{P(w|v_d) * P(v_d)}{P(w)}} \tag{1}$$
$$ln{P(v_d|w)} = ln{P(w|v_d)} + ln{P(v_d)} - ln{P(w)} \tag{2}$$
<br>
<center>and
<br>
$$ln{P(v_d|\neg w)} = ln{\frac{P(\neg w|v_d) * P(v_d)}{P(\neg w)}} \tag{3}$$
$$ln{P(v_d|\neg w)} = ln{P(\neg w|v_d)} + ln{P(v_d)} - ln{P(\neg w)} \tag{4}$$

In [13]:
take_n = 10 # take n words with highest posterior probility
for key, value in word_p_a_dict.items():
    word_p_a_dict[key].sort(key= lambda x : x[1])
    word_p_a_dict[key] = value[-take_n:]

#### 3.a.1. List the 10 words whose presence most strongly predicts that the mail is ham<a class="anchor" id="3a1"></a>

In [14]:
query = "ham presence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
cc	0.920928
hou	0.921190
shirley	0.921799
ect	0.922348
vince	0.922348
enron	0.922417
kaminski	0.922542
crenshaw	0.923175
713	0.923189
stinson	0.923226


#### 3.a.2. List the 10 words whose absence most strongly predicts that the mail is ham<a class="anchor" id="3a2"></a>

In [15]:
query = "ham absence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
501	0.761173
unstable	0.761173
discharge	0.761173
interference	0.761174
renovation	0.761174
persistence	0.761174
cloak	0.761174
suffering	0.761174
secrecy	0.761176
php	0.761206


#### 3.a.3. List the 10 words whose presence most strongly predicts that the mail is spam<a class="anchor" id="3a3"></a>

In [16]:
query = "spam presence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
click	0.750847
search	0.783951
life	0.827946
95	0.843815
save	0.860793
money	0.882610
logo	0.945333
adobe	1.003532
2005	1.020820
viagra	1.037638


#### 3.a.4. List the 10 words whose absence most strongly predicts that the mail is spam<a class="anchor" id="3a4"></a>

In [17]:
query = "spam absence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
respected	0.238828
trash	0.238829
paige	0.238829
determining	0.238829
recovery	0.238831
sets	0.238833
sheet	0.238835
participating	0.238837
sometime	0.238841
steven	0.238868


We see that usually presence of certain words nearly gurantees that the document belongs to certain class. Such as `viagra`, where it's presence strongly suggests that the mail is spam. Similarly, absence of some words suggests to some level that it belongs to certain class. `secrecy` can be a great example for this, where it's absence strongly suggests that the mail is ham.<br>
<br>
Below we reimplemented our $\text{N$\dot{a}$ive Bayes}$ model to use tfidf values to discard not important words. Our implementation discards not important words based on their normalized tfidf values. Where we calculated the normalized tfidf values as follows:
1. Sum by axis 0 to get a matrix of shape (C,), where C is the distinct word count
2. Divide each column sum by document frequency (df), where df is defined as follows:
    $$ df(D, w) = |\{d \in D, w \in d\}| \text{, number of documents where word w appears} $$

In [18]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from tqdm.auto import tqdm
import pandas as pd

MODES = {
    "unigram" : (1, 1),
    "bigram" : (2, 2),
    "uni-bigram" : (1, 2)
}


class NaiveBayesV2:
    def __init__(self, mode: str = "unigram", stop_words = None, use_tfidf : bool = False, tfidf_th : float = None) -> None:
        """
        Initialize NaiveBayes model

        Args:
            mode (str, optional): Mode for BagOfWords, should be either Unigram or Bigram. Defaults to "unigram".
            stop_words (list, optional): Stop words to eliminate from BagOfWords
            use_idf (bool, optional)
            tfidf_th (float, optional): Threshold for tfidf values, necessary only if use_tfidf is True
        """
        self.count_vector = None
        self.probability_dict = dict()

        assert mode in MODES.keys(), "Mode should be either bigram or unigram"
        self.ngram_mode = MODES[mode]
        self.stop_words = stop_words
        self.use_tfidf = use_tfidf
        if use_tfidf:
            self.tfidf_th = tfidf_th

    def fit(self, x: np.ndarray, y: np.ndarray) -> None:
        """
        Fit the data, calculate probabilities according to it

        Args:
            x (np.ndarray): data with shape (N,1), each row consists of a mail
            y (np.ndarray): data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        """

        # columns is list of every words without their counts, counts stores in count_vector
        self.count_vector, columns = self.__vectorizer(x)

        self.__calculate_class_prior(y)     # calculates P(spam) and P (ham), and adds them into probability_dict
        # calculates P(x(i)|spam) and P(x(i)|ham) values, and adds them into probability_dict
        self.__calculate_likelihoods(y, columns, 1)
        
        if self.use_tfidf:
            self.__eliminate_less_important(y, columns)

    def predict(self, x_predict: np.ndarray) -> np.ndarray:
        """
        Predict the labels of the mails in x_predict

        Args:
            x_predict (np.ndarray): Data to be predicted, shape (N,1)

        Returns:
            y_predict (np.ndarray): Predictions for the mails, shape (N,)
        """

        n = x_predict.shape[0]
        y_predict = np.zeros(n)

        vector, columns = self.__vectorizer(x_predict)

        for i in range(n):
            probability_of_spam = 0
            probability_of_ham = 0

            word_idx = np.arange(columns.shape[0])[vector[i] > 0] # only work on words that the text has
            # calculate P(vj | text)
            for j in word_idx:
                if "%s|spam" % columns[j] in self.probability_dict.keys():
                    probability_of_spam += vector[i][j] * self.probability_dict[columns[j] + "|spam"]
                    probability_of_ham += vector[i][j] * self.probability_dict[columns[j] + "|ham"]

            probability_of_spam += self.probability_dict["spam"]
            probability_of_ham += self.probability_dict["ham"]

            y_predict[i] = 1 if probability_of_spam > probability_of_ham else 0

        return y_predict

    def __vectorizer(self, arr: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
        """
        Creates a matrix which stores the words that emails includes and their counts in each email

        :param arr: an numpy array with shape (N,1) for
        :return: a CountVectorizer matrix (N, number of different words in emails) and a vector named columns
        with shape (number of different words in emails,1) which stores all words appears in mails
        """

        # initializes CountVectorizer item with ngram_mode
        vectorizer = CountVectorizer(ngram_range=self.ngram_mode, stop_words=self.stop_words)

        # vector that holds all words in emails and their counts for each item
        vector = vectorizer.fit_transform(arr)

        # convert vector variable to array for usability
        count_vector = vector.toarray()

        # names of columns
        columns = vectorizer.get_feature_names_out()

        return count_vector, columns
    
    def __eliminate_less_important(self, y : np.ndarray, columns : np.ndarray):
        tfidf = TfidfTransformer(use_idf = True, smooth_idf = True)
        
        tfidf_ham = tfidf.fit_transform(self.count_vector[y == 0]).sum(axis = 0).A1
        counts = np.sum(self.count_vector[y == 0] > 0, axis=0)
        tfidf_ham /= (1 + counts)
        
        
        tfidf_spam = tfidf.fit_transform(self.count_vector[y == 1]).sum(axis = 0).A1
        counts = np.sum(self.count_vector[y == 1] > 0, axis=0)
        
        tfidf_spam /= (1 + counts)
        
        idx_ham = np.arange(tfidf_ham.shape[0])[tfidf_ham > self.tfidf_th]
        idx_spam = np.arange(tfidf_spam.shape[0])[tfidf_spam > self.tfidf_th]
        
        old_total = tfidf_ham.shape[0] + tfidf_spam.shape[0]
        current_total = idx_ham.shape[0] + idx_spam.shape[0]
        print("%d words discarded because they fall below the tfidf threshold, predicting with %d word"\
              % ((old_total - current_total), current_total))
        
        for doc_class, idx in [("ham", idx_ham), ("spam", idx_spam)]:
            for i in idx:
                self.probability_dict["%s|%s" % (columns[i], doc_class)] = 0
        
        

    def __calculate_class_prior(self, y: np.ndarray) -> None:
        """
        Calculates class probabilities [P(spam) and P(ham)] for training examples, and adds the result into
        self.probability dictionary as "spam" and "ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :return: None
        """

        # labels are only 0 and 1 therefore if we sum all items we get number of 1s
        # instead of a for loop we can use this method
        number_of_spam = np.sum(y)
        number_of_ham = len(y) - number_of_spam

        self.probability_dict["spam"] = np.log(number_of_spam / y.shape[0])  # P(spam) = number of spams / N
        self.probability_dict["ham"] = np.log(number_of_ham / y.shape[0])     # P(ham) = number of hams / N

    def __calculate_likelihoods(self, y: np.ndarray, columns: np.ndarray, alpha: int) -> None:
        """
        Calculates likelihoods of each word that contains in emails as P(word|spam) and P(word|ham), and adds the results
        into self.probability dictionary as "word|spam" and "word|ham" labels

        :param y: data with shape (N,), each row consists of the label of the mail (0 for ham, 1 for spam)
        :param columns: a vector with shape (number of different words in emails,1) which stores all words appears in mails
        :param alpha: int value for smoothing
        :return: None
        """

        N, D = self.count_vector.shape
        spam_vector, ham_vector = np.sum(self.count_vector[y == 1], axis=0), np.sum(self.count_vector[y == 0], axis=0)

        n_s = np.sum(spam_vector) # | Text_spam |
        n_h = np.sum(ham_vector) # | Text_ham  |

        for word_i in range(D):
            n_w_s = spam_vector[word_i]
            n_h_s = ham_vector[word_i]

            self.probability_dict["%s|spam" % columns[word_i]] = np.log((n_w_s + alpha) / (n_s + D))
            self.probability_dict["%s|ham" % columns[word_i]] = np.log((n_h_s + alpha) / (n_h + D))

In [19]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

file_path = os.path.join(os.getcwd(), "emails.csv")
df = pd.read_csv(file_path)
X, y = df["text"].to_numpy(), df["spam"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0, stratify=y
)

model = NaiveBayesV2(use_tfidf=True, tfidf_th = 0.3)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print("Accuracy with tfidf_th = 0.3:", accuracy_score(y_test.astype(bool), y_predict.astype(bool)))

67489 words discarded because they fall below the tfidf threshold, predicting with 107 word
Accuracy with tfidf_th = 0.3: 0.9912739965095986


We tried our model with tfidf threshold value 0.3 and noted important things:
1. Total number of used words **decreased by 67489 words**
2. Even though only 107 words were used, our accuracy is **0.9912**, which we found quite high.

We conclude that using tfidf values helps to find words that are important for certain class, hence discarding unnecessary operations.

### 3.b. Stopwords<a class="anchor" id="3b"></a>

In [20]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

word_p_a_dict_stop = find_presence_absence_probilities(X, y, ENGLISH_STOP_WORDS)

take_n = 10 # take n words with highest posterior probility
for key, value in word_p_a_dict.items():
    word_p_a_dict_stop[key].sort(key= lambda x : x[1])
    word_p_a_dict_stop[key] = value[-take_n:]

##### Highest TFIDF values for class 1 #####
['real' 'info' 'buy' 'identity' 'rate' 'remove' 'security' 'contact'
 'engines' 'sites' 'internet' 'available' 'look' 'design' 'removed' '2004'
 'today' 'creative' 'love' 'mailing' 'oem' 'getting' '19' 'hot' 'wish'
 'success' 'received' 'send' 'ready' 'day' 'credit' 'great' 'price' 'help'
 'fast' 'work' 'right' 'don' 'good' 'regards' 'address' 'thing' 'prices'
 'use' 'know' 'submit' 'hello' 'future' 'stationery' 'web' 'man' 'visit'
 'order' '2005' 'interested' 'way' 'net' 'people' 'marketing' 'home'
 'offer' 'receive' '10' 'start' 'offers' '000' 'time' 'search' 'account'
 'list' 'viagra' 'want' 'life' 'message' 'new' 'www' 'site' 'like' '95'
 'mail' 'logo' 'make' 'information' 'best' 'online' 'need' 'save' 'free'
 'http' 'just' 'adobe' 'money' 'company' 'email' 'click' 'website'
 'software' 'com' 'business' 'subject']
##### Highest TFIDF values for class 0 #####
['doc' 'office' 'data' 'market' 'options' 'london' 'make' 'send' 'just'
 'sent' 

Calculating probilities for P(class | word in doc) and P(class | word not in doc):   0%|          | 0/2 [00:00…

Doing presence spam:   0%|          | 0/100 [00:00<?, ?it/s]

Doing absence spam:   0%|          | 0/100 [00:00<?, ?it/s]

Doing presence ham:   0%|          | 0/100 [00:00<?, ?it/s]

Doing absence ham:   0%|          | 0/100 [00:00<?, ?it/s]

In [21]:
query = "spam presence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict_stop[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
click	0.750847
search	0.783951
life	0.827946
95	0.843815
save	0.860793
money	0.882610
logo	0.945333
adobe	1.003532
2005	1.020820
viagra	1.037638


In [22]:
query = "ham presence"
print("Word\tProbability\n---------------------")
for word, prob in word_p_a_dict_stop[query]:
    print("%s\t%f" % (word, prob))

Word	Probability
---------------------
cc	0.920928
hou	0.921190
shirley	0.921799
ect	0.922348
vince	0.922348
enron	0.922417
kaminski	0.922542
crenshaw	0.923175
713	0.923189
stinson	0.923226


### 3.c. Analyzing effect of the stopwords<a class="anchor" id="3c"></a>

Even though we removed stopwords from the corpus we got the same 10 words that suggests a certain class in case of presence. We think that, this happened because of how our pipeline works. Earlier, at the elimination step by using tfidf values, we saw some stopwords in the matrix, but there were none in the output. This happened because when we are calculating their posterior probabilities, we are diving by $P(w)$ which gets higher as word seen frequently in the documents, thus resulting in low $P(w)$. At the end, even though we removed stop words, we got the same words for our results.

1. Why might it make sense to remove stop words when interpreting the model?
    1. Removing stop words helps to lower the corpus by a certain amount, thus making the model faster. Furthermore since stop words are used too often, it results in high conditional probility even though they are near to meaningless when left alone.
2. Why might it make sense to keep stop words?
    1. In our opinion, keeping stop words in lower ngram settings doesn't introduce a meaningful increase in prediction. On the other hand, if we were to keep it in higher ngram settings we think it will introduce meaningful combinations may referring to certain idioms, hence it may introduce certain word combinations that has very high conditional probility and it's presence suggesting that the mail belongs to certain class.

In [23]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import pandas as pd
import numpy as np
import os

file_path = os.path.join(os.getcwd(), "emails.csv")
df = pd.read_csv(file_path)
X, y = df["text"].to_numpy(), df["spam"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=29, stratify=y
)

args = [("unigram", None), ("unigram", ENGLISH_STOP_WORDS)]
for arg in args:
    model = NaiveBayes(*arg)        # initializes the model
    model.fit(X_train, y_train)      # training
    y_predict = model.predict(X_test)
    acc = accuracy_score(y_test.astype(bool), y_predict.astype(bool))
    print("Accuracy with ngram=%s, stop_words=%s:\t%f" % (arg[0], arg[1] != None, acc))

Accuracy with ngram=unigram, stop_words=False:	0.989529
Accuracy with ngram=unigram, stop_words=True:	0.990401


As expected, we see an increase in accuracy when we discard stop words from our model. Our accuracy increased from 0.989529 to 0.990401. We think that, the extra noise coming from calculating conditional probilities of stop words caused this difference.

## Part 4 Calculation of Performance Metrics<a class="anchor" id="part4"></a>

Below we calculate the wanted performance metrics for different settings of the model.

$$\textbf{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
$$\textbf{Precision} = \frac{TP}{TP + FP}$$
$$\textbf{Recall} = \frac{TP}{TP + FN}$$
$$\textbf{F1 Score} = \frac{2 * (Precision * Recall)}{Precision + Recall}$$

In [24]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import pandas as pd
import numpy as np
import os

file_path = os.path.join(os.getcwd(), "emails.csv")
df = pd.read_csv(file_path)
X, y = df["text"].to_numpy(), df["spam"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=29, stratify=y
)

args = [("unigram", ENGLISH_STOP_WORDS), ("bigram", ENGLISH_STOP_WORDS), ("unigram", None), ("bigram", None)]
for arg in args:
    model = NaiveBayes(*arg)        # initializes the model
    model.fit(X_train, y_train)      # training
    y_predict = model.predict(X_test)
    tn, fp, fn, tp = confusion_matrix(y_test, y_predict).ravel()
    
    acc = (tp + tn) / (tn + fp + fn + tp)
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * recall * precision / (recall + precision)
    print("Performance metrics with settings ngram=%s, stop_words=%s:" % (arg[0], arg[1] != None))
    print("---------------------------")
    print("Accuracy:\t%f" % acc)
    print("Precision:\t%f" % precision)
    print("Recall:\t%f" % recall)
    print("F1 Score:\t%f" % f1)
    print("\n")

Performance metrics with settings ngram=unigram, stop_words=True:
---------------------------
Accuracy:	0.990401
Precision:	0.978182
Recall:	0.981752
F1 Score:	0.979964


Performance metrics with settings ngram=bigram, stop_words=True:
---------------------------
Accuracy:	0.990401
Precision:	0.992509
Recall:	0.967153
F1 Score:	0.979667


Performance metrics with settings ngram=unigram, stop_words=False:
---------------------------
Accuracy:	0.989529
Precision:	0.974638
Recall:	0.981752
F1 Score:	0.978182


Performance metrics with settings ngram=bigram, stop_words=False:
---------------------------
Accuracy:	0.990401
Precision:	0.996226
Recall:	0.963504
F1 Score:	0.979592


