In [1]:
import pandas as pd
import numpy as np
from collections import Counter

# Näive Bayes Classifier

Given a feature vector $\vec{x}$, we want to know which of all the classes is most likely (*main problem*). Essentially, we want to answer the following questions: 
<br><br>
\begin{equation}
    \text{argmax}_{k \in K} P(C=k|\vec{x})
\end{equation}
<br><br>
where $C$ is a random variable representing the class of the data. Using Bayes' theorem, we have an expression for $P(C=k|\vec{x})$:
<br><br>
\begin{equation}
    P(C=k|\vec{x}) = P(C=k) \frac{P(\vec{x}|C=k)}{P(\vec{x})}
\end{equation}
<br><br>
We note that

* $P(C=k|\vec{x})$ - posterior probability
* $P(C=k)$ - prior
* $P(\vec{x}|C=k)$ - likelihood
* $P(\vec{x})$ - evidence

We note that $\vec{x}$ is also a stochastic or random variable. Suppose that $\vec{x} \in \mathbb{R}^n$, then we can expand the numerator using the Chain Rule of probability assuming that the **features are independent** (this is called the Näive assumption)
<br><br>
\begin{align}
    P(x_1, x_2, \ldots, x_n|C=k) = P(C=k) P(\vec{x}|C=k)
\end{align}
<br><br>
Applying the chain rule multiple times:
<br><br>
\begin{align}
    P(C=k) P(\vec{x}|C=k) &= P(x_1|x_2, \ldots, x_n, C=k) P(x_2, \ldots, x_n, C=k)\\
                                &= P(x_1|x_2, \ldots, x_n, C=k) P(x_2|x_3, \ldots, x_n,C=k) \cdots P(x_n|C=k) P(C=k)\\
    P(C=k) P(\vec{x}|C=k) &= P(C=k) \prod_{i=1}^n P(x_i|C=k)
\end{align}
<br><br>
Therefore, we have:
<br><br>
\begin{align}
    P(C=k|\vec{x}) = \frac{P(C=k)}{P(\vec{x})} \prod_{i=1}^n P(x_i|C=k)
\end{align}
<br><br>
Hence, the problem reduces to
<br><br>
\begin{align}
    \text{argmax}_{k \in K} P(C=k) \prod_{i=1}^n P(x_i|C=k)
\end{align}
<br><br>
We can drop the term $P(\vec{x})$ in the denominator because it is not dependent on $k$. 

# Spam Filters

A spam filter is a classification problem with two classes: spam and ham (not spam). Let's go into detail as to how to solve:
<br><br>
\begin{align}
    \text{argmax}_{k \in K} P(C=k) \prod_{i=1}^n P(x_i|C=k)
\end{align}
<br><br>
We note that we have a labeled training set to determine $P(C=\text{spam})$, the probability of spam, and $P(C=\text{ham})$, the probability of ham. To do this, **we assume that the training set is a representative sample** and define
<br><br>
\begin{align}
    P(C=\text{spam}) = \frac{N_{\text{spam}}}{m}
\end{align}
<br><br>
and 
<br><br>
\begin{align}
    P(C=\text{ham}) = \frac{N_{\text{ham}}}{m}
\end{align}
<br><br>
where $m$ is the number of samples in the dataset. Using a bag of words model, we can create a simple representation of $P(x_i|C=k)$ where $x_i$ is the $i$th word in a message (one-hot vector), and therefore $\vec{x}$ is the entire message. This results in the simple definition:
<br><br>
\begin{align}
    P(x_i|C=k) = \frac{N_{\text{occurences of $x_i$ in class $k$}}}{N_{\text{words in class $k$}}}
\end{align}
<br><br>
We note that the denominator is the total number of occurences of *any* word in class $k$. 

In [2]:
class NaiveBayesFilter:
    def __init__(self):
        pass
    
    # Class Methods
    def fit(self, X, y):
        '''
        X: pd.Series
        y: pd.Series
        '''
        # Probability of spam
        self.P_spam = len(y[y=='spam']) / len(y)
        
        # Probability of ham
        self.P_ham = len(y[y=='ham']) / len(y)
        
        # Filter the spam and ham data from X
        X_spam = X[y=='spam'].reset_index()
        X_spam.drop('index', axis=1, inplace=True)
        X_spam['Message'] = X_spam['Message'].str.lower()
        
        X_ham = X[y=='ham'].reset_index()
        X_ham.drop('index', axis=1, inplace=True)
        X_ham['Message'] = X_ham['Message'].str.lower()
        
        # Get the counts for each class
        count_spam = self.__counts(X_spam)
        count_ham = self.__counts(X_ham)
        
        # Get the unique words
        self.unique_words = sorted(list( set(count_ham.keys()) | set(count_spam.keys()) ))
        
        # Get the words not in spam but in ham, vice versa
        keys_not_in_spam = list(set(count_ham.keys()).difference(list(count_spam.keys()))) # keys in ham that are not in spam
        keys_not_in_ham = list(set(count_spam.keys()).difference(list(count_ham.keys())))
        
        # For the keys not in spam, automatically set to 0
        for not_key in keys_not_in_spam:
            count_spam[not_key] = 0

        for not_key in keys_not_in_ham:
            count_ham[not_key] = 0
            
        # Sort the keys of the dictionaries
        count_spam = dict(sorted(count_spam.items()))
        count_ham = dict(sorted(count_ham.items()))
        
        # Construct the final dataframe
        data = pd.DataFrame(columns=count_spam.keys(), index=['spam','ham'])
        data.loc['spam'] = list(count_spam.values())
        data.loc['ham'] = list(count_ham.values())
        
        self.data = data
    
    
    def predict_proba(self, X):
        '''
        In this method, we want to calculate proba = [P(Spam|x), P(Ham|x)].
        
        Input:
            X: pd.Series
                    - Represents new data that needs to be classified. The data is a series of messages.
        '''
        
        # Computes [P(Spam|x), P(Ham|x)] for each message x in dataset X
        proba = np.zeros((len(X), 2))
        for i, x in enumerate(X):
            prod_spam_x = 1
            prod_ham_x = 1
            for word in x:
                if word in self.unique_words:
                    prod_spam_x *= self.data.loc['spam', word] / self.data.loc['spam'].sum()
                    prod_ham_x *= self.data.loc['ham', word] / self.data.loc['ham'].sum()
                    
                else: # Word is not in the dictionary (unknown)
                    prod_spam_x *= 1
                    prod_ham_x *= 1
            
            proba[i] = np.array([self.P_spam * prod_spam_x, self.P_ham * prod_ham_x])
                
        # Returns [P(Spam|x), P(Ham|x)] for each message in x
        return proba
    
    
    def predict(self, X):
        # Gets the probabilities
        probas = self.predict_proba(X)
        
        # Predicts the class of each example
        pred_classes = []
        for i in probas:
            result = np.argmax(i) 
            if result == 0:
                pred_classes.append('spam')
            else:
                pred_classes.append('ham')
            
        return np.array(pred_classes)
    
    
    def predict_log_proba(self, X):
        log_proba = np.zeros((len(X), 2))
        for i, x in enumerate(X):
            sum_spam_x = 0
            sum_ham_x = 0
            for word in x:
                if word in self.unique_words:
                    sum_spam_x += np.log((self.data.loc['spam', word] + 1) / (self.data.loc['spam'].sum() + 2))
                    sum_ham_x += np.log((self.data.loc['ham', word] + 1) / (self.data.loc['ham'].sum() + 2))
                    
                else: # Word is not in the dictionary (unknown)
                    sum_spam_x += np.log(1)
                    sum_ham_x += np.log(1)
                
        return log_proba
    
    
    def predict_log(self, X):
        # Gets the probabilities
        log_probas = self.predict_log_proba(X)
        
        # Predicts the class of each example
        pred_classes = []
        for i in log_probas:
            result = np.argmax(i) 
            if result == 0:
                pred_classes.append('spam')
            else:
                pred_classes.append('ham')
            
        return np.array(pred_classes)
    
    
    # Secret methods
    def __counts(self, df):
        results = Counter()
        df['Message'].str.lower().str.split().apply(results.update)
        
        return dict(sorted(results.items()))

# Load the data

In [3]:
df = pd.read_csv('data/spam.csv', encoding = "ISO-8859-1", usecols=['Class', 'Message'])
df.head()

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Select the features and the labels

In [4]:
X = df.Message
y = df.Class

# Instantiate the model

In [5]:
NB = NaiveBayesFilter()
NB.fit(X[:300], y[:300])

# Make Predictions

In [6]:
y_pred = NB.predict(X[5000:5500])
y_true = y[5000:5500].values

# Accuracy Score

In [7]:
from sklearn.metrics import accuracy_score

In [8]:
print(f"Accuracy: {accuracy_score(y_true, y_pred) * 100}%")

Accuracy: 84.0%


# Underflow

In [9]:
NB.predict_proba(X[[1085, 2010]])

array([[0.00000000e+000, 2.46881184e-176],
       [0.00000000e+000, 1.18335326e-075]])

##### Use predict_log insead

In [10]:
NB.predict_log(X[[1085, 2010]])

array(['spam', 'spam'], dtype='<U4')