## Discrete Multivariate Distributions

Multivariate distributions describe the probabilistic behavior of *multiple* quantities and their relationships. Every Statistical and Machine Learning method involves multiple random variables (RVs) and their distributions: joint,  conditionals, and margninals. In fact, the foundation of each method lies on how it models the dependence structure of the RVs, which is typically done in way that facilitates analysis and/or computation. In other words, different models make assumptions about how the variables are related to allow for efficient processing. Experienced Data Scientists understand the modelling assumptions and their limitations, allowing them to choose and improve appropriate methods for the problem at hand.

In this notebook we build a [Naive Bayes Classifier (NBC)](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for spam detection, one of the earliest and most widespread applications of [Statistical classification](https://en.wikipedia.org/wiki/Statistical_classification). Although the probability model for NBC is very simple, the method works extremely well for email filtering and other text classification. We will first look at the model details, and then we will implement the NBC in Python and compare its results with the built-in function in [scikit-learn](https://scikit-learn.org/stable/), Python's principal Machine Learning library.


### Naive Bayes Classification

Classification is the problem of *predicting the class* of an object based on its other observed characteristics which are  probabilistically related to its class. We denote the class by the *categorical* (i.e., discrete finite) RV $Y$, called the *response* or *label*, and we denote the related variables by $X_1,\ldots,X_p$, called the *predictors* of *features*. For out purposes we will assume the predictors are also discrete finite RVs. So, to summarize, the goal of NBC is to predict the (unknown) value of $Y$ based on the (observed) values of $X_1,\ldots, X_p$. 

##### Maximum A Posteriori Prediction

Assume we know the *joint* distribution of all #$(p+1)$ RVs $Y,X_1,\ldots, X_p$ for starters. If we observed the values $x_1,\ldots, x_p$ for $X_1,\ldots, X_p$, our best description of $Y$ would be the *conditional* distribution:
$$ P( Y = y | X_1 =x_1, \ldots, X_p = x_p ),\quad \text{ for }  y \in \text{ range }(Y) $$
which gives you the conditional probabilities of the possible values of $Y$. If we then had to *guess* which value the RV $Y$ takes, we would choose the *most likely* one, i.e. that which maximizes the conditional probability:
$$ \hat{y} =  \arg \max_y P( Y = y | X_1 =x_1, \ldots, X_p = x_p ) $$
This approach is called [Maximum a Posteriori (MAP)](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) estimation, from the name of the conditional distribution, which is called the [*posterior*](https://en.wikipedia.org/wiki/Posterior_probability) in this context. 


Furthermore, from Bayes theorem we get:
$$ P(Y=y|X_1=x_1,\ldots,X_p=x_p) = \frac{  P(Y=y, X_1 = x_1, \ldots, X_p = x_p ) } { P( X_1 = x_1, \ldots, X_p = x_p ) } \\ = \frac{  P( X_1 = x_1, \ldots, X_p = x_p | Y=y ) P(Y=y) } { P( X_1 = x_1, \ldots, X_p = x_p ) } $$
Note that $y$ appears only in the numerator, so maximizing the posterior is equivalent to maximizing the numerator, which is essentially a re-expression of the *joint* distribution. This is the form of the maximization we will perform, in order to avoid unnecessary calculations. 

Also note that *if* the predictor RVs $X_1,\ldots, X_p$ where *independent* of the response RV $Y$, then its conditional/posterior distribution would be the same as its *marginal* distribution:
$$P( Y = y | X_1 =x_1, \ldots, X_p = x_p ) = P( Y = y)$$
Intuitively, independence implies that the information obtained from $X_1, \ldots, X_p$ is irrelevant for predicting $Y$. When building a classifier, we would therefore want to use predictors that are as *dependent* as possible to the response. As an extreme example, if you had a predictor $X$ that was *perfectly* related to $Y$, i.e. there is a 1-to-1 relation between the values of $X$ and $Y$, then knowing the value of $X$ would tell you the value of $Y$.




##### Naive Bayes Assumption

We have seen how Bayesian classification works, at least conceptually, but to actually *apply* it we need a workable form of the *joint* distribution of $Y,X_1,\ldots, X_p$. Since all variables are finite, their joint distribution can be represented as a $(1+p)$-dimensional array. In Statistical/Machine Learning applications, the probability model is not provided, but has to be *estimated/trained* based on data. E.g., assuming all variables are *binary*, there are around $2^{(p+1)}$ probabilities to estimate in the joint distribution array. Because the number of parameters increases *exponentially* in the number of dimensions/predictors, good estimation quickly becomes infeasible (requires astronomical amounts of data); what is called the [curse of dimensionality](link here). 

To solve this problem, NBC makes a *naive* (not reallistic) simplifying assumption: it assumes features are *conditionally independent* given the class $Y$. In practice, this implies that:
$$ P( X_1 = x_1, \ldots, X_p = x_p | Y=y ) P(Y=y) \\
= P(X_1 = x_1 | Y = y )  \times \cdots \times P(X_p = x_p | Y = y ) \times P( Y = y )  \\
= \left( \prod_{i=1}^{p} P(X_i = x_i | Y = y ) \right)  \times P( Y = y ) $$
I.e., the joint distribution can be expressed as a product of the $X$'s 1D conditionals times $Y$'s 1D marginal. By imposing this special structure we effectively reduce the dimensionality of the problem: instead of one $(p+1)$-dimensional distribution, we work with $(p+1)$ 1-dimensional distributions. E.g., if we assume RVs are binary, there are $2(p+1)$ probabilities to estimate, rather than $2^{(p+1)}-1$ for the general model. Now that we have know the main idea behind NBC, we look at a specific application for spam detection.


##### Spam Data  

We will be working with an open data set of 5726 emails, of which 1368 are *spam* and the remaining 4358 *ham* (legitimate). These are a subset of the http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/index.html
We use the [pandas](...) library in Python for loading tabular/spreadsheet-like data, where each row represents and observation, and each column represents a field/variable. Column ```text``` contains the email text, and column ```spam``` indicates if the email is spam (1) or not (0); below is a preview:

In [1]:
import pandas as pd
emails = pd.read_csv("./data/emails.csv",  usecols=[0, 1])
emails.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


The spam column is our response, and the text here will provide our predictors. For the latter, we construct predictor/feature variables that record for the presence or absence of select words in *all* emails. This is a common pre-processing step for text analysis called [tokenization], which is a crude way to represent text since it misses all structure and meaning (it is part of  [bag-of-words](link) models). 

For our tokenization, we use the ```feature_extraction.text.CountVectorizer``` function in ```scikit learn```. We remove infrequent (in <10 emails) and very frequent (in >30% of emails) words, since neither provides much information (e.g., if a word is in every email, it doesn't say much about it being spam). There are 6233 reamining words, each one representing a *binary* feature, where 1 means the word is present in the email and 0 means otherwise. Below is 


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words = 'english', binary = 'True', min_df=10, max_df = .30) # initialize vectorizer
X = vectorizer.fit_transform( emails.text )   # apply vectorizer to emails text column; output is sparse binary matrix 
X_words = vectorizer.get_feature_names_out()  # extract word

X_words[ range(0,X.shape[1],500) ]


array(['00', 'aggregate', 'bright', 'contractors', 'easter', 'forever',
       'indicating', 'loqo', 'official', 'projected', 'rush',
       'subscribed', 'vieira'], dtype=object)

In [3]:
X.shape
Y = emails.spam.array
X.shape[1]

6233

In [4]:
5726-sum(Y)


4358

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=1234)

NameError: name 'y' is not defined

In [320]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
sum( y_pred == y_test ) / len(y_test)

0.9629888268156425

In [315]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred) )

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))


[[1031   51]
 [   2  348]]
              precision    recall  f1-score   support

           0       1.00      0.95      0.97      1082
           1       0.87      0.99      0.93       350

    accuracy                           0.96      1432
   macro avg       0.94      0.97      0.95      1432
weighted avg       0.97      0.96      0.96      1432



In [325]:
import numpy as np
P_X_y0 = ( np.sum( X_train[ y_train == 0 ] , 0) + 1 ) / ( sum( y_train == 0 ) + X_train.shape[1] )
P_X_y1 = ( np.sum( X_train[ y_train == 1 ] , 0) + 1 ) / ( sum( y_train == 1 ) + X_train.shape[1] ) 

P_y0 = sum( y_train == 0 ) / len(y_train)
P_y1 = sum( y_train == 1 ) / len(y_train)

n_test = len(y_test)
y_my_pred = np.zeros(len(y_test))

for i in np.arange(0, len(y_test)) :
    lP_y0X = np.sum( np.log( P_X_y0[ X_test[i].todense() == 1 ] ) ) + np.sum( np.log( 1 - P_X_y0[ X_test[i].todense() == 0 ] ) ) #+ np.log( P_y0 )
    lP_y1X = np.sum( np.log( P_X_y1[ X_test[i].todense() == 1 ] ) ) + np.sum( np.log( 1 - P_X_y1[ X_test[i].todense() == 0 ] ) ) #+ np.log( P_y1 )
    if lP_y1X > lP_y0X:
        y_my_pred[i] = 1

sum( y_pred == y_test ) / len(y_test)

0.9629888268156425

In [326]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_my_pred) )

from sklearn.metrics import classification_report
print(classification_report(y_test, y_my_pred))


[[1042   40]
 [  86  264]]
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1082
           1       0.87      0.75      0.81       350

    accuracy                           0.91      1432
   macro avg       0.90      0.86      0.88      1432
weighted avg       0.91      0.91      0.91      1432



In [233]:
X_train[ y_train > 0 ]

<1018x6233 sparse matrix of type '<class 'numpy.int64'>'
	with 66870 stored elements in Compressed Sparse Row format>

In [288]:
sum( y_train == 1)



0.7629250116441546 0.23707498835584537 1.0
