## **Multinomial Naive Bayes : Implementation**
We use **Multinomial NB classifier** for problems like **document classification**.

* We represent $i^{th}$ document with a feature vector $  { \mathbf x^{(i)}}$ containing counts of words in the vocabulary $ { \{ x_1^{(i)}, x_2^{(i)},\ldots, x_m^{(i)}\}}$

* The sum of all feature counts is equal to the total number of words in the document : $ {\sum \limits_{j=1}^m x_j^{(i)}=l}$ 

In mathematical terms: 
$$ \mathbf x|y_r \sim \text{Multinomial}(w_{1y_r},w_{2y_r},\ldots,w_{my_r}) $$
$$ \sim \text {Multinomial}(\mathbf w_{\mathbf y_r})$$

The **total number of parameters** $=m \times k + k$ 
where : 
* $m \times k$ is the total number of features for $k$ multinomial distributions and 

* $k$ is the total number priors.

### **Parameter estimation**

The $j^{th}$ component of parameters vector $\mathbf w_{\mathbf y_r}$ is calculated as follows:
\begin{equation} 
w_{jy_r}=\frac{\sum \limits_{i=1}^n1(y^{(i)}=y_r)x_j^{(i)}}{\sum \limits_{i=1}^n1(y^{(i)}=y_r) \sum \limits_{j=1}^m x_j^{(i)}}
\end{equation}

Here, 

* The numerator is the **sum of feature** $x_j$ for all examples from $y_r$.

* The denominator is the total count of features from all examples from class $y_r$.

With **Laplace correction**: 
\begin{equation} 
w_{jy_r}=\frac{\sum \limits_{i=1}^n1(y^{(i)}=y_r)x_j^{(i)}+\alpha}{\sum \limits_{i=1}^n1(y^{(i)}=y_r) \sum \limits_{j=1}^m x_j^{(i)}+m\alpha}
\end{equation}

**NOTE :** We add $\alpha$ in the numerator and $m\alpha$ in the denominator, correction of $\alpha =1$.
### **Inference**
In log space the calculation is performed as follows 
:
* In the numerator, we first **multiply** the **count matrix** with **transpose of log of weight vector** and **add** it to the **log of prior probabilities** & then **exponentiate** the resulting value.

* In the denominator, we perform the same calculation as numerator but for **different class labels**. And **sum** them up.

* The denominator normalizes numerator between 0 and 1, thus giving us the posterior probability of label $y_c$ for the given count vector $\mathbf x$.

\begin{equation} 
p(y_c|\mathbf x; l, \mathbf w_{y_c})=\frac{\exp\left(\mathbf X(\log \mathbf w_{\mathbf y_r})^T+\log p(y_c)\right)}{{\sum}_r\exp\left(\mathbf X(\log \mathbf w_{\mathbf y_r})^T+\log p(y_r)\right)}
\end{equation}


### **Implementation**

In [1]:
import numpy as np

class MultinomialNB(object):
    def fit(self, X, y, alpha=1):
        '''implements parameter estimation for multinomial NB.'''
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)

        #calculate parameters of k multinomial distributions and priors.
        self.w = np.zeros((n_classes, n_features), dtype=np.float64)
        self.w_prior = np.zeros(n_classes, dtype=np.float64)

        for idx, c in enumerate(self._classes):
            X_c = X[y == c]

            #get the total count of features for class c.
            total_count = np.sum(np.sum(X_c, axis=1))

            #estimate parameters of multinomial distribution for class c

            self.w[idx, :] = (np.sum(X_c, axis=0)+alpha) / \
                (total_count+alpha*n_features)

            ##estimate class prior for class c.
            self.w_prior[idx] = (X_c.shape[0]+alpha)/float(n_samples+alpha*n_classes)

    def log_likelihood_prior_prod(self, X):
        '''calculates log of product of likelihood and prior.'''
        return X@(np.log(self.w).T) + np.log(self.w_prior)

    def predict(self, X):
        ''' predicts class for input examples.'''
        return np.argmax(self.log_likelihood_prior_prod(X), axis=1)

    def predict_proba(self, X):
        ''' calculates probability of examples belonging to diff. classes.'''
        q = self.log_likelihood_prior_prod(X)
        return np.exp(q)/np.expand_dims(np.sum(np.exp(q), axis=1), axis=1)

### **Demonstration**

We will demonstrate working on least square classification in the following set ups:
1. Binary classification set up.

2. Multi-class classification set up 

#### DEMO 1 : *Binary classification* 

Generate synthetic data for two classes and each example with 5 features.

In [2]:
rng = np.random.RandomState(1)

# range of data 0 to 4
X = rng.randint(5, size = (1000,5)) 

# range of data 0,1
y= rng.randint(2,size=(1000,)) 

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X,y)

print('Shape of feature matrix : ', X_train.shape)
print('Shape of label vector : ', y_train.shape)

Shape of feature matrix :  (750, 5)
Shape of label vector :  (750,)


Estimate the parameters of Multinomial NB.

In [5]:
multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, y_train)

# Examine the parameters of multinomial NB.
print('Prior : ',multinomial_nb.w_prior)
print()
print('Parameters of multinomial distribution : \n',multinomial_nb.w)

Prior :  [0.48138298 0.51861702]

Parameters of multinomial distribution : 
 [[0.21458508 0.17993853 0.19027661 0.20648226 0.20871752]
 [0.21238047 0.19778561 0.19954706 0.19401107 0.19627579]]


Observe that : 

* Each class is equally likely - each class has probability of 0.5.

* Sum of probabilities of different features for each class = 1.

Let's evaluate the classifier:

In [6]:
from sklearn.metrics import classification_report
print(classification_report(y_test, multinomial_nb.predict(X_test)))

              precision    recall  f1-score   support

           0       0.49      0.28      0.36       124
           1       0.50      0.71      0.59       126

    accuracy                           0.50       250
   macro avg       0.49      0.49      0.47       250
weighted avg       0.49      0.50      0.47       250



The lower values of precision and recall is due to the random label assignment in the synthetic data.

Let's calculate the probability of each example belonging to both the classes :

In [7]:
multinomial_nb.predict_proba(X_test[:5])

array([[0.47182646, 0.52817354],
       [0.48945212, 0.51054788],
       [0.4673714 , 0.5326286 ],
       [0.47109459, 0.52890541],
       [0.48856185, 0.51143815]])

#### DEMO 2 : *Multi-class classification*

Let's generate data for 3 classes.

In [11]:
rng = np.random.RandomState(1)
X = rng.randint(5,size=(1000,5))
y = rng.randint(3,size=(1000,))

In [12]:
X_train, X_test,y_train, y_test = train_test_split(X,y)
X_train.shape ,X_test.shape, y_train.shape, y_test.shape

((750, 5), (250, 5), (750,), (250,))

Let's estimate parameters of Multinomial NB classifier.

In [14]:
multinomial_nb = MultinomialNB() 
multinomial_nb.fit(X_train,y_train)

# Examine the parameters of multinomial NB.

print('Prior : ',multinomial_nb.w_prior)
print()
print('Parameters of Multinomial distribution : \n',multinomial_nb.w)

Prior :  [0.34130146 0.35192563 0.30677291]

Parameters of multinomial distribution : 
 [[0.21068939 0.197134   0.1859024  0.20565453 0.20061967]
 [0.21341463 0.18407012 0.20121951 0.19893293 0.2023628 ]
 [0.21391304 0.2        0.19434783 0.19695652 0.19478261]]


Let's evaluate the classifier that we have learnt:

In [15]:
print(classification_report(y_test, multinomial_nb.predict(X_test)))

              precision    recall  f1-score   support

           0       0.35      0.37      0.36        91
           1       0.34      0.62      0.44        78
           2       0.08      0.01      0.02        81

    accuracy                           0.33       250
   macro avg       0.26      0.33      0.27       250
weighted avg       0.26      0.33      0.28       250



The lower values of precision and recall is due to the random label assignment in the synthetic data. 

Finally predict probability for test examples belonging to different classes.

In [16]:
multinomial_nb.predict_proba(X_test[:5])

array([[0.31618901, 0.35900142, 0.32480957],
       [0.35406345, 0.31794449, 0.32799207],
       [0.32437015, 0.39520632, 0.28042353],
       [0.30468286, 0.35292255, 0.34239459],
       [0.31898534, 0.36681071, 0.31420395]])