## **Bernoulli Naive Bayes: Implementation**

### **Parameter estimation** : Class conditional density and prior

Remember that the **class conditional density** for Bernoulli NB is calculated as follows:

\begin{equation} 
w_{y_c} = \frac{\sum_{i=1}^{n} \mathbb {1} (y^{(i)}=y_c)}{n} 
\end{equation} 

Here : 

The numerator gives us **total number of examples with label $y_c$ and is divided by the total number of examples in the training set**.

While estimating parameters of the model, we process examples from each label separately and estimate the parameters.

In [1]:
import numpy as np

def fit(X, y):
    n_samples, n_features = X.shape
    class_count = np.unique(y)
    n_classes = len(class_count)

    #initialize the weight vectors
    w = np.zeros((n_classes, n_features), dtype=np.float64)
    w_priors = np.zeros(n_classes, dtype=np.float64)

    for c in range(n_classes):

        # processing examples from each class separately.
        # get examples with label = c

        X_c = X[y == c]

        # estimation of w_{jy_c}: The parameter of bernoulli separately. i.e P(x_j | y_c) ~ Ber(w_{jy_c})

        # We have vectorized this operation and we obtain vector w_{y_c} that contains w_{jy_c} for each x_j.

        w[c, :] = np.sum(X_c, axis=0)/X_c.shape[0]
        w_priors[c] = X_c.shape[0]/(n_samples)

    print('Weight vector : \n', w)
    print()
    print('Prior : \n', w_priors)


Demonstration of above code

In [2]:
X = np.array([[1,1,40],[1,100,2],[2,3,4]])
y = np.array([[0,1,1],[2,1,3],[2,2,1]])

X.shape ,y.shape

((3, 3), (3, 3))

In [3]:
X_0 = X[y == 0]
X_1 = X[y == 1]
X_2 = X[y == 2]
X_3 = X[y == 3]

print(X_0, X_1, X_2, X_3)
print()
print(X_0.shape ,X_1.shape ,X_2.shape ,X_3.shape)

[1] [  1  40 100   4] [1 2 3] [2]

(1,) (4,) (3,) (1,)


In [4]:
fit(X, y)

Weight vector : 
 [[ 1.    1.    1.  ]
 [36.25 36.25 36.25]
 [ 2.    2.    2.  ]
 [ 2.    2.    2.  ]]

Prior : 
 [0.33333333 1.33333333 1.         0.33333333]


Let's look at the parameter estimation in step by step manner:

In [5]:
# feature matrix with shape (4,2). x_1 ~ Ber(w_1), x_2~ Ber(w_2)
X = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])

# label vector with shape(4,)
y = np.array([1, 0, 0, 1])

X.shape , y.shape

((4, 2), (4,))

In [6]:
X_0 = X[y == 0]
X_1 = X[y == 1]


print(X_0)
print(X_1)
print()
print(X_0.shape ,X_1.shape)

[[0 1]
 [0 1]]
[[1 0]
 [1 0]]

(2, 2) (2, 2)


In [7]:
# call fit with feature matrix and label vector as arguments.
fit(X, y)

Weight vector : 
 [[0. 1.]
 [1. 0.]]

Prior : 
 [0.5 0.5]


A few observations:

* Since there are 50% examples of each class 1 and class 0, the prior probability vector has 0.5 for each class.

* Note that:

  * For class 0, $x_1=0 \ \text {and} \ x_2=1$ and hence the parameters of bernoulli distributions are 0 and 1 respectively.

   * $w_{01}=0,w_{02}=1$

* For class 1, $x_1=1 \ \text {and} \ x_2=0$ and hence the parameters of bernoulli distributions are 0 and 1 respectively.

   * $w_{11}=1,w_{21}=0$

Let's understand class conditional density calculation step-by-step :


##### **STEP 1** : Filter examples for a class ,say c=1

In [8]:
X_c = X[y==1]
X_c

array([[1, 0],
       [1, 0]])

##### **STEP 2** : Feature wise sum

In [9]:
np.sum(X_c, axis=0)

array([2, 0])

##### **STEP 3**: Dividing by class count

In [10]:
w = np.sum(X_c, axis=0)/X_c.shape[0]
w

array([1., 0.])

### **Incorporating Laplace correction**

The zero (0) value for parameter is a problem as it leads to 0 posterior probability. 

We can fix this problem with **Laplace correction** or by adding a small dummy counts in each class for each feature.

* The **class priors** with laplace correction can be calculated as follows:

\begin{equation} 
p(y=y_c)= \frac{\sum \limits_{i=1}^n 1(y^{(i)}=y_c) + \alpha}{n+k\alpha}
\end{equation}

* The **class conditional density** with laplace correction is computed as follows:

\begin{equation} 
\frac{\sum \limits_{i=1}^n 1(y^{(i)}=y_c)x_j^{(i)}+\alpha}{\sum \limits_{i=1}^n 1(y^{(i)}=y_c)+2\alpha}
\end{equation} 

In both cases, we use $\alpha=1$. (**Laplace correction or smoothing**)


### **Inference** 
#### Determine class label

Remember that we assign class label $y_c$ that results in the largest product of likelihood and prior.

\begin{align} 
y_c &=& \text{argmax}_{y_c}\left(\sum \limits_{j=1}^m \log \ p(x_j|y_c;\mathbf w)\right)+ \log \ p(y_c;\mathbf w) \\
\end{align}

\begin{align} 
&=& \text{argmax}_{y_c} \left(\sum \limits_{j=1}^m w_{jy_c}^{x_j}(1-w_{jy_c})^{1-x_j}\right)+\log \ p(y_c;\mathbf w)\\
\end{align}

\begin{align} 
&=& \text{argmax}_{y_c}\left(\sum \limits_{j=1}^m x_j \log \ w_{jy_c}+(1-x_j) \log \ (1-w_{jy_c})\right)+\log \ p(y_c; \mathbf w)
\end{align} 

**NOTE :** We performed these computations in log space to avoid problems with underflow.

Further with vectorization, this is implemented as follows :

\begin{align}
y=\text{argmax}_y \mathbf X \log \mathbf w^T + (1-\mathbf X) \log (1-\mathbf w)^T + \log \mathbf w_{\text {prior}}
\end{align}

### **Implementation**

In [11]:
class BernoulliNB(object):
    def __init__(self,alpha=1.0):
        self.alpha = alpha

    def fit(self,X,y):
        n_samples, n_features = X.shape
        class_count = np.unique(y)
        n_classes = len(class_count)

        self.w = np.zeros((n_classes, n_features),dtype=np.float64)
        self.w_priors = np.zeros(n_classes,dtype=np.float64)

        for  c in range(n_classes):
            X_c = X[y==c]

            self.w[c,:]= (np.sum(X_c, axis=0)+ self.alpha)/(X_c.shape[0]+2*self.alpha)

            self.w_priors[c] = (X_c.shape[0]+self.alpha)/(float(n_samples) + n_classes * self.alpha)

        print('Class conditional density :', self.w)
        print() 
        print('Prior :', self.w_priors)
        

    def log_likelihood_prior_prod(self,X):
        return X@(np.log(self.w).T)+(1-X)@np.log((1-self.w).T)+ np.log(self.w_priors)

    def predict_proba(self,X):
        q = self.log_likelihood_prior_prod(X)
        return np.exp(q)/np.expand_dims(np.sum(np.exp(q),axis=1),axis=1)

    #print(np.exp(q))
    #return np.expand_dims(np.sum(np.exp(q),axis=1),axis=1)

    def predict(self,X):
        return np.argmax(self.log_likelihood_prior_prod(X),axis=1)

### **Demonstration**

We will demonstrate working on least square classification in the following set ups:
1. Binary classification set up.

2. Multi-class classification set up 

#### DEMO 1 : *Binary classification*

In [12]:
ber_nb = BernoulliNB() 
ber_nb.fit(X ,y)

Class conditional density : [[0.25 0.75]
 [0.75 0.25]]

Prior : [0.5 0.5]


In [13]:
#Let's predict classes for input example.
ber_nb.predict(X)

array([1, 0, 0, 1], dtype=int64)

In [14]:
# The class labels are inferred by selecting the label that results into highest value of product of likelihood and priors:

ber_nb.log_likelihood_prior_prod(X)

array([[-3.4657359 , -1.26851133],
       [-1.26851133, -3.4657359 ],
       [-1.26851133, -3.4657359 ],
       [-3.4657359 , -1.26851133]])

Observe that based on this calculation, the first example gets class 1, second one gets class 0, third also gets class 0 and the last one gets class 1.


In [15]:
# let's predict probabilities for each example.
ber_nb.predict_proba(X)

array([[0.1, 0.9],
       [0.9, 0.1],
       [0.9, 0.1],
       [0.1, 0.9]])

#### DEMO 2 : *Multi-class classification*

The NB implementation also works in multi-class setting. Here is an example with three classes.

In [16]:
X = np.array([[1,0],[0,1],[0,1],[1,0],[1,1],[1,1]])
y = np.array([1, 0, 0, 1, 2, 2])

Estimation of parameters of Bernoulli distribution and class priors.

In [17]:
ber_nb = BernoulliNB() 
ber_nb.fit(X,y)

Class conditional density : [[0.25 0.75]
 [0.75 0.25]
 [0.75 0.75]]

Prior : [0.33333333 0.33333333 0.33333333]


In [18]:
ber_nb.log_likelihood_prior_prod(X)

array([[-3.87120101, -1.67397643, -2.77258872],
       [-1.67397643, -3.87120101, -2.77258872],
       [-1.67397643, -3.87120101, -2.77258872],
       [-3.87120101, -1.67397643, -2.77258872],
       [-2.77258872, -2.77258872, -1.67397643],
       [-2.77258872, -2.77258872, -1.67397643]])

In [19]:
# let's predict probabilities for each example.
ber_nb.predict_proba(X)

array([[0.07692308, 0.69230769, 0.23076923],
       [0.69230769, 0.07692308, 0.23076923],
       [0.69230769, 0.07692308, 0.23076923],
       [0.07692308, 0.69230769, 0.23076923],
       [0.2       , 0.2       , 0.6       ],
       [0.2       , 0.2       , 0.6       ]])