# Naive Bayes Classifier 

In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats

## Input preparation

In [17]:
def load_data(file_name):
    """ loads and processes data in the manner specified above
    Inputs:
        file_name (str): path to csv file containing data
    Outputs:
        pd.DataFrame: processed dataframe
    """
    d_f = pd.read_csv(file_name, sep = ',', header = 0, na_values = '?')
    d_f = d_f.dropna(axis = 0, how= 'any')
    d_f['label'] = d_f['income'].replace({'>50K': 1, '<=50K':0})
    del d_f['income']
    d_f.reset_index(inplace=True)
    
    return d_f

df = load_data('census.csv')


## Overview of Naive Bayes classifier
Let $X_1, X_2, \ldots, X_k$ be the $k$ features of a dataset, with class label given by the variable $y$. A probabilistic classifier assigns the most probable class to each instance $(x_1,\ldots,x_k)$, as expressed by
$$ \hat{y} = \arg\max_y P(y\ |\ x_1,\ldots,x_k) $$

Using Bayes' theorem, the above *posterior probability* can be rewritten as
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) P(x_1,\ldots,x_n\ |\ y)}{P(x_1,\ldots,x_k)} $$
where
- $P(y)$ is the prior probability of the class
- $P(x_1,\ldots,x_k\ |\ y)$ is the likelihood of data under a class
- $P(x_1,\ldots,x_k)$ is the evidence for data

Naive Bayes classifiers assume that the feature values are conditionally independent given the class label, that is,
$ P(x_1,\ldots,x_n\ |\ y) = \prod_{i=1}^{k}P(x_i\ |\ y) $. This strong assumption helps simplify the expression for posterior probability to
$$ P(y\ |\ x_1,\ldots,x_k) = \frac{P(y) \prod_{i=1}^{k}P(x_i\ |\ y)}{P(x_1,\ldots,x_k)} $$

For a given input $(x_1,\ldots,x_k)$, $P(x_1,\ldots,x_k)$ is constant. Hence, we can simplify omit the denominator replace the equality sign with proportionality as follows:
$$ P(y\ |\ x_1,\ldots,x_k) \propto P(y) \prod_{i=1}^{k}P(x_i\ |\ y) $$

Thus, the class of a new instance can be predicted as $\hat{y} = \arg\max_y P(y) \prod_{i=1}^{k}P(x_i\ |\ y)$. Here, $P(y)$ is commonly known as the **class prior** and $P(x_i\ |\ y)$ termed **feature predictor**. The rest of the assignment deals with how each of these $k+1$ probability distributions -- $P(y), P(x_1\ |\ y), \ldots, P(x_k\ |\ y)$ -- are estimated from data.


**Note**: Observe that the computation of the final expression above involve multiplication of $k+1$ probability values (which can be really low). This can lead to an underflow of numerical precision. So, it is a good practice to use a log transform of the probabilities to avoid this underflow.

** TL;DR ** Final take away from this cell is the following expression:
$$\hat{y} = \arg\max_y \underbrace{\log P(y)}_{log-prior} + \underbrace{\sum_{i=1}^{k} \log P(x_i\ |\ y)}_{log-likelihood}$$

Each term in the sum for log-likelihood can be regarded a partial log-likelihood based on a particular feature alone.

## Feature Predictor
The beauty of a Naive Bayes classifier lies in the fact we can mix-and-match different likelihood models for each feature predictor according to the prior knowledge we have about it and these models can be varied independent of each other. For example, we might know that $P(X_i|y)$ for some continuous feature $X_i$ is normally distributed or that $P(X_i|y)$ for some categorical feature follows multinomial distribution. In such cases, we can directly plugin the pdf/pmf of these distributions in place of $P(x_i\ |\ y)$.

- Gaussian model, for continuous real-valued features (parameterized by mean $\mu$ and variance $\sigma$)
- Categorical model, for discrete features (parameterized by $\mathbf{p} = <p_0,\ldots,p_{l-1}>$, where $l$ is the number of values taken by this categorical feature)

- **Parameter estimation `init()`**: Learn parameters of the likelihood model using MLE (Maximum Likelihood Estimator). Need to keep track of $k$ sets of parameters, one for each class, *in the increasing order of class id, i.e., mu[i] indicates the mean of class $i$ in the Gaussian Predictor*.
- **Partial Log-Likelihood computation for *this* feature `partial_log_likelihood()`**: Use the learnt parameters to compute the probability (density/mass for continuous/categorical features) of a given feature value. Report np.log() of this value.

The parameter estimation is for the conditional distributions $P(X|Y)$. Thus, while estimating parameters for a specific class (say class 0), we will use only those data points in the training set (or rows in the input data frame) which have class label 0.

## Gaussian Feature Predictor
The Guassian distribution is characterized by two parameters - mean $\mu$ and standard deviation $\sigma$:
$$ f_Z(z) = \frac{1}{\sqrt{2\pi}\sigma} \exp{(-\frac{(z-\mu)^2}{2\sigma^2})} $$

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the MLE for mean and standard deviation are:
$$ \hat{\mu} = \frac{1}{n} \sum_{j=1}^{n} z_j $$

$$ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{j=1}^{n} (z_j-\hat{\mu})^2} $$



In [3]:
class GaussianPredictor:
    """ Feature predictor for a normally distributed real-valued, continuous feature.
        Attributes: 
            mu (array_like) : vector containing per class mean of the feature
            sigma (array_like): vector containing per class std. deviation of the feature
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, x, y) :
        """ initializes the predictor statistics (mu, sigma) for Gaussian distribution
        Inputs:
            x (array_like): feature values (continuous)
            y (array_like): class labels (0,...,k-1)
        """
        self.k = len(np.unique(y))
        self.mu = np.zeros(self.k)
        self.sigma = np.zeros(self.k)
        for i in range(self.k):
            self.mu[i] = np.mean([x[j] for j in range(len(y)) if y[j] == i])
            self.sigma[i] = np.std([x[j] for j in range(len(y)) if y[j] == i])
        
        pass
        
    
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class
        Inputs:
            x (array_like): vector of feature values
        Outputs:
            (array_like): matrix of log likelihood for this feature alone
        """
        result_mat = np.zeros((len(x), self.k))
        for i in range(len(x)):
            for j in range(self.k):
                result_mat[i][j] = np.log(stats.norm.pdf(x[i], loc=self.mu[j], scale=self.sigma[j]))
        return result_mat
        pass

f = GaussianPredictor(df['age'], df['label'])
f.mu
f.sigma
f.partial_log_likelihood([43,40,100,10])


array([[ -3.63166766,  -3.2524249 ],
       [ -3.55071473,  -3.32238449],
       [-14.60226337, -18.13920716],
       [ -5.47164304,  -8.71608989]])

## Categorical Feature Predictor
The categorical distribution with $l$ categories $\{0,\ldots,l-1\}$ is characterized by parameters $\mathbf{p} = (p_0,\dots,p_{l-1})$:
$$ P(z; \mathbf{p}) = p_0^{[z=0]}p_1^{[z=1]}\ldots p_{l-1}^{[z=l-1]} $$

where $[z=t]$ is 1 if $z$ is $t$ and 0 otherwise.

Given $n$ samples $z_1, \ldots, z_n$ from the above distribution, the smoothed-MLE for each $p_t$ is:
$$ \hat{p_t} = \frac{n_t + \alpha}{n + l\alpha} $$

where $n_t = \sum_{j=1}^{n} [z_j=t]$, i.e., the number of times the label $t$ occurred in the sample. The smoothing is done to avoid zero-count problem (similar in spirit to $n$-gram model in NLP).


In [4]:
class CategoricalPredictor:
    """ Feature predictor for a categorical feature.
        Attributes: 
            p (dict) : dictionary of vector containing per class probability of a feature value;
                    the keys of dictionary should exactly match the values taken by this feature
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, x, y, alpha=1) :
        """ initializes the predictor statistics (p) for Categorical distribution
        Inputs:
            x (array_like): feature values (categorical)
            y (array_like): class labels (0,...,k-1)
        """
        self.p = {}
        x_features = np.unique(x)
        y_labels = np.unique(y)
        
        self.p = {thisX: np.zeros(len(y_labels)) for thisX in x_features}
        
        for i in range(len(y_labels)):
            sample = [x[j] for j in range(len(x)) if y[j] == y_labels[i]]
            for thisX in x_features:
                self.p[thisX][i] = (np.sum([1 for t in sample if t == thisX]) + alpha) / (len(sample) + (len(x_features)*alpha))
        
        pass
        
        
    def partial_log_likelihood(self, x):
        """ log likelihood of feature values x according to each class
        Inputs:
            x (array_like): vector of feature values
        Outputs:
            (array_like): matrix of log likelihood for this feature
        """
        result = []
        for thisX in x:
            thisResult = np.zeros(len(self.p[thisX]))
            for i in range(len(thisResult)):
                thisResult[i] = np.log(self.p[thisX][i])
            result.append(thisResult)
        
        return np.array(result)

f = CategoricalPredictor(df['sex'], df['label'])
f.p
f.partial_log_likelihood(['Male','Female','Male'])


array([[-0.48243939, -0.16040634],
       [-0.96044059, -1.90917639],
       [-0.48243939, -0.16040634]])

## Putting things together
It's time to put all the feature predictors together and do something useful! 

1. **__init__()**: Compute the log prior for each class and initialize the feature predictors (based on feature type). The smoothed prior for class $t$ is given by
$$ prior(t) = \frac{n_t + \alpha}{n + k\alpha} $$
where $n_t = \sum_{j=1}^{n} [y_j=t]$, i.e., the number of times the label $t$ occurred in the sample. 

2. **predict()**: For each instance and for each class, compute the sum of log prior and partial log likelihoods for all features. Use it to predict the final class label. Break ties by predicting the class with lower id.


In [12]:
class NaiveBayesClassifier:
    """ Naive Bayes classifier for a mixture of continuous and categorical attributes.
        We use GaussianPredictor for continuous attributes and MultinomialPredictor for categorical ones.
        Attributes:
            predictor (dict): model for P(X_i|Y) for each i
            log_prior (array_like): log P(Y)
    """
    # feel free to define and use any more attributes, e.g., number of classes, etc
    def __init__(self, df, alpha=1):
        """initializes predictors for each feature and computes class prior
        Inputs:
            df (pd.DataFrame): processed dataframe, without any missing values.
        """
        self.labels = np.unique(df['label'])
        self.log_prior = np.zeros(len(self.labels))
        for i in range(len(self.labels)):
            self.log_prior[i] = np.log((np.sum([1 for val in df['label'] if self.labels[i] == val])+alpha)/(len(df['label'])+(len(self.labels)*alpha)))

        self.predictor = {}
        for thisCol in df:
            if thisCol not in ['index', 'label']:
                self.predictor[thisCol] = GaussianPredictor(df[thisCol], df['label']) if df[thisCol].dtype == 'int64' else CategoricalPredictor(df[thisCol], df['label'], alpha)
        

    def predict(self, x):
        """ predicts label for input instances from log_prior and partial_log_likelihood of feature predictors
        Inputs:
            x (pd.DataFrame): processed dataframe, without any missing values and without class label.
        Outputs:
            (array_like): array of predicted class labels (0,..,k-1)
        """        
        ctr = 0
        pred_labels = np.zeros(len(x))
        
        thisPred = np.zeros((len(x),len(self.labels)))
        for i in range(len(self.labels)):
            thisPred[:,i] += self.log_prior[self.labels[i]]

        for c in x:
            if c not in ['index', 'label']:
                partial_ll = self.predictor[c].partial_log_likelihood(x[c])
                for i in range(len(self.labels)):
                    thisPred[:,i] += partial_ll[:,i]
        
        for this in thisPred:
            pred_labels[ctr] = np.argmax(this)
            ctr += 1

        return pred_labels
        
        pass

c = NaiveBayesClassifier(df, 0)
y_pred = c.predict(df)
print(y_pred)




[ 0.  0.  0. ...,  0.  0.  1.]


## Evaluation - Error rate

In [19]:
def evaluate(y_hat, y):
    """ Evaluates classifier predictions
        Inputs:
            y_hat (array_like): output from classifier
            y (array_like): true class label
        Output:
            (double): error rate as defined above
    """
    y = np.array(y)
    return np.mean(y_hat != y)
    
    pass


evaluate(y_pred, df['label'])


0.17240236058616804