# Hands-On Naive Bayes
***

In this notebook we'll implement two flavors of continuous Naive Bayes and test it out on the Iris data set.  

**Note**: There are some helper functions at the bottom of this notebook.  Scroll down and execute those cells before continuing. 

In [None]:
import numpy as np
import matplotlib.pylab as plt

%matplotlib inline

### Part 1: Classifying Iris Species with Gaussian Naive Bayes 
***

In this problem we'll use K-Nearest Neighbors to classify species of irises based on certain physical characteristics.  The so-called _iris dataset_ is a popular dataset for prototyping classification algorithms. We can load the iris dataset from Scikit-Learn directly. The dataset contains four features: sepal length, sepal width, pedal length, and pedal width and three classes defined by the species of iris: setosa, versicolor, and virginica. We'll only use the sepal dimensions so that we can easily visualize the data. 

Execute the following code cell to load training and validation sets for the iris data set and then plot the data.    

In [None]:
X_train, y_train, X_valid, y_valid, target_names = load_iris()
print("classes = ", target_names)
plot_iris(X_train, y_train)

Our plan in this part is to implement a Gaussian Naive Bayes classifier for this two-feature data. Recall that our goal is to make predictions by computing a class score for a query point of the form 

$$
p(\textrm{Class} \mid {\bf x}) \propto p({\bf x} \mid \textrm{Class}) \cdot p(\textrm{Class}) = p(x_1 \mid \textrm{Class} ) \cdot p(x_2 \mid \textrm{Class}) \cdot p(\textrm{Class})
$$

Note that we assume that feature $x_k$ for data in Class $c$ follows a normal distribution with parameters $\mu_{kc}$ and variance $\sigma_{kc}^2$. If we know these parameters then we can estimate the class conditional likelihood by evaluating the Gaussian probability density function 

$$
p(x_k \mid c) = \frac{1}{\sqrt{2\pi\sigma_{kc}^2}} \textrm{Exp}\left[ -\frac{(x_k - \mu_{kc})^2}{2\sigma_{kc}^2}   \right]
$$

Finally, we can then compute the class-score in **log-space** for each class for a given query point, and then predict it's class membership by predicting the class with the highest score. 

We'll implement this using the class `GaussNB` shown below.  Take a look at the current state of the skeleton, and then scroll down and look at the tasks that we want to accomplish. 

In [None]:
class GaussNB:
    def __init__(self, X_train, y_train):
        
        # store training data 
        self.X_train = X_train
        self.y_train = y_train 
        
        # get number of classes 
        self.num_classes = len(set(y_train))
        
        # initialize means (num_classes x num_features) 
        self.means = np.zeros((self.num_classes, self.X_train.shape[1]))
        
        # initialize variance (num_classes x num_features) 
        self.vars = np.zeros((self.num_classes, self.X_train.shape[1]))
        
        # initialize class counts (num_classes)
        self.counts = np.zeros(self.num_classes)
        
        # initialize class priors (num_classes)
        self.priors = np.zeros(self.num_classes)
        
    def train(self):
        """
        Learn the parameters for the Guassian for each feature-class combination 
        """
        
        # populate the counts array 
        self.counts = np.array([np.sum(y_train == ii) for ii in range(self.num_classes)])
        
        # populate priors 
        self.priors = np.array([self.counts[ii]/np.sum(self.counts) for ii in range(self.num_classes)])
        
        # populate means and variances 
        for feat in range(self.X_train.shape[1]):
            for c in range(self.num_classes):
                self.means[c, feat] = np.mean(self.X_train[self.y_train==c, feat])
                self.vars[c, feat] = np.var(self.X_train[self.y_train==c, feat])
                
    def pdf(self, xk, mu, ssq):
        """
        evaluates the Gaussian probability density function 
        """
        return np.exp(-((xk-mu)**2)/(2*ssq)) / np.sqrt(2*np.pi*ssq)
        
    def predict_log_score(self, x):
        """
        Get the log-probability score for each class
        for a query point x 
        """
        
        class_scores = np.zeros(self.num_classes) 
        
        for c in range(self.num_classes):
            class_scores[c] = np.log(self.priors[c])
            for kk in range(len(x)):
                class_scores[c] += np.log(self.pdf(x[kk], self.means[c, kk], self.vars[c,kk]))
        
        return class_scores
    
    def predict(self, X):
        """
        Predict the class of each example in X 
        """
        yhat = np.zeros(X.shape[0], dtype=int)
        for ii, x in enumerate(X):
            class_scores = self.predict_log_score(x)
            yhat[ii] = np.argmax(class_scores)
        
        return yhat 
    
    def accuracy(self, X, y):
        yhat = self.predict(X)
        return np.sum(yhat == y)/len(y)
        
        

**Part A**: The first thing we'll do is fill in the `train` function.  We start by filling in the array corresponding to the prior for each class.  To necessitate this we'll first fill in the counts of the training examples for each class. When you're done, execute the following code cell.  Do the results mesh with what we know about the Iris dataset. 

In [None]:
gnb = GaussNB(X_train, y_train)
gnb.train()
print("class counts: ", gnb.counts)
print("class priors: ", gnb.priors)

**Part B**: Next we need to learn the parameters of the Gaussian distributions for each feature and each class from the training data.  Note that we'll store the means and the variances in separate matrices of dimensions `num_classes` $\times$ `num_features`.  Add this functionality to the `train` function.  When you think you're done, execute the following cell.  Do the values seem correct given the plot of the Iris data set we made above? 

In [None]:
gnb = GaussNB(X_train, y_train)
gnb.train()
print("means:\n ", gnb.means)
print("\nvars:\n ", gnb.vars)

**Part C**: OK, now let's implement functionality to make predictions.  Note that there are two related functions called `predict` and `predict_log_score`.  We'll be working on the latter of those two. Our goal is to compute the log-score for each class for a particular query point: 

$$
\textrm{class_score}[c] = \log~p(c) + \log~p(x_1 \mid c) + \log~p(x_2 \mid c)
$$

For our data this should return a numpy array of length $3$. The `predict` function is implemented for you.  It loops over a data matrix of features, calls `predict_log_score` on each example, and then predicts the class corresponding to the higher log-score.  

Fill out the missing code in `predict_log_score`, and then run the following code cell. If your code is working then you should predict that the first query point belongs to class 0, the second to class 1, and the third to class 2.   

In [None]:
gnb = GaussNB(X_train, y_train)
gnb.train()
X_test = np.array([[4.5, 4.0], [5.5, 2.5], [8,3]])
gnb.predict(X_test)

**Part D**: When it seems like your code is working, you can run the following code cell to plot the class decision boundaries induced by this classifier.  

In [None]:
gnb = GaussNB(X_train, y_train)
gnb.train()
print("accuracy: {:.2f}".format(gnb.accuracy(X_train, y_train)))
nb_plot(X_train, y_train, gnb, db=True)

**Part E**: OK, let's take a step back and look at the kinds of decision boundaries Gaussian Naive Bayes can learn for binary data. The following code lets you generate binary labeled data with two features in different configurations.  Experiment with different configurations and discuss the types of decision boundaries that you can and CAN'T learn.      