<h2>Bayesian Classifiers</h2>
<ul>
<li><h4>Author: Blake Conrad</h4></li>
<li><h4>Purpose: P3 for CSCI 48100 </h4></li>
<li><h4>Goal: Implement Bayes and Naive Bayes Classifiers </h4></li>
</ul>

In [57]:
from __future__ import division

# Import Libraries
import sys
import os
import numpy as np
import pandas as pd
import sys
import os
from scipy.stats import multivariate_normal
from scipy.stats import norm
from sklearn.metrics import confusion_matrix
import sklearn.metrics


In [64]:
df_train = pd.read_csv("1stfold_train.txt",
                      sep=",",
                      names=["Septal Length",
                            "Septal Width",
                            "Pedal Length",
                            "Pedal Width",
                            "Flower Type"],
                      dtype={'Septal Length':  np.float64,
                             'Septal Width' :  np.float64,
                             'Pedal Length' :  np.float64,
                             'Pedal Width'  :  np.float64})
df_test = pd.read_csv("1stfold_heldout.txt",
                      sep=",",
                      names=["Septal Length",
                            "Septal Width",
                            "Pedal Length",
                            "Pedal Width",
                            "Flower Type"],
                      dtype={'Septal Length':  np.float64,
                             'Septal Width' :  np.float64,
                             'Pedal Length' :  np.float64,
                             'Pedal Width'  :  np.float64})

<center><h2>Bayes Classification Objective Function</h2></center>
<p>$$y = argmax(c_i){P(c_i | x)} == argmax(c_i)(f_i(x)P(c_i))$$</p>

In [75]:
#
# Bayes Classifier Class Object
#
# Accepts: A pandas-like data frame object
#     - For Details: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
# Returns: 3 lists, P(c_i), mu_i, and sigma_i
#     - Ex) P_c_i[1], mu_i[1], sigma_i[1] ,
#           P_c_i[2], mu_i[2], sigma_i[2] , and
#           P_c_i[3], mu_i[3], sigma_i[3] are enough information to satisfy our objective Function.
#
# Objective Function: y = argmax of c_i = { P(c_i|x) * P(c_i) }
#

class Bayes_Classifier:
    
    # Class Attributes
    constructed = False
    built = False
    predicted = False
    
    # Constructor
    def __init__(self, DF_TRAIN):
        
        # Flag Appropriately
        self.constructed = True
        
        # Save training set
        self.DF_TRAIN = DF_TRAIN
        
        # Class labels
        self.classes = pd.unique(self.DF_TRAIN["Flower Type"])
        self.actual_labels = self.DF_TRAIN.ix[:,"Flower Type"]
        # Constants
        self.k = len(self.classes)
        self.n = len(self.DF_TRAIN)

        # Containers
        self.D_i = list()
        self.n_i = list()
        self.P_c_i = list()
        self.mean_i = list()
        self.sigma_i = list()
        
        
    # Methods    
    def build(self):
        
        # Flag Appropriately
        self.built = True
        
        # Algorthm 18.1
        for i in range(self.k):
            self.D_i.append(self.DF_TRAIN.loc[self.DF_TRAIN['Flower Type'] == self.classes[i]])
            self.n_i.append(len(self.D_i[i]))
            self.P_c_i.append(self.n_i[i] / self.n)
            self.mean_i.append(self.D_i[i].mean())
            self.sigma_i.append(np.cov(self.D_i[i].ix[:,:4].as_matrix().transpose(), bias=True))
            """
            print "D_i\n", D[i]
            print "n_i\n", n_i[i]
            print "P(c_i)\n", P_c_i[i]
            print "mean_i\n", mean_i[i]
            print "sigma_i\n", sigma_i[i]
            """
            
        # Return the model
        return self.D_i, self.n_i, self.P_c_i, self.mean_i, self.sigma_i
    
    def writeModel(self):
        
        # Round to 2 decimals as requested
        self.P_c_i = np.around(self.P_c_i, decimals=2)
        self.mean_i = np.around(self.mean_i, decimals=2)
        self.sigma_i = np.around(self.sigma_i, decimals=2)
        
        target = open("bayes_model.txt", 'w')
        target.write("--- skip this line --- P(c_i) is each line per class~\n")
        target.write(str(self.P_c_i[0]))
        target.write("\n")
        target.write(str(self.P_c_i[1]))
        target.write("\n")
        target.write(str(self.P_c_i[2]))
        target.write("\n")
        target.write("--- skip this line --- mean_i is each line per class~\n")
        for mu in self.mean_i:
            target.write(str(mu.tolist()[0]))
            target.write(",")
            target.write(str(mu.tolist()[1]))
            target.write(",")
            target.write(str(mu.tolist()[2]))
            target.write(",")
            target.write(str(mu.tolist()[3]))
            target.write("\n")
        target.write("--- skip this line --- sigma_i is each 4 lines per class~\n")
        for coV in self.sigma_i: #1,2,and3
            for v in coV:        #row1,2,3,and4
                for rowVal in v: #col1,2,3,and4
                    target.write(str(rowVal))
                    target.write(",")
                target.write("\n") 

        target.close()
    @classmethod
    def readModel(self):
        
        
        target = open("bayes_model.txt","r")
        lines = target.readlines()
        
        # P(c_i)
        label1 = lines[0]
        P_c_i_str = lines[1:4]
        P_c_i_flt = map(float, P_c_i_str)
        P_c_i_ls = P_c_i_flt
        P_c_i_ls = np.array(P_c_i_ls)
        
        # Mean_i
        label2 = lines[4]
        mu1_str = lines[5].split(",")
        mu2_str = lines[6].split(",")
        mu3_str = lines[7].split(",")
        mu1_flt = map(float, mu1_str)
        mu2_flt = map(float, mu2_str)
        mu3_flt = map(float, mu3_str)
        mean_i_ls = [mu1_flt, mu2_flt, mu3_flt]
        mean_i_ls = map(np.array, mean_i_ls)
        
        label3 = lines[8]
        cov1_str = lines[9:13] #4 lines
        cov2_str = lines[13:17] #4 lines
        cov3_str = lines[17:] #4 lines

        
        # Cov1
        s1 = cov1_str[0][:-2].split(",")
        s2 = cov1_str[1][:-2].split(",")
        s3 = cov1_str[2][:-2].split(",")
        s4 = cov1_str[3][:-2].split(",")
        cov1_str = [s1, s2, s3, s4]
        cov1_mat = np.matrix(cov1_str, dtype=np.float)
        
        # Cov2
        s1 = cov2_str[0][:-2].split(",")
        s2 = cov2_str[1][:-2].split(",")
        s3 = cov2_str[2][:-2].split(",")
        s4 = cov2_str[3][:-2].split(",")
        cov2_str = [s1, s2, s3, s4]
        cov2_mat = np.matrix(cov2_str, dtype=np.float)
        
        # Cov3
        s1 = cov3_str[0][:-2].split(",")
        s2 = cov3_str[1][:-2].split(",")
        s3 = cov3_str[2][:-2].split(",")
        s4 = cov3_str[3][:-2].split(",")
        cov3_str = [s1, s2, s3, s4]
        cov3_mat = np.matrix(cov3_str, dtype=np.float)
        
        sigma_i_ls = [cov1_mat, cov2_mat, cov3_mat]
        
        self.P_c_i = P_c_i_ls
        self.mean_i = mean_i_ls
        self.sigma_i = sigma_i_ls
        return self.P_c_i, self.mean_i , self.sigma_i
    
    @classmethod
    def predict(self, DF_TEST):
         
        # Flag Appropriately
        self.predicted = True
        
        # Save the testing set
        self.DF_TEST = DF_TEST
        
        # Containers
        self.predicted_labels = list()
        
        self.classes = pd.unique(self.DF_TEST["Flower Type"])
        self.k = len(self.classes)
        
        print len(self.DF_TEST)
        # For each point in DF_TEST
        for j in range(len(self.DF_TEST)):
            
            # Get the maxmimum probability classification
            max_probability_class_label = ""
            max_probability = 0
            for i in range(self.k):
                tmp = multivariate_normal.pdf(self.DF_TEST.ix[j,:4].as_matrix(),
                                              mean=self.mean_i[i], 
                                              cov=self.sigma_i[i])
                tmp = tmp * self.P_c_i[i]
                
                if(tmp > max_probability):
                    max_probability = tmp
                    max_probability_class_label = self.classes[i]
                    
            # Store our prediction for each point
            self.predicted_labels.append(max_probability_class_label)
            
        self.actual_labels = self.DF_TEST.ix[:,4].tolist()
        # Return the predictions
        return self.predicted_labels
    
    @classmethod
    def get_confusion(self, act, pred):
        if([self.constructed, self.built, self.predicted]):
            print "Safe to calculate."
            return confusion_matrix(act, pred)
            
        else:
            print "Not safe to calculate. Consider building and predicting with your model first."
            
    def perf_measure(self, y_true, y_pred):
        recall = metrics.recall_score(y_true, y_pred)
        precision = metrics.precision_score(y_true, y_pred)
        fscore = metrics.f1_score(y_true, y_pred)  

        
        return (recall, precision, fscore)

In [76]:
# BAYES CLASSIFIER
bayes_classifier = Bayes_Classifier(df_train)
D_i, n_i, P_c_i, mean_i, sigma_i = bayes_classifier.build()

#print "D_i:\n", D_i
#print "n_i:\n", n_i
#print "P_c_i:\n", P_c_i
#print "mean_i:\n", mean_i
#print "sigma_i:\n", sigma_i
bayes_classifier.writeModel()
bayes_classifier.readModel()

# Predict with the built model object
prediction_labels = bayes_classifier.predict(df_test)
print bayes_classifier.predicted_labels[0:5]
print bayes_classifier.actual_labels[0:5]
bayes_classifier.get_confusion(df_test["Flower Type"].tolist(), prediction_labels)
# Get how well the model did
#accuracy = bayes_classifier.get_accuracy()

50
['Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa']
0     Iris-virginica
1    Iris-versicolor
2    Iris-versicolor
3     Iris-virginica
4        Iris-setosa
Name: Flower Type, dtype: object
Safe to calculate.


array([[ 0,  0, 16],
       [17,  1,  0],
       [ 0, 16,  0]])

In [72]:
df_test["Flower Type"].head()

0    Iris-versicolor
1        Iris-setosa
2    Iris-versicolor
3    Iris-versicolor
4    Iris-versicolor
Name: Flower Type, dtype: object

In [73]:
prediction_labels[0:5]

['Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa']

In [11]:
#
# Naive Bayes Classifier Class Object
#
# Accepts: A pandas-like data frame object
#     - For Details: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
#
#

class Naive_Bayes_Classifier:
    
    # Class Attributes
    constructed = False
    built = False
    predicted = False
    
    # Constructor
    def __init__(self, DF_TRAIN):
        
        # Flag Appropriately
        self.constructed = True
        
        # Save training set
        self.DF_TRAIN = DF_TRAIN
        
        # Class labels
        self.classes = pd.unique(self.DF_TRAIN["Flower Type"])
        self.actual_labels = self.DF_TRAIN.ix[:,"Flower Type"]
        
        # Constants
        self.k = len(self.classes)
        self.n = len(self.DF_TRAIN)

        # Containers
        self.D_i = list()
        self.n_i = list()
        self.P_c_i = list()
        self.mean_i = list()
        self.sigma_i = list()
        
        
    # Methods    
    def build(self):
        
        # Flag Appropriately
        self.built = True
        
        # Algorthm 18.2
        for i in range(self.k):
            self.D_i.append(self.DF_TRAIN.loc[self.DF_TRAIN['Flower Type'] == self.classes[i]])
            self.n_i.append(len(self.D_i[i]))
            self.P_c_i.append(self.n_i[i] / self.n)
            self.mean_i.append(self.D_i[i].mean())
            
            variance_i = list()
            for j in range(len(self.DF_TRAIN.columns) - 1):
                variance_i.append(np.var(self.D_i[i].ix[:,j].as_matrix(), out=None))
                #print variance_i[j]
                #raw_input(",,,")
            self.sigma_i.append(variance_i)
            
            """
            print "D_i\n", self.D[i]
            print "n_i\n", self.n_i[i]
            print "P(c_i)\n", self.P_c_i[i]
            print "mean_i\n", self.mean_i[i][:self.DF_TRAIN.columns]
            print "sigma_i\n", self.sigma_i[i][:self.DF_TRAIN.columns]
            """
            
        # Return the model
        return self.D_i, self.n_i, self.P_c_i, self.mean_i, self.sigma_i

    
    def writeModel(self):
        
        # Round to 2 decimals as requested
        self.P_c_i = np.around(self.P_c_i, decimals=2)
        self.mean_i = np.around(self.mean_i, decimals=2)
        self.sigma_i = np.around(self.sigma_i, decimals=2)
        
        target = open("naive_bayes_model.txt", 'w')
        target.write("--- skip this line --- P(c_i) is each line per class~\n")
        target.write(str(self.P_c_i[0]))
        target.write("\n")
        target.write(str(self.P_c_i[1]))
        target.write("\n")
        target.write(str(self.P_c_i[2]))
        target.write("\n")
        target.write("--- skip this line --- mean_i is each line per class~\n")
        for mu in self.mean_i:
            target.write(str(mu.tolist()[0]))
            target.write(",")
            target.write(str(mu.tolist()[1]))
            target.write(",")
            target.write(str(mu.tolist()[2]))
            target.write(",")
            target.write(str(mu.tolist()[3]))
            target.write("\n")
        target.write("--- skip this line --- sigma_i is each lines per class~\n")
        for coV in self.sigma_i: #1,2,and3
            target.write(str(coV.tolist()[0]))
            target.write(",")
            target.write(str(coV.tolist()[1]))
            target.write(",")
            target.write(str(coV.tolist()[2]))
            target.write(",")
            target.write(str(coV.tolist()[3]))
            target.write("\n")
            
        target.close()
        
    def readModel(self):
        
        
        target = open("naive_bayes_model.txt","r")
        lines = target.readlines()
        
        # P(c_i)
        label1 = lines[0]
        P_c_i_str = lines[1:4]
        P_c_i_flt = map(float, P_c_i_str)
        P_c_i_ls = P_c_i_flt
        P_c_i_ls = np.array(P_c_i_ls)
        
        # Mean_i
        label2 = lines[4]
        mu1_str = lines[5].split(",")
        mu2_str = lines[6].split(",")
        mu3_str = lines[7].split(",")
        mu1_flt = map(float, mu1_str)
        mu2_flt = map(float, mu2_str)
        mu3_flt = map(float, mu3_str)
        mean_i_ls = [mu1_flt, mu2_flt, mu3_flt]
        mean_i_ls = map(np.array, mean_i_ls)
        
        label3 = lines[8]
        cov1_str = lines[9]
        cov2_str = lines[10]
        cov3_str = lines[11]
        
        # Var1
        cov1_str = cov1_str[:-1].split(",")
        cov1_arr = np.array(cov1_str, dtype=np.float)
        
        # Var1
        cov2_str = cov2_str[:-1].split(",")
        cov2_arr = np.array(cov2_str, dtype=np.float)
        
        # Var3
        cov3_str = cov3_str[:-1].split(",")
        cov3_arr = np.array(cov3_str, dtype=np.float)
        
        sigma_i_ls = [cov1_arr, cov2_arr, cov3_arr]
        
        self.P_c_i = P_c_i_ls
        self._mean_i = mean_i_ls
        self.sigma_i = sigma_i_ls
        
        return self.P_c_i, self._mean_i, self.sigma_i
        
    def predict(self, DF_TEST):
         
        # Flag Appropriately
        self.predicted = True
        
        # Save the testing set
        self.DF_TEST = DF_TEST
        
        # Containers
        self.predicted_labels = list()
        
        # For each point in DF_TEST
        for j in range(len(DF_TEST)):
            
            # Get the maxmimum probability classification
            max_probability_class_label = ""
            max_probability = 0
            for i in range(self.k):
                
                product_of_columns_probability = 1
                for r in range(len(self.DF_TEST.columns)-1):
                    product_of_columns_probability *= norm.pdf(self.DF_TEST.ix[j,r],
                                                                             self.mean_i[i][r], 
                                                                             self.sigma_i[i][r])
                
                tmp = product_of_columns_probability * self.P_c_i[i]
                if(tmp > max_probability):
                    max_probability = tmp
                    max_probability_class_label = self.classes[i]
                    
            # Store our prediction for each point
            self.predicted_labels.append(max_probability_class_label)
            
        #self.actual_labels = self.DF_TEST.ix[:,5]
        # Return the predictions
        return self.predicted_labels
    
    def get_confusion(self):
        if([self.constructed, self.built, self.predicted]):
            print "Safe to calculate."
        else:
            print "Not safe to calculate. Consider building and predicting with your model first."

In [12]:
# NAIVE BAYES CLASSIFIER
naive_bayes_classifier = Naive_Bayes_Classifier(df_train)
D_i, n_i, P_c_i, mean_i, sigma_i = naive_bayes_classifier.build()

#naive_bayes_classifier.writeModel()
#naive_bayes_classifier.readModel()

# Predict on the unseen data
predicted_labels = naive_bayes_classifier.predict(df_test)
predicted_labels

# Determine the accuracy of the model
#accuracy = naive_bayes_classifier.get_accuracy()


['Iris-virginica',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-versicolor',
 'Iris-virginica',
 'Iris-virginica',
 'Iris-setosa',
 'Iris-versicolor',
 'Iris-setosa']

<h2> Problem 1 <h3>Build the model file</h3></h2>
<ul> 
<li>Implement a python program that accepts a dataset as a command line parameter and generates
a model file in the current directory. </li>
<li>The model file contains: (i) the prior probabilities of each of
the classes;</li>
<li>(ii) the mean and the covariance matrix of each of the classes. Our objective is to use
this model file to perform classification using full Bayes classification method.</li>
<li>To ensure readabilityof the model file, please write all the numeric values using 2 digits after the decimal point. Youcan use build-in functions in the NumPy package for computing the mean and the covariance.</li>
</ul>


<h2>Problem 2</h2>
<h3>Testing the model</h3>
<ul>
<li>Implement a python program that accepts a model file (output of Q1) and a test file as command
line parameter.</li>
<li>The test file has identical format of the train file.</li>
<li>For each instance of the test file,the program outputs the predicted label.</li>
<li>The program also prints a confusion matrix by comparingthe true labels and predicted labels of all the instances.</li>
</ul>
  

<h2>Problem 3</h2>
<h3>3-fold Cross Validation</h3>
<ul>
<li>For this, make 3-folds of the file iris.txt.shuffled by considering 50 consecutive instances as one fold
1(do not reorder the instances in the files).</li>
<li>Use the program from Q1 for training purpose using
instances from two of the folds and use the program from Q2 for testing on the instances of the
remaining fold.</li>
<li>Print the confusion matrix for each of the three folds (when they were used as
test). Also, for each class, print the accuracy, precision, recall, and F-score, averaged over 3-folds.</li>
</ul>


<h2> Problem 4</h2>
<h3> Repeat 1-3 for Naive Bayes</h3>