# Homework 2: Bayes Optimal Classifiers

- Name: Congxin (David) Xu
- Computing ID: cx2rx

### Honor Pledge: 
I have neither given nor received aid on this assignment.

### Question 1
The authors in [1] describe a marketing campaign by a bank in Portugal. Modify the `Exercise3.2LDAQDA_IrisSoution` python code to use the data (`bank-full.csv`) from this marketing campaign with only the predictor variables `age`, `balance`, and `duration` and the response variable, `y`, to create the Bayes optimal classifiers for each of the conditions listed below. Provide your python code for each of these cases.

(a) Assume Gaussian class conditional likelihoods with unequal  variance-covariance matrices with each of the following additional assumptions
applied singularly to each decision rule in this class:

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in the data
data = pd.read_csv("bank-full.csv")
data['Class'] = data['y']
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,Class
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,no


In [3]:
data_1a = data[['age', 'balance', 'duration', 'Class']]
data_1a.head()

Unnamed: 0,age,balance,duration,Class
0,58,2143,261,no
1,44,29,151,no
2,33,2,76,no
3,47,1506,92,no
4,33,1,198,no


In [4]:
def multivariate_gaussian_pdf(X, MU, SIGMA):
    """Code from Data Blog https://xavierbourretsicotte.github.io/MLE_Multivariate_Gaussian.html
    Maximum Likelihood Estimator: Multivariate Gaussian Distribution
        by Xavier Bourret Sicotte, Fri 22 June 2018
    Returns the pdf of a multivariate Gaussian distribution
     - X, MU are p x 1 vectors
     - SIGMA is a p x p matrix"""
    # Initialize and reshape
    X = X.reshape(-1, 1)
    MU = MU.reshape(-1, 1)
    p, _ = SIGMA.shape

    # Compute values
    SIGMA_inv = np.linalg.inv(SIGMA)
    denominator = np.sqrt((2 * np.pi) ** p * np.linalg.det(SIGMA))
    exponent = -(1 / 2) * ((X - MU).T @ SIGMA_inv @ (X - MU))

    # Return result
    return float((1. / denominator) * np.exp(exponent))

class QDA:
    """Creates a class for Quadratic Discriminant Analysis
    Input:
        data = pandas data frame for a csv file, must have one column labeled "Class" and the rest numeric data
    Methods:
        compute_probabilities = given an input observation computes the likelihood for each class and the GML class
        compute_probabilities: given an input observation and prior probabilities,
            computes the posterior probabilities for each class and most probable class"""

    def __init__(self, data):
        # reads the data and computes the statistics needed for classification

        # read the iris data as a Pandas data frame
        df = data

        # separate the class labels from the rest of the data
        # we are assuming the column name with class labels is 'Class'
        # and all other columns are numeric
        self.data_labels = df.loc[:]['Class']
        self.data = np.asarray(df.drop('Class', axis=1, inplace=False))

        # get information about the dimensions the data
        self.num_rows, self.num_cols = self.data.shape

        # get the class names as an array of strings
        self.class_names = np.unique(self.data_labels)

        # determine number of observations in each class
        self.num_obs = dict()
        for name in self.class_names:
            self.num_obs[name] = sum(self.data_labels == name)

        # compute the mean of each class
        self.means = dict()
        for name in self.class_names:
            self.means[name] = np.mean(self.data[self.data_labels == name, :], 0)

        # compute the covariance matrix of each class
        self.covs = dict()
        for name in self.class_names:
            self.covs[name] = np.cov(np.transpose(self.data[self.data_labels == name, :]))

    def compute_likelihoods(self, x):
        # compute and output the likelihood of each class and the maximum likelihood class

        # check that the input data x has the correct number of rows
        if not (len(x) == self.num_cols):
            print('Data vector has wrong number of values.')
            return -1

        # reformat x as a numpy array, incase the user input a list
        x = np.asarray(x)

        # compute the likelihood of each class
        likelihoods = np.zeros(len(self.class_names))
        idx = 0
        for name in self.class_names:
            likelihoods[idx] = multivariate_gaussian_pdf(x, self.means[name], self.covs[name])
            idx = idx + 1
        # get the indices for sorting the likelihoods (in descending order)
        indices_sorted = np.argsort(likelihoods)[::-1]

        # print the predicted class and all class likelihoods
        print('QDA Predicted Class: ' + self.class_names[indices_sorted[0]])
        print('QDA Class Likelihoods:')
        for idx in range(len(indices_sorted)):
            print(self.class_names[indices_sorted[idx]] + ': ' + str(likelihoods[indices_sorted[idx]]))

        # return the likelihoods
        return likelihoods

    def compute_probabilities(self, x, priors):
        # compute and output the probability of each class and the maximum probability class
        
        #likelihoods = self.compute_likelihoods(x)
        # check that the input data x has the correct number of rows
        if not (len(x) == self.num_cols):
            print('Data vector has wrong number of values.')
            return -1

        # reformat x as a numpy array, incase the user input a list
        x = np.asarray(x)

        # compute the likelihood of each class
        likelihoods = np.zeros(len(self.class_names))
        idx = 0
        for name in self.class_names:
            likelihoods[idx] = multivariate_gaussian_pdf(x, self.means[name], self.covs[name])
            idx = idx + 1
        
        #Number of classes
        Nclass = len(self.class_names)
        
        
        #posterior probabilties: likelihood * prior / normalize
        postprob = np.zeros(Nclass)
        idx = 0
        normalize = 0
        for name in self.class_names:
            proportion = likelihoods[idx] * priors[name]
            postprob[idx] = proportion
            normalize += proportion
            idx = idx + 1
        postprob = np.round(postprob / normalize,4)
        # get the indices for sorting the posterior probabilities (in descending order)
        indices_sorted = np.argsort(postprob)[::-1]
        
        
        print('QDA Predicted Class: ' + self.class_names[indices_sorted[0]])
        print('QDA Posterior Probabilities:')
        for idx in range(len(indices_sorted)):
            print(self.class_names[indices_sorted[idx]] + ': ' + str(postprob[indices_sorted[idx]]))

        # return the posterior probabilities
        return postprob

In [5]:
# Running the model
model_qda = QDA(data_1a)
observation = [47, 1506, 92]

#### i: Equal class priors and equal costs for misclassification

In [6]:
uninformative_priors = {
    "no": 1 / 2,
    "yes": 1 / 2
}
model_qda.compute_probabilities(observation, uninformative_priors)

QDA Predicted Class: no
QDA Posterior Probabilities:
no: 0.8071
yes: 0.1929


array([0.8071, 0.1929])

#### ii: The prior for not selecting the new bank service is 0.9 and misclassification costs are equal;

In [7]:
informative_priors = {
    "no": 0.9,
    "yes": 0.1
}
model_qda.compute_probabilities(observation, informative_priors)

QDA Predicted Class: no
QDA Posterior Probabilities:
no: 0.9741
yes: 0.0259


array([0.9741, 0.0259])

#### iii: The prior for not selecting the new bank service is 0.9 and the cost of classifying a customer as not a new service candidate when they are is 15 times the cost of classifying a customer as a new service customer

In [8]:
informative_priors = {
    "no": (0.9 * 1 / 16),
    "yes": (0.1 * 15 / 16)
}
model_qda.compute_probabilities(observation, informative_priors)

QDA Predicted Class: no
QDA Posterior Probabilities:
no: 0.7151
yes: 0.2849


array([0.7151, 0.2849])

(b) Assume Gaussian class conditional likelihoods with equal variancecovariance
matrices with each of the following additional assumptions
applied singularly to each decision rule in this class:

In [9]:
###########################
# LDA Code

def multivariate_gaussian_pdf(X, MU, SIGMA):
    """Code from Data Blog https://xavierbourretsicotte.github.io/MLE_Multivariate_Gaussian.html
    Maximum Likelihood Estimator: Multivariate Gaussian Distribution
        by Xavier Bourret Sicotte, Fri 22 June 2018
    Returns the pdf of a multivariate Gaussian distribution
     - X, MU are p x 1 vectors
     - SIGMA is a p x p matrix"""
    # Initialize and reshape
    X = X.reshape(-1, 1)
    MU = MU.reshape(-1, 1)
    p, _ = SIGMA.shape

    # Compute values
    SIGMA_inv = np.linalg.inv(SIGMA)
    denominator = np.sqrt((2 * np.pi) ** p * np.linalg.det(SIGMA))
    exponent = -(1 / 2) * ((X - MU).T @ SIGMA_inv @ (X - MU))

    # Return result
    return float((1. / denominator) * np.exp(exponent))


class LDA:
    """Creates a class for Linear Discriminant Analysis
    Input:
        data = pandas data frame name for a csv file, must have one column labeled "class" and the rest numeric data
    Methods:
        compute_probabilities = given an input observation computes the likelihood for each class and the GML class
        compute_probabilities: given an input observation and prior probabilities,
            computes the posterior probabilities for each class and most probable class"""

    def __init__(self, data):
        # reads the data and computes the statistics needed for classification

        # read the iris data as a Pandas data frame
        df = data

        # separate the class labels from the rest of the data
        # we are assuming the column name with class labels is 'Class'
        # and all other columns are numeric
        self.data_labels = df.loc[:]['Class']
        self.data = np.asarray(df.drop('Class', axis=1, inplace=False))

        # get information about the dimensions the data
        self.num_rows, self.num_cols = self.data.shape

        # get the class names as an array of strings
        self.class_names = np.unique(self.data_labels)

        # determine number of observations in each class
        self.num_obs = dict()
        for name in self.class_names:
            self.num_obs[name] = sum(self.data_labels == name)

        # compute the mean of each class
        self.means = dict()
        for name in self.class_names:
            self.means[name] = np.mean(self.data[self.data_labels == name, :], 0)

        # compute the mean covariance matrix
        self.cov = np.zeros([self.num_cols, self.num_cols])
        for name in self.class_names:
            self.cov = self.cov + self.num_obs[name] * np.cov(np.transpose(self.data[self.data_labels == name, :]))
        self.cov = self.cov / self.num_rows

    def compute_likelihoods(self, x):
        # compute and output the likelihood of each class and the maximum likelihood class

        # check that the input data x has the correct number of rows
        if not (len(x) == self.num_cols):
            print('Data vector has wrong number of values.')
            return -1

        # reformat x as a numpy array, incase the user input a list
        x = np.asarray(x)

        # compute the likelihood of each class
        likelihoods = np.zeros(len(self.class_names))
        idx = 0
        for name in self.class_names:
            likelihoods[idx] = multivariate_gaussian_pdf(x, self.means[name], self.cov)
            idx = idx + 1

        # get the indices for sorting the likelihoods (in descending order)
        indices_sorted = np.argsort(likelihoods)[::-1]

        # print the predicted class and all class likelihoods
        print('LDA Predicted Class: ' + self.class_names[indices_sorted[0]])
        print('LDA Class Likelihoods:')
        for idx in range(len(indices_sorted)):
            print(self.class_names[indices_sorted[idx]] + ': ' + str(likelihoods[indices_sorted[idx]]))

        # return the likelihoods
        return likelihoods

    def compute_probabilities(self, x, priors):
        # compute and output the probability of each class and the maximum probability class
        
        #likelihoods = self.compute_likelihoods(x)
        # check that the input data x has the correct number of rows
        if not (len(x) == self.num_cols):
            print('Data vector has wrong number of values.')
            return -1

        # reformat x as a numpy array, incase the user input a list
        x = np.asarray(x)

        # compute the likelihood of each class
        likelihoods = np.zeros(len(self.class_names))
        idx = 0
        for name in self.class_names:
            likelihoods[idx] = multivariate_gaussian_pdf(x, self.means[name], self.cov)
            idx = idx + 1
        
        #Number of classes
        Nclass = len(self.class_names)
        
        
        #posterior probabilties: likelihood * prior / normalize
        postprob = np.zeros(Nclass)
        idx = 0
        normalize = 0
        for name in self.class_names:
            proportion = likelihoods[idx] * priors[name]
            postprob[idx] = proportion
            normalize += proportion
            idx = idx + 1
        postprob = np.round(postprob / normalize, 4)
        # get the indices for sorting the posterior probabilities (in descending order)
        indices_sorted = np.argsort(postprob)[::-1]
        
        
        print('LDA Predicted Class: ' + self.class_names[indices_sorted[0]])
        print('LDA Posterior Probabilities:')
        for idx in range(len(indices_sorted)):
            print(self.class_names[indices_sorted[idx]] + ': ' + str(postprob[indices_sorted[idx]]))

        # return the posterior probabilities
        return postprob

In [10]:
# Running the model
model_lda = LDA(data_1a)
observation = [47, 1506, 92]

#### i: Equal class priors and equal costs for misclassification

In [11]:
uninformative_priors = {
    "no": 1 / 2,
    "yes": 1 / 2
}
model_lda.compute_probabilities(observation, uninformative_priors)

LDA Predicted Class: no
LDA Posterior Probabilities:
no: 0.8291
yes: 0.1709


array([0.8291, 0.1709])

#### ii: The prior for not selecting the new bank service is 0.9 and misclassification costs are equal;

In [12]:
informative_priors = {
    "no": 0.9,
    "yes": 0.1
}
model_lda.compute_probabilities(observation, informative_priors)

LDA Predicted Class: no
LDA Posterior Probabilities:
no: 0.9776
yes: 0.0224


array([0.9776, 0.0224])

#### iii: The prior for not selecting the new bank service is 0.9 and the cost of classifying a customer as not a new service candidate when they are is 15 times the cost of classifying a customer as a new service customer

In [13]:
informative_priors = {
    "no": (0.9 * 1 / 16),
    "yes": (0.1 * 15 / 16)
}
model_lda.compute_probabilities(observation, informative_priors)

LDA Predicted Class: no
LDA Posterior Probabilities:
no: 0.7444
yes: 0.2556


array([0.7444, 0.2556])

### Question 2
Use numpy and pandas to develop a Naive Bayes classifier for edible mushrooms with the data in MushroomData.csv and MushroomVariables.txt. Use $\frac{2}{3}$ of the observations to train the classifier and test on $\frac{1}{3}$. Submit your code and your testing results. Hint: Treat the observations as  documents and the features with their values as words in a similar manner to the approach used to classify restaurant reviews.

In [30]:
# Read in Column Nmaes
header = pd.read_csv("MushroomVariables.txt", sep=",", header = None)
header = list(header.values)

# Read in Data
data_2 = pd.read_csv("MushroomData.csv", header = None)
data_2.columns = header
data_2.head()

Unnamed: 0,edible_class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
1,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
2,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
3,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
4,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,BROWN,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
