# Homework 2 Part 2 Solutions

In [1]:
import numpy as np
from scipy.stats import multivariate_normal
from numpy.matlib import repmat

from sklearn.metrics import confusion_matrix

import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

# Problem 1

**Consider the i.i.d. data samples $\{x_i\}_{i=1}^N$. Suppose that the data samples are drawn from a Geometric distribution with parameter $\mu$. The Geometric distribution takes the form**

$$p(x|\mu) = (1-\mu)^{x-1}\mu$$

**Answer the following questions:**

1. **Find the MLE estimate for the parameter $\mu$ assuming a Geometric data likelihood.**

2. **Assuming a Beta distribution as the prior distribution on the parameter $\mu$, find the MAP estimate for the parameter $\mu$. The Beta distribution takes the form** 

$$\text{Beta}(x|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}$$

**where $\Gamma(x) = (x-1)!$ and $\alpha,\beta>0$.**

**Show your work.** 

***Note: there is no need to type your answer using LaTeX. Answer this question on paper and then upload a picture/scan of your work.***

## MLE

The data likelihood is given by:

\begin{align*}
\mathcal{L} &= \prod_{i=1}^N p(x_i|\lambda)\\ 
&= \prod_{i=1}^N (1-\mu)^{x_i-1}\mu
\end{align*}

The log-data likelihood is then given by:

\begin{align*}
\ln \mathcal{L} &= \ln \left( \prod_{i=1}^N (1-\mu)^{x_i-1}\mu \right) \\
&= \sum_{i=1}^N \left((x_i-1)\ln(1-\mu) + ln(\mu)\right) \\
&= \sum_{i=1}^N (x_i-1)\ln(1-\mu) + N\ln(\mu)
\end{align*}

Let's find the parameter $\mu$ that maximizies the log-data likelihood:

\begin{align*}
\frac{\partial \ln \mathcal{L}}{\partial \lambda} &= 0 \\
\sum_{i=1}^N -\frac{x_i-1}{1-\mu} + \frac{N}{\mu} &= 0 \\
\sum_{i=1}^N -\mu(x_i-1) + (1-\mu)N &=0 \\
-\mu \sum_{i=1}^N x_i + N\mu + N - \mu N &=0\\
\mu &= \frac{N}{\sum_{i=1}^N x_i}
\end{align*}

## MAP

The data likelihood is given by:

\begin{align*}
\mathcal{L} &= \prod_{i=1}^N p(x_i|\mu)p(\mu|\alpha,\beta)\\ 
&= \prod_{i=1}^N \left( (1-\mu)^{x_i-1}\mu \right) \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1} (1-\mu)^{\beta-1}
\end{align*}

The log-data likelihood is then given by:

\begin{align*}
\ln \mathcal{L} &= \ln \left( \prod_{i=1}^N \left( (1-\mu)^{x_i-1}\mu \right) \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1} (1-\mu)^{\beta-1} \right) \\
&= \sum_{i=1}^N \left((x_i-1) \ln(1-\mu) +\ln(\mu) \right) + \ln\left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right) + (\alpha-1)\ln(\mu) + (\beta-1)\ln(1-\mu)\\
&= \left(\sum_{i=1}^N x_i -N +\beta-1\right) \ln(1-\mu) + \left(N+\alpha-1\right)\ln(\mu)
\end{align*}

Let's find the parameter $\mu$ that maximizies the log-data likelihood:

\begin{align*}
\frac{\partial \ln \mathcal{L}}{\partial \lambda} &= 0 \\
-\left(\sum_{i=1}^N x_i -N +\beta-1\right)\mu + \left(N+\alpha-1\right)(1-\mu) &= 0 \\
\mu &= \frac{N+\alpha-1}{\sum_{i=1}^N x_i +\beta+\alpha-2}
\end{align*}

And so, the parameter $\mu$ that maximizes the log-data likelihood is a function of both the data $\mathbf{x}$ and the prior parameters, $\alpha$ and $\beta$.

# Problem 2

## Crab Dataset Description

**The Crab Data Set has 200 samples and 7 features (Frontal Lip, Rear Width, Length, Width, Depth, Male and Female), describing 5 morphological measurements on 50 crabs each of two color forms and both sexes, of the species *Leptograpsus* variegatus collected at Fremantle, W. Australia.**

* **Dataset Source: Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus *Leptograpsus*. *Australian Journal of Zoology* 22, 417–425.**

**The data set is saved in the file "crab.txt": the firt column corresponds to the class label (crab species) and the other 7 columns correspond to the features.**

**Use the first 140 samples as your training set and the last 60 samples as your test set.**

**Answer the following questions:**

1. **Implement the Naive Bayes classifier, under the assumption that your data likelihood model $p(x|C_j)$ is a multivariate Gaussian and the prior probabilities $p(C_j)$ are dictated by the number of samples $n_j\in\mathbb{R}$ that you have for each class.**

2. **Did you encounter any problems when implementing the probabilistic generative model? What is your solution for the problem? Explain why your solution works. (Note: There is more than one solution.)**

3. **Report your classification results in terms of a confusion matrix in both training and test set. (You can use the function ```confusion_matrix``` from the module ```sklearn.metrics```.)**

In [2]:
data = pd.read_csv("crab.txt", delimiter="\t")

data.head()

Unnamed: 0,Species,FrontalLip,RearWidth,Length,Width,Depth,Male,Female
0,0,20.6,14.4,42.8,46.5,19.6,1,0
1,1,13.3,11.1,27.8,32.3,11.3,1,0
2,0,16.7,14.3,32.3,37.0,14.7,0,1
3,1,9.8,8.9,20.4,23.9,8.8,0,1
4,0,15.6,14.1,31.0,34.5,13.8,0,1


In [3]:
X_train = data.iloc[:140,1:].to_numpy()
y_train = data.iloc[:140,0].to_numpy()

X_test = data.iloc[140:,1:].to_numpy()
y_test = data.iloc[140:,0].to_numpy()

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((140, 7), (60, 7), (140,), (60,))

# Problem 2 Solutions

## Probabilistic Generative Classifier

In [4]:
# Prior probabilities

pC0 = np.sum(y_train==0)/y_train.size
pC1 = np.sum(y_train==1)/y_train.size

pC0, pC1

(0.5142857142857142, 0.4857142857142857)

In [5]:
# Means and covariances of the data likelihood

mu0 = np.mean(X_train[y_train==0,:],axis=0)
mu1 = np.mean(X_train[y_train==1,:],axis=0)

cov0 = np.cov(X_train[y_train==0,:].T)
cov1 = np.cov(X_train[y_train==1,:].T)

In [6]:
# Training Data Likelihood
y0_train = multivariate_normal.pdf(X_train, mean=mu0, cov=cov0) #P(x|C0)
y1_train = multivariate_normal.pdf(X_train, mean=mu1, cov=cov1) #P(x|C1)

# Test Data Likelihood
y0_test = multivariate_normal.pdf(X_test, mean=mu0, cov=cov0) #P(x|C0)
y1_test = multivariate_normal.pdf(X_test, mean=mu1, cov=cov1) #P(x|C1)

LinAlgError: singular matrix

Note that if we used all 7 features, the covariance matrix $\Sigma_X$ would be singular. This is because one of the features is colinear with another feature. In particular, features male and female and the negative of one another.

There are 2 ways to address this issue:

1. Eliminate one of the features (method used here)
2. Diagonally-load the covariance matrix: $\Sigma_X + \lambda I$

In [7]:
# Eliminating the last feature ("female")

X_train = data.iloc[:140,1:7].to_numpy()
X_test = data.iloc[140:,1:7].to_numpy()
y_train = data.iloc[:140,0].to_numpy()
y_test = data.iloc[140:,0].to_numpy()

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((140, 6), (60, 6), (140,), (60,))

In [8]:
# Recomputing MLE estimates for mean and covariance

mu0 = np.mean(X_train[y_train==0,:],axis=0)
mu1 = np.mean(X_train[y_train==1,:],axis=0)

cov0 = np.cov(X_train[y_train==0,:].T)
cov1 = np.cov(X_train[y_train==1,:].T)

In [9]:
# Training Data Likelihood
y0_train = multivariate_normal.pdf(X_train, mean=mu0, cov=cov0) #P(x|C0)
y1_train = multivariate_normal.pdf(X_train, mean=mu1, cov=cov1) #P(x|C1)

# Test Data Likelihood
y0_test = multivariate_normal.pdf(X_test, mean=mu0, cov=cov0) #P(x|C0)
y1_test = multivariate_normal.pdf(X_test, mean=mu1, cov=cov1) #P(x|C1)

y0_train.shape, y1_train.shape

((140,), (140,))

In [10]:
# Posterior for Training Data
pos0_train = (y0_train*pC0)/(y0_train*pC0 + y1_train*pC1) # Class 0
pos1_train = (y1_train*pC1)/(y0_train*pC0 + y1_train*pC1) # Class 1
pos_train = np.array([pos0_train, pos1_train]).T # Creating a matrix with posterior probabilities
likelihood_train = np.array([y0_train, y1_train]).T # Creating a matrix with likelihoods

# Posterior for Test Data
pos0_test = (y0_test*pC0)/(y0_test*pC0 + y1_test*pC1) # Class 0
pos1_test = (y1_test*pC1)/(y0_test*pC0 + y1_test*pC1) # Class 1
pos_test = np.array([pos0_test, pos1_test]).T # Creating a matrix with posterior probabilities
likelihood_test = np.array([y0_test, y1_test]).T # Creating a matrix with likelihoods

# Prediction for Training Data
predict_train = np.argmax(pos_train, axis=1) # Label prediction for training data
# labels it as the class with largest posterior
predict_likelihood_train = likelihood_train[predict_train] # Likelihood value for the assigned class

# Prediction for Test Data
predict_test = np.argmax(pos_test, axis=1) # Label prediction for test set
predict_likelihood_test = likelihood_train[predict_test] # Likelihood value for the assigned class

In [11]:
print('Confusion matrix in Training')
print(confusion_matrix(y_train, predict_train))

print('Confusion matrix in Test')
print(confusion_matrix(y_test, predict_test))

Confusion matrix in Training
[[72  0]
 [ 0 68]]
Confusion matrix in Test
[[28  0]
 [ 0 32]]


The confusion matrix shows that all points were correctly classified for both training and test sets.