# Question 4
Using image of numbers classify them using beyes theorm. Note that here the parameters of guassian are unknown so with parameter estimation methods such as MLE, first calculate the parameters of guassian distribution and then use the beyes theorm to classify images. Divide images dataset into half training and half test set.

Going through the MLE (Maximum likelihood) method we calculate the $p(D|\theta)$ and $D = {x_1, x_2, ..., x_n}$. To find the $\hat{\theta}_{ML}$, we need to find the max argument of $p(D|\theta)$. As the eqaution below:
$$
\begin{equation}
\hat{\theta}_{ML} = argmax \; p(D|\theta) =>  p({x_1, x_2, ..., x_n} | \theta) = \prod_{i=1}^{n} p(x_i | \theta)
\end{equation} 
$$
and if we continue the equation above we will find the argmax as the derivative of $p(D|\theta)$ with respect to $\theta$.
$$
\begin{equation}
\hat{\theta}_{ML} = \prod_{i=1}^{n} p(x_i | \theta) \; d\theta = 0
\end{equation} 
$$
To make it more simple we can apply the logarithm to the equation to make the product as a summation
$$
\begin{equation}
\hat{\theta}_{ML} = ln \;(\prod_{i=1}^{n} p(x_i | \theta) \; d\theta) = 0 \quad => \quad \hat{\theta}_{ML} = \sum_{i=1}^{n} ln \; p(x_i | \theta) \; d\theta = 0
\end{equation} 
$$

We will choose the normal distribution function and using the equation 3 we will find the mean and covariance as below:
$$
\begin{equation}
\hat{\mu} = \frac{1}{n} \sum_{k=1}^{n} x_k
\end{equation}
$$
$$
\begin{equation}
\hat{\Sigma} = \frac{1}{n} (x_k - \hat{\mu}) (x_k - \hat{\mu})^t 
\end{equation}
$$
So lets get to work and first calculate the $\hat{\mu}$ and $\hat{\Sigma}$.

In [2]:
import numpy as np
import pandas as pd

In [4]:
## read dataset
dataset = pd.read_csv('../numbers dataset/usps_images.csv')
dataset.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_247,feature_248,feature_249,feature_250,feature_251,feature_252,feature_253,feature_254,feature_255,label
0,0,8,0,0,6,2,0,4,0,11,...,4,5,0,0,1,0,0,0,6,0
1,0,0,9,0,4,0,6,6,0,10,...,4,8,0,0,9,0,4,8,0,0
2,1,0,0,0,0,2,4,4,0,5,...,4,9,0,0,2,0,0,0,1,0
3,0,0,0,0,3,4,2,0,0,4,...,9,6,0,5,1,0,0,0,5,0
4,0,3,0,0,9,0,5,2,1,0,...,0,0,2,4,3,0,0,0,0,0


In [6]:
def divide_dataframe(dataset, validation_split, label_string):
    """
    Divide a pandas dataframe into two dataframes

    INPUTS:
    ---------
    dataset:  the pandas dataframe that is going to be splited
    validation_split:  the partition of how the validation set will be, example: 0.1
    label:  For a multiclass classification, we need the label to get sample from each label (Must be an string)

    OUTPUTS:
    ---------
    train_set:  pandas dataframe, the partition of training set
    valid_set:  pandas dataframe, the partition of validation set  
    """

    assert ((validation_split >0) & (validation_split < 1)), "[ERROR] validation_split must be between 0 and 1!"
    columns = dataset.columns
    train_images = pd.DataFrame(columns=columns)
    validation_images = pd.DataFrame(columns=columns)

    ## get all labels for training and test data
    for label in dataset[label_string].unique():
        ## get the dataset for each dataset label
        dataset_label = dataset[dataset[label_string] == label]
        length = int(len(dataset_label) * validation_split)

        train_images = train_images.append(dataset_label.iloc[: length], ignore_index=True)
        validation_images = validation_images.append(dataset_label.iloc[length: ], ignore_index=True)

    features = columns[:len(columns) - 1]

    ## normalize the values into float
    ## we need to convert the integer values to float for KNN function
    for col in features:
        train_images[col] = pd.to_numeric(train_images[col], downcast='float')
        validation_images[col] = pd.to_numeric(validation_images[col], downcast='float')

    return train_images, validation_images

df_train, df_test = divide_dataframe(dataset, 0.5, 'label')

In [16]:
def mu_hat(x_vectors):
    """"
    find the mu hat from Maximum likelihood, using the equation 4

    INPUTS:
    -------
    x_vectors:  a pandas dataframe of features

    OUTPUT:
    ---------
    mu_hat:  the mean for all data
    """

    sum = 0
    for i in range(0, len(x_vectors)):
        sum += x_vectors.iloc[i]
    
    mu_hat = (1 / len(x_vectors)) * sum

    return mu_hat


features = df_train.columns[df_train.columns != 'label']
train_mean = mu_hat(df_train[features])
test_mean = mu_hat(df_test[features])

print("TRAIN MEAN")
print(train_mean.head())

print("TEST MEAN")
print(test_mean.head())

TRAIN MEAN
feature_0     3.616399
feature_1     8.693048
feature_2    12.365063
feature_3    17.414261
feature_4    23.185026
Name: 0, dtype: float32
TEST MEAN
feature_0     3.530481
feature_1     8.618182
feature_2    13.042424
feature_3    17.378609
feature_4    20.939394
Name: 0, dtype: float32
