###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Assignment 1: Pose classification with naive Bayes


**Student ID(s):**     910519


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [2]:
# Import dependencies
import numpy as np
from collections import defaultdict
import pandas as pd

In [18]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing
# Our preprocessing will return a package of all the useful subsets / formats of the data
def preprocess():
    
    # Open the training data
    with open("train.csv", "r") as f:
        data = f.readlines()
    
    # Create a 2 dimensional array to store the data.
    # We have: [class, x1 (head), x2 (shoulders), x3 (elbowR), x4 (wristR), x5 (elbowL)
    #          ,x6 (wristL), x7 (hips), x8 (kneeR), x9 (footR), x10 (kneeL),x11 (footL)
    #          ,y1 (head), y2 (shoulders), y3 (elbowR), y4 (wristR), y5 (elbowL), y6 (wristL)
    #          ,y7 (hips), y8 (kneeR), y9 (footR), y10 (kneeL), y11 (footL)]
    
    data_array = [line.strip().split(',') for line in data]
    
    class_dict = examine_features(data)
    # For now we will not clean the data, and assume it will work fine:)
    
    # Create a dictionary that maps the class names to the entries that are in that class
    data_class_split = partition_data_classes(data_array, class_dict)
    
    # Within each class, find the mean and standard deviation for each attribute 
    class_summaries = find_gaussian_params(data_class_split)
    
    return {"Data": data_array, "Class frequencies": class_dict, "Partitioned data": data_class_split, "Class Summaries": class_summaries}


# Returns a dictionary that maps class labels to a list of (mean, standard deviation) corresponding to each attribute
def find_gaussian_params(partitioned_data):
    gaussian_dict = {}
    # for each class
    for _class in list(partitioned_data):
        # Take the entries for that class and group them by columns (attributes)
        attributes = group_by_column(partitioned_data[_class])
        gaussian_dict[_class] = []
        # For each attribute excluding the class type
        for attribute in attributes[1:]:
            # Calculate and store the attribute mean and std deviation
            np_array = np.asarray(attribute).astype(float)
            gaussian_dict[_class].append((np.mean(np_array), np.std(np_array)))
    return gaussian_dict
            
# Takes a list of instances and returns a 2d array where the ith entry is a list of all the values for that attribute
def group_by_column(data_array):
    new_array = []
    # For each of the attribute positions
    for i in range(len(data_array[0])):
        column_i = []
        # For each of the instances
        for j in data_array:
            # Append the ith element to the column space
            column_i.append(j[i])
        # Attach the column to our result array
        new_array.append(column_i)
    # Return the array now grouped by columns
    return new_array
            
# Returns a dictionary that maps class labels to entries with that label
def partition_data_classes(data, class_dict):
    result = {}
    for _class in list(class_dict):
        result[_class] = []
        for entry in data:
            if (entry[0] == _class):
                result[_class].append(entry)
    return result
    


def examine_features(data):
    # Examine the data
    n_instances = 0
    class_dict = defaultdict(int)
    for line in data:
        n_instances = n_instances + 1
        class_dict[line.strip().split(',')[0]] += 1
    
    print('Our total number of instances is:',n_instances, "\n")

    for lbl in class_dict.keys():
        print('For class', lbl , 'we have', class_dict[lbl], 'instances.')
    
    return (n_instances, class_dict)

data_processed = preprocess()

Our total number of instances is: 747 

For class bridge we have 81 instances.
For class childs we have 69 instances.
For class downwarddog we have 103 instances.
For class mountain we have 160 instances.
For class plank we have 57 instances.
For class seatedforwardbend we have 43 instances.
For class tree we have 67 instances.
For class trianglepose we have 59 instances.
For class warrior1 we have 54 instances.
For class warrior2 we have 54 instances.


In [None]:
# In order to implement a naive bayes we will need to summarise the data for each class
# This will include the mean and standard deviation of each attribute for each class

In [29]:
# This function should calculate prior probabilities and likelihoods from the training data and using
# them to build a naive Bayes model

def train():
    # First we will fetch our set of useful data
    data = preprocess()
    # The priors for each class are P(Class) = class_dict[class]/n
    num_instances = data[]
    for 
    # A Naive Bayes Learner will predict the class that maximises the posterior given the attributes seen
    # P(class|data) = (P(data|class) * P(class))/P(data)
    return

In [None]:
# This function should predict classes for new items in a test dataset (for the purposes of this assignment, you
# can re-use the training data as a test set)

def predict():
    # P(class|data) = P(X|class) * P(class)
    
    # Load the test data
    
    test = 
    return

In [None]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate():
    return

## Questions 


If you are in a group of 1, you will respond to **two** questions of your choosing.

If you are in a group of 2, you will respond to **four** questions of your choosing.

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer should be submitted separately as a PDF.

### Q1
Since this is a multiclass classification problem, there are multiple ways to compute precision, recall, and F-score for this classifier. Implement at least two of the methods from the "Model Evaluation" lecture and discuss any differences between them. (The implementation should be your own and should not just call a pre-existing function.)

### Q2
The Gaussian naıve Bayes classifier assumes that numeric attributes come from a Gaussian distribution. Is this assumption always true for the numeric attributes in this dataset? Identify some cases where the Gaussian assumption is violated and describe any evidence (or lack thereof) that this has some effect on the classifier’s predictions.

### Q3
Implement a kernel density estimate (KDE) naive Bayes classifier and compare its performance to the Gaussian naive Bayes classifier. Recall that KDE has kernel bandwidth as a free parameter -- you can choose an arbitrary value for this, but a value in the range 5-25 is recommended. Discuss any differences you observe between the Gaussian and KDE naive Bayes classifiers. (As with the Gaussian naive Bayes, this KDE naive Bayes implementation should be your own and should not just call a pre-existing function.)

### Q4
Instead of using an arbitrary kernel bandwidth for the KDE naive Bayes classifier, use random hold-out or cross-validation to choose the kernel bandwidth. Discuss how this changes the model performance compared to using an arbitrary kernel bandwidth.

### Q5
Naive Bayes ignores missing values, but in pose recognition tasks the missing values can be informative. Missing values indicate that some part of the body was obscured and sometimes this is relevant to the pose (e.g., holding one hand behind the back). Are missing values useful for this task? Implement a method that incorporates information about missing values and demonstrate whether it changes the classification results.

### Q6
Engineer your own pose features from the provided keypoints. Instead of using the (x,y) positions of keypoints, you might consider the angles of the limbs or body, or the distances between pairs of keypoints. How does a naive Bayes classifier based on your engineered features compare to the classifier using (x,y) values? Please note that we are interested in explainable features for pose recognition, so simply putting the (x,y) values in a neural network or similar to get an arbitrary embedding will not receive full credit for this question. You should be able to explain the rationale behind your proposed features. Also, don't forget the conditional independence assumption of naive Bayes when proposing new features -- a large set of highly-correlated features may not work well.