###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2023 Semester 1

## Assignment 1: Music genre classification with naive Bayes


**Student ID(s):**     1268256


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [6]:
# Import relevant packages
import numpy as np
import pandas as pd

In [117]:
# This function should prepare the data by reading it from a file and converting it into a useful format for training and testing

def preprocess(pathname):
    """
    We want to extract the dataframe containing all attribute value pairs, as well as a Series of corresponding labels
    """
    df = pd.read_csv(pathname)
    df = df.drop('filename',axis=1) # Don't need filename
    labels = df['label']
    df = df.drop('label', axis=1) # Don't want label in our attribute-value table
    return (df, labels)

In [118]:


def train(df, labels):
    """
    This function should calculate prior probabilities and conditional likelihoods from the training data
    Such that we have the necessary data for a Naive Bayes model
    """
    return (prior_prob(labels), conditional_likelihoods(df, labels))
    
def prior_prob(labels):
    """
    Auxiliary to calculate all label prior probabilities
    """
    priors = {}

    unique_labels, counts = np.unique(labels, return_counts=True)
    n = sum(counts)

    for i in range(len(unique_labels)):
        # Get the proportion this label occurs in the entire dataset
        priors[unique_labels[i]] = (counts[i] / n).round(2)

    return priors

def conditional_likelihoods(df, labels):
    """
    Auxiliary to calculate the likelihood of each feature given a label
    """
    
    # Get all the features excluding filename and label
    features = df.columns
    unique_labels = np.unique(labels)

    # Get the approximated Normal distribution for the feature given the class
    distributions = {}
    for feature in features:
        feature_values = df[feature]
        distributions[feature] = {}
        # Intialise a dictionary which will be used to store the feature values associated with each label
        feature_classes = {}
        for label in unique_labels:
            feature_classes[label] = []

        # Then fill in this dictionary
        for i in range(len(feature_values)):
            feature_classes[labels[i]].append(feature_values[i])

        # Then we can calculate the Normal distribution parameters for Pr(Feature Value|Class Label)
        for (key, value) in feature_classes.items():
            # Store a tuple of the mu and sigma for a feature and class
            distributions[feature][key] = (np.mean(value), np.std(value))
            
    return distributions

(train_df, train_labels) = preprocess("pop_vs_classical_train.csv")
(priors, distributions) = train(train_df, train_labels)

In [121]:
def predict(df, priors, distributions):
    """
    Predict the classes for new items in a test dataset
    """
    predictions = []
    for index, row in df.iterrows():
        # For each row, predict the log likelihood of each possible label given the data
        log_likelihoods = []
        for label in priors.keys():
            # Get the Bayes formula relative likelihood for this combo of label and row
            log_likelihood = np.log(priors[label])
            for (attribute, value) in row.items():
                if (attribute == "label" or attribute == "filename"):
                    continue
                log_likelihood = log_likelihood + log_gaussian(value, distributions[attribute][label])

            # Append a tuple containing the label, and the log probability calculation
            log_likelihoods.append((label, log_likelihood))

        # Extract the maximum likelihood, which dictates our prediction
        argmax = max(log_likelihoods, key=lambda x:x[1])[0]
        predictions.append(argmax)

    # Modify the original dataframe to include the predictions
    df['prediction'] = predictions
    # We can now move the labels back onto the original dataframe for evaluation
    return df

def log_gaussian(x, distribution):
    """
    Return the natural log of the gaussian distribution with mean mu and sd sigma, for a realisation x
    Since we are only calculating relative probabilities, we can comfortably exclude the inclusion of sqrt(2pi)
    """
    mu = distribution[0]
    sigma = distribution[1]
    return np.log(1/sigma) - (((x-mu)**2) / (2*(sigma**2)))

(train_df, train_labels) = preprocess("pop_vs_classical_train.csv")
(test_df, test_labels) = preprocess("pop_vs_classical_test.csv")
(priors, distributions) = train(train_df, train_labels)
predict(test_df, priors, distributions)

Unnamed: 0,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,rolloff_mean,rolloff_var,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,prediction
0,0.254753,0.084223,0.034045,0.00046,1516.831118,80406.83,1629.756432,36217.914238,2974.121717,308251.1,...,143.199127,0.959206,142.132935,-0.760935,94.016602,-2.110459,122.134209,0.754724,104.192406,classical
1,0.21665,0.084129,0.011433,8.8e-05,1371.280858,111557.8,1562.114726,63117.191939,2619.689856,613284.4,...,163.772018,7.16144,138.801865,3.840835,224.231369,2.599433,291.526215,4.933314,268.553925,classical
2,0.256378,0.086092,0.037363,0.001031,1358.547002,81305.17,1417.623243,81420.079572,2416.898043,488686.4,...,89.044304,-0.097337,100.365051,1.501247,77.920609,0.380252,116.987503,-3.007311,138.785706,classical
3,0.239004,0.084633,0.018697,0.000321,1157.916744,187689.6,1320.686233,148728.574175,2181.923001,928671.0,...,76.453407,1.527259,69.52813,-2.264838,86.554382,-1.436002,105.99485,2.624205,163.5858,classical
4,0.262914,0.084129,0.062621,0.000654,1314.282125,142564.5,1371.398438,91861.781251,2398.554019,581328.3,...,65.114998,-1.946552,59.13364,-2.513381,61.42952,1.562973,76.378067,-1.472089,83.74543,classical
5,0.252292,0.085997,0.017926,4.5e-05,1181.340097,114510.1,1411.881657,96394.160163,2032.722717,568182.1,...,61.707802,5.031433,110.964256,-0.75756,108.318176,-0.139755,137.757034,-2.850241,226.437332,classical
6,0.346803,0.078617,0.115099,0.000357,1878.382005,85699.96,1964.688409,53244.699258,3633.324176,469249.9,...,29.995607,-3.394369,35.613697,1.692993,24.515646,-3.680548,66.970215,-3.433261,59.929924,classical
7,0.296374,0.081929,0.160857,0.003166,1513.870637,548932.6,1640.492674,267135.469308,2766.775682,2499264.0,...,57.92823,-4.695436,46.060249,-2.247865,53.625267,-5.165849,55.770363,-3.224921,84.38813,classical
8,0.286093,0.084303,0.007976,1.7e-05,1170.028076,48596.25,1595.365294,52758.970717,2360.084027,237637.4,...,38.962044,-2.287446,47.514862,-4.10984,43.306095,-1.811351,30.788736,1.229756,42.02021,classical
9,0.265606,0.087774,0.030006,0.000123,1137.99553,33773.01,1480.566388,63469.687882,1821.312627,220290.5,...,84.324097,0.114769,94.082024,4.251423,142.758514,5.07931,81.052422,2.604298,137.285767,classical


In [73]:
# This function should evaliate the prediction performance by comparing your model’s class outputs to ground
# truth labels

def evaluate(df):
    accurate = []
    for index, row in df.iterrows():
        if row['label'] == row['prediction']:
            accurate.append(1)
        else:
            accurate.append(0)
    return np.mean(accurate)

evaluate(test_df)

0.975609756097561

## Task 1. Pop vs. classical music classification

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

### Q1
Compute and report the accuracy, precision, and recall of your model (treat "classical" as the "positive" class).

### Q2
For each of the features X below, plot the probability density functions P(X|Class = pop) and P(X|Class = classical). If you had to classify pop vs. classical music using just one of these three features, which feature would you use and why? Refer to your plots to support your answer.
- spectral centroid mean
- harmony mean
- tempo

## Task 2. 10-way music genre classification

#### NOTE: you may develope codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

### Q3
Compare the performance of the full model to a 0R baseline and a one-attribute baseline. The one-attribute baseline should be the best possible naive Bayes model which uses only a prior and a single attribute. In your write-up, explain how you implemented the 0R and one-attribute baselines.

### Q4
Train and test your model with a range of training set sizes by setting up your own train/test splits. With each split, use cross-fold validation so you can report the performance on the entire dataset (1000 items). You may use built-in functions to set up cross-validation splits. In your write-up, evaluate how model performance changes with training set size.

### Q5
Implement a kernel density estimate (KDE) naive Bayes model and compare its performance to your Gaussian naive Bayes model. You may use built-in functions and automatic ("rule of thumb") bandwidth selectors to compute the KDE probabilities, but you should implement the naive Bayes logic yourself. You should give the parameters of the KDE implementation (namely, what bandwidth(s) you used and how they were chosen) in your write-up.

### Q6
Modify your naive Bayes model to handle missing attributes in the test data. Recall from lecture that you can handle missing attributes at test by skipping the missing attributes and computing the posterior probability from the non-missing attributes. Randomly delete some attributes from the provided test set to test how robust your model is to missing data. In your write-up, evaluate how your model's performance changes as the amount of missing data increases.