###### ### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2024 Semester 1

## Assignment 1: Wine quality classification with K-NN


**Student ID(s):**     `1356034`


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [13]:
# MAKE SURE TO RUN FIRST: IMPORT ALL LIBRARIES REQUIRED FOR ALL FUNCTIONS IN THIS NOTEBOOK
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import random
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

train_orig = "winequality-train.csv"
test_orig = "winequality-test.csv"

## 1. K-NN classifier

In [14]:
# THIS FUNCTION RUNS K-NN CLASSIFICATION - CALL classifier TO CLASSIFY A GIVEN TRAIN-TEST DATASET
# Takes a training filepath, a test filepath, a value for k as well as a filepath for classification results output
# Outputs its predictions in the "predictedQuality" column in the output file
def classifier(train_path: str, test_path: str, k: int, out_file: str):
    # read each csv file into a pandas dataframe
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    
    # prohibit k values which are outside of the specified bounds
    if (k < 1 or k > train.shape[0]):
        raise Exception(f"k must be between 1 and {train.shape[0]} inclusive")

    test['predictedQuality'] = test.apply(lambda row: predict_label(row, train, k), axis = 1)
    test.to_csv(out_file, index = False)
    print(f"Classification finished - check output file {out_file} for results")
    

def predict_label(test_row, train, k):
    # calculate the euclidean distance between the current test row and all of the training rows
    train = train.drop(['euclideanDistance'], axis = 1, errors = 'ignore')
    train['euclideanDistance'] = train.apply(lambda train_row: calculate_distance(train_row, test_row), axis = 1)
    
    return break_ties(train, k)
    
    
def break_ties(df, k):
    # find the k smallest distances including tied values
    k_items = df.nsmallest(k, 'euclideanDistance', keep = 'all')
    # get the counts for how many times each label appears
    counts_series = k_items['quality'].value_counts()
    # create a set of these counts for easy comparison later
    value_set = set(counts_series)
        
    # if we have a tie between counts for the two labels
    if (len(value_set) == 1 and max(value_set) != k_items.shape[0]):
        
        if (k_items.shape[0] != k or k == 1):
            # tie due to two points being at the same distance
            # choose a random label and return
            return random.choose(counts_series.index)
        else:
            # tie not due to two points being at the same distance
            # re-run the function again with 1-nn
            return break_ties(k_items, 1)
        
    else:
        # no ties are occurring
        # return the value of the label that appears most frequently
        return counts_series.idxmax()
    

def calculate_distance(train, test):
    distance = 0
    
    # find the euclidean distance between all of the features (excluding quality)
    for i in range(0, len(test) - 1):
        distance += ((train.iloc[i] - test.iloc[i]) ** 2)

    # no need to square root the distance as we are only worried about the relative magnitude to other distances
    # saves computation time
    return distance

In [28]:
# THE FUNCTION CALL BELOW RUNS THE ABOVE CLASSIFICATION FUNCTION
# RUNS ON TRAIN_ORIG AND TEST_ORIG WITH K=1 AND DIRECT OUTPUT TO task1_output.csv

classifier(train_orig, test_orig, 1, "output_task1.csv")

Classification finished - check output file output_task1.csv for results


# 2. 1-NN classification

#### NOTE: you may develop codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

In [15]:
# THIS FUNCTION CALCULATES ACCURACY OF A CLASSIFIED TEST FILE
# Takes a results filepath with columns 'quality' and 'predictedQuality' as input
def calculate_accuracy(in_file: str):
    # calculate the accuracy of the test file using SKLearn
    results = pd.read_csv(in_file)
    actual = results['quality'].values
    
    predicted = results['predictedQuality'].values
    return accuracy_score(actual, predicted)

# THIS FUNCTION PRINTS THE CLASS DISTRIBUTION IN A GIVEN DATA FILE
# Takes a filepath as input
def calculate_class_distribution(in_file: str):
    # return the number of instances which are in each feature class
    data = pd.read_csv(in_file)
    counts_series = data['quality'].value_counts()
    return(counts_series)

# THIS FUNCTION PLOTS SCATTER PLOTS FOR TWO GIVEN ATTRIBUTES AND A GIVEN TRAINING FILE
# ALSO HAS THE OPTION TO PLOT A RANDOM SUBSET AND TO ADD A TRENDLINE
# Takes a training file, two attribute names and boolean values to add a trendline or to plot a subset of the data as input
def plot_graphs(data_file: str, att1: str, att2: str, trendline: bool, subset: bool):
    df = pd.read_csv(data_file)

    # create a scatter plot between the two chosen attributes
    # colour the plots according to their quality label
    if subset:
        # plot a subset of the dataframe - randomly selects and plots 1/4 of the data
        sample_df = df.sample(len(df)//4)
        sns.scatterplot(x = sample_df[att1], y = sample_df[att2], alpha=0.4, hue = df['quality'])
    else:
        sns.scatterplot(x = df[att1], y = df[att2], alpha=0.4, hue = df['quality'])
        
    plt.xlabel(att1)
    plt.ylabel(att2)
    plt.title(f'{att1} vs {att2} {data_file}')

    # add a trendline to the data to show dependence between two attributes
    if trendline:
        z = np.polyfit(df[att1], df[att2], 1)
        p = np.poly1d(z)
        plt.plot(df[att1], p(df[att1]), "r-")
    
    plt.savefig(f'Scatter {att1} vs {att2} {data_file}.png')
    plt.close("all")

In [21]:
# THE BELOW FUNCTION CALLS RUN 1-NN ON THE ORIGINAL DATA AND THEN CALCULATE ACCURACY
results_1nn = "results_1nn.csv"
classifier(train_orig, test_orig, 1, results_1nn)
accuracy = calculate_accuracy(results_1nn)
print(f"Accuracy score of {accuracy} without scaling\n")

Accuracy score of 0.7644444444444445 without scaling



In [17]:
# THE BELOW FUNCTIONS PRINT THE CLASS DISTRIBUTION IN THE TRAINING FILE AS WELL AS CREATE SCATTER PLOTS FOR TWO GIVEN ATTRIBUTES
class_distribution = calculate_class_distribution(train_orig)
print(f"Class distributions for TRAIN are: {class_distribution}\n")

plot_graphs(train_orig, 'totalSulfurDioxide', 'citricAcid', False, True)
plot_graphs(train_orig, 'chlorides', 'alcohol', False, True)

Class distributions for TRAIN are: quality
0    820
1    530
Name: count, dtype: int64



## 3. Normalization

#### NOTE: you may develop codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

In [18]:
# THIS FUNCTION NORMALISES A TRAINING AND TEST FILE ACCORDING TO THE SPECIFIED NORMALISATION METHOD
# Takes a training and test file, an output training and test path, as well as a boolean (true: minmax, false: std) as input
def normalise_values(train_path: str, test_path: str, train_out: str, test_out: str, minmax: bool):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)
    
    if minmax:
        # scale all columns except for quality
        for column in train.columns[:-1]:
            # find the min and max value for each attribute using the training data
            min = train[column].min()
            max = train[column].max()

            # use the min max scale formula to scale all columns in both training and test datasets
            train[column] = train.apply(lambda curr_row: calculate_minmax(min, max, curr_row[column]), axis = 1)
            test[column] = test.apply(lambda curr_row: calculate_minmax(min, max, curr_row[column]), axis = 1)
            
    else:
        for column in train.columns[:-1]:
            # find the mean and std dev for each attribute using the training data
            mean = train[column].mean()
            stddev = train[column].std()

            # use the standardisation formula to scale all columns in both training and test datasets
            train[column] = train.apply(lambda curr_row: calculate_std(mean, stddev, curr_row[column]), axis = 1)
            test[column] = test.apply(lambda curr_row: calculate_std(mean, stddev, curr_row[column]), axis = 1)
            
    train.to_csv(train_out, index = False)
    test.to_csv(test_out, index = False)


# the two functions below implement the min max and standardisation formulas respectively when called
def calculate_minmax(min, max, val):
    return (val-min)/(max-min)


def calculate_std(mean, stddev, val):
    return (val-mean)/stddev

In [19]:
# output file names for below functions
train_minmax = "train_minmax.csv"
test_minmax = "test_minmax.csv"
train_std = "train_std.csv"
test_std = "test_std.csv"
results_minmax = "results_minmax.csv"
results_std = "results_std.csv"

# normalise the dataset using both methods
normalise_values(train_orig, test_orig, train_minmax, test_minmax, True)
normalise_values(train_orig, test_orig, train_std, test_std, False)

# classify the minmax data and find accuracy
classifier(train_minmax, test_minmax, 1, results_minmax)
minmax_accuracy = calculate_accuracy(results_minmax)
print(f"Accuracy score of {minmax_accuracy} for min max scaling\n")

# classify the standardised data and find accuracy
classifier(train_std, test_std, 1, results_std)
std_accuracy = calculate_accuracy(results_std)
print(f"Accuracy score of {std_accuracy} for standardisation scaling\n")

Classification finished - check output file results_minmax.csv for results
Accuracy score of 0.8503703703703703 for min max scaling

Classification finished - check output file results_std.csv for results
Accuracy score of 0.8674074074074074 for standardisation scaling



In [20]:
# plot graphs for totalSulfurDioxide vs freeSulfurDioxide for the three different train files
plot_graphs(train_orig, 'totalSulfurDioxide', 'freeSulfurDioxide', False, False)
plot_graphs(train_minmax, 'totalSulfurDioxide', 'freeSulfurDioxide', False, False)
plot_graphs(train_std, 'totalSulfurDioxide', 'freeSulfurDioxide', False, False)

# plot graphs for citricAcid vs pH for the three different train files
plot_graphs(train_orig, 'citricAcid', 'pH', False, False)
plot_graphs(train_minmax, 'citricAcid', 'pH', False, False)
plot_graphs(train_std, 'citricAcid', 'pH', False, False)

## 4. Model extensions

#### NOTE: you may develop codes or functions to help respond to the question here, but your formal answer must be submitted separately as a PDF.

### 4.1
Compare the performance of your best 1-NN model from Question 3 to a Gaussian naive Bayes model on this dataset (you may use library functions to implement the Gaussian naive Bayes model). In your write-up, state the accuracy of the naive Bayes model and identify instances where the two models disagree. Why do the two models classify these instances differently?

In [22]:
# THIS FUNCTION CREATES A GNB CLASSIFIER FOR A GIVEN TRAINING DATASET AND CALCULATES ITS ACCURACY TO A TEST DATASET
# Takes a training and test dataset file as well as a results output filepath as input
def gnb_classifier(train_in, test_in, file_out):
    train = pd.read_csv(train_in)
    test = pd.read_csv(test_in)

    # create a GNB classifier and fit it on the training data
    gnb = GaussianNB()
    gnb.fit(train.iloc[:,:-1], train.iloc[:,-1])

    # test the model's accuracy by predicting results on the test data
    accuracy = gnb.score(test.iloc[:,:-1], test.iloc[:,-1])
    print(f"Accuracy score of {accuracy} for GNB model")

    # output the posterior probabilities that GNB computed for each instance
    prob_table = gnb.predict_proba(test.iloc[:,:-1])
    test['predictedQuality'] = gnb.predict(test.iloc[:,:-1])
    test['0prob'] = prob_table[:,0]
    test['1prob'] = prob_table[:,1] 
    
    test.to_csv(file_out, index = False)

# THIS FUNCTION PLOTS HISTOGRAMS FOR A GIVEN ATTRIBUTE IN A TRAINING DATASET SEPARATED BY CLASS LABEL
# Takes a training filepath and the name of an attribute to make histograms for
def make_histograms(train_in, att):
    # create two separate datatables for instances with 0 or 1 quality
    train = pd.read_csv(train_in)
    train_0 = train[train['quality'] == 0]
    train_1 = train[train['quality'] == 1]

    # create a histogram for the instances with 0 quality
    train_0.hist(att, bins = 20)
    plt.xlabel(f'{att} Value')
    plt.ylabel('Number of Occurrences')
    plt.title(f'Histogram for {att} Label 0')
    plt.savefig(f'{att} Label0 Histogram.png')
    plt.close("all")

    # create a histogram for the instances with 1 quality
    train_1.hist(att, bins = 20)
    plt.xlabel(f'{att} Value')
    plt.ylabel('Number of Occurrences')
    plt.title(f'Histogram for {att} Label 1')
    plt.savefig(f'{att} Label1 Histogram.png')
    plt.close("all")

In [23]:
# run GNB classification on the standardised data
gnb_file = "results_gnb.csv"
gnb_classifier(train_std, test_std, gnb_file)

# plot histograms and graphs to demonstrate reasons for lower performance
plot_graphs(train_std, 'totalSulfurDioxide', 'freeSulfurDioxide', True, False)
make_histograms(train_std, 'pH')

Accuracy score of 0.774074074074074 for GNB model


### 4.2
Implement two additional distance measures for your K-NN model: cosine similarity and Mahalanobis distance (you may use library functions for these distance measures). Do 1-NN classification using each of these new distance measures and the three normalization options from Question 3. Discuss how the new distance metrics compare to Euclidean distance and how each metric is affected by normalization.

### 4.3
Implement either of the two K-NN weighting strategies discussed in lecture (inverse linear distance or inverse distance). Compare the performance of the weighted and majority vote models for a few different values of K. In your write-up, discuss how weighting strategy and the value of K affect the model's decisions.

### 4.4
Measure the empirical distribution of class labels in the training dataset (what percentage of the training data comes from each class). Then evaluate the distribution of labels predicted by your K-NN model for the test data, for a range of values for K. Does the class distribution of the predicted labels match the class distribution of the training data? Explain why or why not.