David Giacobbi, CPSC 322, Fall 2023, Notebook for HW-5

# 1. Load libraries and datasets

 Import the data table and utility functions.

In [1]:
from data_table import *
from data_learn import *
from data_eval import *
from data_util import *

Load and clean auto data.

In [2]:
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')

auto = remove_duplicates(auto)
auto = remove_missing(auto, auto.columns())

# 2. Exploring k-NN

1. Discretize the mpg values in the auto table using three equal-width bins
2. Normalize all of the columns except for model and origin
3. Create a train and test set using holdout with approximately half of the rows in the test set

In [3]:
# Discretize mpg values with three equal-width bins
discretize(auto, 'mpg', [20,30])

# Normalize all the columns
norm_cols = ['cyls','disp','hp','weight','accl']
for column in norm_cols:
    normalize(auto, column)

# Create test and train set with holdout
test_set_size = int(auto.row_count() / 2)
train_set, test_set = holdout(auto, test_set_size)

4. Run knn over the train and test set to predict mpg class labels, using majority voting, k=5, the numerical columns cylinders, weight, and acceleration, and no nominal attributes. 
5. Print the resulting confusion matrix
6. Calculate and print the (average) accuracy across mpg labels, and the macro average f-measure. 

In [4]:
confusion_matrix = knn_eval(train_set, test_set, majority_vote, 5, 'mpg', ['cyls','weight','accl'])
print(confusion_matrix)


# Create a list of accuracies for each mpg label
acc = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(confusion_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))


# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(confusion_matrix, label)
    cur_precision = recall(confusion_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

print("Macro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1   72    1    0
       2   16   37    6
       3    0   12    9

Average Accuracy: 0.8474945533769063
Macro Average f-measure: 0.680663814167413


7. Run steps 1-6 3 times (i.e., put 1-6 in a for loop that iterates three times).

In [5]:
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, majority_vote, 5, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    1    2    3
--------  ---  ---  ---
       1   69   11    0
       2   12   41    4
       3    0    6   10

Average Accuracy: 0.8562091503267973
Macro Average f-measure: 0.7355994152046783 


  actual    1    2    3
--------  ---  ---  ---
       1   68    6    0
       2   10   41    8
       3    0    5   15

Average Accuracy: 0.8736383442265795
Macro Average f-measure: 0.7879447243854024 


  actual    1    2    3
--------  ---  ---  ---
       1   65    8    0
       2    9   49    5
       3    0    6   11

Average Accuracy: 0.8779956427015251
Macro Average f-measure: 0.7717491867370997 




8. Redo 7 (as another for loop) that uses weighted voting instead.

In [6]:
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, weighted_vote, 5, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    1    2    3
--------  ---  ---  ---
       1   58    8    1
       2   10   51    1
       3    0   18    6

Average Accuracy: 0.8344226579520697
Macro Average f-measure: 0.6460840956507784 


  actual    1    2    3
--------  ---  ---  ---
       1   62   16    0
       2    8   44    8
       3    0    5   10

Average Accuracy: 0.8387799564270152
Macro Average f-measure: 0.7316239316239316 


  actual    1    2    3
--------  ---  ---  ---
       1   65    7    0
       2   13   46    5
       3    0    6   11

Average Accuracy: 0.8649237472766885
Macro Average f-measure: 0.7561955337690631 




9. Compare the performance differences, if any, between the results from 6 and 7.

Overall, the results from the majority_vote knn evaluation had a slightly higher average accuracy. Moreover, the f-measures for weighted vote also saw a small drop off. Initial analysis on these two methods would argue that the majority vote was the stronger method, and scoring weights caused some predictions to alter.

10. Pick two other bin sizes (either equal width or hand-crafted cut-points) for mpg values and redo 6 (as another for loop) for each. Compare the results.

Bin Size 1: [22]

In [7]:
# BIN SIZE 1
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')
auto = remove_duplicates(auto)
auto = remove_missing(auto, auto.columns())

# Discretize mpg values with three equal-width bins
discretize(auto, 'mpg', [22])

# Normalize all the columns
norm_cols = ['cyls','disp','hp','weight','accl']
for column in norm_cols:
    normalize(auto, column)
    
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, majority_vote, 5, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    1    2
--------  ---  ---
       1   84    6
       2    6   57

Average Accuracy: 0.9215686274509803
Macro Average f-measure: 0.9190476190476191 


  actual    1    2
--------  ---  ---
       1   76    9
       2    9   59

Average Accuracy: 0.8823529411764706
Macro Average f-measure: 0.8808823529411764 


  actual    1    2
--------  ---  ---
       1   80    5
       2    7   61

Average Accuracy: 0.9215686274509803
Macro Average f-measure: 0.9191176470588236 




Bin Size 2: [15, 20, 25, 30]

In [8]:
# BIN SIZE 2
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')
auto = remove_duplicates(auto)
auto = remove_missing(auto, auto.columns())

# Discretize mpg values with three equal-width bins
discretize(auto, 'mpg', [15, 20, 25, 30])

# Normalize all the columns
norm_cols = ['cyls','disp','hp','weight','accl']
for column in norm_cols:
    normalize(auto, column)
    
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, majority_vote, 5, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    2    1    3    4    5
--------  ---  ---  ---  ---  ---
       2   26    8   19    1    0
       1    9   16    0    0    0
       3    5    0   18    7    0
       4    1    0    8   12    7
       5    0    0    1    3   12

Average Accuracy: 0.819607843137255
Macro Average f-measure: 0.580010582010582 


  actual    2    1    3    4    5
--------  ---  ---  ---  ---  ---
       2   29   14    2    0    0
       1    6   14    0    0    0
       3   17    1   10    3    1
       4    3    0   13   14    6
       5    0    0    0    9   11

Average Accuracy: 0.803921568627451
Macro Average f-measure: 0.5191666666666667 


  actual    2    1    3    4    5
--------  ---  ---  ---  ---  ---
       2   31    9    5    1    0
       1   11   17    0    0    0
       3    8    1   12    7    0
       4    1    0    9   11   13
       5    0    0    0    3   14

Average Accuracy: 0.8222222222222223
Macro Average f-measure: 0.5713372305443917 




11. Pick a bin size and voting method from above, and redo 6 but with two different k values (i.e., add two more for loops).

k = 3

In [9]:
# k = 3
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')
auto = remove_duplicates(auto)
auto = remove_missing(auto, auto.columns())

# Discretize mpg values with three equal-width bins
discretize(auto, 'mpg', [20,30])

# Normalize all the columns
norm_cols = ['cyls','disp','hp','weight','accl']
for column in norm_cols:
    normalize(auto, column)
    
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, majority_vote, 3, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    1    2    3
--------  ---  ---  ---
       1   69    9    0
       2   10   42    3
       3    0   13    7

Average Accuracy: 0.8474945533769063
Macro Average f-measure: 0.6660839160839161 


  actual    1    2    3
--------  ---  ---  ---
       1   65   10    0
       2   11   42    8
       3    0    4   13

Average Accuracy: 0.8562091503267973
Macro Average f-measure: 0.7732990463945141 


  actual    1    2    3
--------  ---  ---  ---
       1   67    5    0
       2   18   38    8
       3    0    5   12

Average Accuracy: 0.8431372549019608
Macro Average f-measure: 0.7433959694989106 




k = 6

In [10]:
# k = 6
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')
auto = remove_duplicates(auto)
auto = remove_missing(auto, auto.columns())

# Discretize mpg values with three equal-width bins
discretize(auto, 'mpg', [20,30])

# Normalize all the columns
norm_cols = ['cyls','disp','hp','weight','accl']
for column in norm_cols:
    normalize(auto, column)
    
# Run 3 iterations of knn
for i in range(3):

    # Create test and train set with holdout
    test_set_size = int(auto.row_count() / 2)
    train_set, test_set = holdout(auto, test_set_size)

    # knn evaluation
    confusion_matrix = knn_eval(train_set, test_set, majority_vote, 3, 'mpg', ['cyls','weight','accl'])
    print(confusion_matrix)

    # Create a list of accuracies for each mpg label
    acc = []
    for label in distinct_values(auto, ['mpg']):
        acc.append(accuracy(confusion_matrix, label))

    print("\nAverage Accuracy:", sum(acc)/len(acc))

    # Calculate the macro average f-measure
    f_measures = []
    for label in distinct_values(auto, ['mpg']):
        cur_recall = recall(confusion_matrix, label)
        cur_precision = recall(confusion_matrix, label)
        f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))

    print("Macro Average f-measure:", sum(f_measures)/len(f_measures), "\n\n")

  actual    1    2    3
--------  ---  ---  ---
       1   61    9    0
       2    7   53    2
       3    0   12    9

Average Accuracy: 0.869281045751634
Macro Average f-measure: 0.7182795698924731 


  actual    1    2    3
--------  ---  ---  ---
       1   65   12    0
       2   10   43    1
       3    0   14    8

Average Accuracy: 0.8387799564270152
Macro Average f-measure: 0.6680295013628347 


  actual    1    2    3
--------  ---  ---  ---
       1   72    4    0
       2   15   41    6
       3    0    7    8

Average Accuracy: 0.8605664488017428
Macro Average f-measure: 0.7139973589888701 




# 3. Issues, Challenges, and Observations

One of the most difficult challenges that I faced was trying to keep all of the indexes for the confusion matrix in order. Given that there were so many rows and column indexes, it was difficult to design the evaluation algorithm for that reason. Similarly, specific DataTable aspects created errors that caused my program to fail even though the concept was correct. Other than that, this homework went pretty well.

I thought that it was interesting that the weighted voting did slightly worse than majority voting as I thought it would provide more insight to the Euclidean distances of each row in knn. Moreover, it was apparent that bins that created less labels would have a higher accuracy as they do not need to choose from as many different options. Lastly, the k values chosen to change the evaluation were not significant enough to note. I think a substantial amount of k would be needed to really show a difference, especially given the size of the dataset.