David Giacobbi, CPSC 322, Fall 2023, Notebook for HW-6

# 1. Load libraries and datasets

 Import the data table and utility functions.

In [1]:
from data_table import *
from data_learn import *
from data_eval import *
from data_util import *

# 2. Auto MPG Data Analysis

Load and clean auto data.

In [2]:
auto = DataTable(['mpg','cyls','disp','hp','weight','accl','year','origin','name'])
auto.load('auto-mpg.txt')

# Rmove duplicate rows
auto = remove_duplicates(auto)

# Remove rows with missing values in any columns
auto = remove_missing(auto, auto.columns())

## Step 1: k-NN versus Naive Bayes 

Discretize the mpg value in the auto table using three equal-width bins. Normalize the weight (wt) and displacement (disp) attributes

In [3]:
# Discretize the mpg value into three equal-width bins
discretize(auto, 'mpg', [20, 30])

# Normalize weight and displacement attributes
normalize(auto, 'weight')
normalize(auto, 'disp')

Evaluate knn using stratified k-fold cross validation (i.e., your knn_stratified() function) to predict mpg labels using 10 folds, a knn k-value of 7, majority voting, and only the weight and displacement attributes (as numeric columns). Display the resulting confusion matrix. Compute accuracy, precision, recall, and the f-measure over the resulting confusion matrix and display each. 

In [4]:
# Evaluate a stratified 
k_fold_matrix = knn_stratified(auto, 10, 'mpg', majority_vote, 7, ['weight', 'disp'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1  130   18    1
       2   16   99    7
       3    0   11   25

Average Accuracy: 0.8849077090119435

Average Precision: 0.8071414054932889

Average Recall: 0.79280102525234

Macro Average f-measure: 0.7993312044542701


Repeat step 3 using the same parameters but using naive-bayes instead of knn (i.e., your naive_bayes_stratified() function). Be sure to use weight and displacement as continuous attributes. Display the resulting confusion matrix. Compute accraccy, precision, recall, and the f-measure over the resulting confusion matrix and display each.

In [5]:
# Evaluate a stratified 
k_fold_matrix = naive_bayes_stratified(auto, 10, 'mpg', ['weight', 'disp'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1  133   15    1
       2   16   73   33
       3    0    2   34

Average Accuracy: 0.8545059717698154

Average Precision: 0.7345761869251802

Average Recall: 0.8118075166155263

Macro Average f-measure: 0.7450476162645341


Compare the results from 4,5 and 6,7. Write down your thoughts and observations concerning the results. 

Although the naive bayes and knn approach have very similar evaluation metrics, the f-measure provides the best overarching evaluation of the classifiers effectiveness across any dataset. In this case, knn appeared to perform at a 5% better classifier than naive bayes. I think this is because knn uses Euclidean distances to classify, and its formula appears to be more accurate for the continuous values in comparison to the gaussian density of naive bayes.

## Step 2: Experimentation with Auto MPG Data Classification

Experiment with different numbers of folds, different attributes, different knn k-values, and so on, to find the parameters that work best for each approach (knn verus naive bayes) on the mpg data set. 

### Experimentation with stratified kNN

1. k-folds: 6, majority_vote, k: 3, ['weight', 'disp', 'cyls', 'accl']
1. k-folds: 12, weighted_vote, k: 5, ['weight', 'disp', 'cyls', 'accl', 'hp']

In [6]:
# k-folds: 6, majority_vote, k: 3, ['weight', 'disp', 'cyls', 'accl']
normalize(auto, 'cyls')
normalize(auto, 'accl')
k_fold_matrix = knn_stratified(auto, 6, 'mpg', majority_vote, 3, ['weight', 'disp', 'cyls', 'accl'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1  134   15    0
       2   17   95   10
       3    0   13   23

Average Accuracy: 0.8805646036916395

Average Precision: 0.7855815463633263

Average Recall: 0.7723020908464852

Macro Average f-measure: 0.7785034013605442


In [7]:
# k-folds: 12, weighted_vote, k: 5, ['weight', 'disp', 'hp', 'cyls', 'accl']
normalize(auto, 'hp')
k_fold_matrix = knn_stratified(auto, 12, 'mpg', weighted_vote, 5, ['weight', 'disp', 'hp', 'cyls', 'accl'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1  133   15    1
       2   16   91   15
       3    0    6   30

Average Accuracy: 0.8849077090119435

Average Precision: 0.7857637875693028

Average Recall: 0.8239508074473418

Macro Average f-measure: 0.8007008481717928


### Experimentation with Stratified Naive Bayes

1. k-folds: 12, ['weight', 'disp', 'hp', 'cyls', 'accl']

In [8]:
k_fold_matrix = naive_bayes_stratified(auto, 12, 'mpg', ['weight', 'disp', 'hp', 'cyls', 'accl'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(auto, ['mpg']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(auto, ['mpg']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    3
--------  ---  ---  ---
       1  128   17    4
       2   12   21   89
       3    0    0   36

Average Accuracy: 0.7350705754614549

Average Precision: 0.5819956868916477

Average Recall: 0.6770638500751825

Macro Average f-measure: 0.5282255950508545


In order to get a best idea of things that worked, I started with analyzing the stratified kNN. I noticed that increasing the number of folds as well as decreasing the k value and using weighted voting slightly increased the classifier's effectiveness. I wanted to add more attributes pertaining to the engine as larger engines typically require more power to work. However, when I used these changes on the Naive Bayes classifier, I noticed that the more attributes greatly impacted its accuracy. The more attributes and folded given to Naive Bayes appeared to have a very negative effect. My inference is that more continuous values in a Naive Bayes greatly decreases overall f-measure as it is based on the idea that the continuous value is in a perfect normal distribution, which is not typically the case. This is important to note when choosing continuous values for a Naive Bayes classifier.

# 2. Titanic Data Analysis

Load the titanic data set below. The attributes are *class*, *age*, *gender*, and *survival*.

In [9]:
titanic = DataTable(['class', 'age', 'gender', 'survival'])
titanic.load('titanic.txt')

Using the titanic data set, predict survival using both knn and naive bayes via stratified k-fold cross validation using each of the other attributes as categorical features. For both, use 4 folds. For knn use a k-value of 7 and majority voting. As above, show the resulting confusion matrix for both along with accuraccy, precision, recall, and f measure results.

### k-NN Stratified Classifier

In [10]:
k_fold_matrix = knn_stratified(titanic, 4, 'survival', majority_vote, 7, [], nom_cols=['class', 'age', 'gender'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(titanic, ['survival']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(titanic, ['survival']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

actual      yes    no
--------  -----  ----
yes        1490     0
no          711     0

Average Accuracy: 0.6769650159018628

Average Precision: 0.8384825079509314

Average Recall: 0.5

Macro Average f-measure: 0.4036846383094012


### Naive Bayes Stratified Classifier

In [11]:
k_fold_matrix = naive_bayes_stratified(titanic, 4, 'survival', [], cat_cols=['class', 'age', 'gender'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(titanic, ['survival']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(titanic, ['survival']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

actual      yes    no
--------  -----  ----
yes        1364   126
no          360   351

Average Accuracy: 0.7791912766924125

Average Precision: 0.7635161756336732

Average Recall: 0.7045535638433438

Macro Average f-measure: 0.7198478248571589


# 3. Student Stress Data Analysis

Load the student stress data set below. The attributes are given below in column order, where the short name to use is given in parenthesis: 
1. sleep_quality (sleep)
2. living_conditions (living)
3. basic_needs (basics)
4. academic_performance (academic)
5. study_load (study)
6. future_career_concerns (career)
7. social_support (social)
8. extracurricular_activities (extra)
9. stress_level (stress)

In [12]:
student_stress = DataTable(['sleep', 'living', 'basics', 'academic', 'study', 'career', 'social', 'extra', 'stress'])
student_stress.load('student-stress.txt')

## Step 1: Initial kNN and Naive Bayes Evaluation

Use stratified k-fold cross validation to evaluate knn and naive bayes for predicting student stress level (the stress attribute) using the other table attributes as categorical values. For both evaluations use 10 folds, and for knn use a k-value of 7 and majority voting. Give your resulting confusion matrices as well as accuracy, precision, recall, and f-measure values.

### kNN Stratified Classifier

In [13]:
k_fold_matrix = knn_stratified(student_stress, 10, 'stress', majority_vote, 7, [], nom_cols=['sleep', 'living', 'basics', 'academic', 
                                                                                             'study', 'career', 'social', 'extra'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(student_stress, ['stress']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(student_stress, ['stress']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    0
--------  ---  ---  ---
       1  231    1  126
       2    0  300   69
       0    0    0  373

Average Accuracy: 0.8812121212121212

Average Precision: 0.8844559605696193

Average Recall: 0.8194198422431151

Macro Average f-measure: 0.8242254462402888


### Naive Bayes Stratified Classifier

In [14]:
k_fold_matrix = naive_bayes_stratified(student_stress, 10, 'stress', [], cat_cols=['sleep', 'living', 'basics', 'academic', 
                                                                                   'study', 'career', 'social', 'extra'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(student_stress, ['stress']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(student_stress, ['stress']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    0
--------  ---  ---  ---
       1  314   24   20
       2   15  332   22
       0   16   35  322

Average Accuracy: 0.9199999999999999

Average Precision: 0.8812883904955516

Average Recall: 0.8800315822789683

Macro Average f-measure: 0.8802704439782209


## Step 2: Experimentation with Student Stress Data

### Naive Bayes Classifier Changes

1. k-fold: 14, ['sleep', 'basics', 'career', 'academic']
1. k-fold: 14, ['study', 'social', 'extra', 'living']

In [15]:
k_fold_matrix = naive_bayes_stratified(student_stress, 14, 'stress', [], cat_cols=['sleep', 'basics', 'academic', 'career'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(student_stress, ['stress']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(student_stress, ['stress']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    0
--------  ---  ---  ---
       1  314   18   26
       2   23  323   23
       0   23   23  327

Average Accuracy: 0.9175757575757576

Average Precision: 0.8764219035495632

Average Recall: 0.8763697762239104

Macro Average f-measure: 0.8763752385881931


In [16]:
k_fold_matrix = naive_bayes_stratified(student_stress, 14, 'stress', [], cat_cols=['living', 'study', 'social', 'extra'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(student_stress, ['stress']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(student_stress, ['stress']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    0
--------  ---  ---  ---
       1  289   25   44
       2   10  334   25
       0   12   41  320

Average Accuracy: 0.9048484848484848

Average Precision: 0.8622941860433078

Average Recall: 0.8567734895026345

Macro Average f-measure: 0.8575105650032796


### kNN Classifier Changes

1. k-folds: 12, majority_vote, k: 10, ['sleep', 'living', 'basics', 'career', 'social']

In [17]:
k_fold_matrix = knn_stratified(student_stress, 12, 'stress', majority_vote, 4, [], nom_cols=['sleep', 'living', 'basics', 'academic', 
                                                                                             'study', 'career', 'social', 'extra'])
print(k_fold_matrix)

# Compute accuracy, precision, recall
acc = []
prec = []
rec = []
for label in distinct_values(student_stress, ['stress']):
    acc.append(accuracy(k_fold_matrix, label))
    prec.append(precision(k_fold_matrix, label))
    rec.append(recall(k_fold_matrix, label))

print("\nAverage Accuracy:", sum(acc)/len(acc))
print("\nAverage Precision:", sum(prec)/len(prec))
print("\nAverage Recall:", sum(rec)/len(rec))

# Calculate the macro average f-measure
f_measures = []
for label in distinct_values(student_stress, ['stress']):
    cur_recall = recall(k_fold_matrix, label)
    cur_precision = precision(k_fold_matrix, label)
    f_measures.append((2 * cur_precision * cur_recall) / (cur_recall + cur_precision))
print("\nMacro Average f-measure:", sum(f_measures)/len(f_measures))

  actual    1    2    0
--------  ---  ---  ---
       1  314   28   16
       2    9  345   15
       0    8   49  316

Average Accuracy: 0.9242424242424243

Average Precision: 0.8922796175382918

Average Recall: 0.8864131027519031

Macro Average f-measure: 0.8871857325188431


In the changes that were made, I wanted to take a look at ways to split up the Naive Bayes attributes to see if some of them had a significant weight over the others. In the first test, I selected attributes that appeared to be the best indicators in identifying stress in people. I then placed the remaining ones in a different Naive Bayes classifier. Despite this change, I did not notice a difference. I think that all of the attributes had similar influences and weight, so the more attributes provided actually end up painting a better picture for the model to classify given the three options for stress levels.

Using this information, I altered the k value of the kNN classifier. I noticed that the voting scheme was thwarting the classifier's effectiveness. When I increased the k too high, I ended up with a classifier that only wanted to predict 0 stress levels. So, I decreased the k value to four and notcied a very high jump in the kNN's evaluation metrics, ultimately making it the most accurate of my altered models.

# 4. Issues, Challenges, and Observations

Overall, the assignment did not have too many struggles in the implementation of Naive Bayes. I think that the concept was easier to grasp after already having written a model for kNN. Moreover, the most difficulty I encountered was trusting that my evaluation functions were correctly performing stratification and depicting its results and metrics accurately in the confusion matrix. I think that since there were no python tests to check this, it was difficult to trust that my algorithm was performing what I belived it to be correct simply because it was not throwing any errors when I ran my program. The Jupyter notebook helped me to learn more about the strengths and weaknesses of kNN and Naive Bayes classifiers, which I hope to implement into my own project soon.