# Skill lab: Comparing classifiers


In this assignment you will apply the statistical tools we learned to a machine learning task of comparing performance of two classifiers.

By the end of this lab you will know
- How to implement a k-nearest neighbor classifier.
- How to perform a k-fold cross validation.
- How to find confidence intervals for a classiifer performance based on a sample.
- How to statistically compare performance of two classsifiers.



You need to perform the following seven tasks:
1. Compute the accuracy of the Naive Bayes classifier based on the holdout estimation. Next, compute the confidence interval for accuracy at the confidence level 0.90.
2. Break the original dataset into 10 folds for cross-validation of Naive Bayes classifier. Obtain classification results from 10 cross-validation experiments.
3. Implement the Nearest Neighbors classifier. 
4. Use it to find the accuracy based on the holdout estimation. Compute the confidence interval at the confidence level 0.90. 
5. Generate the same 10-folds from a dataset with all numeric attributes and obtain classification results using the k-NN classifer. 
6. Test the hypothesis that two classifers have a diferent performance at significance level 0.05.
7. Use the best classifier to predict the evaluation score of several instructors that you know.

Feel free to use any programming tools available: pandas, plain python, numpy or anything else. 

**You are not allowed to use sklearn or any other python library that already includes the implementation of all these tasks**.


## Dataset

The dataset for this lab contains about 460 anonymized student evaluations collected at the University of Texas at Austin, and used in the following publication: "Beauty in the Classroom: Instructors' Pulchritude and Putative Pedagogical Productivity". You can learn how the data was collected and the meaning of various data attributes following [THIS LINK](https://chance.amstat.org/2013/04/looking-good/).

We use a subset of attributes. This smaller subset of the original data is included in the repository. We want to build a classifier that &mdash; based on these attributes &mdash; will predict the evaluation result for each instructor: good (&ge; 4) or bad (<4). 

In [104]:
data_file = "SStudentEvaluations.csv"

In [105]:
import pandas as pd

data = pd.read_csv(data_file)
print(data.columns)
print(data.dtypes)

Index(['rank', 'ethnicity', 'gender', 'language', 'age', 'bty_avg',
       'eval_categorical'],
      dtype='object')
rank                 object
ethnicity            object
gender               object
language             object
age                   int64
bty_avg             float64
eval_categorical     object
dtype: object


In [106]:
display(data)

Unnamed: 0,rank,ethnicity,gender,language,age,bty_avg,eval_categorical
0,tenure track,minority,female,english,36,5.000,good
1,tenure track,minority,female,english,36,5.000,bad
2,tenure track,minority,female,english,36,5.000,bad
3,tenure track,minority,female,english,36,5.000,good
4,tenured,not minority,male,english,59,3.000,good
...,...,...,...,...,...,...,...
458,tenure track,not minority,male,english,32,6.833,good
459,tenure track,minority,female,non-english,42,5.333,bad
460,tenure track,minority,female,non-english,42,5.333,bad
461,tenure track,minority,female,non-english,42,5.333,bad


First of all, we will shuffle the data. We use a seeded randomization &mdash; so we can obtain reproducible results (needed for testing of your work).

In [107]:
data = data.sample(frac = 1, random_state=1)    # shuffling the data before performing any validation
data.head()

Unnamed: 0,rank,ethnicity,gender,language,age,bty_avg,eval_categorical
331,tenured,not minority,male,english,64,2.333,bad
101,tenured,not minority,female,english,46,4.333,good
192,tenured,not minority,male,english,54,2.333,good
66,teaching,not minority,male,english,37,4.333,bad
327,tenured,not minority,male,english,64,2.333,bad


### Holdout estimation
That is how we can divide the dataset into training and testing sets in proportion of ~ 2:1:

In [108]:
# Select ratio
ratio = 0.66
 
total_rows = data.shape[0]
train_size = int(total_rows*ratio)
 
# Split data into test and train
data_train = data[0:train_size]
data_test = data[train_size:]

In [109]:
data_train.shape[0]

305

In [110]:
data_test.shape[0]

158

## Naive Bayes classifier

Below we provide our implementation of the first classifier: Naive Bayes.

We have a mix of cathegorical and numeric attributes. We will produce counts and probabilities for cathegorical attributes. We will also precompute the mean and standard deviation for the numeric attributes which we will later use with the normal distribution probability density function (PDF) to compute the contribution of numeric attributes. 

Here is an implementation of the PDF:

In [111]:
from math import *

def normal_pdf(x, stat):
    """
    :param x: a variable
    :param mean: µ - the expected value or average from M samples
    :param stdev: σ - standard deviation
    :return: Gaussian (Normal) Density function.
    N(x; µ, σ) = (1 / 2πσ) * (e ^ (x–µ)^2/-2σ^2
    """
    mean, stdev = stat
    variance = stdev ** 2
    exp_squared_diff = (x - mean) ** 2
    exp_power = -exp_squared_diff / (2 * variance)
    exponent = e ** exp_power
    denominator = ((2 * pi) ** .5) * stdev
    normal_prob = exponent / denominator
    return normal_prob

Here is our counting function:

In [112]:
def produce_counts (train_set, column, results):
    # counter = 5
    col_idx = col_name_to_col_idx [column]
    for tup in train_set.itertuples():
        val = tup[col_idx]
        class_label = tup[7]
        prev = results [class_label][column]

        if val not in prev.keys():
            prev[val] = 0
        prev[val] += 1    

Based on these counts, we can pre-compute conditional probabilities for all combinations of cathegorical attributes and class labels:

In [113]:
def produce_probabilities(counts, results, class_label, total):
    for col in counts[class_label].keys():
        results[class_label][col] = {} 
        cardinality = len(counts[class_label][col].keys())
        
        for val in counts[class_label][col].keys():
            results[class_label][col][val] = (counts[class_label][col][val] + 1)/(total + cardinality)      
                

The classification algorithm that classifies all the records in the *test_set*, based on the data in the *train_set*. 

The output is the list of classification results in form of a tuple (*classified*, *actual*), where *classified* is a class label obtained by our classification, and *actual* is the actual label of this record in the test set.

In [117]:
col_name_to_col_idx = {"rank":1, "ethnicity":2, "gender":3, "language":4 }
idx_to_col_name = {1:"rank", 2: "ethnicity", 3: "gender", 4: "language"}

def naive_bayes_classify (train_set, test_set):  
    counts = {"good": {"rank":{}, "ethnicity":{}, "gender":{}, "language":{}}, "bad":{"rank":{}, "ethnicity":{}, "gender":{}, "language":{}} }   

    total_good  = train_set.groupby("eval_categorical").size()["good"]
    total_bad = train_set.groupby("eval_categorical").size()["bad"]
    priors = {"good":total_good/(total_good+total_bad), "bad":total_bad/(total_good+total_bad) }

    for col in col_name_to_col_idx.keys():
        produce_counts(train_set, col, counts)   
    # print(counts)
    
    probs = {"good":{}, "bad":{}}
    produce_probabilities (counts,  probs, "good", total_good)
    produce_probabilities (counts,  probs, "bad", total_bad)
    # print(probs)

    # means and std for normal distribution of numeric parameters
    data_good = train_set[train_set["eval_categorical"]== "good"]
    data_bad = train_set[train_set["eval_categorical"]== "bad"]

    stats = {"good":{"age":(data_good["age"].mean(), data_good["age"].std(ddof=1)), 
                 "bty_avg":(data_good["bty_avg"].mean(), data_good["bty_avg"].std(ddof=1)) },
        "bad":{"age":(data_bad["age"].mean(), data_bad["age"].std(ddof=1)), 
                 "bty_avg":(data_bad["bty_avg"].mean(), data_bad["bty_avg"].std(ddof=1)) }}
    #print(stats)
    
    results = []
    for tup in test_set.itertuples():
        class_label = tup[7]
        prob_good = log (priors["good"]) 
        for k in col_name_to_col_idx.keys():
            prob_good += log (probs["good"][k][tup[col_name_to_col_idx[k]]]) 
        prob_good += normal_pdf(tup[5], stats["good"]["age"])
        prob_good += normal_pdf(tup[6], stats["good"]["bty_avg"])
        # print ("good:", prob_good)

        prob_bad = log (priors["bad"]) 
        for k in col_name_to_col_idx.keys():
            prob_bad += log (probs["bad"][k][tup[col_name_to_col_idx[k]]]) 
        prob_bad += normal_pdf(tup[5], stats["bad"]["age"])
        prob_bad += normal_pdf(tup[6], stats["bad"]["bty_avg"])
        # print ("bad:", prob_bad)

        classified_as = "good"
        if prob_bad > prob_good:
            classified_as = "bad"
        
        results += [(classified_as, class_label )]    
    return results

Let's run the classifier using the training and testing parts we obtained in the holdout section.

In [118]:
class_results = naive_bayes_classify(data_train, data_test)
print(class_results[:5])

correct_count = 0
for r in range(len(class_results)):
    if class_results[r][0] == class_results[r][1]:
        correct_count+= 1
print ("Accuracy:", correct_count/len(class_results))

[('good', 'good'), ('good', 'good'), ('good', 'bad'), ('good', 'good'), ('bad', 'good')]
Accuracy: 0.569620253164557


<div style="background-color:blue;">
    <h3>Task 1. Generate confidence interval for accuracy of the Naive Bayes</h3>    
</div>
You can write the code, or use the tables manually.

**Work**: 

We are using a 90% confidence interval of the form

$$ CI = sample statistic \pm critical value * SE $$

 Our sample statistic is **0.5696** for the classifier's performance on the test set, which had 158 data points. Given that the test set's size is greater than 30, we will be using a normal (as opposed to t) distribution. For a 90% confidence interval on the normal distribution, the critical value is &plusmn;1.65. Lastly, since the classifier's outcome is binary yes/no, we will **approximate** its accuracy by **assuming** said accuracy follows a binomial distribution. ***TA Joey said this is acceptable.*** The standard error of that binomial distribution is 
 
 $$\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = .0394 $$


which all evaluates to 

$$ .5696 \pm (1.65 * .0394) $$


**Answer**: 

The confidence interval for the performance of the Naive Bayes classifier is:

$$[.5046, .6346]$$

### Cross-validation

We want to test the performance of two classifiers on different datasets &mdash; to get the mean of the paired difference. To create several traning/testing subsets we will use 10-fold cross-validation: we will divide our original dataset into 10 approximately equal parts (folds) and use 9 out of 10 folds for training and 1 fold for testing. Hence, the total number of performance experiments will be 10.

<div style="background-color:blue;">
    <h3>Task 2. Perform the 10-fold cross-validation with Naive Bayes</h3>    
</div>

Generate 10 equal non-overlapping subsets of data and store them in the list of pandas data frames called *folds*:

We have 462 points that we have to try to equally distribute into 10 folds. 462 % 10 = 2, so that means we'll have eight folds with 46 points and two with 47. Each fold will be its separate dataframe. 

In [119]:
import numpy as np

In [120]:
k = 10
folds = []

# TODO - your code
# generating a np arr of indices to separate data by
split_bounds = np.linspace(0, 462, k + 1, dtype=int)
# that yields [0,46,92,138,184,231,277,323,369,415]
# list[10] of tuples indicating start row and end row for each slice:
row_bounds = [(0,46), (47,93), (92,138), (138,184), (184,230), (230, 276), (276,322), (323,369), (368,414), (414,460)]



for i in row_bounds:
    start_idx = i[0]
    end_idx = i[1]
    fold = data.iloc[start_idx:end_idx]
    folds.append(fold)

for i in range(k):
    print("Fold", i, "size:", folds[i].shape[0])

Fold 0 size: 46
Fold 1 size: 46
Fold 2 size: 46
Fold 3 size: 46
Fold 4 size: 46
Fold 5 size: 46
Fold 6 size: 46
Fold 7 size: 46
Fold 8 size: 46
Fold 9 size: 46


Implement the loop where you obtain classification results for each of the folds. Store these results in the list *nb_accuracies* for future use.

In [121]:
nb_accuracies = []
# TODO: your code here
# I have slices of each fold's start and end in row_bounds. Instead of merging all items in folds besides the test set's index and training the classifier on those, 
# I'm going to instead just make a new df with the start and end bounds of each test set *set subtracted* from it via my list of the bounds of the *test set*; 
# i'm going to train on the entire data on that entire test set-less dataset and then test on the test set

for tup in row_bounds:
    test_low_bound = tup[0]
    test_high_bound = tup[1]

    test_set = data.iloc[test_low_bound:test_high_bound]
    training_set = data.drop(data.index[test_low_bound:test_high_bound])

    trial_results = naive_bayes_classify(training_set, test_set)

    trial_correct_count = 0
    for r in trial_results:
        if r[0] == r[1]:
            trial_correct_count += 1
    
    trial_accuracy = trial_correct_count / len(trial_results)
    nb_accuracies.append(trial_accuracy)
        

print(nb_accuracies)

[0.5652173913043478, 0.4782608695652174, 0.5217391304347826, 0.5652173913043478, 0.5217391304347826, 0.6956521739130435, 0.5434782608695652, 0.6086956521739131, 0.5434782608695652, 0.5]


For comparison &mdash; here are our results: 0.5652173913043478, 0.4782608695652174, 0.5217391304347826, 0.5652173913043478, 
    0.5217391304347826, 0.6956521739130435, 0.5434782608695652, 0.6086956521739131, 0.5434782608695652, 0.5

## Nearest Neighbors classifier (k-NN)

This classifier assigns a class to a given record based on the class labels of *k* labeled records that are closest to it. The closest samples are selected based on a distance metric, then the neighbors vote and the majority class is assigned to a record in question.

The value of *k* indicates the number of closest neighbors used to classify the test record. The value of *k* is non-parametric and a general rule of thumb in choosing the initial value of k is: k = sqrt(N)/2, where N stands for the number of samples in the training dataset. Another hint is to keep the value of k odd, so that there is no tie when choosing between two classes.

For our dataset the size of the training set will be about 9 * 46 = 414, and sqrt(414)/2 is ~ 11. We will use k=11 nearest neighbors for our classification.

### Categorical to numeric (binary)
To use distance metrics we must convert the categorical attributes to numeric. The most common method is to convert a categorical attribute into a set of binary attributes, such that for each categorical value there is a separate column, and the value in this column is either 0 or 1. This is called a "one hot encoding".

One hot encoding for categorical columns:

In [122]:
ohe_rank = pd.get_dummies(data["rank"], dtype=int)
pd.concat([ohe_rank, data["rank"]], axis=1, sort=False).head()

Unnamed: 0,teaching,tenure track,tenured,rank
331,0,0,1,tenured
101,0,0,1,tenured
192,0,0,1,tenured
66,1,0,0,teaching
327,0,0,1,tenured


In [123]:
ohe_ethnicity = pd.get_dummies(data["ethnicity"], dtype=int)
pd.concat([ohe_ethnicity, data["ethnicity"]], axis=1, sort=False).head()

Unnamed: 0,minority,not minority,ethnicity
331,0,1,not minority
101,0,1,not minority
192,0,1,not minority
66,0,1,not minority
327,0,1,not minority


In [124]:
ohe_gender = pd.get_dummies(data["gender"], dtype=int)
pd.concat([ohe_gender, data["gender"]], axis=1, sort=False).head()

Unnamed: 0,female,male,gender
331,0,1,male
101,1,0,female
192,0,1,male
66,0,1,male
327,0,1,male


In [125]:
ohe_language = pd.get_dummies(data["language"], dtype=int)
pd.concat([ohe_language, data["language"]], axis=1, sort=False).head()

Unnamed: 0,english,non-english,language
331,1,0,english
101,1,0,english
192,1,0,english
66,1,0,english
327,1,0,english


Now we create a dataset where all the cathegorical attributes are replaced by the binary columns. This dataset is called *num_data* and it will be used in the k-NN classification.

In [126]:
num_data = pd.concat([ohe_rank, ohe_ethnicity, ohe_gender, ohe_language, data[["age","bty_avg","eval_categorical"]]], axis=1, sort=False)
num_data.head()

Unnamed: 0,teaching,tenure track,tenured,minority,not minority,female,male,english,non-english,age,bty_avg,eval_categorical
331,0,0,1,0,1,0,1,1,0,64,2.333,bad
101,0,0,1,0,1,1,0,1,0,46,4.333,good
192,0,0,1,0,1,0,1,1,0,54,2.333,good
66,1,0,0,0,1,0,1,1,0,37,4.333,bad
327,0,0,1,0,1,0,1,1,0,64,2.333,bad


Now all the data in num_data is numeric, and we can use the Euclidean distance to compute the distance between the records.

### Common scale
You can see that the absolute values of different attributes are on different scales, and we better bring them all to the same interval between 0 and 1, since otherwise the difference in age will dominate an overall distance between two records.

We transform numeric columns to a standard scale 0-1 using the following formula: x<sub>scaled</sub>=(x-min)/(max-min)

In [127]:
# apply normalization techniques to column age 
column = 'age'
num_data[column] = (num_data[column] - num_data[column].min()) / (num_data[column].max() - num_data[column].min())     

In [128]:
# apply normalization techniques to column bty_avg 
column = 'bty_avg'
num_data[column] = (num_data[column] - num_data[column].min()) / (num_data[column].max() - num_data[column].min())     
  
# view normalized data 
display(num_data) 

Unnamed: 0,teaching,tenure track,tenured,minority,not minority,female,male,english,non-english,age,bty_avg,eval_categorical
331,0,0,1,0,1,0,1,1,0,0.795455,0.102462,bad
101,0,0,1,0,1,1,0,1,0,0.386364,0.410154,good
192,0,0,1,0,1,0,1,1,0,0.568182,0.102462,good
66,1,0,0,0,1,0,1,1,0,0.181818,0.410154,bad
327,0,0,1,0,1,0,1,1,0,0.795455,0.102462,bad
...,...,...,...,...,...,...,...,...,...,...,...,...
255,0,0,1,0,1,0,1,1,0,0.522727,0.230769,good
72,0,0,1,0,1,0,1,1,0,0.295455,0.487077,good
396,1,0,0,0,1,0,1,1,0,0.363636,0.256308,good
235,0,0,1,0,1,0,1,1,0,0.727273,0.487077,good


### Holdout for the numeric dataset
Divide the dataset into training and testing sets in proportion of 2:1.

In [129]:
# Select ratio
ratio = 0.66
 
total_rows = num_data.shape[0]
train_size = int(total_rows*ratio)
 
# Split data into test and train
num_data_train = num_data[0:train_size]
num_data_test = num_data[train_size:]

Now you have the input dataset for the k-NN classification.

<div style="background-color:blue;">
    <h3>Task 3. Implement the k-NN classifier</h3>    
</div>

Note that this is a "lazy" classifier and nothing can be precomputed. Both the training and the test sets are used only during classification.

The output of a classifier should be the list of classification results in form of a tuple (*classified*, *actual*), where *classified* is a class label obtained by our classification, and *actual* is the actual label of this record in the test set.

In [130]:
import math

In [131]:
def knn_classify(train_set, test_set, knn):
    results  = []
    # TODO: your code here

    # for each item of the test set:
    for input_datapoint in test_set.iterrows():
        neighbors_by_dist = []

        # Find its 11 nearest neighbors among the training set
        for curr_train_datapoint in train_set.iterrows():
            distances = 0
            # print(curr_train_datapoint[1][10]) YIELDS THE BTY_AVG COLUMN, THUS ITERATE 

            for i in range(0, 11):
                distances += (input_datapoint[1][i] - curr_train_datapoint[1][i])**2
                
            
            # find euclidean distance between input_datapoint and curr_train_datapoint
            e_dist = math.sqrt(distances)

            # form a tuple[2] wherein tuple[0] is e_dist and tuple[1] is the datapoint's label
            curr_tup = (e_dist, curr_train_datapoint[1][11])

            
            neighbors_by_dist.append(curr_tup)

        # sort neighbors_by_dist; the first 11 entries are the 11 nearest neighbors
        neighbors_by_dist.sort()

        eval_good_count = 0
        eval_bad_count = 0
        for i in range(knn):
#             print(neighbors_by_dist[i][1])
            
            if (neighbors_by_dist[i][1] == 'bad'):
                eval_bad_count += 1
            if (neighbors_by_dist[i][1] == 'good'): 
                eval_good_count += 1
                
#         print(f"Eval Good Count: {eval_good_count}; Eval Bad Count: {eval_bad_count}")

        if eval_good_count < eval_bad_count:
            classification = 'bad'
        else:
            classification = 'good'
        
        actual = input_datapoint[1]['eval_categorical']
        # Form a tuple[2] where the first item is the classification and the second item is the actual label given
        curr_output = (classification, actual)
        # Add that tuple to the list
        results.append(curr_output)
        
    return results

<div style="background-color:blue;">
    <h3>Task 4. Generate the confidence interval for the k-NN accuracy </h3>    
</div>
This is based on the holdout estimation. 
Run your classifier, obtain the accuracy of the sample, and then produce a confidence interval. You can write the code, or use the tables manually.

In [135]:
class_results = knn_classify(num_data_train, num_data_test, 11)
# print(class_results[:5])

# TODO: classify and compute accuracy
knn_correct_count = 0
for i in class_results:
    if i[0] == i[1]:
        knn_correct_count += 1
# print(f"correct count: {knn_correct_count}; total: {len(class_results)}")
accuracy = knn_correct_count / len(class_results)
print(accuracy)

0.5569620253164557


Our accuracy was: 0.5569620253164557

**Your answer**: The confidence interval for the performance of the k-NN classifier is:

**Work**: 

We are again using a 90% confidence interval of the form

$$ CI = sample\ statistic \pm critical\ value * SE $$

Our sample statistic is **0.5569** for the classifier's performance on the test set, which had 158 data points like before. Given that the test set's size is greater than 30, we will be using a normal (as opposed to t) distribution. For a 90% confidence interval on the normal distribution, the critical value is &plusmn;1.65. Lastly, since the classifier's outcome is binary yes/no, we will **approximate** its accuracy by **assuming** said accuracy follows a binomial distribution. ***TA Joey said this is acceptable.*** The standard error of that binomial distribution is 
 
 $$\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = .0395 $$


which all evaluates to 

$$ .5696 \pm (1.65 * .0395) $$


**Answer**: 

The confidence interval for the performance of the KNN classifier is:

$$[.5044, .6348]$$

<div style="background-color:blue;">
    <h3>Task 5. Perform the 10-fold cross-validation with k-NN</h3>    
</div>


Generate 10 equal non-overlapping subsets of numeric data from the numeric dataset and store them in the list of pandas data frames called *num_folds*:

In [150]:
k = 10  # k here is the number of folds
num_folds = []

#TODO numeric folds
for i in row_bounds:
    start_idx = i[0]
    end_idx = i[1]
    fold = num_data.iloc[start_idx:end_idx]
    num_folds.append(fold)

for i in range(k):
    print("Fold", i, "size:", num_folds[i].shape[0])

Fold 0 size: 46
Fold 1 size: 46
Fold 2 size: 46
Fold 3 size: 46
Fold 4 size: 46
Fold 5 size: 46
Fold 6 size: 46
Fold 7 size: 46
Fold 8 size: 46
Fold 9 size: 46


Implement the loop to perform 10-fold cross-validation. Store the classification results in the list *knn_accuracies* for future use.

In [156]:
knn = 11 # knn here is the number of nearest neighbors
knn_accuracies = []
# TODO: your code here

# I have slices of each fold's start and end in row_bounds. Instead of merging all items in folds besides the test set's index and training the classifier on those, 
# I'm going to instead just make a new df with the start and end bounds of each test set *set subtracted* from it via my list of the bounds of the *test set*; 
# i'm going to train on the entire data on that entire test set-less dataset and then test on the test set

for tup in row_bounds:
    knn_test_low_bound = tup[0]
    knn_test_high_bound = tup[1]

    knn_test_set = num_data.iloc[knn_test_low_bound:knn_test_high_bound]
    knn_training_set = num_data.drop(num_data.index[knn_test_low_bound:knn_test_high_bound])

    knn_trial_results = knn_classify(knn_training_set, knn_test_set, 11)
    
    knn_trial_correct_count = 0
    for r in knn_trial_results:
        if r[0] == r[1]:
            knn_trial_correct_count += 1
    
    knn_trial_accuracy = knn_trial_correct_count / len(knn_trial_results)
    knn_accuracies.append(knn_trial_accuracy)
        

print (knn_accuracies)

[0.6304347826086957, 0.6956521739130435, 0.5, 0.6304347826086957, 0.6086956521739131, 0.6304347826086957, 0.717391304347826, 0.6521739130434783, 0.6956521739130435, 0.6739130434782609]


Our results were:
0.6304347826086957, 0.717391304347826, 0.5, 0.6304347826086957, 0.5869565217391305, 0.6304347826086957, 
0.717391304347826, 0.6521739130434783, 0.6956521739130435, 0.6739130434782609

<div style="background-color:blue;">
    <h3>Task 6. Compare performance of two classifiers</h3>    
</div>

Based on the paired results stored in lists *nb_accuracies* and *knn_accuracies*, test the hypothesis that the two classifiers do not have the same performance at a significance level 0.05. Recall that we need to use the t-ditribution for the mean of differences. Again, you can either implement the computation or use the tables manually.

**If you are not writing the code, please clearly explain all the steps of your computation**.

Let $\mu_d$ be the mean difference in performance between our KNN and our NB classifiers implemented above. We state our null hypothesis to be $$H_0: \mu_d = 0$$ and our alternative hypothesis to be $$H_A: \mu_d \neq 0$$

The raw accuracies for the 10-fold cross-validations for the Naive Bayes classifier and the KNN classifier are, respectively, as follows (for convenience):

In [158]:
print(f"Naive Bayes Accuracies:\n{nb_accuracies}")
print(f"KNN accuracies:\n{knn_accuracies}")

Naive Bayes Accuracies:
[0.5652173913043478, 0.4782608695652174, 0.5217391304347826, 0.5652173913043478, 0.5217391304347826, 0.6956521739130435, 0.5434782608695652, 0.6086956521739131, 0.5434782608695652, 0.5]
KNN accuracies:
[0.6304347826086957, 0.6956521739130435, 0.5, 0.6304347826086957, 0.6086956521739131, 0.6304347826086957, 0.717391304347826, 0.6521739130434783, 0.6956521739130435, 0.6739130434782609]


The differences for each trials are as follows:

In [162]:
cv_differences = []
for i in range(10):
    out = nb_accuracies[i] - knn_accuracies[i]
    cv_differences.append(out)
print(cv_differences)

[-0.0652173913043479, -0.21739130434782605, 0.021739130434782594, -0.0652173913043479, -0.08695652173913049, 0.06521739130434778, -0.17391304347826086, -0.04347826086956519, -0.15217391304347827, -0.17391304347826086]


The (arithmetic) mean and variance (calculated with the Python statistics package, using arithmetic mean and sample variance) of differences is:

In [170]:
from statistics import mean, variance

print(f"Mean of differences: {mean(cv_differences)}")
print(f"Variance of differences: {variance(cv_differences)}")

Mean of differences: -0.08913043478260871
Variance of differences: 0.008238815374921231


Now, we conduct a t-test on our null hypothesis. With significance level $\alpha = 0.05$ and $n - 2 = 8$ degrees of freedom, our t-value is 

$$t = 2.306$$

Now, we construct an interval for our null hypothesis:

$$\mu_d = 0\ \pm 2.306 \frac{.008239}{\sqrt{10}}$$

and find our rejection regions to be outside of the interval

$$[-0.06008, 0.06008]$$

Now, given that our mean of differences was -0.0891, we have sufficient evidence to reject the null hypothesis.

Given that we calculated the differences with $$NB\ accuracies - KNN\ accuracies$$ 

we have found the KNN classifier to have a higher accuracy with statistical significance in this dataset.  

<div style="background-color:blue;">
    <h3>Task 7. Use the best classifier</h3>    
</div>
Which classifier is significantly better? 

Use it to predict the evaluation results for instructors that you know.
Now you can use the entire dataset as a training set.

Did the predicted class labels correspond to your own evaluations? 

Discuss all these questions and add any notes about this lab in a separate cell below.

We found the KNN classifier to be significantly better. We demonstrate it a few times as follows:

In [190]:
# try KNN for entry 22
trial_1_train = num_data.drop(num_data.index[22:23])
trial_1_test = num_data.iloc[22:23]
trial_1_result = knn_classify(trial_1_train, trial_1_test, 11)
print(f"Trial 1 results: Classification: {trial_1_result[0][0]}; Actual: {trial_1_result[0][1]}")

# try KNN for entry 40
trial_2_train = num_data.drop(num_data.index[40:41])
trial_2_test = num_data.iloc[40:41]
trial_2_result = knn_classify(trial_2_train, trial_2_test, 11)
print(f"Trial 2 results: Classification: {trial_2_result[0][0]}; Actual: {trial_2_result[0][1]}")

# try KNN for entry 295
trial_3_train = num_data.drop(num_data.index[295:296])
trial_3_test = num_data.iloc[295:296]
trial_3_result = knn_classify(trial_3_train, trial_3_test, 11)
print(f"Trial 3 results: Classification: {trial_3_result[0][0]}; Actual: {trial_3_result[0][1]}")

Trial 1 results: Classification: bad; Actual: bad
Trial 2 results: Classification: bad; Actual: good
Trial 3 results: Classification: bad; Actual: bad


I found this lab to be both fun and rewarding. I found the statistics behind these algorithms to be intuitive but the implementation in Python to be troublesome, as most of the Pitt SCI undergraduate curriculum is in Java; Python has simpler syntax but I am not as used to it.

#### This is the end of the Skill lab 3. 

Copyright &copy; 2024 Marina Barsky.