## Workshop - ML Classification

In this workshop we will 

* obtain the null model accuracy
* obtain a Gaussian naive Bayes accuracy
* cross-validate a KNN classifier and obtain the accuracy

Run this code. **Notice the alternative standardization technique.**

In [1]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error
from sklearn import linear_model as lm

from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import LabelEncoder,StandardScaler

from tqdm import tqdm

import matplotlib.pyplot as plt

In [2]:
df = pd.read_pickle('class_data.pkl')

In [3]:
df_prepped = df.drop(columns = ['year']).join([
    pd.get_dummies(df.year, drop_first = False)    
])

In [4]:
le = LabelEncoder().fit(df_prepped['urate_bin'].unique())
y = le.transform(df_prepped['urate_bin'])

In [5]:
y = df_prepped['urate_bin'].astype('category').cat.set_categories(['lower', 'similar', 'higher'])
x = df_prepped.drop(columns = 'urate_bin')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

x_train_std = pd.DataFrame(StandardScaler().fit(x_train).transform(x_train),
                           columns = x_train.columns,
                           index = x_train.index)

x_test_std = pd.DataFrame(StandardScaler().fit(x_test).transform(x_test),
                          columns = x_test.columns, 
                          index = x_test.index)

************
# Null Model
Obtain and print the accuracy for the null model.

In [6]:
null_acc = np.mean(y_test == 'higher')
null_acc

0.43416937149601653

***
# Gaussian Naive Bayes
Obtain and print the GNB test accuracy.

In [7]:
gnb = GaussianNB()
gnb.fit(x_train, y_train)

gnb_acc = (gnb.score(x_test, y_test))
gnb_acc

0.49642962525818823

Obtain and print the percent improvement in test accuracy from the null model.

In [8]:
round((gnb_acc - null_acc)/null_acc*100, 2)

14.34

***
# KNN
Complete the following for loop.

*Hint: Lecture 11 Regression-Based Classification - Alternative Thresholds*.

In [9]:
kf = KFold(n_splits = 5, random_state = 490, shuffle = True)
# I am helping you out by identifying approximately where the optimal solution is
# in general, you should I would start with
# [3, 5, 7, 10, 15, 20, 25]
# and adjust accordingly
# There is no reason to suspect a smaller or higher value is best a priori
k_nbrs = [20, 30, 40]
accuracy = {}

for k in tqdm(k_nbrs):
    acc = []
    for trn, tst in kf.split(x_train_std):
        y_hat = KNeighborsClassifier(n_neighbors = k).fit(x_train_std.iloc[trn], y_train.iloc[trn]).predict(x_train_std.iloc[tst])
        acc.append(np.mean(y_hat == y_train.iloc[tst]))
    accuracy[k] = np.mean(acc)
    
accuracy

100%|██████████| 3/3 [01:06<00:00, 22.27s/it]


{20: 0.6635484835735472, 30: 0.6686829112737166, 40: 0.6667353383532633}

What is the optimal value of $k$ using either `max()` or by producing a scatterplot.

In [10]:
print('max accuracy at alpha = %s' % max(accuracy, key = accuracy.get))

max accuracy at alpha = 30


Refit the optimal KNN model on the training data.

In [12]:
knnc = KNeighborsClassifier(n_neighbors = 30)
knnc.fit(x_train_std, y_train)

KNeighborsClassifier(n_neighbors=30)

Obtain and print the test accuracy.

In [13]:
knn_acc = (knnc.score(x_test_std, y_test))

Obtain and print the percent improvement in test accuracy from the null model.

In [14]:
round((knn_acc - null_acc)/null_acc*100, 2)

54.94