## Workshop - ML Classification

In this workshop we will 

* obtain the null model accuracy
* obtain a Gaussian naive Bayes accuracy
* cross-validate a KNN classifier and obtain the accuracy

Run this code. **Notice the alternative standardization technique.**

In [2]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error
from sklearn import linear_model as lm

from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import LabelEncoder,StandardScaler

from tqdm import tqdm

import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('class_data.csv')

In [4]:
df.set_index('GeoName', append = True, inplace = True)

In [5]:
df_prepped = df.drop(columns = ['year']).join([
    pd.get_dummies(df.year, drop_first = False)    
])

In [6]:
y = df_prepped['urate_bin'].astype('category')
x = df_prepped.drop(columns = 'urate_bin')

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)

x_train_std = pd.DataFrame(StandardScaler().fit(x_train).transform(x_train),
                           columns = x_train.columns,
                           index = x_train.index)

x_test_std = pd.DataFrame(StandardScaler().fit(x_test).transform(x_test),
                          columns = x_test.columns, 
                          index = x_test.index)

************
# Null Model
Obtain and print the accuracy for the null model.

In [7]:
null_acc = np.mean(y_test == 'higher')
null_acc

0.43416937149601653

***
# Gaussian Naive Bayes
Obtain and print the GNB test accuracy.

In [8]:
gnb = GaussianNB()
gnb.fit(x_train, y_train)

gnb_acc = (gnb.score(x_test, y_test))
gnb_acc

0.5061079964591325

Obtain and print the percent improvement in test accuracy from the null model.

In [9]:
round((gnb_acc - null_acc)/null_acc*100, 2)

16.57

***
# KNN
Complete the following for loop.

*Hint: Lecture 11 Regression-Based Classification - Alternative Thresholds*.

In [13]:
kf = KFold(n_splits = 5, random_state = 490, shuffle = True)
# I am helping you out by identifying approximately where the optimal solution is
# in general, you should I would start with
# [3, 5, 7, 10, 15, 20, 25]
# and adjust accordingly
# There is no reason to suspect a smaller or higher value is best a priori
k_nbrs = [20, 30, 40]
accuracy = {}

for k in tqdm(k_nbrs):
    acc = []
    for trn, tst in kf.split(x_train_std):
        yhat = (KNeighborsClassifier(n_neighbors = k).fit(x_train_std.iloc[trn, 1:], x_train_std.iloc[trn, 0]).predict_proba(x_train_std.iloc[tst, 1:])[:, 1] > a)*1
        
        acc.append(np.mean(yhat == y_train.iloc[tst]))
    accuracy[k] = np.mean(acc)
    
accuracy

  0%|          | 0/3 [00:00<?, ?it/s]


ValueError: Unknown label type: 'continuous'

What is the optimal value of $k$ using either `max()` or by producing a scatterplot.

In [14]:
print(max(accuracy, key = accuracy.get))

ValueError: max() arg is an empty sequence

Refit the optimal KNN model on the training data.

In [None]:
knnc = KNeighborsClassifier(n_neighbors = max(accuracy, key = accuracy.get))
knnc.fit(x_train_std, y_train)

Obtain and print the test accuracy.

In [14]:
knn_acc = knn.score(x_test_std, y_test)

Wall time: 25.9 s


Obtain and print the percent improvement in test accuracy from the null model.

In [None]:
round((knn_acc - null_acc)/null_acc*100, 2)