# Adjustments for Classification

The problems in this notebook correspond to the concepts covered in `Lectures/Supervised Learning/Classification/Adjustments for Classification`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

### Looking for errors

Below you will find code from various students learning supervised learning. Try and identify the errors they made while coding up an iris classifier using $k$ nearest neighbors.

In [2]:
## to get the iris data
from sklearn.datasets import load_iris

## Load the data
iris = load_iris()
iris_df = pd.DataFrame(iris['data'],columns = ['sepal_length','sepal_width','petal_length','petal_width'])
iris_df['iris_class'] = iris['target']

##### 1. Matt's code

In [3]:
## first I import train_test_split
from sklearn.model_selection import train_test_split

In [4]:
## Now I make the train test split
iris_train, iris_test = train_test_split(iris_df.copy(),
                                            shuffle=True,
                                            random_state=213)

In [5]:
## import KNN
from sklearn.neighbors import KNeighborsClassifier as KNN

In [6]:
knn = KNN(5)

knn.fit(iris_train[['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width']],
           iris_train['iris_class'])

KNeighborsClassifier()

##### Write down Matt's mistep(s) here

##### Sample Solution

Matt should have stratified his train test split using the `stratify = iris_df['iris_class']` argument.

##### 2. Kevin's Code

In [7]:
## first I import train_test_split
from sklearn.model_selection import train_test_split

In [8]:
## Now I make the train test split
iris_train, iris_test = train_test_split(iris_df.copy(),
                                            shuffle=True,
                                            random_state=213,
                                            stratify=iris_df['iris_class'])

In [9]:
## import KNN
from sklearn.neighbors import KNeighborsClassifier as KNN

## Now I import KFold
from sklearn.model_selection import KFold

## and accuracy
from sklearn.metrics import accuracy_score

In [10]:
## now I get the accuracy for each cv split for k = 5

kfold = KFold(5, shuffle=True, random_state=31344)

cv_accs = np.zeros(5)

i = 0
for train_index, test_index in kfold.split(iris_train):
    iris_tt = iris_train.iloc[train_index]
    iris_ho = iris_train.iloc[test_index]
    
    knn = KNN(5)
    
    knn.fit(iris_tt[['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width']],
               iris_tt['iris_class'])
    
    pred = knn.predict(iris_ho[['sepal_length',
                    'sepal_width',
                    'petal_length',
                    'petal_width']])
    
    cv_accs[i] = accuracy_score(iris_ho['iris_class'].values, pred)
    
    i = i + 1
    
print(np.mean(cv_accs))

0.990909090909091


##### Write down Kevin's mistep(s) here

##### Sample Solution

Kevin should have used `StratifiedKFold` instead of regular `KFold`.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)