# Class 22: Classification

Plan for today:
- Reivew/continuation of cross-validation
- Other classifiers
- Building a kNN classifier
- Features normalization


In [63]:
import YData

# YData.download.download_class_code(22)   # get class code    
# YData.download.download_class_code(22, TRUE) # get the code with the answers 

# YData.download.download_homework(9)  # downloads the homework 

# project review template
# YData.download.download_class_file('reviewer_template.ipynb', 'homework')


If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [65]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [67]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Review: features and labels

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [68]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
58,Adelie,Biscoe,36.5,16.6,181.0,2850.0,Female
155,Chinstrap,Dream,45.4,18.7,188.0,3525.0,Female
229,Gentoo,Biscoe,46.8,15.4,215.0,5150.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
193,Chinstrap,Dream,46.2,17.5,187.0,3650.0,Female


To begin the classification process, let's store the features (X) and the labels (y) in separate names called `X_penguin_features` and `y_penguin_labels` respectively. 

In [69]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']]

y_penguin_labels = penguins['species']


## 2. Review and continuation of cross-validation

To avoid over-fitting, we need to split our data into a training and test set. 

The classifier "learns" the relationship between features (X) and labels (y) on the **training set**.

The classifier makes predictions on the features (X) of the **test set**. 

We compare the classifier's predictions on the test features (X) to the actual labels y, to get a more accuracy assessment of the **classification accuracy**.


Let's try this now...



We can use the scikit-learn `train_test_split()` function to generate training and test splits of our data 

In [72]:
from sklearn.model_selection import train_test_split


# split data into a training and test set
X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  y_penguin_labels, random_state = 0)





# print the shape




In [73]:
from sklearn.neighbors import KNeighborsClassifier


# construct a classifier



# “train” the classifier 
# (which for a KNN classifier just involves memorizing the training data)






In [74]:
# get the predictions




In [75]:
# Get the prediction accuracy 




In [76]:
# Test the classifier on the test set using the .score() method



In [77]:
# What happens if we test the classifier on the training set? 


# prediction accuracy on the training set





### K-fold cross-validation

In k-fold cross-validation we split our data into k-parts (note, the k here has no relation to the k in k-Nearest Neighbor - it is just that k is a frequent letter to use in math to denote integer values).  

To run a k-fold cross-validation analysis, we train the classifier on k-1 parts of the data and test it on the remaining part. We repeat this process k times to get k classification accuracies. We then take the average of these results as our estimate of our overall classification accuracy. 

We can use the scikit-learn `cross_val_score()` to easily do this...


In [78]:
from sklearn.model_selection import cross_val_score


# construct knn classifier


# do 5-fold cross-validation







## 3. Other classifiers

Many other types of classifiers that have been created. Scikit-learn makes it very easy to try out a range of classifiers. 

Let's explore the Support Vector Machine, and Random Forest Classifier on our penguin data...


In [79]:
# Suppress ConvergenceWarning - please ignore this code 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FutureWarning)


# Try a support vector machine (SVM)

from sklearn.svm import LinearSVC

# construct an SVM  # max_iter=10000



# get prediction accuracies





In [80]:
# Try a random forest

from sklearn.ensemble import RandomForestClassifier

# construct a random forest classifier






## 4. Building the KNN classifier

So far we have used the KNN classifier (and a few other classifiers). Let's now see if we can write code that will implement the KNN classifier.

We will do this by writing a several helper functions that build on each other. These functions are: 

1. `euclid_dist(x1, x2)`: finds the Euclidean distance between two points `x1` and `x2`

2. `get_labels_and_distances(test_point, X_train_features, y_train_labels)`: This function finds the distance between a test point and all the training points. It returns a DataFrame with the distance from all training points and the training labels for each point.

3. `classify_point(test_point, k, X_train_features, y_train_labels)`: Classifies which class a test point belongs to

4. `classify_all_test_data(X_test_data, k, X_train_features, y_train_labels)`: Classifiers which class all test points below to.


Let's start by writing a function that can get the Euclidean distance between two points `x` and `z`: 

$$dist(x, z) = \sqrt{\Sigma_{i = 1}^d (x_i - z_i)^2)}$$


In [81]:
def euclid_dist(x1, x2):
    ...


    
# test our function 
my_vec1 = np.array([1, 2, 3, 4])
my_vec2 = np.array([2, 3, 4, 5])



In [82]:
# Let's now write a function that returns the labels and distances 
# between a training point and all the test points


def get_labels_and_distances(test_point, X_train_features, y_train_labels):
    
    the_distances = []
    
    # get the distance between the test point and all training points
    
    

    
    # Create a DataFrame with the training labels and distances 

    



# test our code 

test_data_point = X_test.iloc[0]
test_label = y_test.iloc[0]

labels_and_distances = get_labels_and_distances(test_data_point, X_train, y_train)

labels_and_distances

In [83]:
# get the k closest neighbors








In [84]:
# get the majority label







In [85]:
# write a function to do the classification on a test point 
# by putting together all the pieces

def classify_point(test_point, k, X_train_features, y_train_labels):
    
    # get the labels and distances DataFrame

    
    
    # Sort the data frame and get k closest rows

    
    
    
    # get the majority class

    ...

    



# test our classifier on one test point
prediction = classify_point(test_data_point, 5, X_train, y_train)

print(prediction)

print(test_label)

None
Gentoo


In [86]:
# classify a full test set

def classify_all_test_data(X_test_data, k, X_train_features, y_train_labels):
    
    predictions = []
    
    # loop through all test points and get all predictions 

    
    
    
    
    
    
# test the classifier on the whole test set    
all_predictions = classify_all_test_data(X_test, 5, X_train, y_train)

all_predictions


In [87]:
# get the classification accuracy



## 5. Feature normalization

If you look at the features we have been using in our analyses, you will notice that they are on very different scales. This is quite problematic for a KNN classifier since the classifier is finding the distance between each data point, so features that have large values will dominate this distance. 

Let's explore the scales that different features have by looking at some descriptive statistics. In particular, let's go back to the manually created `X_train`, `X_test`, `y_train`, `y_test` to examine the scale that different features are measured on.


In [88]:
# Create the training and test splots of the data using train_test_split



# Get summary statistics of the training data using the .describe() method



Let's do a z-score transformation of our features which set the mean of the features to 0 and the standard deviation to 1. We can do this using the using the `StandardScaler()` object as follows: 

1. Create a new `StandardScaler()` object using `scalar = StandardScaler()` 

2. Have the `scalar` object learn the means and standard deviations of our training data by calling the `scalar.fit(X)` function on the training data.

3. Use the fit `scalar` object to transform both the training and test features so that all features are on a similar scale by calling the `.transform(X)` method. 


In [89]:
from sklearn.preprocessing import StandardScaler


# learning the mean and standard deviations to scale the features






In [90]:
# z-score transform the features 






Let's now look at our transformed training data...

In [91]:
# view descriptive statistics on the transformed features





Let's see how our classification accuracy changes using the z-score transformed data

In [92]:
# apply KNN classification on the normalized features





In order to transform our features inside a cross-validation loop, we can set up a pipeline. This pipeline will do the following:

1. It will split the data into a training and test set
2. It will fit the transformation of the features on the training set (i.e., learn the means and standard deviations on the training set). 
3. It will apply a z-score transformation of the training and test set based on the features learned in step 2
4. It will train the classifier on the transformed data
5. It will measure the classification accuracy on the test data
6. It will repeat this process k times, where k here refers to how many cross-validation splits we are using

In order to do this in scikit-learn we can use a `Pipeline` object which sets up the stages of transformation and classification, along with a `KFold` object which will run the cross-validation.  

In [93]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold


# create a pipeline for running cross-validation with feature normalization

# components that go into the pipeline



# build the pipeline



# get the cross-validation scores



# print out the mean score over the 5 cross-validation splits
