# Class 24: Classification

Plan for today:
- Classification features
- Introduction to Machine Learning
- KNN classifier


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(24)   # get class code    
# YData.download.download_class_code(24, TRUE) # get the code with the answers 


There are also similar functions to download the homework:

In [None]:
# YData.download.download_homework(9)  # downloads the homework 

If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# To start, let's use a function that generates a statistic p-hat that is consistent with a particular population parameter value pi

def generate_prop_bechdel(n, null_prop):
    
    random_sample = np.random.rand(n) <= null_prop
    return np.mean(random_sample)

generate_prop_bechdel(1794, .5)


## 1. Intro to Machine Learning:  Features (X) and labels (y)

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [None]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

In [None]:
# Let's explore how many different members there are of each species in our data set? 

species_counts = penguins.groupby("species").agg(count = ('island', 'count'))

species_counts


#### Questions: 

1. If we had to guess the species of the penguin without knowing any of the penguin's features, species of penguin should we guess? 
A: Always guess Adelie


2. If we were to following the optimal guessing strategy, what percent of our guess would be correct (i.e., what would our classification accuracy be)? 


In [None]:
species_counts['count']/sum(species_counts['count'])

To begin the classification process, let's store the features (X) and the labels (y) in separate names called `X_penguin_features` and `y_penguin_labels` respectively. 

In [None]:
# get the features and the labels

X_penguin_features = ...    # 4 features

y_penguin_labels = ...


## 2. k-Nearest Neighbors classifier


To explore classification, let's use a k-Nearest Neighbors classifier to predict the species of a penguin based on particular features the penguin has such as the penguin's bill length and body mass. 

Let's construct a K-Nearest Neighbor classifier (KNN) using 5 neighbors for predictions (i.e., k = 5 so we are using a 5-Nearest Neighbor classifier). 

We can do this using the `KNeighborsClassifier(n_neighbors = )` function.  


In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Construct a classifier a 5 nearest neighbor classifier



Let's now train the classifier (the KNN classifier just stores the data during training)


In [None]:
# “train” the classifier (which for a KNN classifier just involves memorizing the training data)




Let's now use the classifier to make predictions

In [None]:
# make predictions





Let's get the prediction (classificaton accuracy) which is the proportion of predictions that are correct

In [None]:
# get the classification accuracy



Let's repeat our analysis with k = 1 to see what happens...

In [None]:
# What happens if k = 1?

# construct a classifier


# “train” the classifier (which for a KNN classifier just involves memorizing the training data)


# make predictions


# get classification accuracy




Do we believe we have a perfect classifier???


## 3. Cross-validation

To avoid over-fitting, we need to split our data into a training and test set. 

The classifier "learns" the relationship between features (X) and labels (y) on the **training set**.

The classifier makes predictions on the features (X) of the **test set**. 

We compare the classifier's predictions on the test features (X) to the actual labels y, to get a more accuracy assessment of the **classification accuracy**.


Let's try this now...



In [None]:
# manually create a training with 250 examples, and a test set that has the rest of the data

X_train_manual = ...
y_train_manual = ...

X_test_manual = ...
y_test_manual = ...


# print the shape of training and test sets 




In [None]:
from sklearn.model_selection import train_test_split

# split data into a training and test set





# print the shape




In [None]:
from sklearn.neighbors import KNeighborsClassifier


# construct a classifier



# “train” the classifier 
# (which for a KNN classifier just involves memorizing the training data)






In [None]:
# get the predictions




In [None]:
# Get the prediction accuracy 




In [None]:
# Test the classifier on the test set using the .score() method



In [None]:
# What happens if we test the classifier on the training set? 


# prediction accuracy on the training set





### K-fold cross-validation

In k-fold cross-validation we split our data into k-parts (note, the k here has no relation to the k in k-Nearest Neighbor - it is just that k is a frequent letter to use in math to denote integer values).  

To run a k-fold cross-validation analysis, we train the classifier on k-1 parts of the data and test it on the remaining part. We repeat this process k times to get k classification accuracies. We then take the average of these results as our estimate of our overall classification accuracy. 

We can use the scikit-learn `cross_val_score()` to easily do this...


In [None]:
from sklearn.model_selection import cross_val_score


# construct knn classifier


# do 5-fold cross-validation







## 4. Other classifiers

Many other types of classifiers that have been created. Scikit-learn makes it very easy to try out a range of classifiers. 

Let's explore the Support Vector Machine, and Random Forest Classifier on our penguin data...


In [None]:
# Try a SVM

from sklearn.svm import LinearSVC

# construct an SVM  # max_iter=10000



# get prediction accuracies





In [None]:
# Try a random forest

from sklearn.ensemble import RandomForestClassifier

# construct a random forest classifier






## 5. Building the KNN classifier

So far we have used the KNN classifier (and a few other classifiers). Let's now see if we can write code that will implement the KNN classifier.

We will do this by writing a several helper functions that build on each other. These functions are: 

1. `euclid_dist(x1, x2)`: finds the Euclidean distance between two points `x1` and `x2`

2. `get_labels_and_distances(test_point, X_train_features, y_train_labels)`: This function finds the distance between a test point and all the training points. It returns a DataFrame with the distance from all training points and the training labels for each point.

3. `classify_point(test_point, k, X_train_features, y_train_labels)`: Classifies which class a test point belongs to

4. `classify_all_test_data(X_test_data, k, X_train_features, y_train_labels)`: Classifiers which class all test points below to.


Let's start by writing a function that can get the Euclidean distance between two points `x` and `z`: 

$$dist(x, z) = \sqrt{\Sigma_{i = 1}^d (x_i - z_i)^2)}$$


In [None]:
def euclid_dist(x1, x2):
    ...


    
# test our function 
my_vec1 = np.array([1, 2, 3, 4])
my_vec2 = np.array([2, 3, 4, 5])



In [None]:
# Let's now write a function that returns the labels and distances 
# between a training point and all the test points


def get_labels_and_distances(test_point, X_train_features, y_train_labels):
    
    the_distances = []
    
    # get the distance between the test point and all training points
    
    

    
    # add the training labels and distances on to a DataFrame 

    



# test our code 

test_data_point = X_test.iloc[0]
test_label = y_test.iloc[0]

labels_and_distances = get_labels_and_distances(test_data_point, X_train, y_train)

labels_and_distances.head(5)

In [None]:
# get the k closest neighbors








In [None]:
# get the majority label







In [None]:
# write a function to do the classification on a test point 
# by putting together all the pieces

def classify_point(test_point, k, X_train_features, y_train_labels):
    
    # get the labels and distances DataFrame

    
    
    # Sort the data frame and get k closest rows

    
    
    
    # get the majority class

    



# test our classifier on one test point
prediction = classify_point(test_data_point, 5, X_train, y_train)

print(prediction)

print(test_label)

In [None]:
# classify a full test set

def classify_all_test_data(X_test_data, k, X_train_features, y_train_labels):
    
    predictions = []
    
    # loop through all test points and get all predictions 

    
    
    
    
    
    
# test the classifier on the whol test set    
all_predictions = classify_all_test_data(X_test, 5, X_train, y_train)

all_predictions


In [None]:
# get the classification accuracy

