# Class 25: Building a KNN classifier and unsupervised learning

Plan for today:
- Building a KNN classifier
- Clustering


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(25)   # get class code    
# YData.download.download_class_code(25, TRUE) # get the code with the answers 


If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Review: Training and test sets, and KNN

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [None]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

In [None]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']]

y_penguin_labels = penguins['species']


In [None]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  y_penguin_labels, random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(5)


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score


knn = KNeighborsClassifier(n_neighbors = 1) # construct knn classifier

scores = cross_val_score(knn, X_penguin_features,  y_penguin_labels, cv = 5)

scores.mean()


## 2. Feature normalization

If you look at the features we have been using in our analyses so far, you will notice that they are on very different scales. This is quite problematic for a KNN classifier since the classifier is finding the distance between each data point, so features that have large values will dominate this distance. 

Let's explore the scales that different features have by looking at some descriptive statistics. In particular, let's go back to the manually created `X_train`, `X_test`, `y_train`, `y_test` to examine the scale that different features are measured on.


In [None]:
X_train.describe()

Let's do a z-score transformation of our features which set the mean of the features to 0 and the standard deviation to 1. We can do this using the using the `StandardScaler()` object as follows: 

1. Create a new `StandardScaler()` object using `scalar = StandardScaler()` 

2. Have the `scalar` object learn the means and standard deviations of our training data by calling the `scalar.fit(X)` function on the training data.

3. Use the fit `scalar` object to transform both the training and test features so that all features are on a similar scale by calling the `.transform(X)` method. 


In [None]:
from sklearn.preprocessing import StandardScaler


# learning the mean and standard deviations to scale the features

scalar = StandardScaler()

scalar.fit(X_train)


In [None]:
# z-score transform the features 

X_train_transformed = scalar.transform(X_train)
X_test_transformed = scalar.transform(X_test)

type(X_test_transformed)

Let's now look at our transformed training data...

In [None]:
# view descriptive statistics on the transformed features

X_train_transformed_df = pd.DataFrame(X_train_transformed, columns = X_train.columns)

X_train_transformed_df.describe()

Let's see how our classification accuracy changes using the z-score transformed data

In [None]:
# apply KNN classification on the normalized features

knn = KNeighborsClassifier(n_neighbors = 1) 
knn.fit(X_train_transformed, y_train)
knn.score(X_test_transformed, y_test)

In order to transform our features inside a cross-validation loop, we can set up a pipeline. This pipeline will do the following:

1. It will split the data into a training and test set
2. It will fit the transformation of the features on the training set (i.e., learn the means and standard deviations on the training set). 
3. It will apply a z-score transformation of the training and test set based on the features learned in step 2
4. It will train the classifier on the transformed data
5. It will measure the classification accuracy on the test data
6. It will repeat this process k times, where k here refers to how many cross-validation splits we are using

In order to do this in scikit-learn we can use a `Pipeline` object which sets up the stages of transformation and classification, along with a `KFold` object which will run the cross-validation.  

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold


# create a pipeline for running cross-validation with feature normalization

# components that go into the pipeline
scalar = StandardScaler()
knn = KNeighborsClassifier(n_neighbors = 1) 
cv = KFold(n_splits=5)

# build the pipeline
pipeline = Pipeline([('transformer', scalar), ('estimator', knn)])

# get the cross-validation scores
scores = cross_val_score(pipeline, X_penguin_features, y_penguin_labels, cv = cv)


# print out the mean score over the 5 cross-validation splits
scores.mean()

## 3. Building a KNN classifier

So far we have used the KNN classifier (and a few other classifiers). Let's now see if we can write code that will implement the KNN classifier.

We will do this by writing a several helper functions that build on each other. These functions are: 

1. `euclid_dist(x1, x2)`: finds the Euclidean distance between two points `x1` and `x2`

2. `get_labels_and_distances(test_point, X_train_features, y_train_labels)`: This function finds the distance between a test point and all the training points. It returns a DataFrame with the distance from all training points and the training labels for each point.

3. `classify_point(test_point, k, X_train_features, y_train_labels)`: Classifies which class a test point belongs to

4. `classify_all_test_data(X_test_data, k, X_train_features, y_train_labels)`: Classifiers which class all test points below to.


Let's start by writing a function that can get the Euclidean distance between two points `x` and `z`: 

$$dist(x, z) = \sqrt{\Sigma_{i = 1}^d (x_i - z_i)^2)}$$


In [None]:
def euclid_dist(x1, x2):
    return np.sqrt(np.sum((x1 - x2)**2))


# test our function 
my_vec1 = np.array([1, 2, 3, 4])
my_vec2 = np.array([2, 3, 4, 5])

euclid_dist(my_vec1, my_vec2)

In [None]:
# Let's now write a function that returns the labels and distances 
# between a training point and all the test points


def get_labels_and_distances(test_point, X_train_features, y_train_labels):
    
    the_distances = []
    
    # get the distance between the test point and all training points
    for i in range(X_train_features.shape[0]):
        the_distances.append(euclid_dist(test_point, X_train_features.iloc[i]))

    
    # add the training labels and distances on to a DataFrame 
    labels_and_distances = pd.DataFrame({'label': y_train_labels})
    labels_and_distances['distance'] = the_distances 

    return labels_and_distances


test_data_point = X_test.iloc[0]
test_label = y_test.iloc[0]

labels_and_distances = get_labels_and_distances(test_data_point, X_train, y_train)

labels_and_distances.head(5)

In [None]:
# get the k closest neighbors

k = 5

sorted_labels_dist = labels_and_distances.sort_values("distance")

sorted_labels_dist = sorted_labels_dist.iloc[0:k]

sorted_labels_dist

In [None]:
# get the majority label

count_table = sorted_labels_dist.groupby("label").count().reset_index()

sorted_count_table = count_table.sort_values("distance", ascending = False)

sorted_count_table.iloc[0]["label"]

In [None]:
# write a function to do the classification on a test point 
# by putting together all the pieces

def classify_point(test_point, k, X_train_features, y_train_labels):
    
    labels_and_distances =  get_labels_and_distances(test_point, 
                                                     X_train_features, 
                                                     y_train_labels)

    sorted_labels_dist = labels_and_distances.sort_values("distance")
    sorted_labels_dist = sorted_labels_dist.iloc[0:k]
    
    
    count_table = sorted_labels_dist.groupby("label").count().reset_index()
    sorted_count_table = count_table.sort_values("distance", ascending = False)
    majority_class = sorted_count_table.iloc[0]["label"]
    
    return majority_class



prediction = classify_point(test_data_point, 5, X_train, y_train)

print(prediction)

print(test_label)

In [None]:
# classify a full test set

def classify_all_test_data(X_test_data, k, X_train_features, y_train_labels):
    
    predictions = []
    
    for i in range(X_test_data.shape[0]):
        
        curr_test_point = X_test_data.iloc[i]
        
        curr_prediction = classify_point(curr_test_point, 
                                         k, 
                                         X_train_features, 
                                         y_train_labels)
        
        predictions.append(curr_prediction)

    return np.array(predictions)
    
    
all_predictions = classify_all_test_data(X_test, 5, X_train, y_train)

all_predictions


In [None]:
# get the classification accuracy

np.mean(all_predictions == y_test)

## 4. Unsupervised learning: clustering

We can do k-means clustering in scikit-learn using the `KMeans()` object.


In [None]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 

kmeans = KMeans(n_clusters=3)
kmeans.fit(X_penguin_features)

In [None]:
# see which cluster each point belongs to 

predicted_labels = kmeans.predict(X_penguin_features)
predicted_labels

In [None]:
# look at a matrix of which penguin types end up in which cluster 

matrix = pd.DataFrame({'labels': predicted_labels, 'species': y_penguin_labels})
ct = pd.crosstab(matrix['labels'], matrix['species'])
print(ct)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 
scaler = StandardScaler()
pipeline = make_pipeline(scaler, kmeans)

pipeline.fit(X_penguin_features)

In [None]:
# see which cluster each (normalized) point belongs to

predicted_labels2 = pipeline.predict(X_penguin_features)

predicted_labels2


In [None]:
# look at a matrix of which penguin types end up in which cluster 

matrix_new = pd.DataFrame({'labels': predicted_labels2, 'species': y_penguin_labels})
ct_new = pd.crosstab(matrix_new['labels'], matrix_new['species'])
print(ct_new)

### 4b. Unsupervised learning: Hierarchical clustering


In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters
clusters = hierarchy.linkage(X_penguin_features, method="ward")   


In [None]:
# display a dendrogram
dendrogram = hierarchy.dendrogram(clusters)

In [None]:
# cluster points into 3 clusters 
clustering_model = AgglomerativeClustering(n_clusters=3, linkage="ward")
clustering_model.fit(X_penguin_features)

# get the predicted cluster for each point
labels = clustering_model.labels_

labels

In [None]:
# visualize how well the clustering matches the penguin species

sns.relplot(X_penguin_features, 
            x='bill_length_mm', 
            y='flipper_length_mm', 
            hue=labels, 
            style = y_penguin_labels,
            palette="Set2");
