# Class 26: Unsupervised learning

Plan for today:
- K-means clustering
- Hierarchical clustering


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(26)   # get class code    
# YData.download.download_class_code(26, TRUE) # get the code with the answers 


If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Unsupervised learning: clustering

In unsupervised machine learning, we try to find patterns in the data using only a set of features X (without any labels y). 

Let's explore clustering which is a form of unsupervised learning. 


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)


# get the features and the labels
X_penguin_features = penguins[['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g']]
y_penguin_labels = penguins['species']




We can do k-means clustering in scikit-learn using the `KMeans()` object.

In [None]:
from sklearn.cluster import KMeans

# fit k-means with 3 clusters 

kmeans = KMeans(n_clusters=3)
kmeans.fit(X_penguin_features)

In [None]:
# see which cluster each point belongs to 

predicted_labels = kmeans.predict(X_penguin_features)
predicted_labels

In [None]:
# look at a matrix of which penguin types end up in which cluster 

matrix = pd.DataFrame({'labels': predicted_labels, 'species': y_penguin_labels})
ct = pd.crosstab(matrix['labels'], matrix['species'])
print(ct)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# do clustering with feature normalization 
scaler = StandardScaler()
pipeline = make_pipeline(scaler, kmeans)

pipeline.fit(X_penguin_features)

In [None]:
# see which cluster each (normalized) point belongs to

predicted_labels2 = pipeline.predict(X_penguin_features)

predicted_labels2


In [None]:
# look at a matrix of which penguin types end up in which cluster 

matrix_new = pd.DataFrame({'labels': predicted_labels2, 'species': y_penguin_labels})
ct_new = pd.crosstab(matrix_new['labels'], matrix_new['species'])
print(ct_new)

## 2. Unsupervised learning: Hierarchical clustering


In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster import hierarchy

#  Ward's method adds points to a cluster that minimizes the sum of squared differences within all clusters
clusters = hierarchy.linkage(X_penguin_features, method="ward")   


In [None]:
# display a dendrogram
dendrogram = hierarchy.dendrogram(clusters)

In [None]:
# cluster points into 3 clusters 
clustering_model = AgglomerativeClustering(n_clusters=3, linkage="ward")
clustering_model.fit(X_penguin_features)

# get the predicted cluster for each point
labels = clustering_model.labels_

labels

In [None]:
# visualize how well the clustering matches the penguin species

sns.relplot(X_penguin_features, 
            x='bill_length_mm', 
            y='flipper_length_mm', 
            hue=labels, 
            style = y_penguin_labels,
            palette="Set2");
