<img src="https://gist.githubusercontent.com/jakubczakon/10e5eb3d5024cc30cdb056d5acd3d92f/raw/5c464c16ccbc7150b4025e0a2a05b84ab99a7bc3/logo_DS_AI.png" alt="Drawing" width="600"/>

# deepsense.ai's workshop

## 1.3. Clustering

* discovering patterns
* simplifying analysis

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

In [None]:
df = pd.read_csv("data/Bike-Sharing-Dataset/day.csv")

## KMeans

KMeans is a simple and popular algorithm for clustering.

We will cluster days based on similar characteristics - whether bike renting or weather conditions.

In [None]:
from sklearn.cluster import KMeans

In [None]:
# creating input for clustering
feature_1 = 'registered'
feature_2 = 'casual'

X = df[[feature_1, feature_2]].copy()

# scale of variables is crucial for KMeans
X = (X - X.mean()) / X.std()

In [None]:
# creating KMeans classifier
kmeans = KMeans(n_clusters=3, random_state=12)

In [None]:
# training on data
# notice that there is only input
# (it's unsupervised learning)
kmeans.fit(X)

In [None]:
# we attach it to data;
# normalization helps us discriminating low from high values
df_clustered = df.copy()
dteday = df_clustered['dteday']
df_clustered.drop('dteday', axis=1, inplace=True)
df_clustered = (df_clustered - df_clustered.mean())/df_clustered.std()
df_clustered['dteday'] = dteday
df_clustered['cluster'] = kmeans.predict(X)

In [None]:
df_clustered.head()

In [None]:
df_clustered_numerical = df_clustered.loc[:, df_clustered.columns != 'dteday']
df_clustered_numerical.groupby('cluster').mean()

In [None]:
# a plot may be more readable
sns.heatmap(df_clustered_numerical.groupby('cluster').mean())

In [None]:
df_clustered.plot(kind='scatter', x=feature_1, y=feature_2)

In [None]:
sns.pairplot(df_clustered, vars=[feature_1, feature_2], hue="cluster", height=4)

## Exercises

* Try to interpret these 3 clusters (with clustering on `registered` and `casual`) with plain words. 
* Try different number of clusters (e.g. `2`, `4`, ...).
* ★ Perform clustering using other parameters (e.g. only meteorological).
* ★★ Plot the decision boundaries.

## HDBSCAN

- HDBSCAN is one of the most powerful clustering algorithms.
- Unlike the *k*-means algorithm, HDBSCAN detects the number of clusters automatically. Moreover, it can classify some observations as not belonging to any cluster, i.e. as outliers.

In [None]:
import hdbscan

If you get an error `ModuleNotFoundError: No module named 'hdbscan'`, it means that `hdbscan` library is not installed yed. You can install it on the fly by typing `!pip install hdbscan` in a new cell and running it.

In [None]:
clusterer = hdbscan.HDBSCAN()

In [None]:
clusterer.fit(X)

In [None]:
df_clustered['cluster_hdbscan'] = clusterer.labels_

In [None]:
sns.pairplot(df_clustered, vars=[feature_1, feature_2], hue="cluster_hdbscan", height=4)

## Exercise

* How many observations are in particular clusters?
* Play with the hyperparameters of HDBSCAN to get a satisfactory number of clusters.