# Unsupervised learning - Clustering

The goal of clustering:
> ... clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within a single cluster are very similar and points in different clusters are different. Similarly to classification algorithms, clustering algorithms assign (or predict) a number to each data point, indicating which cluster a particular point belongs to.

A big part is interpreting the clusters. You need to know what the features represent, where your data comes from.

## KMeans Clustering

Follow:
- _Introduction to Machine Learning_ [Chapter 3](https://github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb) **Section 3.5.1 k-Means Clustering** 
- _Practical Statistics for Data Scientists_ [Chapter 7](https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%207%20-%20Unsupervised%20Learning.ipynb) K-Means algorithm


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import mglearn

## How does KMeans Clustering work?

**The number of clusters has to be chosen by the user first**

>The algorithm alternates between two steps: assigning each data point to the closest cluster center, and then setting each cluster center as the mean of the data points that are assigned to it

In [None]:
mglearn.plots.plot_kmeans_algorithm()

There is randomness involved -> set random_state for consistent results.

scikit-learn performs multiple clusterings `n_init=10` by default and returns the *best*.

### Scikit-learn code

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# generate synthetic two-dimensional data
X, y = make_blobs(random_state=1)

# build the clustering model
kmeans = KMeans(n_clusters=3, random_state=4, n_init=10)
kmeans.fit(X)

In [None]:
print("Cluster memberships:\n{}".format(kmeans.labels_))

In [None]:
print("Cluster memberships:\n{}".format(kmeans.predict(X)))

### Visualizing clusters and cluster centres

In [None]:
mglearn.discrete_scatter(X[:, 0], X[:, 1], kmeans.labels_, markers='o')
mglearn.discrete_scatter(
    kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], [0, 1, 2],
    markers='^', markeredgewidth=2);

## Another way to interpret Clustering

Code from Practical Statistics for Data Scientists [Unsupervised Learning](https://github.com/gedeck/practical-statistics-for-data-scientists/blob/master/python/notebooks/Chapter%207%20-%20Unsupervised%20Learning.ipynb)

### 1. Create a DataFrame with features and predicted labels

In [None]:
df = pd.DataFrame(X, columns=['f1', 'f2'])
df['label'] = kmeans.labels_
df['label'] = df['label'].astype('category') #seaborn plots will be nicer
df

### 2. Create a DataFrame with cluster centres

In [None]:
centers = pd.DataFrame(kmeans.cluster_centers_, columns=['f1', 'f2'])
centers

### 3. Plot features and cluster centres

In [None]:
fig, ax = plt.subplots(figsize=(4,4))
ax = sns.scatterplot(x='f1', y='f2', hue='label', ax=ax, data=df)

centers.plot.scatter(x='f1', y='f2', ax=ax, marker='x', s=80, color='black');

### 4. Check cluster balances
Small clusters might be noise

In [None]:
from collections import Counter
counts = Counter(kmeans.labels_)
print(counts)
fig1, ax1 = plt.subplots()
ax1.pie(counts.values(), labels=[f'Cluster {i}' for i in counts.keys()], autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal');  # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title('Cluster size distribution');

### 5. Plot means of the features in each cluster

In [None]:
f, axes = plt.subplots(kmeans.n_clusters, 1, figsize=(6, 6), sharex=True)

for i, ax in enumerate(axes):
    center = centers.loc[i, :]
    maxPC = 1.01 * np.max(np.max(np.abs(center)))
    colors = ['C0' if l>0 else 'C1' for l in center]
    ax.axhline(color='#888888')
    center.plot.bar(ax=ax, color=colors)
    ax.set_ylabel(f'Cluster{i}')
    ax.set_ylim(-maxPC, maxPC)
    if i == 0:
        ax.set_title('Cluster centers per feature on original scale')

### 6. Interpret clusters

Now we could interpret and say:

- Cluster 2 contains samples with large negative values in both features
- Cluster 0 contains samples with large positive f2 values.
- Cluster 1 contains samples with large negative f1 values.

This would make more sense if we would know what the features f1 and f2 represent.

## Wine dataset

Because k-means clustering uses distances, we scale the features using a `StandardScaler()`.

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_wine

X, _ = load_wine(return_X_y=True, as_frame=True)
X_scaled = X.copy()

scaler = StandardScaler()
scaler.fit(X_scaled)
X_scaled = pd.DataFrame(scaler.transform(X_scaled), columns=X_scaled.columns)
X_scaled

### K-means with 3 clusters (arbitrary choice)

In [None]:
kmeans = KMeans(n_clusters=3, random_state=54, n_init=10)
kmeans.fit(X_scaled)
# y_kmeans = kmeans.predict(df_scaled)
X['clusters'] = kmeans.labels_
X['clusters'] = X['clusters'].astype('category') #makes seaborn use qualitative color palette

### Cluster centers in orginal and scaled spaces

In [None]:
centers = pd.DataFrame(scaler.inverse_transform(kmeans.cluster_centers_), columns=X_scaled.columns)
centers_scaled = pd.DataFrame(kmeans.cluster_centers_, columns=X_scaled.columns)
centers

### Visualize clusters and centers

We have more than two dimensions:
1. We choose a single pair
2. We use PCA to reduce the dimensions

In [None]:
fig, ax = plt.subplots(figsize=(4,4))
ax = sns.scatterplot(x='alcohol', y='color_intensity', hue='clusters', ax=ax, data=X)

centers.plot.scatter(x='alcohol', y='color_intensity', ax=ax, marker='x', s=80, color='black');

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_scaled)
X_2D = pca.transform(X_scaled)
sns.scatterplot(x=X_2D[:,0], y=X_2D[:,1], hue=X['clusters'])

### Cluster proportions

In [None]:
from collections import Counter
counts = Counter(kmeans.labels_)
print(counts)
fig1, ax1 = plt.subplots()
ax1.pie(counts.values(), labels=[f'Cluster {i}' for i in counts.keys()], autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal');  # Equal aspect ratio ensures that pie is drawn as a circle.
ax1.set_title('Cluster size distribution')

### Values of cluster centres in scaled space

In [None]:
f, axes = plt.subplots(kmeans.n_clusters, 1, figsize=(6, 6), sharex=True)

for i, ax in enumerate(axes):
    center = centers_scaled.loc[i, :]
    maxPC = 1.01 * np.max(np.max(np.abs(center)))
    colors = ['C0' if l>0 else 'C1' for l in center]
    ax.axhline(color='#888888')
    center.plot.bar(ax=ax, color=colors)
    ax.set_ylabel(f'Cluster{i}')
    ax.set_ylim(-maxPC, maxPC)
    if i == 0:
        ax.set_title('Cluster centers per feature on standard scale')

### Interpretation

Because it is in scaled space, the above plots are with respect to the average value

Cluster 0: High alcohol, medium acidity and medium color intensity wines

Cluster 1: Medium alcohol, high acidity with high color intensity wines

Cluster 2: Low alcohol, low acidity with low color intensity wines

Another approach would be to group the features and comment on the feature groups. This requires domain knowledge