# Lecture 3.1: Clustering

[**Lecture Slides**](https://docs.google.com/presentation/d/19huVbcPfj-okCOjXIMG34cuQdZZP1ktswuMFjbT66ZA/edit?usp=sharing)

This lecture, we are going to cluster a geospatial dataset using k-Means.

**Learning goals:**

- Explore data using clustering
- Implement k-means
- Visualize geospatial data
- Create a ridgeline graph


We are tasked with gathering insights into to a volcano dataset 🌋, but we know nothing about geology 🙈. No need to panic! We know machine learning algorithms that can help us explore the patterns in this data. 


## 1. Geospatial Data

Let's start our volcanic exploration by loading the dataset into pandas:


In [None]:
import pandas as pd

volcanos = pd.read_csv('volcanos.csv')
volcanos.head()

There are a lot of columns, but the most interesting are `Latitude` and `Longitude`. These are geospatial coordinates, which means it's our first opportunity to visualize some awesome maps! There are many geospatial data visualization libraries in python, but for this notebook we'll use [folium](https://python-visualization.github.io/folium/). 

Let's focus on the coordinates by selecting the two geospatial columns from our `DataFrame`:

In [None]:
coords = volcanos[['Latitude', 'Longitude']]

We now have to initialize a [`folium.Map`](https://python-visualization.github.io/folium/modules.html#folium.folium.Map). This is done with one central location. Since our coordinates span the entire globe, this doesn't matter, and we can pick our first volcano as center:

In [None]:
import folium
m = folium.Map(location=coords.iloc[0], tiles='Stamen Toner', zoom_start=1)
type(m)

In [None]:
m

We have a map! 🗺 Notice how you can move and zoom interactively. Let's populate this map with our volcanos! We'll iterate through the `DataFrame` rows using [`.iterrows()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html). Then, we can create a [`Circle`](https://python-visualization.github.io/folium/modules.html#folium.vector_layers.Circle) for each volcano location. These have to be explicitly added to the map with `.add_to(m)`:

In [None]:
for index, row in coords.iterrows():
    folium.Circle(
    radius=10,
    location=row,
    color='crimson',
    fill=True,
).add_to(m)

m

This is already much easier to understand than our tabular format! Notice how volcanos aren't spread around the globe, but instead form "clumps" and "lines". We'd like to investigate this further, but we don't have a column which categorizes the data into these groups... 

## 2. K-Means

So we're going to have to make them ourselves! This is a clustering task, for which we will use the k-Means implementation from the [sklearn](https://scikit-learn.org/) library. Remember that k-Means is a _learning_ algorithm, so there will be two steps: fitting the data, then applying the model on the data.

First let's train our k-Means model. It looks like there is a dozen "clumps" on the map, so we'll pick a somewhat arbitrary $k=10$. Then, we'll convert our `DataFrame` to a NumPy `ndarray` using the [`.to_numpy()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html) method, and fit the model to the data:

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10)
kmeans.fit(coords.to_numpy())

Our model is trained, and ready to be used! Next, we "predict" the cluster allocation of points by feeding our `DataFrame` back into the `kmeans` model. 

ℹ️ This step might sound redundant, but it is important to differentiate between _training data_ and _prediction data_. In our case, they are the same, but they don't have to be! For example, if we were aliens and our dataset contained _billions_ of volcanoes, it could be more efficient to choose a random subset of the data to train the k-Means model.

In [None]:
y_kmeans = kmeans.predict(coords)
y_kmeans

The [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.predict) method returned a vector of integers. These are the _cluster allocations_ of our volcanos. Each cluster is labeled by an integer , and each volcano is assigned a cluster. This vector stores these assignments.

This allows us to visualize the clusters on our map! We use a trick to iterate through the `coords.iterrows()` and the `y_kmeans` at the same time: the python builtin function, [`zip()`](https://docs.python.org/3.3/library/functions.html#zip):

In [None]:
import seaborn as sns

colors = sns.color_palette('husl', n_colors=10).as_hex()

m = folium.Map(location=coords.iloc[0], tiles='Stamen Toner', zoom_start=1)

for (index, row), y in zip(coords.iterrows(), y_kmeans):
    folium.Circle(
    radius=10,
    location=row,
    color=colors[y],
    fill=True,
).add_to(m)

m

That's a lava hot visualization 🔥Notice how k-Means identified real underlying patterns in the data, and forms geospatially coherent groups. 

## 3. Cluster Analysis

The clusters _look_ good, but let's see if they can be useful in our data exploration. First, let's append our cluster allocation data to our `DataFrame` as a new column:

In [None]:
volcanos.loc[:, 'Cluster'] = y_kmeans.copy()
volcanos.head() 

Manipulating one object will be easier than two! Let's investigate distributional differences between the clusters (see lecture 2.3):

In [None]:
volcanos.groupby('Cluster').mean()

Interestingly, the clusters have different `Evelation` averages. This suggests that by identifying volcanos that were close to eachother, k-Means also grouped the data by other criteria of similarity. In lecture 2.6, we learned how averages can be misleading, and that it's preferential to visualise entire _distributions_ of datasets. Let's do this with a [ridgeline plot](https://www.data-to-viz.com/graph/ridgeline.html):

In [None]:
import matplotlib.pyplot as plt

sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})

df = volcanos[["Elevation (Meters)", 'Cluster']]

# Initialize the FacetGrid object
pal = sns.cubehelix_palette(15, rot=-.25, light=.7)
g = sns.FacetGrid(df, row="Cluster", hue="Cluster", aspect=10, height=.5, palette=pal)

# Draw the densities in a few steps
g.map(sns.kdeplot, "Elevation (Meters)", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "Elevation (Meters)", clip_on=False, color="w", lw=2, bw=.2)
g.map(plt.axhline, y=0, lw=2, clip_on=False)


# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
    ax = plt.gca()
    ax.text(0, .2, label, fontweight="bold", color=color,
            ha="left", va="center", transform=ax.transAxes)

g.map(label, "Elevation (Meters)")

# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)

This is an example of a more "advanced" [seaborn](https://seaborn.pydata.org/) plot that can be both aesthetic and informative!


## Summary

Today was our introduction to **data analysis**. We learned how **machine learning algorithms** differ from rule-based algorithms, and how they fall into either **supervised learning** and **unsupervised learning**. Then, we explained what **clustering** is, and identified some of its major **applications**. We defined the most popular clustering method: **k-Means**, and visualized the **Expectation-Maximization** optimisation algorithm before giving it a try ourselves. It allowed us to infer structures in a **geospatial** dataset, which gave insights into the distributions of volcanos across the Globe.



# Resources


### Core Resources

- [**Slides**](https://docs.google.com/presentation/d/19huVbcPfj-okCOjXIMG34cuQdZZP1ktswuMFjbT66ZA/edit?usp=sharing)
- [Python Data Science Handbook - k-Means](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html)
- [k-Means applications and drawbacks](https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a)  
Excellent blog to further your k-Means skills with silhouette analysis and the elbow method

### Additional Resources

- [k-Means clustering from the mathematicalmonk](https://youtu.be/0MQEt10e4NM)  
Detailed but intuitive theoretical explanation of k-Means