# Data Science Society Hierarchical clustering (solutions)

## Debugging problems

### 1) Using the data that we have provided, implement the clusters using a distance of 6

Using the crime dataset that we exported from the previous notebook, implement the clustering algorithm but using distances of 6 and see how this has changed the result both in terms of different clusters and also spatially. Think about why do you get the outcome that you do based on the dendrogram shown:

In [None]:
#import all necessary libraries
import pandas as pd
import geopandas as gpd
import os
import numpy as np
import matplotlib.pyplot as plt
import contextily as cx
from sklearn.cluster import AgglomerativeClustering

In [None]:
#read in the crime data that we extracted before
London_crime = gpd.read_file("Data/crime_2019.gpkg")

In [None]:
#extract the columns that we don't want to plot
not_plot = ["LSOA11CD", "geometry", "LSOA Code", "Total_crime", "Aggl_clus"]
#use this to extract the columns that we do want to plot
to_plot = [col for col in London_crime.columns if col not in not_plot]
#extract the values that we want to plot
crime_clus = London_crime[to_plot]

In [None]:
#create the model that we want, setting the linkage to ward, the distance threshold to 6 and 
#set the number of clusters to none so that we can plot the dendrogram afterwards
model6 = AgglomerativeClustering(linkage=??, 
                                 distance_threshold = ??, 
                                 n_clusters=??)
#fit the model to the data
model6.fit(crime_clus)

In [None]:
#assign the labels back to the dataset
London_crime["Aggl_clus_6"] = model6.??

In [None]:
#set the base axis
fig, ax = plt.subplots(figsize = (10,10))

#plot the results
London_crime.plot(column = ??, 
                  categorical = True, 
                  legend=True, 
                  ax=ax,
                  alpha = 0.7,
                 cmap = "tab10")

#add a basemap
cx.add_basemap(ax = ax,
               crs = "EPSG:27700")

#set the axis off
ax.set_axis_off()

In [None]:
#extract the table showing the summary results
agglom_means =London_crime.groupby(??)[to_plot].mean()
agglom_means_T = agglom_means.T.round(3)

#turn this into a dataframe
agglom_means_T = pd.DataFrame(agglom_means_T)
#show the results
agglom_means_T

In [None]:
#reset the index
agglom_means_T.reset_index(inplace=True)

#get the colours
colors = ["#1f77b4", "#2ca02c", "#8c564b", "#7f7f7f", "#17becf"]

#create subplots for each cluster
fig, ax = plt.subplots(1,5, figsize = (15,8), sharey = True, sharex = True)
#flatten the axis
axis = ax.flatten()

#going over each column
for i, col  in enumerate(agglom_means_T.columns):
    #ignore the index column
    if col != "index":
        ax = axis[i-1]
        #plot the bar chart
        ax.bar(height = agglom_means_T[col], x=agglom_means_T["index"], color = colors[i-1] )
        #rotate the x-ticks
        ax.set_xticklabels(labels =agglom_means_T["index"], rotation = 90)
        #set the title
        ax.set_title(f"Cluster {col}", fontsize = 20)

- How is this different from the previous results?
- Does the extra cluster add anything to our analysis?
- What about if you increased distance?

### 2) Using the same dataset, try to implement the results with average linkage

We have already implemented the model with wards linkage. Why not try changing the linkage metric (more information [here](https://towardsdatascience.com/introduction-to-hierarchical-clustering-part-1-theory-linkage-and-affinity-e3b6a4817702)) to see how this changes the results:

In [None]:
#implement the model
model_avg = AgglomerativeClustering(linkage=??, 
                                    distance_threshold = 0.63, 
                                    n_clusters=??)
#fit the model to the data
model_avg.fit(??)

In [None]:
#assign the labels back to the dataset
London_crime["Aggl_clus_avg"] = model_avg.??

In [None]:
#create the base axis
fig, ax = plt.subplots(figsize = (10,10))

#plot the data to the axis
London_crime.plot(column = ??, 
                  categorical = ??, 
                  legend=??, 
                  ax=??,
                  alpha = ??,
                 cmap = ??)

#add the basemap
cx.add_basemap(ax = ax,
               crs = ??)

#set the axis off
ax.set_axis_off()

In [None]:
#create the results table
agglom_means =London_crime.groupby(??)[to_plot].mean()
agglom_means_T = ??
agglom_means_T = ??
agglom_means_T

In [None]:
#plot the results
agglom_means_T.reset_index(inplace=True)

#get the colours
colors = ["#1f77b4", "#8c564b", "#17becf"]

#create subplots for each cluster
fig, ax = plt.subplots(1,??, figsize = (15,8), sharey = True, sharex = True)
#flatten the axis
axis = ax.flatten()

#going over each column
for i, col  in enumerate(agglom_means_T.??):
    #ignore the index column
    if col != "index":
        ax = axis[i-1]
        #plot the bar chart
        ax.bar(height = agglom_means_T[col], x=agglom_means_T["index"], color = colors[i-1] )
        #rotate the x-ticks
        ax.set_xticklabels(labels =agglom_means_T["index"], rotation = 90)
        #set the title
        ax.set_title(f"Cluster {col}", fontsize = 20)

In [None]:
import numpy as np
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count +=1
            else:
                current_count += counts[child_idx-n_samples]
        counts[i] = current_count
        
    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                     counts]).astype(float)
    
    dendrogram(linkage_matrix, **kwargs)

fig, ax = plt.subplots(figsize = (10,10))
ax.set_title("Hierarchical clustering dendrogram")
#plot the top three levels of the dendrogram
plot_dendrogram(model_avg, truncate_mode='level', p=3)
plt.axhline(y = 7, color = "r", linestyle = "--")
ax.set_xlabel("Number of points in node")
plt.show()

- What has gone wrong here? 
- Why do you think this has gone wrong?
- How would you change this?
- What about other linkage metrics?

### 3) Try implementing a spatially constrained model using wards linkage

Following this [link](https://towardsdatascience.com/introduction-to-hierarchical-clustering-part-3-spatial-clustering-1f8cbd451173) Try implementing a spatially constrained model. The purpose of this is to control for geography in the model (you can also use different linkages in other scenarios to constrain the model as well in the future).

In [None]:
#!pip install libpysal

In [None]:
#import the necessary packages
from libpysal import weights

#calculate the weights matrix
wr = weights.contiguity.Rook.from_dataframe(??)

In [None]:
#create the model with wards linkage
model = AgglomerativeClustering(linkage=??, 
                                #define the connectivity
                                connectivity = wr.sparse,
                                #set the distance threshold
                                distance_threshold = 3.2, 
                                n_clusters=None)

#fit the model
model.fit(??)

In [None]:
#extract labels
London_crime["Aggl_clus_spa"] = model.??

#creating axis
fig, ax =plt.subplots(figsize = (10,10))

#plt the results
London_crime.plot(column = ??, 
                  categorical = True, 
                  legend=True, 
                  ax = ax,
                 cmap = "tab10")

#add the basemap
cx.add_basemap(ax = ax,
               crs = "EPSG:27700")

#remove the axis
ax.set_axis_off()

In [None]:
#extract the results
agglom_means =London_crime.groupby(??)[to_plot].mean()
#extract the transformed data
agglom_means_T = pd.DataFrame(agglom_means.T.round(3))
#reset the index
agglom_means_T.reset_index(inplace=True)

#extract the colours
colors = ["tab:blue", "tab:green", "tab:purple", "tab:pink", "tab:olive", "tab:cyan"]

#plot the results
fig, ax = plt.subplots(2,3, figsize = (15,15), sharey = True, sharex = True)
axis = ax.flatten()
for i, col  in enumerate(agglom_means_T.columns):
    if col != "index":
        ax = axis[i-1]
        ax.bar(height = agglom_means_T[col], x=agglom_means_T["index"],
               color = colors[i-1], )
        ax.set_xticklabels(labels =agglom_means_T["index"], rotation = 90)
        ax.set_title(f"Cluster {col}", fontsize = 20)
        ax.grid(axis = "y", zorder = 0, linestyle = "--")

- How can these be interpreted?
- What is the main difference?
- Can you look at the dendrogram to see how it may be different?
- What about other linkages, distance metrics or spatial weights?

# Find your own dataset to perform this methodology on

Think about:

1. What makes this a good dataset for hierarchical clustering?
2. What distance metric or affinity metric do you want to use?
3. How do your clusters vary with difference distances?
4. How can your results be interpreted?