# CAS-DML Course Project 4 - Clustering


This project is designed to give you a hands-on experience with clustering techniques. You will investigate the structure of a real-world dataset and apply the different clustering methods from the lecture as well as principal component analysis. 
Furthermore, you will learn how to assess the robustness of your models and how a suitable visualization of the results can help you to interpret the data.

# Preparation - importing libraries and loading the data

As in the previous projects, we will use numpy for numerical operations, pandas for data manipulation, and matplotlib for plotting. We will also use the scikit-learn library for clustering and PCA. To visualize the data, we will use another library, namely `geopandas`, which is built on top of matplotlib and pandas and allows us to work with geospatial data. You are not expected to understand the details of the `geopandas` library in this course and we will hide all the details from you. However, if you are interested in this topic, you can find more information [here](https://geopandas.org/).

***Note:*** Geopandas is already installed on our Jupyterhub. If you are working on your local installation of Jupyterlab, you will need to install it using the following command:

`conda install -c conda-forge geopandas`

With this out of the way, let's import the libraries. 

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import geopandas

In this notebook, we will use a **cleaned subset of the 2019 World Development Indicators dataset**. This preprocessing has been done by the instructors of the course beforehand. Some countries have been removed due to insufficient data. A corresponding notebook will be made available after the project's deadline for those one's who are interested.  

Let's load the data and take a look at the first few rows. The country names are unique identifiers for the data points, hence we can set them as the index of the DataFrame. This allows us to access the data for a specific country more easily. 

In [None]:
data = pd.read_csv("./cleaned_data.csv")
data = data.set_index("Country Name")
data.head(n = 5)

As we can see the rows (samples) of this dataset represent different countries. The columns (features) are development indicators as measured throughout the countries. Specifically we have:


*   ```child_mortality```: Mortality rate, infant (per 1,000 live births)
*   ```exports```: Exports of goods and services (% of GDP)
*   ```health```: Current health expenditure (% of GDP)
*   ```imports```: Imports of goods and services (% of GDP)
*   ```income```: Adjusted net national income per capita (current US\$)
*   ```inflation```: Inflation, GDP deflator (annual %)
*   ```life_expectancy```: Life expectancy at birth, total (years)
*   ```fertility_rate```: Fertility rate, total (births per woman)
*   ```gdp_pc```: GDP per capita (current US\$)
*   ```corruption```: Control of corruption. Captures perceptions of the extent to which public power is exercised for private gain. The higher the value the more perceived corruption (orginial inverted)
*   ```acc_clean_cooking```: 	Access to clean fuels and technologies for cooking (% of population)

These features should be self-explanatory. A little background on the inflation metric:
Inflation, shown by the yearly change in the GDP deflator, tells us how prices are rising across the entire economy. The GDP deflator compares the current value of all goods and services produced to their value in previous years, adjusting for price changes. Those metrics have been inspired by the paper 

- [Shahriar Sohan et al. <br/> Optimizing development aid allocation: A data-driven approach using unsupervised
machine learning and multidimensional indices ](https://wjarr.com/sites/default/files/WJARR-2023-1904.pdf).

As you can easily see, all the features are numeric. 

In [None]:
data.info()

We will remove `child_mortality` from the dataset. We will use this feature later to evaluate the clustering results.

In [6]:
child_mortality = data["child_mortality"]
data.drop("child_mortality", axis=1, inplace=True)

Later on we want to color the countries on a map according to their cluster assignment. For this purpose, we will use the `world` dataset from the geopandas library. The following function will take care of all the details of the visualization. You don't need to understand it, just run the cell to make it available for later use.

In [7]:
from matplotlib.colors import to_hex
from matplotlib.patches import Patch
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

def plot_clusters_on_map(data, labels):
    unique_labels = np.unique(labels)
    country_categories = []
    n_labels = len(np.unique(labels))
    for i in range(0, n_labels):
      country_categories.append(data.loc[labels == i].index.values.tolist())

    url = "https://nibr1609.github.io/files/world2.geojson"
    map = geopandas.read_file(url)

    cmap = plt.cm.get_cmap('tab20', n_labels)  # 'tab20' has 20 distinct colors
    category_colors = [to_hex(cmap(i)) for i in range(n_labels)]

    # Create a color array to hold the colors for all countries
    color_array = ['white'] * len(map)  # Default color for countries not in any category
    for i, category in enumerate(country_categories):
        # Identify countries in the current category by checking if any country name is in the longer name
        highlight = map['NAME'].apply(lambda x: any(x in country for country in category))

        # Assign the corresponding color to these countries
        color_array = np.where(highlight, category_colors[i], color_array)

    # Plot the world map
    fig, ax = plt.subplots(figsize=(15, 10))
    map.plot(ax=ax, color=color_array, edgecolor='black')
    legend_patches = [Patch(color=category_colors[i], label=f"Category {unique_labels[i]}") for i in range(n_labels)] + [Patch(color='white', label="No Data")]
    plt.title("Highlighted Countries on World Map")
    plt.legend(handles=legend_patches, loc='upper left', title="Labels")
    plt.show()

## Explorative Data Analysis

Before we start with the clustering, it is always a good idea to take a look at the data. 

#### Exercise:
Please provide some visualization for the features in the dataset. You can use histograms or boxplots to show the distribution of the values for the individual features. You can also use scatter plots to show the relationship between two features. 
Note down your observations.

Think about the individual features and which you would use if the goal of clustering is to find clusters related to heath and which you would use if the goal is to find clusters related to governance.

In [1]:
# Your code here

YOUR NOTES

Did you notice that the features have different scales? As we will illustrate later, this can have a significant impact on the clustering results. Therefore, we will standardize the data before applying the clustering algorithms.

In [21]:
from sklearn.preprocessing import StandardScaler
data_scaled = StandardScaler().fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns, index=data.index)


# Hierarchical Clustering - An Overview

As mentioned above, the features selected were selected to investigate regions of the world that are most in need of development aid.
We start taking all these features and just explore what groupings we can find.

We start with hierarchical clustering. 

Let's plot a dendrogram to visualize the hierarchical clustering. As a linkage method we use `ward` which minimizes the variance of the clusters being merged.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(data_scaled, method='ward')

fig, ax = plt.subplots(figsize=(20,45))
dendrogram(Z, orientation='left', leaf_font_size=8, labels=list(data.index), color_threshold=3)
ax.tick_params(axis='y', which='major', labelsize=12)
ax.tick_params(axis='x', which='major', labelsize=20)

### Exercise

- Which countries are close to each other in the dendrogram? Do you see any surprising results?
- Play with the color_threshold parameter. How do you need to choose it such that you get 3 clusters? How for 5 clusters?
  - Do the clusters make sense to you? Why or why not?
- In how many clusters would you divide the countries based on the dendrogram? Why? Note: There is no right or wrong answer and several solutions are possible. But some might be more reasonable than others. Try to justify your answer.

YOUR NOTES

Because the dendogram function doesn't really return us the clusters of countries as list but serves as a visualization of which countries are "close" to each other, we want to do hierarchical clustering using the ```AgglomerativeClustering``` from scikit-learn.

In [61]:
from sklearn.cluster import AgglomerativeClustering

agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg_clustering.fit_predict(data_scaled)

Now that we have labels, we can print the countries in each cluster.


In [None]:

# print the countries in each cluster
for i in range(0, len(np.unique(labels))):
        print(data.loc[labels == i].index.values.tolist())
        print("\n")

Much better than printing the countries in each cluster is to visualize the clusters on a map. We will use the `plot_clusters_on_map` function that we defined earlier.

In [None]:
plot_clusters_on_map(data, labels)

### Exercise

- Use `AgglomerativeClustering`  to cluster the countries into 5 and 10 clusters. Is there still a clear interpretation of the clusters? Is there anything you learned from the clustering that you didn't know before?

YOUR ANSWER HERE


## K-Means

*After* applying hierarchical clustering, we want to have a look at another clustering algorithm: **k-Means**. In contrast to hierarchical clustering, we have to specify the number of clusters **before** its execution.
While we can specify how many clusters we want in hierarchical clustering, the number of clusters doesn't have to be known before algorithm execution. The linkage matrix can be determined without knowledge of k.

With K-Means, we have to specify the number of clusters beforehand. This is a disadvantage of K-Means compared to hierarchical clustering. However, K-Means is computationally more efficient and can be used for larger datasets.

### Exercise

- Use the `KMeans` to cluster the countries. Do you get the same results as with hierarchical clustering?

In [None]:
from sklearn.cluster import KMeans

model = ...
labels = ...

# print the countries in each cluster
for i in range(0, len(np.unique(labels))):
        print(data.loc[labels == i].index.values.tolist())
        print("\n")


YOUR ANSWER HERE

## Determining the number of clusters

Now we want to determine the optimal number of clusters for the K-Means algorithm. We will use the elbow method for this purpose. The elbow method looks at the sum of squared distances of samples to their closest cluster center. This value is plotted against the number of clusters. The point where the curve starts to flatten out is the optimal number of clusters.

In addition to the elbow method, we will also use the silhouette score. It complements the elbow method by providing a measure of how well the data is clustered. 

In [None]:
from sklearn.metrics import silhouette_score

wcss = []
ss = []

for i in range(2,15):
    # Clustering
    model = KMeans(n_clusters=i, n_init=20, random_state=0)
    clusters = model.fit_predict(data_scaled)

    wcss.append(model.inertia_)
    ss.append(silhouette_score(data_scaled, clusters))

# create two subplots

WCSS = pd.DataFrame({'k': range(2, 15), 'WCSS': wcss, 'SS': ss})

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 10))
sns.lineplot(WCSS, x='k', y='WCSS', ax = ax1)
ax1.set_title("Elbow Method")
sns.lineplot(WCSS, x='k', y='SS', ax = ax2)
ax2.set_title("Silhouette Score")

### Exercise

- How many clusters would you choose based on the above plots? Why?
- When you choose the optimal number of clusters and look again at the map, is there an interpretation of the clusters that makes sense to you?

YOUR ANSWER

Let's now visualize the clusters on a map.

In [None]:
plot_clusters_on_map(data, labels)

Remember that we removed the `child_mortality` feature from the dataset? Now we want to use this feature to see if our clustering could be an indicator of child mortality. We will plot the child mortality rate for each country and color the countries according to their cluster assignment.

In [None]:
# plot child mortality rate for each cluster
child_mortality_df = child_mortality.to_frame()
child_mortality_df["Cluster"] = labels


sns.boxplot(data=child_mortality_df, x="Cluster", y="child_mortality")
plt.show()

### Exercise

- What can you say about the child mortality rate in the different clusters? What does this tell you about your clustering results?
- If you wanted to know more about how child mortality is related to the features in the dataset, which method could you use?

YOUR NOTES HERE


#### Assessing the robustness of the clustering

We want to assess the robustness of the clustering results as we did in the clustering notebook that we covered in the class. We create a pair of bootstrapped datasets, use them for training a k-means model and compare the cluster assignments that result on the original dataset. If the cluster assignments are similar, the clustering is considered robust.

In [None]:
# produce bootstrap samples for the silhouette score
from sklearn.metrics import jaccard_score, adjusted_rand_score

ari_scores = []
for i in range(100):

    bootstrap_sample1 = data_scaled.sample(replace=True, n=len(data_scaled), random_state=i)
    km1 = KMeans(n_clusters=5, n_init=40)
    km1.fit(bootstrap_sample1)

    bootstrap_sample2 = data_scaled.sample(replace=True, n=len(data_scaled), random_state=i)
    km2 = KMeans(n_clusters=5, n_init=40)
    km2.fit(bootstrap_sample2)
    
    s1 = km1.predict(data_scaled)
    s2 = km2.predict(data_scaled)
    ari = adjusted_rand_score(s1, s2)
   
    ari_scores.append(ari)
# plot the distribution of ari scores
sns.histplot(ari_scores, bins=30)


#### Exercise

- How do you interpret the distribution of the ARI scores? Would you consider the clustering results to be robust?

YOUR ANSWER HERE

### Getting more insights

One way to understand if a feature is important for the clustering is to do a box-plot of the feature for all the clusters. 
Let's choose for simplicity only 3 clusters. 

In [None]:
model = KMeans(n_clusters=3, n_init=40, random_state=0)
labels = model.fit_predict(data_scaled)


# do a boxplot for the cluster and each feature
data_with_clusters = data_scaled.copy() # copy dataframe in order not to add label to original data
data_with_clusters["Cluster"] = labels

fig, axs = plt.subplots(len(data_scaled.columns), 1, figsize=(10, 30))
for i, col in enumerate(data_scaled.columns):
    
    sns.boxplot(data=data_with_clusters, x="Cluster", y=col, ax=axs[i])


### Exercise

- Can you use the plots to say what influenced each cluster? What are the defining features for each individual clusters?
- We see in the graphs that there are some outliers. Why could this be interesting? You may want to look at the countries that are outliers. In the following code cell you see an example of how to get the countries that are outliers for a specific cluster.

In [None]:
print(data_with_clusters[(data_with_clusters["Cluster"] == 1) & (data_with_clusters["imports"] > 4)])

YOUR ANSWER

## Using different features


Let us now investigate how the choice of features can influence the clustering. 
If we look at all the features included in our data we see that there are different kind of features (e.g. financial: `imports`, `exports` or `gdp_pc`, governance: `corruption` or health: `acc_clean_cooking`, `health`, `life_expectancy`).  

We now want to take a closer look at two different pairs of features.

#### Feature Set 1

As a first set of features, let's take `acc_clean_cooking` and `life_expectancy`.
As a second set of features, we will use `inflation` and `corruption`.

In [148]:
features_1 = ['acc_clean_cooking', 'life_expectancy']
features_2 = ['inflation', 'corruption']

### Exercise

- Produce a scatterplot for each pair of features. Can you see any clusters?
- Use the K-Means algorithm to cluster the countries based on these two features. How many clusters would you choose? Why?
- Plot the countries with their corresponding clusters on the map. Do the clusters differ from the ones when using all features?

*Hint* To restrict a dataframe to only two columns you can use the following code:

```python
df_subset = df[['feature1', 'feature2']]
```


In [95]:
# Your code

# Dimensionality Reduction

As we have seen in the previous section, being able to visualize data in two dimension can be very helpful. However, we can't always choose the two features that are most informative for clustering. In such cases, we can use dimensionality reduction techniques to project the data into a lower-dimensional space. We can do this either to visualize the data or to reduce the number of features before applying clustering algorithms.

Let's start with the former and use PCA to project the data into a two-dimensional space. 

In [96]:
from sklearn.decomposition import PCA
# Normalize
num_dim = 2

# redo the K-Means clustering
model = KMeans(n_clusters=5, n_init=20, random_state=0)
labels = model.fit_predict(data_scaled)

# Apply PCA
pca = PCA(n_components=num_dim)
transformed_data = pca.fit_transform(data_scaled)

# Create a DataFrame with the transformed data and add the labels from the clustering
pca_data = pd.DataFrame(transformed_data, columns=[f"PC{i}" for i in range(1, num_dim + 1)], index=data.index)

Now that we have a lower-dimensional representation of the data, we can do a scatterplot and show the labels in 2 dimensions.

In [None]:

# Plot the data
scatter_plot = sns.scatterplot(data=pca_data, x="PC1", y="PC2", hue=labels)

# You can uncomment the following code to add the country names to the plot for extreme values
for i, row in pca_data.iterrows():
    if row['PC1'] > 3 or row['PC2'] > 3 or row['PC1'] < -3 or row['PC2'] < -3:
      scatter_plot.text(row['PC1'] + 0.02, row['PC2'], str(row.name), 
                        fontsize=10)



### Exercise

- Uncomment the code to add some labels to the scatter plot. Are you surprised about the extreme values?

In [None]:
# YOUR CODE GOES HERE
pca = PCA(n_components=num_dim)
transformed_data = pca.fit_transform(data_scaled)

model = KMeans(n_clusters=5, n_init=20, random_state=0)
labels = model.fit_predict(transformed_data)

pca_data = pd.DataFrame(transformed_data, columns=[f"PC{i}" for i in range(1, num_dim + 1)], index=data.index)

scatter_plot = sns.scatterplot(data=pca_data, x="PC1", y="PC2", hue=labels)


YOUR NOTES HERE

### Optional reading

Finally, we want to analyze what original features PCA deemed important and what information is represented by the principal components.  
Using pca.components_ and a heatmap, we can visualize and explain what features have the biggest influence on the principal components that we obtain by applying PCA.

In [None]:
# Create a DataFrame for PCA loadings
loadings = pd.DataFrame(pca.components_.T, columns=[f'PC{i+1}' for i in range(num_dim)], index=data.columns)

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(loadings, annot=True, cmap='coolwarm')
plt.show()

### Exercise

- Which features have the biggest influence on the first two principal components? 
- Are all the features that you expected to be important for clustering (see analysis above) represented here?

YOUR_NOTES