# Clustering

In this notebook clustering is being applied to the correlation dataframe in order to group countries by similarity. The analysis will be performed for the different dimensions of the indicators: equality, socio-demographic and economic. In the end there is a global analysis  using all the indicators and giving a final conclusion to this notebook.

### Import

Import all the libraries and the correlation dataframe generated in the Notebook-Golden.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from numpy import sort
from sklearn.manifold import TSNE
import ipywidgets as widgets
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from matplotlib import gridspec
from sklearn.cluster import OPTICS, cluster_optics_dbscan
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.cluster import AffinityPropagation
from ipywidgets import interact, interact_manual
import matplotlib.pyplot as plt
from matplotlib_venn import venn3

output_path = os.getcwd() + '/Output/'
col_country = 'Country'
col_region = 'Region'
col_year = 'Year'
col_cluster = 'Cluster'
col_1comp = '1st_component'
col_2comp = '2nd_component'

## LOADING THE DATAFRAME
We will use the Pearson Correlation Dataframe. We could use Spearman's by simply changing the name of the file to be read.

In [None]:
corr_df = pd.read_csv(output_path + 'Corr_DF_pearson.csv', index_col = col_country)
corr_df

## DEFINING THE GROUPS

In order to carry out a more sensitive study, we will not only study the indicators as a whole, but also separate them in the following groups to study them more in-depth.

In [None]:
econ_ind = 'Economic Indicators'
socdem_ind = 'Social-demographic Indicators'
eq_ind = 'Equality Indicators'
all_ind = 'All indicators'

ind_dict = {
    econ_ind: ['CreditToAgriFishForest', 'AgriShareGDP', 'EmploymentRural', '%EmploymentAgriFishForest', 'TotalAgri', '% Soldiers', '% Healthcare Investment', '% Employment Industry', '% Education Expenditure','R&D expenditure %GDP','Researchers in R&D','Employment in agriculture','Employment in industry','Employment in services','Cost business start-up','% Education Expenditure'],
    socdem_ind: ['Marriage Rate', 'Birth Rate', 'Death Rate', 'Homicides', 'Life Expectancy', 'Maternal Death Risk', 'Literacy Rate', 'Infant Mortality', '% Population Growth', '% Rural Population', 'Suicide Rate', 'Population'],
    eq_ind: ['Gender Equality', 'Gender Inequality','% Men Employment', '% Women Employment', 'Women Schooling Years', 'Men Schooling Years', 'Freedom of Expression', '% Undernourishment', 'Civil Liberties', 'Gini','Tertiary School Gender Parity','% Female Employment','% Male Employment','% Vulnerable male employment','% Vulnerable female employment'],
    all_ind: corr_df.columns.tolist()
}

# Keep only the ones that have been carried on to this point.

indicators = set(corr_df.columns)
for ind in ind_dict:
    ind_dict[ind] = list(set(ind_dict[ind]) & indicators)

# Divide the corr_df into slices for each group of indicators, normalize them and store them into a dict.
df_dict = {}
for ind in ind_dict:
    df_norm = corr_df.copy()
    df_norm.drop(df_norm.columns.difference(ind_dict[ind]), axis = 'columns', inplace=True)
    df_norm = df_norm.dropna(how = 'all').fillna(value = 0)
    df_norm.name = ind
    df_dict[ind] = df_norm

for ind in ind_dict:
    print(df_dict[ind].name)
    display(df_dict[ind])

## t-SNE

The algorithm of t-SNE (t-Distributed Stochastic Neighbor Embedding) is used to reduce the dimensionality of all the indicators to only 2 components. It admits a number of parameters to tune the procedure and obtain an optimal output: 
- N-components: the dimensions to analyze.
- Perplexity: measurement of how well a probability distribution or probability model predicts a sample.
- N-iter: iterations for the optimization
- Learning rate: how fast the algorithm learns. A high value may cause a ball form of the data.
- Init: Initialization of embedding. 'pca' is the new default for newer versions.

In this cell, we define the different values that will be used for each group of indicators.

In [None]:
# The TSNE algorithm admits a number of parameters
tsne_dict = {
    econ_ind: TSNE(n_components = 2, perplexity = 5, n_iter = 20000, learning_rate = 100.0, init = 'pca'),
    socdem_ind: TSNE(n_components = 2, perplexity = 5, n_iter = 20000, learning_rate = 100.0, init = 'pca'),
    eq_ind: TSNE(n_components = 2, perplexity = 5, n_iter = 20000, learning_rate = 100.0, init = 'pca'),
    all_ind: TSNE(n_components = 2, perplexity = 5, n_iter = 20000, learning_rate = 100.0, init = 'pca')
}

## CLUSTERING

### Affinity Propagation

The clustering method used in the notebook is Affinity Propagation. The reason behind choosing this one is because it is suited for our data: many clusters with uneven cluster size; also, it can simply be optmized changing 1 parameter, "damping". 
- Damping: is the extent to which the current value is maintained relative to incoming values (weighted 1 - damping).

Again, in this cell we will calibrate the clustering algorithm for each indicator group.

In [None]:
afprop_dict = {
    econ_ind: AffinityPropagation(damping=0.5),
    socdem_ind: AffinityPropagation(damping=0.5),
    eq_ind: AffinityPropagation(damping=0.5),
    all_ind: AffinityPropagation(damping=0.5)
}

In [None]:


for ind in ind_dict:
    df_norm = df_dict[ind]

    # Apply the TSNE chosen for that indicator
    df_tsne = pd.DataFrame(tsne_dict[ind].fit_transform(df_norm))

    # Scale the result
    scaled_df = pd.DataFrame(StandardScaler().fit_transform(df_tsne), index = df_norm.index, columns = [col_1comp, col_2comp])
    
    # Apply Affinity Propagation
    affinity = afprop_dict[ind].fit(scaled_df)

    # Update the DataFrame with the resulting Cluster labels
    labels_affinity = affinity.labels_
    df_dict[ind].loc[:, col_cluster] = labels_affinity
    scaled_df[col_cluster] = labels_affinity
    scaled_df[col_cluster] = scaled_df[col_cluster].astype(str)

    # Show the resulting chart
    fig = px.scatter(scaled_df, x = col_1comp, y = col_2comp, text = scaled_df.index, size_max=100, color=col_cluster, category_orders={col_cluster: sort(list(set(scaled_df.loc[:, col_cluster])))})
    fig.update_layout(title_text=ind, title_x=0.5)
    fig.update_traces(textposition='top center')
    fig.show()

## Widget Clustering

In order to explore the clustering results given a country, we can choose that country and the group of indicators in the table below, to see what other countries are in the same cluster for the selected group of indicators.

In [None]:
def tableCountry(Ind, Country):
    try:
        # Find the cluster Country belongs to.
        cluster_number = df_dict[Ind].loc[df_dict[Ind].index == Country, col_cluster].item()

        # Retrieve the Dataframe with the selected indicators, and filter to only show the rows (countries) belonging to the cluster. Drop the cluster number column as it is redundant.
        df_ind = df_dict[Ind]
        df = df_ind.loc[df_ind[col_cluster] == cluster_number].drop(col_cluster, axis = 'columns')

        # Format the Dataframe representation.
        df_s = df.style
        df_s.apply_index(lambda i: ['background-color: #aadfff; font-weight: 500' if c == Country else '' for c in i], axis = 0)
        df_s.apply(lambda row: ['background-color: #ccebff;' if row.name == Country else '' for cell in row], axis = 1)
        df_s.set_table_styles([{'selector': 'td:hover', 'props': [('background-color', '#ddfdff')]}])
        tt = {}
        for col in df.columns:
            tt[col] = 'Column median: ' + str(df.loc[:, col].median())
        df_s.set_tooltips(pd.DataFrame(tt, index = df.index))

        # Display a short descriptive title and the Dataframe.
        display(Country + ' belongs to Cluster ' + str(cluster_number) + '. This Cluster contains a total of ' + str(df.shape[0]) + ' countries.')
        display(df_s)
    except Exception:
        return print('No indicators available for this country.')

@interact(
    Indicators = df_dict.keys(),
    Country = sort(corr_df.index.tolist()))

def g(Indicators = 'Equality indicators', Country = 'Afghanistan'):
    return tableCountry(Indicators, Country)

### SAVING CLUSTERING RESULTS

We now write a .csv file for each group of indicators and the clusters each countries belongs to. By default, we sort the rows by cluster.

In [None]:
cluster_folder = os.getcwd() + '/Output/Cluster/'

if not os.path.exists(cluster_folder):
            os.makedirs(cluster_folder)

for ind in ind_dict:
    print(ind)
    df = df_dict[ind]
    df = df.set_index(['Cluster', df.index]).sort_index()
    df.to_csv(cluster_folder + ind + '.csv')
    display(df)


## VENN DIAGRAM

Finally as a more visual way to represent the clustering, we show a Venn diagram for any given country so we can observe all the countries that are related to it based on the different indicators group.

NOTE: the central intersection does not correspond to the all indicators clustering, as they have obtained by different methods. The intersection is more restrictive and contains fewer countries than the all indicators group.

In [None]:
def VennOut(Country):

    set_econ = set(df_dict[econ_ind].loc[lambda df: df[col_cluster] == df.loc[Country, col_cluster]].index.to_list())
    set_socdem = set(df_dict[socdem_ind].loc[lambda df: df[col_cluster] == df.loc[Country, col_cluster]].index.to_list())
    set_eq = set(df_dict[eq_ind].loc[lambda df: df[col_cluster] == df.loc[Country, col_cluster]].index.to_list())

    venn = venn3([set_econ, set_socdem, set_eq], (econ_ind, socdem_ind, eq_ind))

    venn.get_label_by_id('100').set_text('\n'.join(set_econ - set_socdem - set_eq)) # Only econ
    venn.get_label_by_id('010').set_text('\n'.join(set_socdem - set_econ - set_eq)) # Only socdem
    venn.get_label_by_id('001').set_text('\n'.join(set_eq - set_econ - set_socdem)) # Only eq

    # The three pair-intersections is guaranteed only if there is an intersection of the three groups.
    if len(set_econ & set_socdem & set_eq):
        venn.get_label_by_id('111').set_text('\n'.join(set_econ & set_socdem & set_eq))
        venn.get_label_by_id('110').set_text('\n'.join(set_econ & set_socdem - set_eq))
        venn.get_label_by_id('101').set_text('\n'.join(set_econ & set_eq - set_socdem))
        venn.get_label_by_id('011').set_text('\n'.join(set_socdem & set_eq - set_econ))
    else:
        # If no center, check the intersections that do exist.
        if len(set_econ & set_socdem - set_eq):
            venn.get_label_by_id('110').set_text('\n'.join(set_econ & set_socdem - set_eq))
        if len(set_econ & set_eq - set_socdem):
            venn.get_label_by_id('101').set_text('\n'.join(set_econ & set_eq - set_socdem))
        if len(set_socdem & set_eq - set_econ):
            venn.get_label_by_id('011').set_text('\n'.join(set_socdem & set_eq - set_econ))

    plt.rcParams["figure.figsize"] = (12, 12)
    plt.show()


@interact(
    Country = sort(corr_df.index.tolist()))

def g(Country = 'Spain'):
    return VennOut(Country)

In [None]:

X_optics = pd.DataFrame(StandardScaler().fit_transform(new_df_tsne))

'''Apply OPTICS'''
optics = OPTICS(xi=.35, min_cluster_size=3, min_samples=5).fit_predict(X_optics)
#labels_optics = optics.labels_

plt.figure(figsize=(8,8))
plt.title('OPTICS',fontsize= 20)
plt.xlabel('Feature 1',fontsize= 18)
plt.ylabel('Feature 2',fontsize= 18)
fig = plt.scatter(X_optics[0], X_optics[1], c= optics)


In [None]:

X_kmeans = pd.DataFrame(StandardScaler().fit_transform(new_df_tsne))
X_kmeans.index = new_df_tsne.index
'''Apply K-Means'''
from sklearn.cluster import *
from sklearn.mixture import GaussianMixture


kmean_clusters =   MiniBatchKMeans(n_clusters=6).fit_predict(X_kmeans)

plt.figure(figsize = (8,8))
plt.title('K-Means Clustering',fontsize= 20)
plt.xlabel('Feature 1', fontsize=18)
plt.ylabel('Feature 2', fontsize=18)
f = plt.scatter(X_kmeans[0],X_kmeans[1],c=kmean_clusters)

In [None]:
sns.set(rc={'figure.figsize':(20, 20)})


z = X_kmeans[0]
y = X_kmeans[1]
n = new_df_tsne.index.get_level_values(0)
fig, ax = plt.subplots()

ax.scatter(z, y, c=kmean_clusters)



for i, txt in enumerate(n):
    ax.annotate(txt, (z[i], y[i]))

In [None]:
""" new_df_tsne['Cluster'] = kmean_clusters.tolist()
new_df_tsne """

In [None]:

new_df_tsne.to_csv(write_path + '/Cluster.csv')