# Modelling (1/2 - Clustering)

Now that we have gone through the process of collating all the data into a single place, we can start to do some modelling!

There are two broad aims here:

**1. Clustering Analysis:** Which seats would an algorithm suggest are most alike, based on their characteristics? This might allow a party to see which seats really ought to be target seats, but aren't.


**2. Categorisation Machine Learning:** Creating a model to predict what kind of seat each constituency is, based on its characteristics, and seeing which issues are the most influential in deciding this.

# 1) Import libraries

We have several libraries to import in order to carry out these analyses.

In [1]:
#For data manipulation
import numpy as np
import pandas as pd
import json
import math

#For data visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.lines import Line2D
import matplotlib.font_manager
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
from matplotlib.pylab import rcParams

#For data pre-processing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

#For clustering analysis
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering

#For evaluating clusters
from sklearn.metrics import calinski_harabasz_score
from scipy.cluster.hierarchy import dendrogram, ward

#Suppress warnings from showing
import warnings
warnings.filterwarnings('ignore')

#Allow ourselves to save things
import pickle

In [2]:
#Define colours for the visuals
CB91_Blue = '#2CBDFE'
CB91_Green = '#47DBCD'
CB91_Pink = '#F3A0F2'
CB91_Purple = '#9D2EC5'
CB91_Violet = '#661D98'
CB91_Amber = '#F5B14C'

CB91_BlueD = '#016794'
CB91_GreenD = '#187970'
CB91_PinkD = '#B317B1'
CB91_PurpleD = '#4E1762'
CB91_VioletD = '#330E4C'
CB91_AmberD = '#985E09'

CB91_BlueL = '#ABE5FF'
CB91_GreenL = '#B5F1EB'
CB91_PinkL = '#FAD9FA'
CB91_PurpleL = '#D9A8EB'
CB91_VioletL = '#ECD4F5'
CB91_AmberL = '#F9D094'


#The following gradients will be used for heatmaps, etc
CB91_Grad_BP = ['#2CBDFE', '#2fb9fc', '#33b4fa', '#36b0f8',
                '#3aacf6', '#3da8f4', '#41a3f2', '#449ff0',
                '#489bee', '#4b97ec', '#4f92ea', '#528ee8',
                '#568ae6', '#5986e4', '#5c81e2', '#607de0',
                '#6379de', '#6775dc', '#6a70da', '#6e6cd8',
                '#7168d7', '#7564d5', '#785fd3', '#7c5bd1',
                '#7f57cf', '#8353cd', '#864ecb', '#894ac9',
                '#8d46c7', '#9042c5', '#943dc3', '#9739c1',
                '#9b35bf', '#9e31bd', '#a22cbb', '#a528b9',
                '#a924b7', '#ac20b5', '#b01bb3', '#b317b1']

CB91_Grad_BA = ['#2cbdfe', '#31bdf9', '#36bcf5', '#3bbcf0',
                '#41bcec', '#46bbe7', '#4bbbe3', '#50bbde',
                '#55bbd9', '#5abad5', '#60bad0', '#65bacc',
                '#6ab9c7', '#6fb9c3', '#74b9be', '#79b8ba',
                '#7eb8b5', '#84b8b0', '#89b7ac', '#8eb7a7',
                '#93b7a3', '#98b79e', '#9db69a', '#a3b695',
                '#a8b690', '#adb58c', '#b2b587', '#b7b583',
                '#bcb47e', '#c1b47a', '#c7b475', '#ccb371',
                '#d1b36c', '#d6b367', '#dbb363', '#e0b25e',
                '#e6b25a', '#ebb255', '#f0b151', '#f5b14c']

CB91_Grad_AP = ['#f5b14c', '#f3ae4f', '#f0aa52', '#eea755',
                '#eca458', '#eaa05c', '#e79d5f', '#e59962',
                '#e39665', '#e19368', '#de8f6b', '#dc8c6e',
                '#da8971', '#d88574', '#d58277', '#d37f7b',
                '#d17b7e', '#cf7881', '#cc7584', '#ca7187',
                '#c86e8a', '#c66a8d', '#c36790', '#c16493',
                '#bf6096', '#bd5d9a', '#ba5a9d', '#b856a0',
                '#b653a3', '#b450a6', '#b14ca9', '#af49ac',
                '#ad46af', '#ab42b2', '#a83fb5', '#a63bb9',
                '#a438bc', '#a235bf', '#9f31c2', '#9d2ec5']

CB91_Grad_GP = ['#47dbcd', '#4bd9ce', '#50d8cf', '#54d6d0',
                '#59d5d1', '#5dd3d2', '#61d2d3', '#66d0d4',
                '#6acfd5', '#6fcdd6', '#73ccd6', '#78cad7',
                '#7cc9d8', '#80c7d9', '#85c6da', '#89c4db',
                '#8ec3dc', '#92c1dd', '#96c0de', '#9bbedf',
                '#9fbde0', '#a4bbe1', '#a8bae2', '#acb8e3',
                '#b1b7e4', '#b5b5e5', '#bab4e6', '#beb2e7',
                '#c2b1e8', '#c7afe9', '#cbaee9', '#d0acea',
                '#d4abeb', '#d9a9ec', '#dda8ed', '#e1a6ee',
                '#e6a5ef', '#eaa3f0', '#efa2f1', '#f3a0f2']

CB91_Grad_GWP= ['#47dbcd','#4fdcce','#56ddd0','#5dded1',
                '#64dfd2','#6ae0d3','#70e1d5','#75e2d6',
                '#7be3d7','#80e4d8','#85e5da','#8ae6db',
                '#8fe7dc','#94e8dd','#98e9df','#9deae0',
                '#a1ebe1','#a6ece2','#aaede4','#afede5',
                '#b3eee6','#b7efe8','#bbf0e9','#c0f1ea',
                '#c4f2eb','#c8f3ed','#ccf4ee','#d0f5ef',
                '#d4f6f1','#d8f6f2','#dcf7f3','#e0f8f5',
                '#e4f9f6','#e8faf7','#ecfbf8','#f0fcfa',
                '#f3fcfb','#f7fdfc','#fbfefe','#ffffff',
                '#fdfafe','#fbf5fc','#f9f0fb','#f7eaf9',
                '#f4e5f8','#f2e0f7','#f0dbf5','#eed6f4',
                '#ecd1f2','#e9ccf1','#e7c7f0','#e5c1ee',
                '#e2bced','#e0b7eb','#deb2ea','#dbade8',
                '#d9a8e7','#d6a3e5','#d49ee4','#d199e2',
                '#cf94e1','#cc8fdf','#ca89de','#c784dc',
                '#c57fdb','#c27ad9','#bf75d8','#bd6fd6',
                '#ba6ad5','#b765d3','#b45fd2','#b25ad0',
                '#af54cf','#ac4ecd','#a949cb','#a642ca',
                '#a33cc8','#a035c7','#9d2ec5']

CB91_Grad_BWP= ['#2cbdfe','#31bffe','#37c0fe','#3cc2fe',
                '#42c4fe','#47c5fe','#4cc7fe','#52c9fe',
                '#57cbfe','#5dccfe','#62cefe','#68d0fe',
                '#6dd1fe','#72d3fe','#78d5fe','#7dd6fe',
                '#83d8fe','#88dafe','#8ddbfe','#93ddfe',
                '#98dfff','#9ee1ff','#a3e2ff','#a8e4ff',
                '#aee6ff','#b3e7ff','#b9e9ff','#beebff',
                '#c3ecff','#c9eeff','#cef0ff','#d4f1ff',
                '#d9f3ff','#dff5ff','#e4f7ff','#e9f8ff',
                '#effaff','#f4fcff','#fafdff','#ffffff',
                '#fdfafe','#fbf5fc','#f9f0fb','#f7eaf9',
                '#f4e5f8','#f2e0f7','#f0dbf5','#eed6f4',
                '#ecd1f2','#e9ccf1','#e7c7f0','#e5c1ee',
                '#e2bced','#e0b7eb','#deb2ea','#dbade8',
                '#d9a8e7','#d6a3e5','#d49ee4','#d199e2',
                '#cf94e1','#cc8fdf','#ca89de','#c784dc',
                '#c57fdb','#c27ad9','#bf75d8','#bd6fd6',
                '#ba6ad5','#b765d3','#b45fd2','#b25ad0',
                '#af54cf','#ac4ecd','#a949cb','#a642ca',
                '#a33cc8','#a035c7','#9d2ec5']

#Add party colors
con_blue = '#0A3B7C'
lab_red = '#E4003B'
lib_yel = '#FAA61A'
snp_yel = '#FFF481'
green_green = '#52DF00'
brex_blue = '#00E2ED'
ukip_pur = '#470A65'
plaid_green = '#006A56'

con_lab = '#992281'
con_lib = '#837859'
con_snp = '#85987f'
lab_lib = '#ef532b'
lab_snp = '#f27a5e'
lib_snp = '#fccf4d'


#A list that we'll use to cycle through colors in charts
color_list = [CB91_Blue, CB91_Green, CB91_Amber, CB91_Pink,
              CB91_Violet, CB91_BlueD, CB91_GreenD, CB91_Purple,
              CB91_AmberL, CB91_BlueL, CB91_GreenL, CB91_AmberD, 
              CB91_VioletD, CB91_PinkL, CB91_VioletL, CB91_PinkD]


#Use seaborn to set all the default chart visual settings
sns.set(font='Franklin Gothic Book',
        rc={
 'axes.axisbelow': False,
 'axes.edgecolor': 'lightgrey',
 'axes.facecolor': 'white',
 'axes.grid': False,
 'axes.labelcolor': 'dimgrey',
 'axes.spines.right': False,
 'axes.spines.top': False,
 'figure.facecolor': 'white',
 'lines.solid_capstyle': 'round',
 'patch.edgecolor': 'w',
 'patch.force_edgecolor': True,
 'text.color': 'dimgrey',
 'xtick.bottom': False,
 'xtick.color': 'dimgrey',
 'xtick.direction': 'out',
 'xtick.top': False,
 'ytick.color': 'dimgrey',
 'ytick.direction': 'out',
 'ytick.left': False,
 'ytick.right': False})

sns.set_context("notebook", rc={"font.size":16,
                                "axes.titlesize":20,
                                "axes.labelsize":16})

plt.rcParams['axes.prop_cycle'] = plt.cycler(color=color_list)

And the mapping function from the previous workbooks...

In [3]:
def gradient_mapper(kpi, grad, outliers=None, stretch=1, factor=1):
    
    '''
    Takes a list/series of numbers, outputs a list of hex colours,
    appropriate for heatmapping the initial data.
    
    Parameters:
    - col (list/series, etc.): The data to be transformed
    - grad (list hex codes): A list of colors that the data will be transformed to
    - outliers (top,bottom,both): Stretches the outliers, resulting in more gradient
                                  change amongst clustered values
    - stretch (int): The number of colors to duplicate if outliers variable used
    - factor (int): The scale of color duplication if outliers variable used
    
    '''
    #Work out how many colours we have in the given gradient
    colors = len(grad)
    half = colors // 2
    
    #Ensure that stretch is possible
    stretch = min(half//3, stretch)
    
    factors = [4*factor, 3*factor, 2*factor]
    
    if outliers != None:
        #Stretch gradient if required. Declare three lists:
        #Start is the stretch map for the bottom end
        if (outliers.lower() == 'bottom') or (outliers.lower() == 'both'):
            start = [factors[0]]*stretch + [factors[1]]*stretch + [factors[2]]*stretch
        else:
            start = []

        #End is the stretch map for the top end
        if (outliers.lower() == 'top') or (outliers.lower() == 'both'):
            end = [factors[2]]*stretch + [factors[1]]*stretch + [factors[0]]*stretch
        else:
            end = []

        #Middle is a list of 1s which will be non-transformed
        middle = [1 for i in range(colors - len(start) - len(end))]

        stretch_map = start + middle + end
        
    else:
        stretch_map = [1 for i in range(colors)]
        
    #Create tuples of the gradients, and the number of
    #times they should be repeated in the list
    zip_list = list(zip(grad,stretch_map))
    
    #Use this to create a list of lists
    #Each element will be a list of the same gradient
    #repeated the required number of times
    list_of_lists = [[i[0]]*i[1] for i in zip_list]
    
    #Melt this list of lists into a single list
    grad = sum(list_of_lists, [])
    
    #Re-define colors variable
    colors = len(grad)-1
    
    #Define the lowest and the highest points in the dataset
    kpi_min = kpi.min()
    kpi_max = kpi.max()
    
    #Transform the data to integers between zero and the length of the gradient list
    first_map = list(map(lambda x: int(round(colors*(x-kpi_min) /
                                             (kpi_max-kpi_min),0)), list(kpi)))
    
    #Map the integers onto the gradient list
    second_map = list(map(lambda x: grad[x], first_map))
    
    #Return this, as well as the new gradient
    return second_map, grad




f = open('Datasets/constituencies.hexjson')
datamap = json.load(f)
datamap = pd.DataFrame(datamap['hexes']).T
datamap = datamap[['n','q','r']]
datamap.columns=['Name','X','Y']

def kpi_map(kpi, width=6, colorbar=True,
            outliers=None, stretch=1, factor=1,
            exclude=[], title=None,
            colors=None, exc_color='#999999',
            grad=CB91_Grad_AP, data=df):
    
    '''
    Outputs a choropleth map, showing each constituency in the same size.
    
    Parameters:
    - kpi (list/series, etc.): The data to be transformed
    - width (float): The desired width of the figure
    - df (dataframe): The dataframe to get the data from
    - grad (list hex codes): A list of colors that the data will be transformed to
    - outliers (top,bottom,both): Stretches the outliers, resulting in more gradient
                                  change amongst clustered values
    - stretch (int): The number of colors to duplicate if outliers variable used
    - factor (int): The scale of color duplication if outliers variable used
    - exclude (list): A list of regions to exclude from the chart
    - colors (dataframe): A dataframe of hex-codes (index should be constituency codes)
    - title (string): The desired title of the chart
    
    '''    
    
    #Filter out different regions, depending on paramaters
    df_filtered = pd.concat([datamap, data[['Region',kpi]]], axis=1)
    
    #If we have colors to add, concatonate these in
    if isinstance(colors, pd.DataFrame):
        df_filtered = pd.concat([df_filtered, colors], axis=1)
        df_filtered.columns = ['Name', 'X', 'Y', 'Region', kpi, 'Colors']
    
    df_filtered = df_filtered.loc[~df_filtered['Region'].isin(exclude)]
    kpi_filtered = df_filtered[kpi]
    
    if isinstance(colors, pd.DataFrame) is False:
        #Use the gradient mapper function to return the colors for the plot
        gradient_map = gradient_mapper(kpi=kpi_filtered,
                                    grad=grad,
                                    outliers=outliers,
                                    stretch=stretch,
                                    factor=factor)
        colors_map = gradient_map[0]
    
    else:
        #Fill in nan colors with white
        df_filtered['Colors'].fillna(exc_color, inplace=True)
        
        #Return the column to be used as the colours list in the plot
        colors_map = list(df_filtered['Colors'])
    
    #Work out the aspect ratio of the filtered constituencies
    X_diff = np.max(df_filtered['X'])-np.min(df_filtered['X'])
    Y_diff = np.max(df_filtered['Y'])-np.min(df_filtered['Y'])
    
    #Declare the width and height of the plot
    height = width * (Y_diff/X_diff)
    size = 500*math.pi*((width/X_diff)**2)
    
    #Create the figure
    fig, ax = plt.subplots(figsize=(width,height))
    
    plt.xticks([])
    plt.yticks([])

    #Plot the scatter
    ax1 = fig.add_subplot(1,1,1)
    ax1.scatter(df_filtered['X'],
                df_filtered['Y'],
                s=size,
                marker='s',
                c=colors_map)
    
    #Remove axes
    sns.despine(left=True, bottom=True)
    ax1.set_title(title);
    
    #plot the colorbar
    if (colorbar == True) and isinstance(colors, pd.DataFrame) is False:        
        cmap = LinearSegmentedColormap.from_list(name= '',
                                                 colors=gradient_map[1])
        ax2 = fig.add_subplot(2,30,28)
        norm = mpl.colors_map.Normalize(vmin=df_filtered[kpi].min(),
                                    vmax=df_filtered[kpi].max())
        cb = mpl.colorbar.ColorbarBase(ax2, cmap=cmap,
                                       norm=norm, orientation='vertical')
    
        # remove the x and y ticks
        for ax in [ax1,ax2]:
            ax.set_xticks([])
            ax.set_yticks([])
    
    else:
        ax1.set_xticks([])
        ax1.set_yticks([])

NameError: name 'df' is not defined

# 2) Final Data Clean

We have prepared a dataset in the previous workbooks. We will see that there are a few different iterations of the data that we could create.

In [None]:
df = pd.read_csv('Datasets/data_with_targets.csv')
df.rename(columns={'Unnamed: 0':'ID'}, inplace=True)
df.set_index('ID', inplace=True, drop=True)
df.head(3)

There are a few things to clean up here.

Firstly, we have some NAs. This is driven by regional data issues. Features with 18 NAs are where we had no data for Northern Ireland, and Features with 117 NAs is where we only had data for England.

Given that the target values for Northern Ireland feature parties that are exclusive to Northern Ireland, data about these constituencies is actually not that helpful. We can probably afford to drop these in all cases.

### The Speaker and Their Constituency

As another complication, we should also consider the constituency held by the 'Speaker' of the House of Commons. The Speaker acts as the 'chairperson' of the House of Commons, and is chosen from the current crop of MPs *by* other MPs.

<img src="images/bercow.gif" alt="The Speaker" style="width: 400px;"/>

<i><center>"Division! Clear the lobby!"</center></i>

It is tradition that the seat held by the Speaker is not contested by the other main parties (the speaker is still elected as an MP, but stands as an 'independent'). Thus, results in these constituencies will be outliers in the dataset. We should probably exclude the seat of Chorley, seat of the speaker in 2019.

In [None]:
#Finding the features with NAs
df.loc[:,df.isna().sum() > 0].isna().sum()

In [None]:
#Removing Northern Ireland, Chorley
df = df.loc[(df['Region'] != 'Northern Ireland')
           &(df['Name'] != 'Chorley')]

#We can also remove the constituency name column, which we don't really care about
df_name = df.iloc[:,0]
df_name.to_csv('Datasets/names.csv')

df = df.iloc[:,1:]

We need to one-hot encode the 'region' and 'type' columns, becuase they are categorical columns.

In [None]:
#Create the ohe object
ohe = OneHotEncoder()

#Fit it to the categorical columns
ohe.fit(df[['Region','Type']])
df_ohe = ohe.transform(df[['Region','Type']]).toarray()

#Create the dataframe
df_ohe = pd.DataFrame(df_ohe, index=df.index,
                      columns=ohe.get_feature_names(['Region','Type']))

#Join the dataframe back
df = pd.concat([df_ohe,df.iloc[:,2:]], axis=1)

We can then spin off the datasets - one for 'Great Britain' (GB), which covers England, Wales, and Scotland, and a separate one for England alone.

In [None]:
#Great Britain dataset will remove columns where we have NAs
df_gb = df.dropna(axis=1)

#England dataset will remove rows where we have NAs
df_eng = df.dropna(axis=0)

Let's check that this has worked as expected.

In [None]:
print('* GB dataframe Scotland/Wales Const: ',
      df_gb['Region_Scotland'].sum()+df_gb['Region_Wales'].sum())
print('* GB dataframe NAs: ', df_gb.isna().sum().sum())


print('\n* England Scotland Const: ',
      df_eng['Region_Scotland'].sum()+df_eng['Region_Wales'].sum())
print('* England dataframe NAs: ', df_eng.isna().sum().sum())

So we see that Scotland and Wales are included in the GB dataframe, but not in the England dataframe, and we have no NAs in either dataframe. This is what we want.

Let's save these datasets down for use later.

In [None]:
df_gb.to_csv('Datasets/data_gb.csv')
df_eng.to_csv('Datasets/data_eng.csv')

We need to a bit more cleaning before we can let scikit learn loose on the data.

In [None]:
#Let's separate the independent and dependent variables for both dataframes
X_gb = df_gb.iloc[:,:-3]
y17_gb = df_gb['Winner_17']
y17_st_gb = df_gb['seat_types_17']
y19_st_gb = df_gb['seat_types_yg']

X_eng = df_eng.iloc[:,:-3]
y17_eng = df_eng['Winner_17']
y17_st_eng = df_eng['seat_types_17']
y19_st_eng = df_eng['seat_types_yg']

# 3) Cluster Analysis

### K-Means Clustering
Remember - with the clustering analysis we're simply trying to see which constituencies are most similar based on their characteristics. We would then want to compare these clusters, to see if they also tend to vote along similar lines.

Given that the clustering algorithms are predicated on distance metrics, the first thing we need to do when we cluster is to perform feature scaling.

In [None]:
scaler = StandardScaler()
X_gb_scaled = scaler.fit_transform(X_gb)
X_eng_scaled = scaler.fit_transform(X_eng)

Let's perform the clustering analysis with the overall Great Britain dataset and England datasets separately.

We'll run the KMeans algorithm over a range of values of n.

We use the Calinski Harabasz score as a metric, since we're not actually interested in the algorithm *correctly* assigning constituencies to the 'right' cluster - rather, we are looking for anomalous labels.

In [None]:
#Declare k_values
k_values = range(3,50)

#Initiate some empty lists for the GB dataset
km_preds_gb = []
km_cs_scores_gb = []

#Iterate through these
for k in k_values:
    #Instantiate and run a KMeans algorithm
    k_means = KMeans(n_clusters=k)
    k_means.fit(X_gb_scaled)
    
    #Store the predicted labels
    km_preds_gb.append(k_means.predict(X_gb_scaled))
    
    #Evaluate and store the clusters' Calinski Harabasz score
    cs_score = calinski_harabasz_score(X_gb_scaled, k_means.labels_)
    km_cs_scores_gb.append(cs_score)
    

#Now initiate some empty lists for the English dataset
km_preds_eng = []
km_cs_scores_eng = []

#Iterate through these values of k
for k in k_values:
    #Instantiate and run a KMeans algorithm
    k_means = KMeans(n_clusters=k)
    k_means.fit(X_eng_scaled)
    
    #Store the predicted labels
    km_preds_eng.append(k_means.predict(X_eng_scaled))
    
    #Evaluate and store the clusters' Calinski Harabasz score
    cs_score = calinski_harabasz_score(X_eng_scaled, k_means.labels_)
    km_cs_scores_eng.append(cs_score)

### Hierarchical Agglomerative Clustering

Let's now create clusters using HAC.

Once we do this, we'll be able to compare the different clusters created by each algorithm, and see which ones came out better. Let's use the same approach that we used for the k-means algorithm above - deploying the algorithms for both the GB dataset, and the England dataset separately.

In [None]:
#Initiate some empty lists for the GB dataset
hac_preds_gb = []
hac_cs_scores_gb = []

#Iterate through the k values we'd previously defined
for k in k_values:
    #Instantiate and run a HAC algorithm
    agg_clust = AgglomerativeClustering(n_clusters=k)
    preds = agg_clust.fit_predict(X_gb_scaled)
    
    #Store the predicted labels
    hac_preds_gb.append(preds)
    
    #Evaluate and store the clusters' Calinski Harabasz score
    cs_score = calinski_harabasz_score(X_gb_scaled, preds)
    hac_cs_scores_gb.append(cs_score)
    

#Now initiate some empty lists for the English dataset
hac_preds_eng = []
hac_cs_scores_eng = []

for k in k_values:
    #Instantiate and run a HAC algorithm
    agg_clust = AgglomerativeClustering(n_clusters=k)
    preds = agg_clust.fit_predict(X_eng_scaled)
    
    #Store the predicted labels
    hac_preds_eng.append(preds)
    
    #Evaluate and store the clusters' Calinski Harabasz score
    cs_score = calinski_harabasz_score(X_eng_scaled, preds)
    hac_cs_scores_eng.append(cs_score)

In [None]:
plt.figure(figsize=(10,6))

alpha = 1

plt.plot(k_values, km_cs_scores_gb, alpha=alpha,
            label='GB Dataset - K-Means Clustering')

plt.plot(k_values, km_cs_scores_eng, alpha=alpha,
            label='England Dataset - K-Means Clustering')

plt.plot(k_values, hac_cs_scores_gb, alpha=alpha,
            label='GB Dataset - HAC')

plt.plot(k_values, hac_cs_scores_eng, alpha=alpha,
            label='England Dataset - HAC')

plt.vlines(x=12, ymin=0, ymax=150, color='red')
plt.ylim(20,150)

plt.legend(frameon=False)

plt.title('Calinski Harabasz Scores for K-Means Clustering')
plt.xlabel('Number of Clusters')
plt.ylabel('Calinski Harabasz Score');

So in both cases, it seems like we hit the 'elbow' point at around 10-12 clusters, depending on the dataset. It does seem that the algorithm has an easier time clustering the English dataset. We would expect this - it has more features to work with.

Moreover, HAC seems to produce better clusters at lower values of K (though the K-Means and HAC CS scores converge for their respective datasets from about k=30 onwards).

We can also visualise our HAC clustering with dendrograms.

In [None]:
# use the ward() function
linkage_array = ward(X_gb_scaled)

# Now we plot the dendrogram for the linkage_array containing the distances between clusters
plt.figure(figsize=(12,5))
dendrogram(linkage_array, truncate_mode='lastp', p=12)

plt.title('Dendrogram of HAC Clustering on GB Dataset')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.show()

In [None]:
# use the ward() function
linkage_array = ward(X_eng_scaled)

# Now we plot the dendrogram for the linkage_array containing the distances between clusters
plt.figure(figsize=(12,5))
dendrogram(linkage_array, truncate_mode='lastp', p=9)

plt.title('Dendrogram of HAC Clustering on England Dataset')
plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.show()

### Investigating Our Clusters

Now we have our clusters, let's investigate them further by seeing how the different seat types break down amongst them. We will use the clusters from the HAC algorithm, seeing as this algorithm seemed to perform better.

First, let's get our clusters at a GB level.

In [None]:
gb_cluster_yg = pd.concat([y19_st_gb,
                           pd.DataFrame(hac_preds_gb[7],
                                        index=y17_st_gb.index,
                                        columns=['Cluster'])],
                           axis=1)

gb_cluster_yg_pivot = pd.pivot_table(data = gb_cluster_yg,
                                     index='seat_types_yg',
                                     columns='Cluster',
                                     aggfunc=len,
                                     fill_value=0)

gb_cluster_yg_pivot

Let's now see how these clusters are arranged on the map.

In [None]:
gb_cluster_yg['Colors'] = gb_cluster_yg['Cluster'].map(
    lambda x: color_list[x])

In [None]:
kpi_map('Population',
        colors=gb_cluster_yg[['Colors']],
        data=pd.read_csv('Datasets/data_with_targets.csv').set_index('Unnamed: 0'),
        exc_color='#ffffff',
        exclude=['Northern Ireland'],
        width=11)

And now at the England clusters.

In [None]:
eng_cluster_yg = pd.concat([y19_st_eng,
                           pd.DataFrame(hac_preds_eng[6],
                                        index=y17_st_eng.index,
                                        columns=['Cluster'])],
                           axis=1)

eng_cluster_yg_pivot = pd.pivot_table(data = eng_cluster_yg,
                                     index='seat_types_yg',
                                     columns='Cluster',
                                     aggfunc=len,
                                     fill_value=0)

eng_cluster_yg_pivot

In [None]:
party_list = ['con safe','con lab marginal',
              'lab safe','ld safe',
              'con ld marginal','lab ld marginal',
              'green safe']

party_colors = [con_blue, con_lab, lab_red, lib_yel,
               con_lib, lab_lib, green_green]

eng_cluster_yg_pivot=eng_cluster_yg_pivot.reindex(party_list)

The number of zeros in these matrices suggest that the clustering has been somewhat successful. However, the political value in this exercise is spotting the low numbers.

For example, if we look at cluster 3, we could say that these have the make up of Conservative safe seats. So there's no reason why the Conservatives shouldn't push in the 30 constituencies that are not yet 'safe'.

Let's see this on a bar chart for clarity.

In [None]:
eng_cluster_bar = (eng_cluster_yg_pivot / eng_cluster_yg_pivot.sum()).T

eng_cluster_bar.plot.barh(stacked=True,
                          figsize=(12,5),
                          width=0.8,
                          color=party_colors)

plt.xticks(ticks = np.arange(0,1.2,0.2), labels = np.arange(0,120,20))
plt.xlim(0,1);

plt.title('English Constituency Clusters');
plt.ylabel('Cluster Number')

plt.legend(bbox_to_anchor=(0.85, -0.1),
           ncol=4,
           frameon=False);

Let's now see how these clusters are arranged on the map.

In [None]:
#Create a new column that assigns a color to each cluster
eng_cluster_yg['Colors'] = eng_cluster_yg['Cluster'].map(
    lambda x: color_list[x])

In [None]:
kpi_map('Population',
        colors=eng_cluster_yg[['Colors']],
        data=pd.read_csv('Datasets/data_with_targets.csv').set_index('Unnamed: 0'),
        exc_color='#ffffff',
        exclude=['Northern Ireland','Scotland','Wales'],
        title='English Constituency Clusters',
        width=10)

We can identify the seats in cluster 3, which are not Conservative safe seats.

In [None]:
df_name[list(eng_cluster_yg.loc[(eng_cluster_yg['Cluster']==3)
                  &(eng_cluster_yg['seat_types_yg']!='con safe')].index)]

And the sole cluster 3 constituency that is classified as a safe labour seat...

In [None]:
df_name[list(eng_cluster_yg.loc[(eng_cluster_yg['Cluster']==3)
                  &(eng_cluster_yg['seat_types_yg']=='lab safe')].index)]

So we know which seats the Conservatives might want to chase. But what messages should they be pushing in these areas to push them over the line in these places?

To help answer this last question, let's create a function that will print out a dashboard of KPIs for each cluster that we've created. Thus we will be able to see what the key unifying characteristics of each cluster are, which allow local messaging to be targeted more effectively.

In [None]:
def cluster_IDs(cluster, cluster_data, cluster_col='Cluster'):
    
    '''
    Takes a cluster label (1, 2, 3, etc.) and a cluster dataframe,
    and returns a list of constituency IDs that belong to the given cluster
    '''
    return list(cluster_data.loc[cluster_data['Cluster']==cluster].index)



def cluster_kpi(kpi, data, cluster_data, cluster_col='Cluster'):
    
    '''
    For a given set of cluster labels, and a given KPI, returns a
    dataframe with one row, for that given kpi
    '''
    
    #Find the unique cluster labels
    cluster_labels = sorted(list(cluster_data[cluster_col].unique()))
    
    #Declare an empty list to store the KPI data in
    kpi_values = []
    
    #Iterate through the labels
    for i in cluster_labels:
        #Work out which constituencies are in the cluster
        cluster_index = cluster_IDs(cluster = i,
                                    cluster_data = cluster_data,
                                    cluster_col = cluster_col)
        
        #Go to the data table, and find the mean for that kpi
        #for those constituencies. Append to the list
        mean = data.loc[cluster_index, kpi].mean()
        kpi_values.append(mean)
        
    #Create and return a dataframe as required
    return pd.DataFrame([kpi_values],
                        columns = cluster_labels,
                        index = [kpi])



def cluster_kpis(kpis, data, cluster_data, cluster_col='Cluster'):
    
    '''
    For a list of KPIs, return a dataframe showing mean
    values on a cluster by cluster basis
    
    '''
    
    #Declare an empty dataframe
    df_temp = pd.DataFrame()
    
    #For each kpi, find the average cluster means, and append to the dataframe
    for i in kpis:
        cluster_values = cluster_kpi(kpi=i,
                                     data=data,
                                     cluster_data=cluster_data,
                                     cluster_col=cluster_col)
        
        df_temp = pd.concat([df_temp, cluster_values])
        
    return df_temp
        
        

def heatmap_cluster_kpis(kpis, data,
                       cluster_data,
                       size=0.4,
                       cmap=CB91_Grad_BP,
                       cluster_col='Cluster'):
    
    '''
    Heatmap the table produced by the cluster_kpis function
    '''
    
    #Calculate the required table and transpose
    df_temp = cluster_kpis(kpis=kpis,
                           data=data,
                           cluster_data=cluster_data,
                           cluster_col=cluster_col).T
    
    #For each kpi, scale as required
    scaler = MinMaxScaler()
    scaler.fit(df_temp)
    df_scale = scaler.transform(df_temp)
    
    df_scale = pd.DataFrame(df_scale,
                            columns=df_temp.columns,
                            index=df_temp.index)
    
    height = size*len(df_scale.index)
    width = 1.1*size*len(df_scale.columns)
    
    plt.figure(figsize=(width,height))
    sns.heatmap(df_scale,cbar=True,cmap=cmap,
               yticklabels=['A','B','C','D','E','F','G','H','I'])
    plt.yticks(rotation='horizontal')
    

    
#Define a standard list of KPIs to look at
kpis_gb=['PopDensity', 'Type_Large City', 'Type_Large Town',
      'Type_Rural', 'Type_Small City', 'Type_Small Town',
      '2019Wage', 'HousePricePerWage', '%HousePriceGrowth',
      '%OwnOutright','%OwnWithMort', '%PrivateRent',
      '%SocialHousing','%Unemployment', 'UnemploymentChange',
      '%Heavy Industry & Manufacturing', '%Wholesale & Retail',
      '%FS & ICT', '%White', '%Muslim', '%BornUK', '%BornOtherEU',
      '%Level4+', '%LeaveVote']

kpis_eng=['PopDensity', 'Type_Large City', 'Type_Large Town',
          'Type_Rural', 'Type_Small City', 'Type_Small Town',
          '2019Wage', 'HousePricePerWage', '%HousePriceGrowth',
          '%OwnOutright','%OwnWithMort', '%PrivateRent',
          '%SocialHousing','%Unemployment', 'UnemploymentChange',
          '%Heavy Industry & Manufacturing', '%Wholesale & Retail',
          '%FS & ICT', '%White', '%BornUK', '%BornOtherEU', '%Muslim',
          '%Level4+', '%ChildcareGood', 'LASpendGrowth15to19',
          'DiseasesPerPop', 'Depression','%17Turnout','%LeaveVote']

Let's build a heatmap of the different clusters, which ranks each cluster along a range of KPIs.

In [None]:
heatmap_cluster_kpis(kpis=kpis_eng,
        data=df_eng,cmap=CB91_Grad_BWP[::-1],size=0.5,
        cluster_data=eng_cluster_yg)

plt.title('Heatmap of English Constituency Clusters by KPI\n\
(0.0=Lowest Cluster For That KPI, 1.0=Highest Cluster For That KPI)\n')
plt.ylabel('Cluster Number');

So what can we say about Cluster 3?

* The inhabitants live in small towns, and have relatively high rates of home ownership, with low levels of social housing. However, house prices haven't grown strongly.
* A high share of people work in Heavy Industry, and wages are below average.
* Most people are white and UK-born.
* Early-years childcare is decent, but education levels are generally low.
* Trend in local authority funding is poor (likely as the result of government cuts).
* There is a high incidence of chronic illness.
* Many people voted to leave the EU, and turnout before was relatively high.

So a Conservative candidate campaigning in these specific seats can use this information to tailor their campaign messaging.