<a href="https://colab.research.google.com/github/alextseng69/KMeans1/blob/main/unsupervised_learning_with_k_means_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using ML to Allocate Funding for Development Aid
## Unsupervised Learning with K-Means Clustering

- <a href='#intro'>1. Introduction</a>

- <a href='#2'>2. Libraries and datasets</a>
     - <a href='#21'>2.1 Import libraries and packages</a>
     - <a href='#22'>2.2 Import data</a>
     
- <a href='#3'>3. Data description and distribution</a>
    - <a href='#31'>3.1. Data description</a> 
    - <a href='#32'>3.2. Data distribution</a>
    
- <a href='#4'>4. Data evaluation and reduction</a>
    - <a href='#41'>4.1. Correlation</a>
    - <a href='#42'>4.2. Scaling</a> 
    - <a href='#43'>4.3. PCA: Principal Component Analysis</a> 

- <a href='#5'>5. Model: K-Means Clustering</a>
    - <a href='#51'>5.1. Model set up</a>
    - <a href='#52'>5.2. Optimal number of clusters: Elbow Method</a>
    - <a href='#53'>5.3. Optimal number of clusters: Silhouette Method</a>

- <a href='#6'>6. Cluster analysis</a>
    - <a href='#61'>6.1. Cluster plotting and visualisation</a>
    - <a href='#62'>6.2. Cluster characteristics</a>
    - <a href='#63'>6.3. Cluster descriptions</a>
    
- <a href='#7'>7. Further analysis to complement clustering </a>
    - <a href='#71'>7.1. Dropping features with high correlation</a>   
    - <a href='#72'>7.2. Further analysis of clusters</a>  
    - <a href='#73'>7.3. Linear regression</a>   
    - <a href='#74'>7.4. Further clustering of clusters</a>   
- <a href='#8'>8. Answer to the question and learnings</a>

- <a href='#9'>9. References</a>

## <a id='intro'>1. Intoduction</a>

**Background**

According to the International Monetary Fund (IMF), *development aid* is aid given by governments and other agencies to support the economic, environmental, social, and political development of developing countries.


**Problem Statement (taken from Dataset)**

HELP International have been able to raise around 10 million dollars. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. 

So, the CEO has to make decision to choose the countries that are in the direst need of aid. 

Hence, the goal is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

### Which countries should receive funding and why?

## <a id='2'>2. Libraries and datasets</a>

### <a id='21'>2.1. Import libraries and packages</a>

In [1]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# scaling 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# kmeans clustering 
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import SilhouetteVisualizer

# geo data
import geopandas as gpd
from geopandas import GeoDataFrame as gdf
import plotly.express as px



ModuleNotFoundError: ignored

### <a id='22'>2.2. Import data</a>

In [2]:
data_path = '../input/unsupervised-learning-on-country-data'

In [3]:
data = pd.read_csv(
    f'{data_path}/Country-data.csv')

FileNotFoundError: ignored

## <a id='3'>3. Data description and distribution</a>

### <a id='31'>3.1. Data description</a>

**Feature Description** 

* country:      Name of the country

* child_mort:   Death of children under 5 years of age per 1000 live births

* exports:      Exports of goods and services per capita. Given as %age of the GDP per capita

* health:       Total health spending per capita. Given as %age of GDP per capita

* imports:      Imports of goods and services per capita. Given as %age of the GDP per capita

* Income:       Net income per person

* Inflation:    The measurement of the annual growth rate of the Total GDP

* life_expec:   The average number of years a new born child would live if the current mortality patterns are to remain the same

* total_fer:    The number of children that would be born to each woman if the current age-fertility rates remain the same

* gdpp:         The GDP per capita. Calculated as the Total GDP divided by the total population

In [None]:
# quick view of columns and values
data.head()

In [None]:
# how many columns and rows in dataframe
data.shape

In [None]:
# are there any missing values?
data.isnull().sum()

In [None]:
# are there duplicate values?
format(len(data[data.duplicated()]))

In [None]:
# standard statistical measures
data.describe(percentiles = [.25, .5, .75, .90 ,.95, .99])

**Findings**

* small dataset
* no missing values
* no duplicate values
* some outliers and skewed distribution

### <a id='32'>3.2. Data distribution</a>

In [None]:
plt.figure(figsize=(12,5))
plt.title("Child Mortality: Death of children under 5 years of age per 1000 live births")
ax = sns.histplot(data["child_mort"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Exports: Exports of goods and services per capita. Given as %age of the GDP per capita")
ax = sns.histplot(data["exports"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Imports: Imports of goods and services per capita. Given as %age of the GDP per capita")
ax = sns.histplot(data["imports"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Health: Total health spending per capita. Given as %age of GDP per capita")
ax = sns.histplot(data["health"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Income: Net income per person")
ax = sns.histplot(data["income"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Inflation: The measurement of the annual growth rate of the Total GDP")
ax = sns.histplot(data["inflation"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("Life expectancy: The average number of years a new born child would live if the current mortality patterns are to remain the same")
ax = sns.histplot(data["life_expec"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("New Population(?) :The number of children that would be born to each woman if the current age-fertility rates remain the same.")
ax = sns.histplot(data["total_fer"])

In [None]:
plt.figure(figsize=(12,5))
plt.title("GDP: The GDP per capita. Calculated as the Total GDP divided by the total population.")
ax = sns.histplot(data["gdpp"])

**Findings**

Looking at the data distribution we can see that there are some features that do indeed have outliers.

For the purpose of this analysis, outliers will not be removed since they could be considered very informative in that they could point out countries that are in critical condition and in need of help.

For example, Child Mortality is a strong indicator of poverty and necessity, so the outliers in this feature show that there are countries with a higher than normal/critical number in child mortality.
 

## <a id='4'>4. Data evaluation and reduction</a>

### <a id='41'>4.1. Correlation</a>

In [None]:
# pearson
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='pearson', min_periods=1),annot=True)

In [None]:
# kendall
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='kendall', min_periods=1),annot=True)

In [None]:
# spearman
plt.figure(figsize=(15,10))
sns.heatmap(data.corr(method='spearman', min_periods=1),annot=True)

**Findings** 

Are there feature(s) that we could do without due to having high correlation with another feature?

After looking at Pearson, Kendall and Spearman correlation, we can see that there are a few features that might be considered for elimination due to high correlation.

- life_expect, due to high correlation with child mortality
- total_fertility, due to high correlation with child mortality
- income, due to high correlation with gdpp


### <a id='42'>4.2. Scaling</a>

Why scale the data in this case? 

* the features have incomparable units (metrics are percentages, dollar values, whole numbers)
* the range values of the features also vary (one for example is 0 to 200, and another 0 to 100,000), so here for example, a change of 50 in one feature is quite significant, whereas in another it is almost unnoticeable
* this level of variance can negatively impact the performance of this model, as this model is based on measuring distances, it can do this by giving more weight to some features 
* by scaling we are removing potential bias that the model can have towards features with higher magnitudes


In [None]:
# eliminate the column that contains the country information, as only numeric values should be used in this case for unsupervised learning
dataset = data.drop(['country'], axis =1)
dataset.head()

#### Scale the data: MinMaxScaler (normalised)

In [None]:
# columns argument ==> we'll use this later to create a new dataframe with the rescaled data 
columns = dataset.columns

# the scaler to use will be 
scaler = MinMaxScaler()

# 'scaler' is for the rescaling technique, 'fit' function is to find the x_min and the x_max, 'transform' function applies formula to all elements of data
rescaled_dataset_minmax = scaler.fit_transform(dataset)
rescaled_dataset_minmax

#### Scale the data: StandardScaler (standardised)

In [None]:
# in standardisation, all features will be transformed to have the properties of standard normal distribution with mean=0 and standard deviation=1
# 
# columns argument ==> we'll use this later to create a new dataframe with the rescaled data 
columns = dataset.columns

# the scaler to use will be 
scaler = StandardScaler()

# 'scaler' is for the rescaling technique, 'fit' function is to find the x_min and the x_max, 'transform' function applies formula to all elements of data
rescaled_dataset_standard = scaler.fit_transform(dataset)
rescaled_dataset_standard

#### Scaled dataframes

In [None]:
# minmax
# we need to create a new dataframe with the column lables and the rescaled values 
df_minmax = pd.DataFrame(data= rescaled_dataset_minmax , columns = columns )
df_minmax

In [None]:
# standardisation
# we need to create a new dataframe with the column lables and the rescaled values 
df_standard = pd.DataFrame(data= rescaled_dataset_standard , columns = columns)
df_standard

#### Comparing scaling methods

In [None]:
plt.scatter(df_standard['gdpp'], df_standard['child_mort'],color = 'black')
plt.scatter

plt.xlabel('GDP per Person')
plt.ylabel('Child Mortality')

In [None]:
plt.scatter(df_minmax['gdpp'], df_minmax['child_mort'],color = 'black')
plt.scatter

plt.xlabel('GDP per Person')
plt.ylabel('Child Mortality')

### <a id='423'>4.3. PCA: Principal Component Analysis</a>

#### PCA with data scaled with StandardScaler

In [None]:
# import PCA 
from sklearn.decomposition import PCA

# fit and transform
pca = PCA()
pca.fit(df_standard)
pca_data_standard = pca.transform(df_standard)

# percentage variation 
per_var = np.round(pca.explained_variance_ratio_*100, decimals =1)
labels = ['PC' + str(x) for x in range (1, len(per_var)+1)]

# plot the percentage of explained variance by principal component
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label = labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

# plot pca
pca_df_standard = pd.DataFrame(pca_data_standard, columns = labels)
plt.scatter(pca_df_standard.PC1, pca_df_standard.PC2)
plt.title('PCA')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

#### PCA with data scaled with MinMaxScaler

In [None]:
# import PCA 
from sklearn.decomposition import PCA

# fit and transform
pca = PCA()
pca.fit(df_minmax)
pca_data_minmax = pca.transform(df_minmax)

# percentage variation 
per_var = np.round(pca.explained_variance_ratio_*100, decimals =1)
labels = ['PC' + str(x) for x in range (1, len(per_var)+1)]

# plot the percentage of explained variance by principal component
plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label = labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

# plot pca

pca_df_minmax = pd.DataFrame(pca_data_minmax, columns = labels)
plt.scatter(pca_df_minmax.PC1, pca_df_minmax.PC2)
plt.title('PCA')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

In [None]:
# dataframe with PC1, PC2, P3, PC4
data2 = pca_df_standard.drop(['PC5','PC6','PC7','PC8','PC9'], axis = 1)
data2

**Findings**

After doing PCA with both standardised and normalised versions of the original dataset, we can see that there are 4 principal components can explain about 90% of the distribution of the original data.


## <a id='5'>5. Model: K-Means Clustering</a>

### <a id='51'>5.1. Model set up</a>

In [None]:
km = KMeans (
    n_clusters = 3, # number of clusters/centroids to create
    init = 'random', # ‘random’: choose n_clusters observations (rows) at random from data for the initial centroids
    n_init = 10, # this is the default value. This is the number of times the k-means algorithm will be run with different centroid seeds
    max_iter = 300, # this is the default value. This is the maximum number of iterations of the k-means algorithm for a single run.
    tol = 1e-4, # this is the default value. This is the relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
    random_state = 0 # this is the default value. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
)

#### Run model with different versions on the dataset

In [None]:
# normalised dataset
# method to compute the clusters and assign the labels
y_predicted_minmax = km.fit_predict(df_minmax) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_minmax

In [None]:
# standardised dataset
# method to compute the clusters and assign the labels
y_predicted_standard = km.fit_predict(df_standard) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_standard

In [None]:
# data2 is the original dataset with standard scaling and 4 principal components found with PCA
# method to compute the clusters and assign the labels
y_predicted_data2 = km.fit_predict(data2) # fit_predict --> Compute cluster centers and predict cluster index for each sample.
y_predicted_data2

In [None]:
# add the cluster column to the dataframe 
df_minmax['cluster'] = y_predicted_minmax
df_minmax.head()

In [None]:
# add the cluster column to the dataframe 
df_standard['cluster'] = y_predicted_standard
df_standard.head()

In [None]:
# add the cluster column to the dataframe (dataset does not include feature 'country')
dataset['cluster'] = y_predicted_data2
dataset.head()

### <a id='52'>5.2. Optimal number of clusters: Elbow Method</a>

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions,  for a range of number of cluster - with df scaled with StandardScaler

sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_standard)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with MinMax

sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_minmax)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with StandardScaler + PCA
sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(dataset)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

**Findings**

After running the kmeans model with the a normalised dataset, a standardised dataset, and a PCA with 4 components (with standardised scaling) we can see that the optimal number of clusters is still 3 with different levels of inertia. Two clusters could also be considered as per results of dataset after PCA.

### <a id='53'>5.3. Optimal number of clusters: Silhouette Method</a>

#### With standardised data



In [None]:
# calculate Silhoutte Score - stardardised
score = silhouette_score(df_standard, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(df_standard)

#### With normalised data

In [None]:
# Calculate Silhoutte Score - normalised
score = silhouette_score(df_minmax, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# # A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(df_minmax)

#### With standardised data + PCA

In [None]:
# Calculate Silhoutte Score - stardardised + PCA
score = silhouette_score(dataset, km.labels_, metric='euclidean')
print('Silhouette Score: %.3f' % score)

# A value near 0 represents overlapping clusters with samples very close to the decision boundary of the neighboring clusters. 

In [None]:
fig,ax = plt.subplots(2,2, figsize = (15,8))
for i in [2,3,4,5]:

    # create kmeans instance for different numbers of clusters
    km = KMeans(n_clusters=i, init= 'random', n_init =10, max_iter = 300, random_state = 0)
    q, mod = divmod(i,2)
    
    #create visualiser
    visualizer = SilhouetteVisualizer(km, colors = 'yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(dataset)

**Findings**

Clusters are overlapping. An increase in clusters (to 5 for example) shows that there are negative values in the scale, meaning that this n of clusters might have samples that have been assigned to the wrong cluster.

## <a id='6'>6. Cluster analysis</a>

### <a id='61'>6.1. Cluster plotting and visualisation</a>

#### Visualise clusters by feature, scaled data with StandardScaler (standardisation)

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(df_standard, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

#### Visualise clusters by feature, scaled data with MinMaxScaler (normalisation)

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(df_minmax, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

#### Visualise clusters by feature, scaled data with StandardScaler and with reduction of features with PCA

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(dataset, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

**Findings**

After running the model with 2 types of scaling and using PCA, we can see there tends to be overlapping between clusters.
Cluster 2 is more spread out and clusters 0 and 1 tend to overlap.

### <a id='62'>6.2. Cluster characteristics</a>

In [None]:
# add cluster column to original dataset with countries and non-scaled values
data['cluster'] = y_predicted_standard.tolist()
data

#### Visualise clusters by feature, original data with no scaling

In [None]:
# load example dataset from seaborn 
sns.get_dataset_names()

# plot
sns.load_dataset('penguins')
sns.pairplot(data, hue="cluster")

# title
plt.suptitle('Pair Plot of Clusters by Feature', 
             size = 20);

### <a id='63'>6.3. Cluster descriptions</a>

In [None]:
# table of clusters showing mean values per cluster and per feature
clusters_table = pd.pivot_table(data, index=['cluster'])
clusters_table

In [None]:
# cluster 0 
cluster_0 = data.loc[data['cluster'] == 0]

# list of countries in this country
cluster_0.country.unique()

**Cluster 0: This cluster is characterised by showing average values for all features when comparing with other clusters**

- child mortality,    avg
- exports,            avg
- gdpp,               avg
- health,             same as cluster 1
- imports,            avg
- income,             avg
- inflation,          avg
- life_expect,        +70 years
- total_fer,          avg, 2 children per woman ((number of children that would be born to each woman if the current age-fertility rates remain the same)

In [None]:
# cluster 1 
cluster_1 = data.loc[data['cluster'] == 1]

# list of countries in this country
cluster_1.country.unique()

**Cluster 1: This cluster is characterised by having the most negative values: high child mortality, lowest economic development (low gdpp, exports and imports, lowest life expectancy**

- child mortality, highest
- exports, lowest
- gdpp, lowest
- health, same as cluster 0
- imports, lowest
- income, significantly lower than other clusters
- inflation, highest
- life_expect, +50 years
- total_fer, highest, 5 children per woman (number of children that would be born to each woman if the current age-fertility rates remain the same)

In [None]:
# cluster 2 
cluster_2 = data.loc[data['cluster'] == 2]

# list of countries in this country
cluster_2.country.unique()

**Cluster 2: This cluster is characterised by showing really strong or positive values such as good economic development, high life expectancy, low child mortality**


- child mortality, lowest
- exports,  highest
- gdpp, highest by a lot
- health, higher than both other clusters
- imports, highest
- income, significantly higher than other clusters
- inflation, lowest
- life_expect, +80 years
- total_fer, lowest age-fertility rate, 1 child per woman (number of children that would be born to each woman if the current age-fertility rates remain the same)


### <a id='64'>6.4. Clusters and their location in the world</a>

In [None]:
# import latitude and longitude data
geo_data = pd.read_csv("/kaggle/input/latitude-and-longitude-for-every-country-and-state/world_country_and_usa_states_latitude_and_longitude_values.csv")

# drop columns
geo_data_trimmed = geo_data.drop(['country_code','usa_state_code','usa_state_latitude','usa_state_longitude','usa_state'], axis=1)

# add geo_data_trimmed df to df with clusters 
data_combined = pd.merge(
    geo_data_trimmed,
    data,
    on='country',
    how= 'inner'
)

# output 
data_combined.head(2)

In [None]:
# https://geopandas.org/docs/user_guide/mapping.html
# load example data from geodataframe 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()

In [None]:
# https://geopandas.org/docs/user_guide/mapping.html
# load example data from geodataframe 
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head()

# remove antartica
world = world[(world.pop_est>0) & (world.name!="Antarctica")]

# change column name
world_copy = world.copy()
world_copy.rename(columns = {'name' : 'country'}, inplace = True)
world_copy.head()

# append geodataframe data with data_combined data
world_data = pd.merge(
        data_combined,
        world_copy,
        on='country',
        how= 'inner'
)

# world_data
world_data.head(5)


# convert df into geodf

world_data = gdf(world_data)

# plot clusters
fig, ax = plt.subplots(1, 1)
world_data.plot(column = 'cluster',  cmap='Set2',  ax=ax, legend=True);

## <a id='7'>7. Further analysis to complement clustering </a>




We've evaluated the results of the clustering by: 

  a) plotting the relationship of features by cluster in Cluster plotting

  b) comparing average values of each feature in Cluster characteristics


Based on an initial assessment of the average values of each cluster, *Cluster 1* could be focus for further analysis. However, when we plot the clusters and look at the graphs, we see that there is overlapping of clusters as well as spread out clusters.

Utilising PCA as an alternative did not result in a significant difference.

We've been able to identify some patters in the data and group countries into 3 clusters. However, we should not rely solely on this result to make the recommendation of countries that should receive funding. There are a few alternatives to explore before we can make this recommendation. Here are some alternatives to explore:


### <a id='71'>7.1. Dropping features with high correlation</a>



In [None]:
# df without these features 
dataset_reduced = data.drop(['country','life_expec','total_fer','income'], axis =1)
dataset_reduced.head()

In [None]:
# scale with standard scaling
columns = dataset_reduced.columns

# the scaler to use will be 
scaler = StandardScaler()

rescaled_dataset_reduced = scaler.fit_transform(dataset_reduced)
rescaled_dataset_reduced

In [None]:
# standardisation
# we need to create a new dataframe with the column lables and the rescaled values 
df_reduced = pd.DataFrame(data= rescaled_dataset_reduced , columns = columns)
df_reduced


In [None]:
# run the model with the standardised reduced dataset
# method to compute the clusters and assign the labels
y_predicted_reduced = km.fit_predict(df_reduced) 
y_predicted_reduced

In [None]:
# add the cluster column to the dataframe 
df_reduced['cluster'] = y_predicted_reduced
df_reduced.head()

In [None]:
# calculate Sum of Squared Errors (SSE), also called distorsions, for a range of number of cluster - with df scaled with StandardScaler + PCA
sse = []
for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(df_reduced)
    sse.append(km.inertia_)

# plot
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()

**Findings** 

2 clusters with high inertia

### <a id='72'>7.2. Further analysis of clusters</a>

#Closely investigate each group of countries and argue that the assignment of countries to these three clusters is reasonable. If not, please explain what went wrong.

### <a id='73'>7.3. Linear regression (Work-in-Progress)</a>

Since evaluating the performance of an algorithm requires a **label** that represents both the *expected* and the *predicted* value to compare them, we might want to consider adding a feature for labeling.

We've managed to use clustering to find meaningful relationships in the data, we can consider this part as a preprocessing step.

In [None]:
mpi_data = pd.read_csv("/kaggle/input/mpi/MPI_national.csv")
mpi_data

### <a id='74'>7.4. Further clustering of clusters (Work-in-Progress)</a>

## <a id='8'>8. Answer to the question and learnings</a>

## <a id='9'>9. References</a>

**Libraries and Code**

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://seaborn.pydata.org/generated/seaborn.load_dataset.html

https://seaborn.pydata.org/generated/seaborn.pairplot.html

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

PCA in Python:  https://www.youtube.com/watch?v=Lsue2gEM9D0 , https://www.youtube.com/watch?v=SBYdqlLgbGk



**PCA**

https://builtin.com/data-science/step-step-explanation-principal-component-analysis

https://online.stat.psu.edu/stat505/lesson/11/11.4


**Similar Cases**

https://upzoning.berkeley.edu/download/Classifying_Neighborhoods_Methodology.pdf


**K-Means Model**

https://www.youtube.com/watch?v=EItlUEPCIzM&list=LL&index=1

https://towardsdatascience.com/k-means-clustering-with-scikit-learn-6b47a369a83c

https://medium.com/analytics-vidhya/why-is-scaling-required-in-knn-and-k-means-8129e4d88ed7

https://developer.squareup.com/blog/so-you-have-some-clusters-now-what/


**Visualisations**

https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166


**Silhouette Score**

https://dzone.com/articles/kmeans-silhouette-score-explained-with-python-exam

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html














