# Data Preprocessing

## Data Acquisition

This is the "acquisition" of data for the country selector project. I do not intend to publish my work in hopes for any material benefits, however, I will list here all the data resources I have used for this project.

Firstly, I would like to take you through my thought process. At first, I wanted to use a Kaggle data set upon which I could try my hands at clustering. This would have been done through the [world happiness report dataset](https://www.kaggle.com/datasets/mathurinache/world-happiness-report). But I wanted to build it as a project that would be useful to many people in general. So I came up with the idea of a country selector. I have all these clusters of countries which have a lot of things in common; I check what those things are and I use them as "group features". These group features will then be presented to the user as a choice and then the user will see what group of countries scores the best in those areas, hence giving him an idea of what countries should he look more into.

All this having been said, I have done a fast (I still wanted to keep this a small project) research in order to see what are the most important criteria when choosing another place to live. I have documented this small endeavour in [this file](research.txt). Keeping these measures in mind, I browsed the web for data about countries' healthcare system, finance, climate, laws etc. and downloaded them. Again, keeping it short, since data was not available on one site, I just downloaded it instead of learning how to use different APIs. 

I have used the following websites to download my data and I give them full credit for this data:

https://climatedata.worldbank.org

https://worldpopulationreview.com

https://databank.worldbank.org

https://www.kaggle.com/datasets/mathurinache/world-happiness-report


## Data Preprocessing

This notebook will deal with transforming all the data obtained into one smooth dataset. Therefore here I will combine all these datasets into one. I will drop all the countries which have less than 70% of necessary data or (if possible) find a solution to impute missing data. The data is also not similar in terms of timeline. Therefore, if the data is available, I will select the most recent one. All the data I have downloaded is at most 4 years old (2018) at the moment of this notebook's creation.

I will also see if I can transform data values so that they are on similar scales, since it should (in theory) be better for clustering. I will apply other aggregations and replacement where I see fit.

Therefore there will be 4 steps, 1 for each type of data. The 5th an final step will be getting all these datasets cleaned and into one useful dataset to be used in the main part.

Without further ado, let's wrangle some data.

In [70]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import re
import sys
# Insert at 1, 0 is the script path
sys.path.insert(1, './code/development')
from preprocessing_functions import categorize_gdp_per_capita_value

## Part one - world happiness report

I will now see all the useful data I can extract from the world happiness report.

In [51]:
world_happiness = pd.read_csv("../data/world-happiness-2022.csv", index_col="RANK", decimal=",")
world_happiness.head()

Unnamed: 0_level_0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Finland,7.821,7.886,7.756,2.518,1.892,1.258,0.775,0.736,0.109,0.534
2,Denmark,7.636,7.71,7.563,2.226,1.953,1.243,0.777,0.719,0.188,0.532
3,Iceland,7.557,7.651,7.464,2.32,1.936,1.32,0.803,0.718,0.27,0.191
4,Switzerland,7.512,7.586,7.437,2.153,2.026,1.226,0.822,0.677,0.147,0.461
5,Netherlands,7.415,7.471,7.359,2.137,1.945,1.206,0.787,0.651,0.271,0.419


In [52]:
# See if there are any null values etc.
world_happiness.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 1 to 147
Data columns (total 11 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country                                     147 non-null    object 
 1   Happiness score                             146 non-null    float64
 2   Whisker-high                                146 non-null    float64
 3   Whisker-low                                 146 non-null    float64
 4   Dystopia (1.83) + residual                  146 non-null    float64
 5   Explained by: GDP per capita                146 non-null    float64
 6   Explained by: Social support                146 non-null    float64
 7   Explained by: Healthy life expectancy       146 non-null    float64
 8   Explained by: Freedom to make life choices  146 non-null    float64
 9   Explained by: Generosity                    146 non-null    float64
 10  Explained by: 

In [53]:
# Check entries with null values
world_happiness[pd.isnull(world_happiness).any(axis=1)]

Unnamed: 0_level_0,Country,Happiness score,Whisker-high,Whisker-low,Dystopia (1.83) + residual,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
147,xx,,,,,,,,,,


In [54]:
# It looks like the 147th entry is not an actual country, so I can drop it.
world_happiness = world_happiness.drop(index=147)
world_happiness.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146 entries, 1 to 146
Data columns (total 11 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country                                     146 non-null    object 
 1   Happiness score                             146 non-null    float64
 2   Whisker-high                                146 non-null    float64
 3   Whisker-low                                 146 non-null    float64
 4   Dystopia (1.83) + residual                  146 non-null    float64
 5   Explained by: GDP per capita                146 non-null    float64
 6   Explained by: Social support                146 non-null    float64
 7   Explained by: Healthy life expectancy       146 non-null    float64
 8   Explained by: Freedom to make life choices  146 non-null    float64
 9   Explained by: Generosity                    146 non-null    float64
 10  Explained by: 

In [55]:
# I will also drop the whisker high, whisker low, and dystopia features as being irrelevant for the state of our project
world_happiness = world_happiness.drop(columns=["Whisker-high", "Whisker-low", "Dystopia (1.83) + residual"])
world_happiness.head()

Unnamed: 0_level_0,Country,Happiness score,Explained by: GDP per capita,Explained by: Social support,Explained by: Healthy life expectancy,Explained by: Freedom to make life choices,Explained by: Generosity,Explained by: Perceptions of corruption
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Finland,7.821,1.892,1.258,0.775,0.736,0.109,0.534
2,Denmark,7.636,1.953,1.243,0.777,0.719,0.188,0.532
3,Iceland,7.557,1.936,1.32,0.803,0.718,0.27,0.191
4,Switzerland,7.512,2.026,1.226,0.822,0.677,0.147,0.461
5,Netherlands,7.415,1.945,1.206,0.787,0.651,0.271,0.419


In [56]:
# Furthermore, we will rename some of the features so we can use them more easily
world_happiness.rename(columns=lambda c: c.replace("Explained by: ", "").replace(" ", "_").lower(), inplace=True)
world_happiness.head()

Unnamed: 0_level_0,country,happiness_score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices,generosity,perceptions_of_corruption
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Finland,7.821,1.892,1.258,0.775,0.736,0.109,0.534
2,Denmark,7.636,1.953,1.243,0.777,0.719,0.188,0.532
3,Iceland,7.557,1.936,1.32,0.803,0.718,0.27,0.191
4,Switzerland,7.512,2.026,1.226,0.822,0.677,0.147,0.461
5,Netherlands,7.415,1.945,1.206,0.787,0.651,0.271,0.419


In [57]:
# Another important step is scaling.
# I will use a simple min max scaling so that all features are between 0 and 1 and they keep their impact 
# respective to their groups

# Get numerical columns for the world happiness dataset
numerical_columns_w_h = world_happiness.columns.copy().drop("country")

min_max_scaler = MinMaxScaler()
world_happiness[numerical_columns_w_h] = min_max_scaler.fit_transform(world_happiness[numerical_columns_w_h])
world_happiness.head()

Unnamed: 0_level_0,country,happiness_score,gdp_per_capita,social_support,healthy_life_expectancy,freedom_to_make_life_choices,generosity,perceptions_of_corruption
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Finland,1.0,0.856496,0.95303,0.822718,0.994595,0.232906,0.90971
2,Denmark,0.965848,0.88411,0.941667,0.824841,0.971622,0.401709,0.906303
3,Iceland,0.951265,0.876415,1.0,0.852442,0.97027,0.576923,0.325383
4,Switzerland,0.942957,0.917157,0.928788,0.872611,0.914865,0.314103,0.785349
5,Netherlands,0.925051,0.880489,0.913636,0.835456,0.87973,0.57906,0.713799


## Part 2 - databank datasets

This is the biggest chunk of the work. There are 10 datasets here, each of which has data for years 2020 and 2021 for every country. I will choose data from 2020 where it is not available for 2021. If no data is available, then data for that country will be null.

I will take all the datasets and compile them into one. For countries which miss most of the values I will drop them. For columns which do not have most of the values, I will drop them.

In [58]:
economy_stats = pd.read_excel("../data/Economy.xlsx", na_values="..")
economy_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Adjusted net national income (current US$),NY.ADJ.NNTY.CD,18458790000.0,
1,Afghanistan,AFG,Current account balance (% of GDP),BN.CAB.XOKA.GD.ZS,-15.59312,
2,Afghanistan,AFG,GDP per capita (current US$),NY.GDP.PCAP.CD,516.7479,
3,Afghanistan,AFG,GNI per capita (constant 2015 US$),NY.GNP.PCAP.KD,,
4,Albania,ALB,Adjusted net national income (current US$),NY.ADJ.NNTY.CD,11939380000.0,


In [59]:
# Let's first rename columns
economy_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
economy_stats.rename(columns=lambda c: re.sub(r'_\[.*\]', '', c), inplace=True)
economy_stats.head()

Unnamed: 0,country_name,country_code,series_name,series_code,2020,2021
0,Afghanistan,AFG,Adjusted net national income (current US$),NY.ADJ.NNTY.CD,18458790000.0,
1,Afghanistan,AFG,Current account balance (% of GDP),BN.CAB.XOKA.GD.ZS,-15.59312,
2,Afghanistan,AFG,GDP per capita (current US$),NY.GDP.PCAP.CD,516.7479,
3,Afghanistan,AFG,GNI per capita (constant 2015 US$),NY.GNP.PCAP.KD,,
4,Albania,ALB,Adjusted net national income (current US$),NY.ADJ.NNTY.CD,11939380000.0,


In [60]:
# Drop the columns that are not useful
economy_stats = economy_stats.drop(columns=["country_code", "series_code"])
economy_stats.head()

Unnamed: 0,country_name,series_name,2020,2021
0,Afghanistan,Adjusted net national income (current US$),18458790000.0,
1,Afghanistan,Current account balance (% of GDP),-15.59312,
2,Afghanistan,GDP per capita (current US$),516.7479,
3,Afghanistan,GNI per capita (constant 2015 US$),,
4,Albania,Adjusted net national income (current US$),11939380000.0,


In [61]:
economy_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country_name  870 non-null    object 
 1   series_name   868 non-null    object 
 2   2020          659 non-null    float64
 3   2021          0 non-null      float64
dtypes: float64(2), object(2)
memory usage: 27.4+ KB


It looks like we have no economical data from 2021. Therefore I can drop this column. I will also need to rearrange our dataframe so that there is one country per row and the the series should be columns instead of values in different rows.

In [62]:
economy_stats = economy_stats.pivot_table(values='2020', index="country_name", columns=['series_name']).reset_index()
economy_stats.head()

series_name,country_name,Adjusted net national income (current US$),Current account balance (% of GDP),GDP per capita (current US$),GNI per capita (constant 2015 US$)
0,Afghanistan,18458790000.0,-15.59312,516.747871,
1,Albania,11939380000.0,-8.830036,5246.096346,
2,Algeria,119996100000.0,-12.565709,3306.858208,3751.770614
3,American Samoa,,,12844.900991,
4,Angola,35814230000.0,1.493624,1776.166868,2890.70289


In [63]:
economy_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 5 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   country_name                                195 non-null    object 
 1   Adjusted net national income (current US$)  170 non-null    float64
 2   Current account balance (% of GDP)          167 non-null    float64
 3   GDP per capita (current US$)                194 non-null    float64
 4   GNI per capita (constant 2015 US$)          128 non-null    float64
dtypes: float64(4), object(1)
memory usage: 7.7+ KB


1. It looks like for almost all countries I have the GDP per capita. I will check which country does not and through a quick google search see if I can find it.
2. Since all this is capital related, it makes sense that to fill the adjusted net national income and current account balance, I can group countries by GDP per capita and then fill missing values with the mean of the group.
3. Since for GNI per capita there are fewer than 70% of the values (65%), I will drop it.

In [64]:
# Check what kind of values we have so we can see what kind of groups we can create
economy_stats["GDP per capita (current US$)"].describe()

count       194.000000
mean      15348.067094
std       23405.739594
min         238.990726
25%        2169.762588
50%        5467.472829
75%       17714.185985
max      173688.189360
Name: GDP per capita (current US$), dtype: float64

In [65]:
# Drop GNI per capita column
economy_stats = economy_stats.drop(columns="GNI per capita (constant 2015 US$)")

# Check what country does not have the GDP per capita value assigned
economy_stats[economy_stats["GDP per capita (current US$)"].isnull()]

In [67]:
# According to https://statisticstimes.com/economy/country/south-sudan-gdp-per-capita.php
# and https://knoema.com/atlas/South-Sudan/GDP-per-capita
# I can conclude that the GDP per capita for Sudan cane be set at 296
economy_stats.loc[162, "GDP per capita (current US$)"] = 296.0
economy_stats[economy_stats["GDP per capita (current US$)"].isnull()]

series_name,country_name,Adjusted net national income (current US$),Current account balance (% of GDP),GDP per capita (current US$)


In [68]:
# Fill NA values with GDP per capita group mean
economy_stats[e_s_numerical_columns] = economy_stats.groupby(
    economy_stats["GDP per capita (current US$)"].apply(
        lambda x: categorize_gdp_per_capita_value)).transform(lambda x: x.fillna(x.mean()))
economy_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 4 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   country_name                                195 non-null    object 
 1   Adjusted net national income (current US$)  195 non-null    float64
 2   Current account balance (% of GDP)          195 non-null    float64
 3   GDP per capita (current US$)                195 non-null    float64
dtypes: float64(3), object(1)
memory usage: 6.2+ KB


  "GDP per capita (current US$)"]] = economy_stats.groupby(


In [72]:
# Again, I will use a simple min max scaling so that all features are between 0 and 1 and they keep their impact 
# respective to their groups
min_max_scaler = MinMaxScaler()

# Get numerical columns for the world happiness dataset
numerical_columns_e_s = economy_stats.columns.copy().drop("country_name")

economy_stats[numerical_columns_e_s] = min_max_scaler.fit_transform(economy_stats[numerical_columns_e_s])
economy_stats.head()

series_name,country_name,Adjusted net national income (current US$),Current account balance (% of GDP),GDP per capita (current US$)
0,Afghanistan,0.001026,0.277733,0.001601
1,Albania,0.000657,0.372249,0.028868
2,Algeria,0.006776,0.320042,0.017687
3,American Samoa,0.021364,0.457369,0.072678
4,Angola,0.002009,0.516524,0.008862


In [74]:
education_stats = pd.read_excel("../data/education.xlsx", na_values="..")
education_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,"Adjusted net enrollment rate, primary (% of pr...",SE.PRM.TENR,,
1,Afghanistan,AFG,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,
2,Afghanistan,AFG,"Literacy rate, adult total (% of people ages 1...",SE.ADT.LITR.ZS,,37.266041
3,Afghanistan,AFG,"Educational attainment, at least Bachelor's or...",SE.TER.CUAT.BA.ZS,,3.06798
4,Afghanistan,AFG,"Educational attainment, at least completed low...",SE.SEC.CUAT.LO.ZS,,11.63192


In [150]:
# Let's first rename columns
education_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
education_stats.rename(columns=lambda c: re.sub(r'_\[.*\]', '', c), inplace=True)
education_stats.head()

Unnamed: 0,country_name,country_code,series_name,series_code,2020,2021
0,Afghanistan,AFG,"Adjusted net enrollment rate, primary (% of pr...",SE.PRM.TENR,,
1,Afghanistan,AFG,"Government expenditure on education, total (% ...",SE.XPD.TOTL.GD.ZS,,
2,Afghanistan,AFG,"Literacy rate, adult total (% of people ages 1...",SE.ADT.LITR.ZS,,37.266041
3,Afghanistan,AFG,"Educational attainment, at least Bachelor's or...",SE.TER.CUAT.BA.ZS,,3.06798
4,Afghanistan,AFG,"Educational attainment, at least completed low...",SE.SEC.CUAT.LO.ZS,,11.63192


In [152]:
# Drop the columns that are not useful
education_stats = education_stats.drop(columns=["country_code", "series_code"])
education_stats.head()

Unnamed: 0,country_name,series_name,2020,2021
0,Afghanistan,"Adjusted net enrollment rate, primary (% of pr...",,
1,Afghanistan,"Government expenditure on education, total (% ...",,
2,Afghanistan,"Literacy rate, adult total (% of people ages 1...",,37.266041
3,Afghanistan,"Educational attainment, at least Bachelor's or...",,3.06798
4,Afghanistan,"Educational attainment, at least completed low...",,11.63192


In [153]:
# Fill null values of 2021 column with values of 2020 if existent
education_stats.loc[:, '2021'].fillna(education_stats.loc[:, '2020'], inplace=True)

# Drop 2020 column
education_stats = education_stats.drop(columns='2020')

education_stats.head()

Unnamed: 0,country_name,series_name,2021
0,Afghanistan,"Adjusted net enrollment rate, primary (% of pr...",
1,Afghanistan,"Government expenditure on education, total (% ...",
2,Afghanistan,"Literacy rate, adult total (% of people ages 1...",37.266041
3,Afghanistan,"Educational attainment, at least Bachelor's or...",3.06798
4,Afghanistan,"Educational attainment, at least completed low...",11.63192


In [154]:
# Rearrange dataframe so there is one country per row
education_stats = education_stats.pivot_table(values='2021', index="country_name", columns=['series_name']).reset_index()
education_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 5 columns):
 #   Column                                                                                              Non-Null Count  Dtype  
---  ------                                                                                              --------------  -----  
 0   country_name                                                                                        69 non-null     object 
 1   Educational attainment, at least Bachelor's or equivalent, population 25+, total (%) (cumulative)   38 non-null     float64
 2   Educational attainment, at least completed lower secondary, population 25+, total (%) (cumulative)  38 non-null     float64
 3   Government expenditure on education, total (% of GDP)                                               35 non-null     float64
 4   Literacy rate, adult total (% of people ages 15 and above)                                          16 non-null     flo

It looks like this data is mostly null, hence not useful at all. This means that I will need to find another source for education data. I will let this be for now. After processing all the available data I will see what kind of data I need and see if I can get it.

In [76]:
employment_stats = pd.read_excel("../data/employment.xlsx", na_values="..")
employment_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Adequacy of social protection and labor progra...,per_allsp.adq_pop_tot,,
1,Afghanistan,AFG,"Employers, total (% of total employment) (mode...",SL.EMP.MPYR.ZS,,
2,Afghanistan,AFG,Employment in industry (% of total employment)...,SL.IND.EMPL.ZS,,
3,Afghanistan,AFG,Employment in agriculture (% of total employme...,SL.AGR.EMPL.ZS,,
4,Afghanistan,AFG,"Employment to population ratio, 15+, total (%)...",SL.EMP.TOTL.SP.NE.ZS,36.709999,


In [155]:
# Let's first rename columns
employment_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
employment_stats.rename(columns=lambda c: re.sub(r'_\[.*\]', '', c), inplace=True)
employment_stats.head()

Unnamed: 0,country_name,country_code,series_name,series_code,2020,2021
0,Afghanistan,AFG,Adequacy of social protection and labor progra...,per_allsp.adq_pop_tot,,
1,Afghanistan,AFG,"Employers, total (% of total employment) (mode...",SL.EMP.MPYR.ZS,,
2,Afghanistan,AFG,Employment in industry (% of total employment)...,SL.IND.EMPL.ZS,,
3,Afghanistan,AFG,Employment in agriculture (% of total employme...,SL.AGR.EMPL.ZS,,
4,Afghanistan,AFG,"Employment to population ratio, 15+, total (%)...",SL.EMP.TOTL.SP.NE.ZS,36.709999,


In [156]:
# Drop the columns that are not useful
employment_stats = employment_stats.drop(columns=["country_code", "series_code"])
employment_stats.head()

Unnamed: 0,country_name,series_name,2020,2021
0,Afghanistan,Adequacy of social protection and labor progra...,,
1,Afghanistan,"Employers, total (% of total employment) (mode...",,
2,Afghanistan,Employment in industry (% of total employment)...,,
3,Afghanistan,Employment in agriculture (% of total employme...,,
4,Afghanistan,"Employment to population ratio, 15+, total (%)...",36.709999,


In [158]:
# Fill null values of 2021 column with values of 2020 if existent
employment_stats.loc[:, '2021'].fillna(employment_stats.loc[:, '2020'], inplace=True)

# Drop 2020 column
employment_stats = employment_stats.drop(columns='2020')

employment_stats.head()

Unnamed: 0,country_name,series_name,2021
0,Afghanistan,Adequacy of social protection and labor progra...,
1,Afghanistan,"Employers, total (% of total employment) (mode...",
2,Afghanistan,Employment in industry (% of total employment)...,
3,Afghanistan,Employment in agriculture (% of total employme...,
4,Afghanistan,"Employment to population ratio, 15+, total (%)...",36.709999


In [159]:
# Rearrange dataframe so there is one country per row
employment_stats = employment_stats.pivot_table(values='2021', index="country_name", columns=['series_name']).reset_index()
employment_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 3 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   country_name                                                        99 non-null     object 
 1   Employment to population ratio, 15+, total (%) (national estimate)  95 non-null     float64
 2   Unemployment, total (% of total labor force) (national estimate)    99 non-null     float64
dtypes: float64(2), object(1)
memory usage: 2.4+ KB


It looks like only 2 columns were non null. I can use this data. Furthermore I can see that they mean more or less the same thing. Therefore I will keep the second statistic (unemployment rate).

I will scale it and then reverse it (take 1 - scaled_value) so that countries with low unemployment score low in this. I will do this because all of the statistics I have until now are positive functions, meaning that the bigger the value, the better the country in that particular area.

This is just my assumption that if we have the same measurements for each featue it will lead to better clustering.

In [160]:
employment_stats = employment_stats.drop(columns="Employment to population ratio, 15+, total (%) (national estimate)")
# Check if there are any more null values
employment_stats[employment_stats.isna().any(axis=1)]

series_name,country_name,"Unemployment, total (% of total labor force) (national estimate)"


In [161]:
# Again, I will use a simple min max scaling so that all features are between 0 and 1 and they keep their impact 
# respective to their groups
min_max_scaler = MinMaxScaler()

# Get numerical columns for the world happiness dataset
numerical_columns_em_s = employment_stats.columns.copy().drop("country_name")

employment_stats[numerical_columns_em_s] = min_max_scaler.fit_transform(employment_stats[numerical_columns_em_s])
employment_stats.head()

series_name,country_name,"Unemployment, total (% of total labor force) (national estimate)"
0,Afghanistan,0.397868
1,Argentina,0.389271
2,Armenia,0.620358
3,Australia,0.170908
4,Austria,0.179505


In [162]:
# Reverse statistic so high-value = good, low-value = bad; also rename column
employment_stats[numerical_columns_em_s] = 1.0 - employment_stats[numerical_columns_em_s]
employment_stats.rename(columns={'Unemployment, total (% of total labor force) (national estimate)': 'employment_rate_labor_force'})
employment_stats.head()

series_name,country_name,"Unemployment, total (% of total labor force) (national estimate)"
0,Afghanistan,0.602132
1,Argentina,0.610729
2,Armenia,0.379642
3,Australia,0.829092
4,Austria,0.820495


In [78]:
financial_indicators = pd.read_excel("../data/financial-indicators.xlsx", na_values="..")
financial_indicators.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Account ownership at a financial institution o...,FX.OWN.TOTL.ZS,,
1,Afghanistan,AFG,Consumer price index (2010 = 100),FP.CPI.TOTL,,
2,Afghanistan,AFG,"Inflation, consumer prices (annual %)",FP.CPI.TOTL.ZG,,
3,Afghanistan,AFG,"Listed domestic companies, total",CM.MKT.LDOM.NO,,
4,Albania,ALB,Account ownership at a financial institution o...,FX.OWN.TOTL.ZS,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   870 non-null    object 
 1   Country Code   868 non-null    object 
 2   Series Name    868 non-null    object 
 3   Series Code    868 non-null    object 
 4   2020 [YR2020]  389 non-null    float64
 5   2021 [YR2021]  238 non-null    float64
dtypes: float64(2), object(4)
memory usage: 41.0+ KB


Again, insufficient data. Will do the same as stated above.

In [80]:
health_stats = pd.read_excel("../data/health.xlsx", na_values="..")
health_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,"Current health expenditure per capita, PPP (cu...",SH.XPD.CHEX.PP.CD,,
1,Afghanistan,AFG,"Hospital beds (per 1,000 people)",SH.MED.BEDS.ZS,,
2,Afghanistan,AFG,Mortality caused by road traffic injury (per 1...,SH.STA.TRAF.P5,,
3,Afghanistan,AFG,Out-of-pocket expenditure per capita (current ...,SH.XPD.OOPC.PC.CD,,
4,Afghanistan,AFG,People with basic handwashing facilities inclu...,SH.STA.HYGN.ZS,38.11505,


In [81]:
health_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1524 entries, 0 to 1523
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   1521 non-null   object 
 1   Country Code   1519 non-null   object 
 2   Series Name    1519 non-null   object 
 3   Series Code    1519 non-null   object 
 4   2020 [YR2020]  295 non-null    float64
 5   2021 [YR2021]  0 non-null      float64
dtypes: float64(2), object(4)
memory usage: 71.6+ KB


Again, insufficient data. Will do the same as stated above.

In [82]:
infrastructure_stats = pd.read_excel("../data/infrastructure.xlsx", na_values="..")
infrastructure_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Fixed broadband subscriptions (per 100 people),IT.NET.BBND.P2,0.068254,
1,Afghanistan,AFG,Rail lines (total route-km),IS.RRS.TOTL.KM,,
2,Afghanistan,AFG,Secure Internet servers (per 1 million people),IT.NET.SECR.P6,34.987363,
3,Afghanistan,AFG,Research and development expenditure (% of GDP),GB.XPD.RSDV.GD.ZS,,
4,Albania,ALB,Fixed broadband subscriptions (per 100 people),IT.NET.BBND.P2,17.684951,


In [83]:
infrastructure_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 873 entries, 0 to 872
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   870 non-null    object 
 1   Country Code   868 non-null    object 
 2   Series Name    868 non-null    object 
 3   Series Code    868 non-null    object 
 4   2020 [YR2020]  415 non-null    float64
 5   2021 [YR2021]  0 non-null      float64
dtypes: float64(2), object(4)
memory usage: 41.0+ KB


Again, insufficient data. Will do the same as stated above.

In [85]:
population_and_environment_stats = pd.read_excel("../data/population-and-environment.xlsx", na_values="..")
population_and_environment_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,97.699997,
1,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,
2,Afghanistan,AFG,Electricity production from coal sources (% of...,EG.ELC.COAL.ZS,,
3,Afghanistan,AFG,Land area (sq. km),AG.LND.TOTL.K2,652860.0,652860.0
4,Afghanistan,AFG,Forest area (% of land area),AG.LND.FRST.ZS,1.850994,


In [86]:
population_and_environment_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1524 entries, 0 to 1523
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   1521 non-null   object 
 1   Country Code   1519 non-null   object 
 2   Series Name    1519 non-null   object 
 3   Series Code    1519 non-null   object 
 4   2020 [YR2020]  1071 non-null   float64
 5   2021 [YR2021]  216 non-null    float64
dtypes: float64(2), object(4)
memory usage: 71.6+ KB


In [87]:
# Let's first rename columns
population_and_environment_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
population_and_environment_stats.rename(columns=lambda c: re.sub(r'_\[.*\]', '', c), inplace=True)
population_and_environment_stats.head()

Unnamed: 0,country_name,country_code,series_name,series_code,2020,2021
0,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,97.699997,
1,Afghanistan,AFG,CO2 emissions (metric tons per capita),EN.ATM.CO2E.PC,,
2,Afghanistan,AFG,Electricity production from coal sources (% of...,EG.ELC.COAL.ZS,,
3,Afghanistan,AFG,Land area (sq. km),AG.LND.TOTL.K2,652860.0,652860.0
4,Afghanistan,AFG,Forest area (% of land area),AG.LND.FRST.ZS,1.850994,


In [88]:
# Drop the columns that are not useful
population_and_environment_stats = population_and_environment_stats.drop(columns=["country_code", "series_code"])
population_and_environment_stats.head()

Unnamed: 0,country_name,series_name,2020,2021
0,Afghanistan,Access to electricity (% of population),97.699997,
1,Afghanistan,CO2 emissions (metric tons per capita),,
2,Afghanistan,Electricity production from coal sources (% of...,,
3,Afghanistan,Land area (sq. km),652860.0,652860.0
4,Afghanistan,Forest area (% of land area),1.850994,


In [92]:
# Fill null values of 2021 column with values of 2020 if existent
population_and_environment_stats.loc[:, '2021'].fillna(population_and_environment_stats.loc[:, '2020'], inplace=True)

# Drop 2020 column
population_and_environment_stats = population_and_environment_stats.drop(columns='2020')

population_and_environment_stats.head()

Unnamed: 0,country_name,series_name,2021
0,Afghanistan,Access to electricity (% of population),97.699997
1,Afghanistan,CO2 emissions (metric tons per capita),
2,Afghanistan,Electricity production from coal sources (% of...,
3,Afghanistan,Land area (sq. km),652860.0
4,Afghanistan,Forest area (% of land area),1.850994


In [93]:
# Rearrange dataframe so there is one country per row
population_and_environment_stats = population_and_environment_stats.pivot_table(values='2021', index="country_name", columns=['series_name']).reset_index()
population_and_environment_stats.head()

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)
0,Afghanistan,97.699997,1.850994,652860.0,59.627395,26.026
1,Albania,100.0,28.791971,27400.0,103.571131,62.112
2,Algeria,99.804131,0.818309,2381741.0,18.41134,73.733
3,American Samoa,,85.65,200.0,275.985,87.153
4,Andorra,100.0,34.042553,470.0,164.393617,87.916


In [94]:
population_and_environment_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 6 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   country_name                                         216 non-null    object 
 1   Access to electricity (% of population)              215 non-null    float64
 2   Forest area (% of land area)                         210 non-null    float64
 3   Land area (sq. km)                                   216 non-null    float64
 4   Population density (people per sq. km of land area)  216 non-null    float64
 5   Urban population (% of total population)             214 non-null    float64
dtypes: float64(5), object(1)
memory usage: 10.2+ KB


Because of the low number of null values we can impute them value by value

In [96]:
population_and_environment_stats[population_and_environment_stats['Access to electricity (% of population)'].isnull()]

Int64Index([3], dtype='int64')

In [97]:
# Impute with found value on the web
population_and_environment_stats.loc[3, 'Access to electricity (% of population)'] = 99.2
population_and_environment_stats[population_and_environment_stats['Access to electricity (% of population)'].isnull()]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)


In [98]:
population_and_environment_stats[population_and_environment_stats['Forest area (% of land area)'].isnull()]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)
39,Channel Islands,100.0,,198.0,878.075758,30.963
75,Gibraltar,100.0,,10.0,3369.1,100.0
86,"Hong Kong SAR, China",100.0,,1050.0,7125.52381,100.0
116,"Macao SAR, China",100.0,,32.9,19736.838906,100.0
129,Monaco,100.0,,2.027,19360.631475,100.0
136,Nauru,100.0,,20.0,541.7,100.0


In [99]:
# Impute with found values on the web
population_and_environment_stats.loc[39, 'Forest area (% of land area)'] = 5.15
population_and_environment_stats.loc[75, 'Forest area (% of land area)'] = 0
population_and_environment_stats.loc[86, 'Forest area (% of land area)'] = 56
population_and_environment_stats.loc[116, 'Forest area (% of land area)'] = 1.5
population_and_environment_stats.loc[129, 'Forest area (% of land area)'] = 0
population_and_environment_stats.loc[136, 'Forest area (% of land area)'] = 0

population_and_environment_stats[population_and_environment_stats['Forest area (% of land area)'].isnull()]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)


In [100]:
population_and_environment_stats[population_and_environment_stats['Urban population (% of total population)'].isnull()]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)
61,Eritrea,52.171097,10.448119,101000.0,35.113139,
182,St. Martin (French part),100.0,24.8,50.0,773.18,


In [101]:
# Impute with found values on the web
population_and_environment_stats.loc[61, 'Urban population (% of total population)'] = 40.71
population_and_environment_stats.loc[182, 'Urban population (% of total population)'] = 0

population_and_environment_stats[population_and_environment_stats['Urban population (% of total population)'].isnull()]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)


In [106]:
# Check if there are any more null values
population_and_environment_stats[population_and_environment_stats.isna().any(axis=1)]

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)


In [107]:
# Again, I will use a simple min max scaling so that all features are between 0 and 1 and they keep their impact 
# respective to their groups
min_max_scaler = MinMaxScaler()

# Get numerical columns for the world happiness dataset
numerical_columns_p_e_s = population_and_environment_stats.columns.copy().drop("country_name")

population_and_environment_stats[numerical_columns_p_e_s] = min_max_scaler.fit_transform(
    population_and_environment_stats[numerical_columns_p_e_s])
population_and_environment_stats.head()

series_name,country_name,Access to electricity (% of population),Forest area (% of land area),Land area (sq. km),Population density (people per sq. km of land area),Urban population (% of total population)
0,Afghanistan,0.975204,0.019002,0.039865,0.003014,0.26026
1,Albania,1.0,0.295569,0.001673,0.005241,0.62112
2,Algeria,0.997888,0.0084,0.145433,0.000926,0.73733
3,American Samoa,0.991375,0.879254,1.2e-05,0.013976,0.87153
4,Andorra,1.0,0.349469,2.9e-05,0.008322,0.87916


In [108]:
poverty_stats = pd.read_excel("../data/poverty.xlsx", na_values="..")
poverty_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Multidimensional poverty headcount ratio (% of...,SI.POV.MDIM,49.4,
1,Albania,ALB,Multidimensional poverty headcount ratio (% of...,SI.POV.MDIM,43.4,
2,Algeria,DZA,Multidimensional poverty headcount ratio (% of...,SI.POV.MDIM,,
3,American Samoa,ASM,Multidimensional poverty headcount ratio (% of...,SI.POV.MDIM,,
4,Andorra,AND,Multidimensional poverty headcount ratio (% of...,SI.POV.MDIM,,


In [109]:
poverty_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222 entries, 0 to 221
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   219 non-null    object 
 1   Country Code   217 non-null    object 
 2   Series Name    217 non-null    object 
 3   Series Code    217 non-null    object 
 4   2020 [YR2020]  35 non-null     float64
 5   2021 [YR2021]  0 non-null      float64
dtypes: float64(2), object(4)
memory usage: 10.5+ KB


Again, insufficient data. Will do the same as stated above.

In [110]:
private_sector_stats = pd.read_excel("../data/private-sector.xlsx", na_values="..")
private_sector_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,Cost of business start-up procedures (% of GNI...,IC.REG.COST.PC.ZS,,
1,Afghanistan,AFG,Ease of doing business score (0 = lowest perfo...,IC.BUS.DFRN.XQ,,
2,Afghanistan,AFG,Labor tax and contributions (% of commercial p...,IC.TAX.LABR.CP.ZS,,
3,Albania,ALB,Cost of business start-up procedures (% of GNI...,IC.REG.COST.PC.ZS,,
4,Albania,ALB,Ease of doing business score (0 = lowest perfo...,IC.BUS.DFRN.XQ,,


In [140]:
private_sector_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 656 entries, 0 to 655
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   653 non-null    object 
 1   Country Code   651 non-null    object 
 2   Series Name    651 non-null    object 
 3   Series Code    651 non-null    object 
 4   2020 [YR2020]  0 non-null      float64
 5   2021 [YR2021]  0 non-null      float64
dtypes: float64(2), object(4)
memory usage: 30.9+ KB


Again, insufficient data. Will do the same as stated above.

In [141]:
public_sector_stats = pd.read_excel("../data/public-sector.xlsx", na_values="..")
public_sector_stats.head()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,CPIA financial sector rating (1=low to 6=high),IQ.CPA.FINS.XQ,1.5,
1,Afghanistan,AFG,CPIA gender equality rating (1=low to 6=high),IQ.CPA.GNDR.XQ,1.5,
2,Afghanistan,AFG,CPIA policies for social inclusion/equity clus...,IQ.CPA.SOCI.XQ,2.7,
3,Afghanistan,AFG,CPIA property rights and rule-based governance...,IQ.CPA.PROP.XQ,2.0,
4,Afghanistan,AFG,CPIA quality of public administration rating (...,IQ.CPA.PADM.XQ,2.5,


In [142]:
public_sector_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1090 entries, 0 to 1089
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Country Name   1087 non-null   object 
 1   Country Code   1085 non-null   object 
 2   Series Name    1085 non-null   object 
 3   Series Code    1085 non-null   object 
 4   2020 [YR2020]  365 non-null    float64
 5   2021 [YR2021]  0 non-null      float64
dtypes: float64(2), object(4)
memory usage: 51.2+ KB


In [143]:
# Let's first rename columns
public_sector_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
public_sector_stats.rename(columns=lambda c: re.sub(r'_\[.*\]', '', c), inplace=True)
public_sector_stats.head()

Unnamed: 0,country_name,country_code,series_name,series_code,2020,2021
0,Afghanistan,AFG,CPIA financial sector rating (1=low to 6=high),IQ.CPA.FINS.XQ,1.5,
1,Afghanistan,AFG,CPIA gender equality rating (1=low to 6=high),IQ.CPA.GNDR.XQ,1.5,
2,Afghanistan,AFG,CPIA policies for social inclusion/equity clus...,IQ.CPA.SOCI.XQ,2.7,
3,Afghanistan,AFG,CPIA property rights and rule-based governance...,IQ.CPA.PROP.XQ,2.0,
4,Afghanistan,AFG,CPIA quality of public administration rating (...,IQ.CPA.PADM.XQ,2.5,


In [144]:
# Drop the columns that are not useful
public_sector_stats = public_sector_stats.drop(columns=["country_code", "series_code"])
public_sector_stats.head()

Unnamed: 0,country_name,series_name,2020,2021
0,Afghanistan,CPIA financial sector rating (1=low to 6=high),1.5,
1,Afghanistan,CPIA gender equality rating (1=low to 6=high),1.5,
2,Afghanistan,CPIA policies for social inclusion/equity clus...,2.7,
3,Afghanistan,CPIA property rights and rule-based governance...,2.0,
4,Afghanistan,CPIA quality of public administration rating (...,2.5,


In [145]:
# Fill null values of 2021 column with values of 2020 if existent
public_sector_stats.loc[:, '2021'].fillna(public_sector_stats.loc[:, '2020'], inplace=True)

# Drop 2020 column
public_sector_stats = public_sector_stats.drop(columns='2020')

public_sector_stats.head()

Unnamed: 0,country_name,series_name,2021
0,Afghanistan,CPIA financial sector rating (1=low to 6=high),1.5
1,Afghanistan,CPIA gender equality rating (1=low to 6=high),1.5
2,Afghanistan,CPIA policies for social inclusion/equity clus...,2.7
3,Afghanistan,CPIA property rights and rule-based governance...,2.0
4,Afghanistan,CPIA quality of public administration rating (...,2.5


In [146]:
# Rearrange dataframe so there is one country per row
public_sector_stats = public_sector_stats.pivot_table(values='2021', index="country_name", columns=['series_name']).reset_index()
public_sector_stats.head()

series_name,country_name,CPIA financial sector rating (1=low to 6=high),CPIA gender equality rating (1=low to 6=high),CPIA policies for social inclusion/equity cluster average (1=low to 6=high),CPIA property rights and rule-based governance rating (1=low to 6=high),CPIA quality of public administration rating (1=low to 6=high)
0,Afghanistan,1.5,1.5,2.7,2.0,2.5
1,Bangladesh,2.5,3.0,3.3,2.5,2.0
2,Benin,2.5,3.5,3.5,3.5,3.0
3,Bhutan,3.0,4.0,4.0,4.0,4.0
4,Burkina Faso,3.0,3.5,3.6,3.0,3.0


In [147]:
public_sector_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73 entries, 0 to 72
Data columns (total 6 columns):
 #   Column                                                                       Non-Null Count  Dtype  
---  ------                                                                       --------------  -----  
 0   country_name                                                                 73 non-null     object 
 1   CPIA financial sector rating (1=low to 6=high)                               73 non-null     float64
 2   CPIA gender equality rating (1=low to 6=high)                                73 non-null     float64
 3   CPIA policies for social inclusion/equity cluster average (1=low to 6=high)  73 non-null     float64
 4   CPIA property rights and rule-based governance rating (1=low to 6=high)      73 non-null     float64
 5   CPIA quality of public administration rating (1=low to 6=high)               73 non-null     float64
dtypes: float64(5), object(1)
memory usage: 3.5+ K

In [148]:
# Rename new columns
public_sector_stats.rename(columns=lambda c: re.sub(r' \(.*\)', '', c), inplace=True)
public_sector_stats.rename(columns=lambda c: re.sub(r'CPIA ', '', c), inplace=True)
public_sector_stats.rename(columns=lambda c: c.replace(" ", "_").lower(), inplace=True)
public_sector_stats.head()

series_name,country_name,financial_sector_rating,gender_equality_rating,policies_for_social_inclusion/equity_cluster_average,property_rights_and_rule-based_governance_rating,quality_of_public_administration_rating
0,Afghanistan,1.5,1.5,2.7,2.0,2.5
1,Bangladesh,2.5,3.0,3.3,2.5,2.0
2,Benin,2.5,3.5,3.5,3.5,3.0
3,Bhutan,3.0,4.0,4.0,4.0,4.0
4,Burkina Faso,3.0,3.5,3.6,3.0,3.0


In [149]:
# Again, I will use a simple min max scaling so that all features are between 0 and 1 and they keep their impact 
# respective to their groups
min_max_scaler = MinMaxScaler()

# Get numerical columns for the world happiness dataset
numerical_columns_pu_s_s = public_sector_stats.columns.copy().drop("country_name")

public_sector_stats[numerical_columns_pu_s_s] = min_max_scaler.fit_transform(public_sector_stats[numerical_columns_pu_s_s])
public_sector_stats.head()

series_name,country_name,financial_sector_rating,gender_equality_rating,policies_for_social_inclusion/equity_cluster_average,property_rights_and_rule-based_governance_rating,quality_of_public_administration_rating
0,Afghanistan,0.166667,0.0,0.428571,0.333333,0.5
1,Bangladesh,0.5,0.5,0.642857,0.5,0.333333
2,Benin,0.5,0.666667,0.714286,0.833333,0.666667
3,Bhutan,0.666667,0.833333,0.892857,1.0,1.0
4,Burkina Faso,0.666667,0.666667,0.75,0.666667,0.666667


Up until this point we have 4 data sets which yielded results: economy_stats, public_sector_stats and population_and_environment_stats, employment_stats

The other 6 data sets did not yield any useful result: education_stats, 