<h1 style="text-align:center" >Phrontistery - Data Driven Decisions  </h1>
<h3 style="text-align:center" >The Big Three</h3>

# Introduction

The starting point of this project is to check if this hypothesis is validated: there exists a relationship between GDP (Gross Domestic Product) growth of a country and social and economical indicators such as population growth, fertility rates, investment in specific sectors or prices. To demostrate the validity of this assumption has been used the correlation as a decision making tool. So, through Python language and the availabe libraries such as Pandas, it has been developed an algorithim that allows us to extract reliable conclusions, being able to take the data from the database files, perform statistical computations on them and get useful numerical and visual results to make conclusions.

This tool has been developed for all available datasets, but in our case we have extracted data from World Bank, updated on June 22, 2022.

The steps of the project have been as follows:

* [Extraction](#Extraction)
* [Integration](#Integration)
* [Normalization](#Normalization)
* [Categorization of variables](#Categorization-of-variables)
* [Correlation study](#Correlation-study)
* [Visualization of results](#Visualization-of-results)
* [Spurious correlations](#Spurious-correlations)


Along the project all steps are explained, as well as, assumptions and special terms that's important to take attention.



Import libraries and functions.

In [1]:
import pandas as pd
import numpy as np
import glob
import os
from zipfile import ZipFile
import warnings
warnings.filterwarnings("ignore")
import functools as ft
import ipywidgets as widgets
from ipywidgets import Layout
from ipywidgets import interact, interact_manual
import plotly.express as px
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from pandas.api.types import is_numeric_dtype

# Extraction

This step has been depreciated due to security features added to the World Bank website. However, the data is still obtainable by openinng the `url`.

In the following cells we are going to import the data from the [World Bank website](https://www.worldbank.org/en/home), Data sections, and decompress it. So later on, we can process it in the integration, and subsequent parts.

To download the data, we are going to use the `request` library as just to be sure that we do not run into any inconvinient we are going to use the `stream` and `verify` to have a simple and cleaner download. Therefore, at the end we will have the data downloaded in our working directory data by design. (It can be obtained with the following function `os.getcwd()`)

In [None]:
url = 'https://databank.worldbank.org/data/download/WDI_csv.zip'
r = requests.get(url, allow_redirects=True, stream = True,verify=False)
open('WDI_csv.zip', 'wb').write(r.content)

Then we extract all the files that are contained in WDI_csv.zip , into the default directory in a new folder called Data.

In [None]:
zip_name = "WDI_csv.zip"
with ZipFile(zip_name, 'r') as zip:
    zip.printdir()
    zip.extractall('Data') 

Finally we are going to delete the WDI_csv.zip.

In [None]:
os.remove(os.getcwd()+'/Data/'+"/**.zip")

# Integration

Firstly we load the database from World Data Bank that has been downloaded and extracted in the *Data extraction* notebook. We acquire it from the predetermined path that is on our computer.

In [2]:
df= pd.read_csv (os.getcwd()+'/Data/'+'WDIData.csv')
df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.825520,19.272212,19.628009,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.499471,6.680066,6.859110,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.794160,32.001027,33.871910,38.880173,40.261358,43.061877,44.270860,45.803485,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.641760,30.404935,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384365,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,14.500000,,,,,,,
384366,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,3.700000,,,,5.418352,,,
384367,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,33.500000,32.400000,,,,33.658057,,,
384368,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.200000,59.400000,59.500000,59.700000,59.900000,60.000000,60.200000,60.400000,,


Moreover, to work more comfortably, we remove those columns not useful for us, as *Country Name* and *Indicator Code*, since with the *Country Code*, *Value* and the *Indicator Name* we have the relevant information.

In [3]:
df.drop(columns=["Country Name","Indicator Code"], axis=1, inplace=True)

FILTER 1: BY COUNTRY

From the almost two hundred countries we have information about in the worldwide database, we have decided to study 50 of them, making an initial grouping by geographical and economical similiarities. With this, we can keep in our dataframe the selected countries.

Criteria for grouping:
- Europe: Germany, France, Sweden, United Kingdom, Spain, Croatia, Poland, Greece, Austria and Netherlands.

*Interesting countries of the European continent that can reflect events such as the Brexit process, the 2008 crisis or their historical strength.*
- Persian Gulf: Iraq, Qatar, United Arab Emirates, Arabia Saudita, Azerbayan, Yemen, Yemen Democratic and Oman.

*Countries located in the Persian Gulf, which have a similar economy based mainly on petrol and social structures.*
- North Africa: Algeria, Egiypt, Lybia, Israel, Turkey and Morroco.

*Countries of the african continent that are middle developed and with high mobility of people and goods.*
- South Africa: Senegal, South Africa, Liberia, Mozambique, Cameroon, Nigeria and Ghana.

*Countries of the south and central africa that are mainly subdeveloped and considered some of the poorest countries worldwide; but, on the contrary, one of them is highly developed.*
- Asia: Bangladesh, India, Vietnam, Thailand, Indonesia, Philipines and Korea (South).

*Converted in the last decades in the manufacturing of the world, they are subdeveloped countries with high population and childhood.*
- Latin America: Mexico, Brasil, Argentina, Peru, Venezuela, Colombia, Chile, Panama and Costa Rica.

*Countries located in same continet and some with singular political structures.* 
- Pair: USA and China.

*Although these countries seem to be confronted between them, they have been the top two most growing worlwide, despite the fact that culturally and economically they are completely distant.*


In [4]:
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list 

In [5]:
df1=df.loc[df['Country Code'].isin(country_list)]

Now we transpose the rows of years into the columns.

In [6]:
df2=(df1.set_index(["Country Code", "Indicator Name"]).stack().reset_index(name='Value').rename(columns={'level_2':'Date'}))
df2

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1769874,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1769875,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1769876,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1769877,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 2: BY YEAR

Our time range covers from 1960 to 2021. However, the record is not uniform and complete for all areas and indicators. We can appreaciate that specially in the first years of the last century, so many data is missing, then it makes no sense to study it. Besides, for the year 2021 many data is also lacking. Therefore, we would delimit our study between 1990 and 2020.

In [7]:
df2[['Date']] = df2[['Date']].astype(int)

In [8]:
df2.dtypes

Country Code       object
Indicator Name     object
Date                int32
Value             float64
dtype: object

In [9]:
df3 = df2[df2['Date'] > 1989]
df3

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1769874,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1769875,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1769876,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1769877,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


In [10]:
BronzeDataFrame=df3

-----

# Normalization

Taking as reference both works of https://www.pluralsight.com/guides/cleaning-up-data-from-outliers and https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/, for normalizing our data we need to start computing the outliers and removing them from our dataframe. As there is not a direct function of pandas that performs this step, it´s been step-by-step code, where we begin with the computation of the quartiles, then the IQR (Inter Quartile Range) and finally the upper and lower limit.

##### IQR explanation

The interquartile range (IQR) measures the spread of the middle half of your data. It is the range for the middle 50% of your sample. Use the IQR to assess the variability where most of your values lie. Larger values indicate that the central portion of your data spread out further. Conversely, smaller values show that the middle values cluster more tightly.

To visualize the interquartile range, imagine dividing your data into quarters. Statisticians refer to these quarters as quartiles and label them from low to high as Q1, Q2, Q3, and Q4. The lowest quartile (Q1) covers the smallest quarter of values in your dataset. The upper quartile (Q4) comprises the highest quarter of values. The interquartile range is the middle half of the data that lies between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that are above Q1 and below Q4. The IQR is the red area in the graph below, containing Q2 and Q3 (not labeled).

https://camo.githubusercontent.com/a5f6cf8164048f8c28f9b00b94e1264480c8c3b20a1b3d0bdca47083f3a86a19/68747470733a2f2f69302e77702e636f6d2f7374617469737469637362796a696d2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f30332f696e7465727175617274696c655f72616e67652e706e673f773d3537362673736c3d31

When measuring variability, statisticians prefer using the interquartile range instead of the full data range because extreme values and outliers affect it less. Typically, use the IQR with a measure of central tendency, such as the median, to understand your data’s center and spread. This combination creates a fuller picture of your data’s distribution.

Therefore it is being utilized to get rid of all the outliers that may come from errors when creating the data or from unexpected years.

Firstly, we compute the first quartile (Q1=25%) and the third quartile (Q3=75%). For that, we have grouped the data by country code and indicator name, so we get the Q1 and Q3 values for each indicator in each geographical area. 

In [11]:
grouped=BronzeDataFrame.groupby(['Country Code','Indicator Name'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B1CA01DBB0>

In [12]:
Q1=BronzeDataFrame.groupby(['Country Code','Indicator Name']).quantile(0.25)
Q3=BronzeDataFrame.groupby(['Country Code','Indicator Name']).quantile(0.75)
IQR=Q3-Q1
IQR

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Value
Country Code,Indicator Name,Unnamed: 2_level_1,Unnamed: 3_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),10.0,0.00
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",10.0,0.00
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",10.0,0.00
ARE,Access to electricity (% of population),15.0,0.00
ARE,"Access to electricity, rural (% of rural population)",15.0,0.00
...,...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),0.0,0.00
ZAF,Women who were first married by age 15 (% of women ages 20-24),9.0,0.15
ZAF,Women who were first married by age 18 (% of women ages 20-24),9.0,2.15
ZAF,Women's share of population ages 15+ living with HIV (%),15.0,4.90


Once we got the quartiles, we compute the upper and lower limit, with a basic mathematical expression.

In [13]:
lower_limit=Q1 - 1.5 * IQR
lower=lower_limit.drop(['Date'],axis=1)
lower.rename(columns={"Value":"Lower limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Lower limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),100.000
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",100.000
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",100.000
ARE,Access to electricity (% of population),100.000
ARE,"Access to electricity, rural (% of rural population)",100.000
...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),1.000
ZAF,Women who were first married by age 15 (% of women ages 20-24),0.625
ZAF,Women who were first married by age 18 (% of women ages 20-24),1.375
ZAF,Women's share of population ages 15+ living with HIV (%),49.850


In [14]:
upper_limit=Q3 + 1.5 * IQR
upper=upper_limit.drop(['Date'],axis=1)
upper.rename(columns={"Value":"Upper limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Upper limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),100.000
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",100.000
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",100.000
ARE,Access to electricity (% of population),100.000
ARE,"Access to electricity, rural (% of rural population)",100.000
...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),1.000
ZAF,Women who were first married by age 15 (% of women ages 20-24),1.225
ZAF,Women who were first married by age 18 (% of women ages 20-24),9.975
ZAF,Women's share of population ages 15+ living with HIV (%),69.450


Thirdly, we join the three tables we have (main dataframe, upper limit and lower limit) by matching country code and indicator name..

In [15]:
dfs = [BronzeDataFrame,lower,upper]
df_joined = ft.reduce(lambda left, right: pd.merge(left, right, on=['Country Code','Indicator Name']), dfs)
df_joined

Unnamed: 0,Country Code,Indicator Name,Date,Value_x,Value_y,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


In [16]:
list(df_joined)

['Country Code', 'Indicator Name', 'Date', 'Value_x', 'Value_y', 'Value']

We rename the columns of the new table, as the columns headers are not saved after the joining. 

In [17]:
renamed=df_joined.set_axis(['Country','Indicator','Year', 'Real value', 'Lower value', 'Upper value'], axis=1, inplace=False)
renamed

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


Now that we have the table correctly defined, we remove from our dataframe the values that are outside our range, as it means that they are outliers.

In [18]:
sin_outliers=renamed.loc[~((renamed['Real value']<renamed['Lower value']) | (renamed['Real value']>renamed['Upper value']))]
sin_outliers

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


From the data above, we can perceive that our data comes down from 1225418  rows to 1189068, so 36.350  were outliers. The next steps are to order and display data better, removing those columns that we just do not need and pivoting the rows and columns. 

In [19]:
df_limpio=sin_outliers.drop(['Lower value','Upper value'],axis=1)
df_limpio

Unnamed: 0,Country,Indicator,Year,Real value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


In [20]:
cols=df_limpio['Indicator'].unique().tolist()

In [21]:
SilverDataFrame=df_limpio.set_index(["Country", "Year"]).pivot(columns="Indicator", values="Real value").reset_index()
SilverDataFrame

Indicator,Country,Year,ARI treatment (% of children under 5 taken to a health provider),Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),...,Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
0,ARE,1990,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,18.8,100.0
1,ARE,1991,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,18.2,100.0
2,ARE,1992,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,19.4,100.0
3,ARE,1993,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,20.0,100.0
4,ARE,1994,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,20.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1531,ZAF,2017,,85.2,64.6,94.20,84.400002,76.738983,88.373024,69.218491,...,,,,,,,,,63.3,100000.0
1532,ZAF,2018,,85.7,65.5,94.65,84.699997,77.168495,88.518814,,...,,,,,,,,,63.7,92000.0
1533,ZAF,2019,,86.3,65.5,94.90,85.000000,77.611824,88.662704,,...,,,,,,,,,64.1,85000.0
1534,ZAF,2020,,86.8,65.9,95.20,84.385536,75.264854,88.806267,,...,,,,,,,,,64.4,79000.0


On the other hand, another big stone of normalizations is to nan/null values, which we have in all variables.

In [22]:
SilverDataFrame.isna().sum().sum()

1016628

As we can observe, we have lots of missing data, and as there is no optimal way to fullfill these values, thus, we will test some to arrive to the optimal method for our data set.

First, we need to create some lists so our loops work.

In [23]:
df=SilverDataFrame
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list


We are attempting the linear interpolation, which is achieved by geometrically rendering a straight line between two adjacent points on a graph or plane.

In [24]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

685787

Here we attempt the backward filling, filling the previous cell with future values.

In [25]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.fillna(method='bfill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.fillna(method='bfill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

498648

Here we will attempt the forward filling, which concists of filling the next cell with previous values.

In [26]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.fillna(method='ffill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

685787

The linear interpolation a form of interpolation, which involves the generation of new values based on an existing set of values. Linear interpolation is achieved by geometrically rendering a straight line between two adjacent points on a graph or plane. Whereas the backwards filling, will help us to arrive to those values which have not been fullfilled with the linear interpolation.

And as none of the methods have worked out correctly, independently, we are going to mix them, to achieve a better result.

In [27]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datc=datc.fillna(method='ffill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

685787

In [28]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datc=datc.fillna(method='bfill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

310048

And finally, mixing the three methods all together.

In [29]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datf=datc.fillna(method='bfill')
datr=datf.fillna(method='ffill')
data=datr

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

310048

##### Conclusion

Therefore, the preferred method for the Nan values´ treatment that we are going to develop is a mix, between the linear interpolation and backwards filling.

In [30]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datf=datc.fillna(method='bfill')
datr=datf.fillna(method='ffill')
data=datr

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data

Indicator,Country,Year,ARI treatment (% of children under 5 taken to a health provider),Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),...,Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
352,DEU,1990,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
353,DEU,1991,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
354,DEU,1992,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
355,DEU,1993,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
356,DEU,1994,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,,73.2,55.2,86.2,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
252,CHN,2018,,75.6,59.0,87.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
253,CHN,2019,,77.6,61.9,88.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
254,CHN,2020,,79.4,65.2,89.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,


Now, we will drop the columns which have over X% missing values because the absence of data creates an unreliable source. This % can be adjusted in the following slider. We have predetermined that 20% is a great starting point.

In [31]:
Slider1=widgets.FloatSlider(
    value=0.2,
    min=0,
    max=1.0,
    step=0.05,
    description='% that creates unreliable source:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='.1f',
)
Slider1

FloatSlider(value=0.2, continuous_update=False, description='% that creates unreliable source:', max=1.0, read…

In [32]:
number_1=len(data.index)*Slider1.value
for i in range(0, len(cols)):
    if data[cols[i]].isna().sum()>number_1:
        del(data[cols[i]])
        print(cols[i])
data

Adults (ages 15+) and children (ages 0-14) newly infected with HIV
Adults (ages 15-49) newly infected with HIV
Antiretroviral therapy coverage (% of people living with HIV)
Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV)
ARI treatment (% of children under 5 taken to a health provider)
Average transaction cost of sending remittances to a specific country (%)
Average working hours of children, study and work, ages 7-14 (hours per week)
Average working hours of children, study and work, female, ages 7-14 (hours per week)
Average working hours of children, study and work, male, ages 7-14 (hours per week)
Average working hours of children, working only, ages 7-14 (hours per week)
Average working hours of children, working only, female, ages 7-14 (hours per week)
Average working hours of children, working only, male, ages 7-14 (hours per week)
Bank capital to assets ratio (%)
Bank liquid reserves to bank assets ratio (%)
Bank nonperforming loans to total gross

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100)
352,DEU,1990,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,1.056365,3.031776,5.700000,4.800000,5.170000,92.050003,89.110001,90.330002,54.519497,71.250
353,DEU,1991,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.934908,3.029378,5.700000,4.800000,5.170000,92.050003,89.110001,90.330002,54.519497,71.250
354,DEU,1992,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.884470,3.026980,5.740000,4.930000,5.270000,91.910004,88.589996,89.970001,54.519497,71.250
355,DEU,1993,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.843967,3.024581,5.850000,5.040000,5.380000,91.669998,88.250000,89.669998,56.039631,71.250
356,DEU,1994,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.636245,3.022183,5.610000,5.200000,5.370000,91.629997,87.739998,89.370003,57.559764,71.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,73.2,55.2,86.2,100.0,100.0,100.0,80.229118,76.364731,...,2.739664,4.203002,46.530001,42.949999,44.530002,52.430000,54.169998,53.400002,21.358958,75.625
252,CHN,2018,75.6,59.0,87.4,100.0,100.0,100.0,80.229118,76.364731,...,2.503401,4.203002,45.720001,41.940001,43.609999,53.209999,55.139999,54.290001,21.358958,75.625
253,CHN,2019,77.6,61.9,88.4,100.0,100.0,100.0,80.229118,76.364731,...,2.290177,4.203002,44.760000,40.819999,42.540000,54.150002,56.279999,55.340000,21.358958,75.625
254,CHN,2020,79.4,65.2,89.4,100.0,100.0,100.0,80.229118,76.364731,...,2.066047,4.203002,44.760000,40.819999,42.540000,54.150002,56.279999,55.340000,21.358958,75.625


Afterwards, we have scaled the values. The escalation process has been done dividing each value by the initial one of an indicator (value in 1990). Considering the start point as 1 (initial value divided by itself), each result will show the growth respect to the initial data.

In [33]:
columns=data.columns.values.tolist()

In [34]:
datae=data.loc[data.loc[:, 'Country'] == country_list[0]]
for i in range(2,len(columns)):
    a=columns[i]
    datae[a]=datae[a]/datae.iloc[0,i]
datau=datae

In [35]:
for u in range(1,len(country_list)):
    datae=data.loc[data.loc[:, 'Country'] == country_list[u]]   
    for i in range(2,len(columns)):
        a=columns[i]
        datae[a]=datae[a]/datae.iloc[0,i]
    datau=pd.concat((datau, datae), axis = 0)
datau

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100)
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.885023,0.999209,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.837277,0.998418,1.007018,1.027083,1.019342,0.998479,0.994164,0.996015,1.000000,1.000000
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.798935,0.997627,1.026316,1.050000,1.040619,0.995872,0.990349,0.992693,1.027882,1.000000
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.602296,0.996836,0.984210,1.083333,1.038685,0.995437,0.984626,0.989372,1.055765,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,0.635700,1.362819,0.707250,0.617275,0.656204,1.555786,1.984976,1.768798,8.610987,1.273684
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,0.580879,1.362819,0.694938,0.602759,0.642647,1.578932,2.020520,1.798278,8.610987,1.273684
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,0.531403,1.362819,0.680347,0.586663,0.626879,1.606825,2.062294,1.833057,8.610987,1.273684
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,0.479397,1.362819,0.680347,0.586663,0.626879,1.606825,2.062294,1.833057,8.610987,1.273684


As later on we want to study the correlations with time moved, we need to create new columns for it. The reason why is because, maybe the effect of a variable does not happen until a couple of years later. The time movements that have been considered are those of the Fibonacci serie within our time period. 

The following pictures helps to realize the behaviour that we were explaining. 

![](https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/Time%20moved.JPG)

In [36]:
shifted=pd.DataFrame()
for i in range(0,len(country_list)):
    dat=datau.loc[datau.loc[:, 'Country'] == country_list[i]]
    dat['GDP (current US$)+1']=dat['GDP (current US$)'].shift(periods=1)
    dat['GDP (current US$)+2']=dat['GDP (current US$)'].shift(periods=2)
    dat['GDP (current US$)+3']=dat['GDP (current US$)'].shift(periods=3)
    dat['GDP (current US$)+5']=dat['GDP (current US$)'].shift(periods=5)
    dat['GDP (current US$)+8']=dat['GDP (current US$)'].shift(periods=8)
    dat['GDP (current US$)+13']=dat['GDP (current US$)'].shift(periods=13)
    dat['GDP (current US$)+21']=dat['GDP (current US$)'].shift(periods=21)
    shifted=pd.concat((shifted, dat), axis = 0)
shifted

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),GDP (current US$)+1,GDP (current US$)+2,GDP (current US$)+3,GDP (current US$)+5,GDP (current US$)+8,GDP (current US$)+13,GDP (current US$)+21
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,,,,,,,
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,,,,,,
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.996015,1.000000,1.000000,1.054905,1.000000,,,,,
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.992693,1.027882,1.000000,1.203142,1.054905,1.000000,,,,
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.989372,1.055765,1.000000,1.169136,1.203142,1.054905,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,1.768798,8.610987,1.273684,31.129362,30.653486,29.029938,23.644292,14.137706,5.418606,2.393592
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,1.798278,8.610987,1.273684,34.114284,31.129362,30.653486,26.521259,16.868589,6.334809,2.664772
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,1.833057,8.610987,1.273684,38.504955,34.114284,31.129362,29.029938,20.926519,7.626636,2.851657
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,1.833057,8.610987,1.273684,39.572189,38.504955,34.114284,30.653486,23.644292,9.838617,3.031657


In [37]:
data=shifted

For the next part of analyzing this data, we think it is gonna be interesting to have it classify by the categories of the Country groups defined before, to which we call "Continent". This category is useful as it groups the nations with similar economies or geographical proximity, so we can extract common conclusions from them.

We create a dictionary with the regions and the countries included in each one. Where we will relate the countries and regions so then we can apply the .map function and arrive to the final dataframe.

In [38]:
countries_by_region = {
    "Europe": ('DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD'),
    'Persian Gulf': ('IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN'),
    'North Africa':('DZA','EGY','LBY','ISR','TUR','MAR'),
    'South Africa':('SEN','ZAF','LBR','MOZ','CMR','NGA','GHA'),
    'Asia':('BGD','IND','VNM','THA','IDN','PHL','KOR'),
    'Latam':('MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI'),
    'Pair':('USA','CHN')
    }

all_countries = {}
for region in countries_by_region.keys():
  for country in countries_by_region[region]:
    all_countries[country] = region

print(all_countries)

{'DEU': 'Europe', 'FRA': 'Europe', 'SWE': 'Europe', 'GBR': 'Europe', 'ESP': 'Europe', 'HRV': 'Europe', 'POL': 'Europe', 'GRC': 'Europe', 'AUT': 'Europe', 'NLD': 'Europe', 'IRQ': 'Persian Gulf', 'QAT': 'Persian Gulf', 'ARE': 'Persian Gulf', 'SAU': 'Persian Gulf', 'AZE': 'Persian Gulf', 'YEM': 'Persian Gulf', 'YDR': 'Persian Gulf', 'OMN': 'Persian Gulf', 'DZA': 'North Africa', 'EGY': 'North Africa', 'LBY': 'North Africa', 'ISR': 'North Africa', 'TUR': 'North Africa', 'MAR': 'North Africa', 'SEN': 'South Africa', 'ZAF': 'South Africa', 'LBR': 'South Africa', 'MOZ': 'South Africa', 'CMR': 'South Africa', 'NGA': 'South Africa', 'GHA': 'South Africa', 'BGD': 'Asia', 'IND': 'Asia', 'VNM': 'Asia', 'THA': 'Asia', 'IDN': 'Asia', 'PHL': 'Asia', 'KOR': 'Asia', 'MEX': 'Latam', 'BRA': 'Latam', 'ARG': 'Latam', 'PER': 'Latam', 'VEN': 'Latam', 'COL': 'Latam', 'CHL': 'Latam', 'PAN': 'Latam', 'CRI': 'Latam', 'USA': 'Pair', 'CHN': 'Pair'}


In [39]:
data['Continent']=data['Country'].map(all_countries)
Goldendataframe=data
Goldendataframe

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),GDP (current US$)+1,GDP (current US$)+2,GDP (current US$)+3,GDP (current US$)+5,GDP (current US$)+8,GDP (current US$)+13,GDP (current US$)+21,Continent
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,,,,,,,,Europe
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,,,,,,,Europe
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.054905,1.000000,,,,,,Europe
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.027882,1.000000,1.203142,1.054905,1.000000,,,,,Europe
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.055765,1.000000,1.169136,1.203142,1.054905,,,,,Europe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,31.129362,30.653486,29.029938,23.644292,14.137706,5.418606,2.393592,Pair
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,34.114284,31.129362,30.653486,26.521259,16.868589,6.334809,2.664772,Pair
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,38.504955,34.114284,31.129362,29.029938,20.926519,7.626636,2.851657,Pair
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,39.572189,38.504955,34.114284,30.653486,23.644292,9.838617,3.031657,Pair


With that all, we export our dataframe all-in-one and by the continent category.

In [40]:
Goldendataframe.to_csv(os.getcwd()+'/Data/GoldenDataFrame.csv')

In [41]:
for region, data in Goldendataframe.groupby('Continent'):
   data.to_csv(os.getcwd()+'/Data/{}.csv'.format(region))

# Categorization of variables

In this section, we are going to attempt a categorization of the whole of the variables, which most of them come the same sources and just differ in the units that are measured, or the total that they are refering, between others. For a simpler treatment of the data, the variables have been pivoted into the same column.

In [42]:
columns_golden=list(Goldendataframe.columns)
del columns_golden[0:2]

In [43]:
Categorization=Goldendataframe.set_index(['Country','Year', 'Continent']).stack().reset_index()
Categorization['Short indicator']=Categorization['Indicator']
Categorization

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul..."
...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13


 There are some indicators which represent exactly the same through different units, so, we are going to select only one type. For example, in monetary cases, indicators which are expressed with current US $ has been selected. Then, which are showed with the percentage and the total value, we have programmed to selct which ones which show a greater value.

The links used to learn about these functions have been:

https://www.geeksforgeeks.org/how-to-drop-rows-that-contain-a-specific-string-in-pandas/ 

https://www.statology.org/pandas-drop-rows-that-contain-string/ 

In [44]:
import re
discard=["annual % growth","constant 2015 US[$]","% of GNI","constant LCU","current LCU"]
Categorization2=Categorization[~Categorization['Short indicator'].str.contains('|'.join(discard))]
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul..."
...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13


To check previous step.

In [45]:
#Categorization2.apply(lambda row: row.astype(str).str.contains('US').any(), axis=1)

Now we are going to structure the indicators in a same way to work better. The first step consist of making a new column that shows the units of each variable. Units are showed inside the parenthesis of the indicator name.

In [46]:
Categorization2['Units']=Categorization2['Short indicator'].str.extract(' (\(.*\))')
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population)
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population)
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population)
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population),(% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul...",(% of rural population)
...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3,(current US$)
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5,(current US$)
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8,(current US$)
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13,(current US$)


Now, short indicator refers to the original indicator name without the units. The extracted information from the origin column has been deleted.

In [47]:
Categorization2['Short indicator']=Categorization2['Short indicator'].str.replace(r" (\(.*\))","")
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population)
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population)
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population)
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population)
...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$)
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$)
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$)
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$)


In some cases there are extra information in indicators name. The information of the second parenthesis is extracted as a new column too.

In [48]:
two_parent=Categorization2[Categorization2['Short indicator'].str.contains('Contributing family workers')]
two_parent

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
151,DEU,1990,Europe,"Contributing family workers, female (% of fema...",1.000000,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
152,DEU,1990,Europe,"Contributing family workers, male (% of male e...",1.000000,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
153,DEU,1990,Europe,"Contributing family workers, total (% of total...",1.000000,"Contributing family workers, total",(% of total employment) (modeled ILO estimate)
1151,DEU,1991,Europe,"Contributing family workers, female (% of fema...",1.000000,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
1152,DEU,1991,Europe,"Contributing family workers, male (% of male e...",1.000000,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
...,...,...,...,...,...,...,...
1538535,CHN,2020,Pair,"Contributing family workers, male (% of male e...",0.252894,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
1538536,CHN,2020,Pair,"Contributing family workers, total (% of total...",0.329385,"Contributing family workers, total",(% of total employment) (modeled ILO estimate)
1539524,CHN,2021,Pair,"Contributing family workers, female (% of fema...",0.386565,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
1539525,CHN,2021,Pair,"Contributing family workers, male (% of male e...",0.252894,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)


Moreover, there are some inidcators with an extra parenthesis adding some  more information. As this information isn't related with units, another column named as 'other specification' has been created.

In [49]:
Categorization2[['Units','Other specification']]=Categorization2['Units'].str.split("\) ", n=1,expand=True)
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population),
...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),


At the end of the variable name, separated by the last "," it is informing us about to which subgroup makes reference the variable. Thus, there are some indicators that have information divided for small groups. This information is shown as a new column named 'Subgroup'.

In [50]:
Categorization2[['Subgroup']]=Categorization2['Short indicator'].str.extract(',(?P<field>[^,]*?)$')
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,


As before, information which is shown as a new column is deleted from the origin one.

In [51]:
Categorization2['Short indicator']=Categorization2['Short indicator'].str.replace(',(?P<field>[^,]*?)$',"")
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,Access to electricity,(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,


All the indicators don't have these elements. So, a checking point is needed.

In [52]:
Categorization2['Subgroup']=Categorization2['Subgroup'].replace(['None'],['total'])
Categorization2['Subgroup']=Categorization2['Subgroup'].fillna('total')

There are some duplicate variables which should be removed too.

In [53]:
Categorization2.drop_duplicates(subset=['Country','Year','Short indicator','Continent','Subgroup'], keep='first')

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,total
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,total
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,Access to electricity,(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,total
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,total
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,total
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,total


Reordering columns, categorization3 is our df after all these division in categories.

In [54]:
Categorization2.rename(columns={Categorization2.columns[4]:'Value'},inplace=True)
Categorization3=Categorization2[['Country','Year','Continent','Indicator','Short indicator','Value','Subgroup','Units','Other specification']]
Categorization3

Unnamed: 0,Country,Year,Continent,Indicator,Short indicator,Value,Subgroup,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,total,(% of population),
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,rural,(% of rural population),
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,urban,(% of urban population),
3,DEU,1990,Europe,Access to electricity (% of population),Access to electricity,1.000000,total,(% of population),
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",Access to electricity,1.000000,rural,(% of rural population),
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,GDP+3,38.504955,total,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,GDP+5,31.129362,total,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,GDP+8,26.521259,total,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,GDP+13,12.731623,total,(current US$),


------------------------------

In [55]:
Categorization4=Categorization3.loc[Categorization3['Subgroup']=='total']
Categorization4

Unnamed: 0,Country,Year,Continent,Indicator,Short indicator,Value,Subgroup,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,total,(% of population),
3,DEU,1990,Europe,Access to electricity (% of population),Access to electricity,1.000000,total,(% of population),
6,DEU,1990,Europe,Account ownership at a financial institution o...,Account ownership at a financial institution o...,1.000000,total,(% of population ages 15+),
20,DEU,1990,Europe,Adjusted net national income (current US$),Adjusted net national income,1.000000,total,(current US$),
23,DEU,1990,Europe,Adjusted net national income per capita (curre...,Adjusted net national income per capita,1.000000,total,(current US$),
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,GDP+3,38.504955,total,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,GDP+5,31.129362,total,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,GDP+8,26.521259,total,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,GDP+13,12.731623,total,(current US$),


In [56]:
Categorization4.to_csv(os.getcwd()+'/Data/Categorization.csv')

# Correlation study

For the correlation study, we are going to check if the indicators are related in a relevantly way to GDP(current US$), through both Pearson and Spearman correlations and their respective p-values. Thus, our hypotheses are as follow:

- H_0: the indicator and the GDP are uncorrelated.​
- H_1: the indicator and the GDP are correlated.​

**If p-value < α then reject H_0 and accept H_1.​**

P-value: is the probability of obtaining test results at least as extreme as the result actually observed. (Marging of the error)

​Confidence level (α): probability that a population parameter will fall between a set of values for a certain proportion of times. α = 1 - p-value.

​Significance level: probability of the study rejecting the null hypothesis when it is actually true.​ 

At the end we want to follow a process that checks if with Pearson and Spearman the correlation is relevant, as the image below ilustrates. 

![](https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/Spearman%20and%20pearson%20process.JPG)


In [57]:
Categorization4= pd.read_csv (os.getcwd()+'/Data/'+'Categorization.csv')

In [58]:
indicators_list=Categorization4['Indicator'].unique().tolist()

In [59]:
columns=indicators_list+['Country','Year','Continent']
clist=Categorization4['Country'].unique()
common=['Unnamed: 0','Country','Year']

The correlation between two variables does not always need to be a linear one, since data of these variables can follow a different distribution that better adjusts. Therefore, we have computed four types (linear, quadratic, cubic and logarithmic) for each indicator in each country and we chose the one with the highest correlation. In the following cell, we have defined a function that will allow us to calculate the different posibilities of relations: cuadratic, cubic and logaritmic.

In [60]:
def multcolumn(frame):
    for u in range(0, len(columns)-3):
        name=columns[u]+'¨l'
        name2=columns[u]+'¨^2'
        name3=columns[u]+'¨^3'
        namelog=columns[u]+'¨log'
        frame.loc[:,name2] = frame[columns[u]]**2
        frame.loc[:,name3] = frame[columns[u]]**3
        frame.loc[:,namelog] = np.log(frame[columns[u]])
        frame.rename(columns={columns[u]:name}, inplace=True)

Moreover, we want to know the correlation between all the variables, so to acomplish this, we have created the following loop, which will help us create a new dataframe where we will have: the *Indicator*, the *Type* of relation, the value of the *R^2*, its *Behaviour*, the *Country* and the *Continent*.

In [61]:
df= pd.read_csv (os.getcwd()+'/Data/'+'GoldenDataFrame.csv')
df_study=df[[c for c in df.columns if c in columns]]


In [62]:
multcolumn(df_study)

Firstly we are going to create two lists for the variables, which their level of confidence for each correlation, so later on, we can calculate only the correlations of those variables. Which will be filtered by the values that we want from the following sliders. The predetermined values for each case are 0.05 for the values and 0.75 for them to be a relevant correlations.

In [63]:
apv=widgets.FloatSlider(value=0.05,min=0,max=1.0,step=0.025,description='Level of confidence (Pearson):',disabled=False,continuous_update=False,orientation='horizontal',readout=True)
cpv=widgets.FloatSlider( value=0.05, min=0, max=1.0, step=0.05, description='Level of confidence (Spearman):', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
bcor=widgets.FloatSlider( value=0.75, min=0, max=1.0, step=0.05, description='Relevant correlation (Pearson):', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
dcor=widgets.FloatSlider( value=0.75, min=0, max=1.0, step=0.05, description='Relevant correlation (Spearman):', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
display(apv,bcor,cpv,dcor)

FloatSlider(value=0.05, continuous_update=False, description='Level of confidence (Pearson):', max=1.0, step=0…

FloatSlider(value=0.75, continuous_update=False, description='Relevant correlation (Pearson):', max=1.0, step=…

FloatSlider(value=0.05, continuous_update=False, description='Level of confidence (Spearman):', max=1.0, step=…

FloatSlider(value=0.75, continuous_update=False, description='Relevant correlation (Spearman):', max=1.0, step…

In the following block, we check the hypotheses set before and extract only the variables that deny H_0.

In [64]:
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]
listacorpe=[]
listacorsp=[]
clmns=dat.columns.values.tolist()
dat.replace([np.inf, -np.inf], np.nan, inplace=True)
for c in range(0, len(clmns)):
    if dat[clmns[c]].isna().sum()>=1:
        del(dat[clmns[c]])
pilares=dat.columns.values.tolist()
for u in range(0,len(pilares)):
    if is_numeric_dtype(dat[pilares[u]]):
        correlation, pvalue=pearsonr(dat[pilares[u]], dat['GDP (current US$)¨l'])
        if pvalue<=apv.value:
            listacorpe.append(pilares[u])
        else:
            pass
        correlation, pvalue=spearmanr(dat[pilares[u]], dat['GDP (current US$)¨l'])
        if pvalue<=cpv.value:
            listacorsp.append(pilares[u])
        else:
            pass
    else:
        pass

Secondly, we need to calculate the correlation table for each country, therefore we use the basic function `corr()` which provides either the Pearson correlation table or the Spearman correlation table, as well as a filter for the countries.

In [65]:
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]

datp=dat[dat.columns[dat.columns.isin(listacorpe)]]
corp=datp.corr('pearson')

datsp=dat[dat.columns[dat.columns.isin(listacorsp)]]
cors=datsp.corr('spearman')

Then we calculate the coefficient of determination which is the correlation squared.

In [66]:
corp.loc[:,'R^2 Pearson'] = corp['GDP (current US$)¨l']**2

cors.loc[:,'R^2 Spearman'] = cors['GDP (current US$)¨l']**2

Moreover, we are going to create new columns to know which *Indicator* are we talking about, and the *Type* of correlation that is being analyzed (linear, cuadratic, cubic or logarithmic)

In [67]:
corp.loc[:,'Indicator']=corp.index
corp[['Indicator','Type']]=corp.Indicator.str.split('¨',1, expand=True)

cors.loc[:,'Indicator']=cors.index
cors[['Indicator','Type']]=cors.Indicator.str.split('¨',1, expand=True)

Now, we can apply the filter we have consider that is enough, selected on the slider. If is not varied it is R^2>=0.75 to filter the correlations.

In [68]:
corpcolumn=corp[['Indicator','R^2 Pearson','Type','GDP (current US$)¨l']]
corpcolumn=corpcolumn.loc[corpcolumn.loc[:, 'R^2 Pearson'] >=bcor.value]

corscolumn=cors[['Indicator','R^2 Spearman','Type','GDP (current US$)¨l']]
corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= dcor.value]

Furthermore, we add all the columns that we have created into a data frame, thanks to the following cell.

In [69]:
idp=corpcolumn.groupby('Indicator')['R^2 Pearson'].transform(max)==corpcolumn['R^2 Pearson']
corpcolumn[idp]
maxp_df=pd.DataFrame(corpcolumn[idp])

ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
corscolumn[ids]
maxs_df=pd.DataFrame(corscolumn[ids])

Here, we conmute the values, by expressions. For example if the correlation is positive, we want in the new column called *Behaviour* the word Positive. Or for the *Type* column if the greatest correlation is cuadratic we want to put, Cuadratic. We also add the country.

In [70]:
maxp_df['Behaviour']=np.where(maxp_df['GDP (current US$)¨l']>0, 'Positive', 'Negative')
maxp_df['Type']=maxp_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
maxp_df['Country']= clist[0]

maxs_df['Behaviour']=np.where(maxs_df['GDP (current US$)¨l']>0, 'Positive', 'Negative')
maxs_df['Type']=maxs_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
maxs_df['Country']= clist[0]

In addition, we also drop the columns which do not add any value, as *GDP*, *Year*, and *Unnamed:0*.

In [71]:
maxp_df.drop("GDP (current US$)¨l",axis=1,inplace=True)
maxp_df=maxp_df.reset_index(drop=True)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Year'].index)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='GDP (current US$)'].index)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Unnamed: 0'].index)

maxs_df.drop("GDP (current US$)¨l",axis=1,inplace=True)
maxs_df=maxs_df.reset_index(drop=True)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)

And finally we sort the values in descending order by the column *R^2 Pearson*.

In [72]:
maxp_df_deu=maxp_df.sort_values(by = 'R^2 Pearson',ascending = False)
pearsondf= maxp_df_deu
spearmandf=maxs_df

So, we can do it with all the countries and create just one dataframe, where we match both data frames, the Pearson and the Spearman, only where there is a case in both sides. Meaning if there is a relevant correlation for Country x in Variable y for Pearson, it will only appear if there is also a relevant correlation for Country x in Variable y for Spearman.

In [73]:
pearsondf
spearmandf
for i in range(1,len(clist)):
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]
    listacorpe=[]
    listacorsp=[]
    clmns=dat.columns.values.tolist()
    dat.replace([np.inf, -np.inf], np.nan, inplace=True)
    for c in range(0, len(clmns)):
        if dat[clmns[c]].isna().sum()>=1:
            del(dat[clmns[c]])
    pilares=dat.columns.values.tolist()
    for u in range(0,len(pilares)):
        if is_numeric_dtype(dat[pilares[u]]):
            correlation, pvalue=pearsonr(dat[pilares[u]], dat['GDP (current US$)¨l'])
            if pvalue<=apv.value:
                listacorpe.append(pilares[u])
            else:
                pass
            correlation, pvalue=spearmanr(dat[pilares[u]], dat['GDP (current US$)¨l'])
            if pvalue<=cpv.value:
                listacorsp.append(pilares[u])
            else:
                pass
        else:
            pass
    
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]

    datp=dat[dat.columns[dat.columns.isin(listacorpe)]]
    corp=datp.corr('pearson')

    datsp=dat[dat.columns[dat.columns.isin(listacorsp)]]
    cors=datsp.corr('spearman')


    corp.loc[:,'R^2 Pearson'] = corp['GDP (current US$)¨l']**2

    cors.loc[:,'R^2 Spearman'] = cors['GDP (current US$)¨l']**2


    corp.loc[:,'Indicator']=corp.index
    corp[['Indicator','Type']]=corp.Indicator.str.split('¨',1, expand=True)

    cors.loc[:,'Indicator']=cors.index
    cors[['Indicator','Type']]=cors.Indicator.str.split('¨',1, expand=True)


    corpcolumn=corp[['Indicator','R^2 Pearson','Type','GDP (current US$)¨l']]
    corpcolumn=corpcolumn.loc[corpcolumn.loc[:, 'R^2 Pearson'] >= bcor.value]
    
    corscolumn=cors[['Indicator','R^2 Spearman','Type','GDP (current US$)¨l']]
    corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= dcor.value]


    idp=corpcolumn.groupby('Indicator')['R^2 Pearson'].transform(max)==corpcolumn['R^2 Pearson']
    corpcolumn[idp]
    maxp_df=pd.DataFrame(corpcolumn[idp])

    ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
    corscolumn[ids]
    maxs_df=pd.DataFrame(corscolumn[ids])


    maxp_df['Behaviour']=np.where(maxp_df['GDP (current US$)¨l']>0, 'Positive', 'Negative')
    maxp_df['Type']=maxp_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
    maxp_df['Country']= clist[i]

    maxs_df['Behaviour']=np.where(maxs_df['GDP (current US$)¨l']>0, 'Positive', 'Negative')
    maxs_df['Type']=maxs_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
    maxs_df['Country']= clist[i]


    maxp_df.drop("GDP (current US$)¨l",axis=1,inplace=True)
    maxp_df=maxp_df.reset_index(drop=True)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Year'].index)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='GDP (current US$)'].index)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Unnamed: 0'].index)

    maxs_df.drop("GDP (current US$)¨l",axis=1,inplace=True)
    maxs_df=maxs_df.reset_index(drop=True)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
    maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)


    maxp_df=maxp_df.sort_values(by = 'R^2 Pearson',ascending = False)
    pearsondf=pd.concat((pearsondf, maxp_df), axis = 0)
    spearmandf=pd.concat((spearmandf, maxs_df), axis = 0)

corrtable=spearmandf.merge(pearsondf, left_on=('Indicator', 'Country','Type','Behaviour'), right_on=('Indicator', 'Country','Type','Behaviour'))
display(corrtable)

Unnamed: 0,Indicator,R^2 Spearman,Type,Behaviour,Country,R^2 Pearson
0,Adjusted net national income (current US$),0.996519,Linear,Positive,DEU,0.999141
1,Gross value added at basic prices (GVA) (curre...,0.996519,Linear,Positive,DEU,0.999851
2,GNI (current US$),0.996337,Linear,Positive,DEU,0.999362
3,Gross national expenditure (current US$),0.990490,Linear,Positive,DEU,0.997181
4,Final consumption expenditure (current US$),0.988302,Linear,Positive,DEU,0.996540
...,...,...,...,...,...,...
5965,Prevalence of anemia among women of reproducti...,0.793298,Logarithmic,Negative,CHN,0.752969
5966,Logistics performance index: Ability to track ...,0.788049,Logarithmic,Positive,CHN,0.926947
5967,Logistics performance index: Competence and qu...,0.788049,Logarithmic,Positive,CHN,0.890297
5968,Out-of-pocket expenditure (% of current health...,0.782573,Logarithmic,Negative,CHN,0.929883


Finally, a table has been created showing the number of times a variable has a high relationship in our 48 countries. These that appear many times will be interesting for us to draw conclusions. Then, we will checck if they are primary or seconday variable type.

In [74]:
columnssf=corrtable.Indicator.to_list()
columnsf=np.unique(columnssf)

In [75]:
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

df_indicators = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Indicator','Number of times repeated'])
df_indicators=df_indicators.sort_values(by = 'Number of times repeated',ascending = False)
df_indicators
display(df_indicators)

Unnamed: 0,Indicator,Number of times repeated
127,GDP per capita (current US$),46
178,"Industry (including construction), value added...",46
132,GNI (current US$),46
152,Households and NPISHs Final consumption expend...,45
108,Final consumption expenditure (current US$),45
...,...,...
250,Net official development assistance and offici...,1
248,"Net capital account (BoP, current US$)",1
246,Natural gas rents (% of GDP),1
29,Arms exports (SIPRI trend indicator values),1


In [76]:
#To get list of all number of times repeated.
#from IPython.display import HTML

#HTML(df_indicators.to_html(index=False))

While for the temporal diferences, we want to know which works better in each case, the process to be followed is described in the following picture.

![](https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/Temporal%20process.JPG)


In [77]:
df= pd.read_csv (os.getcwd()+'/Data/'+'GoldenDataFrame.csv')
df_study=df[[c for c in df.columns if c in columns]]

In [78]:
moveddf=pd.DataFrame()
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]
clmns=dat.columns.values.tolist()
dat.replace([np.inf, -np.inf], np.nan, inplace=True)
tempdiffs=['GDP (current US$)+1','GDP (current US$)+2','GDP (current US$)+3','GDP (current US$)+5','GDP (current US$)+8','GDP (current US$)+13','GDP (current US$)+21']
cors=dat.corr('spearman')
for f in range(0, len(tempdiffs)):
    cors.loc[:,'R^2 Spearman'] = cors[tempdiffs[f]]**2
    cors.loc[:,'Indicator']=cors.index
    corscolumn=cors[['Indicator','R^2 Spearman',tempdiffs[f]]]
    corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= dcor.value]
    ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
    corscolumn[ids]
    maxs_df=pd.DataFrame(corscolumn[ids])
    maxs_df['Behaviour']=np.where(maxs_df[tempdiffs[f]]>0, 'Positive', 'Negative')
    maxs_df['Country']= clist[0]
    maxs_df[['Variable','Moved']]=tempdiffs[f].split('+')
    maxs_df.drop(tempdiffs[f],axis=1,inplace=True)
    maxs_df.drop(columns='Variable',inplace=True)
    maxs_df=maxs_df.reset_index(drop=True)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator'].isin(tempdiffs)].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
    maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)
    moveddf=pd.concat((moveddf,maxs_df),axis=0)
ids=moveddf.groupby('Indicator')['R^2 Spearman'].transform(max)==moveddf['R^2 Spearman']
moveddf[ids]
moveddf=pd.DataFrame(moveddf[ids])
temporaldf=moveddf

In [79]:
for i in range(1,len(clist)):
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]
    clmns=dat.columns.values.tolist()
    dat.replace([np.inf, -np.inf], np.nan, inplace=True)
    tempdiffs=['GDP (current US$)+1','GDP (current US$)+2','GDP (current US$)+3','GDP (current US$)+5','GDP (current US$)+8','GDP (current US$)+13','GDP (current US$)+21']
    cors=dat.corr('spearman')
    moveddf=pd.DataFrame()
    for f in range(0, len(tempdiffs)):
        cors.loc[:,'R^2 Spearman'] = cors[tempdiffs[f]]**2
        cors.loc[:,'Indicator']=cors.index
        corscolumn=cors[['Indicator','R^2 Spearman',tempdiffs[f]]]
        corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= dcor.value]
        ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
        corscolumn[ids]
        maxs_df=pd.DataFrame(corscolumn[ids])
        maxs_df['Behaviour']=np.where(maxs_df[tempdiffs[f]]>0, 'Positive', 'Negative')
        maxs_df['Country']= clist[i]
        maxs_df[['Variable','Moved']]=tempdiffs[f].split('+')
        maxs_df.drop(tempdiffs[f],axis=1,inplace=True)
        maxs_df.drop(columns='Variable',inplace=True)
        maxs_df=maxs_df.reset_index(drop=True)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator'].isin(tempdiffs)].index)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
        maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)
        moveddf=pd.concat((moveddf,maxs_df),axis=0)
    ids=moveddf.groupby('Indicator')['R^2 Spearman'].transform(max)==moveddf['R^2 Spearman']
    moveddf[ids]
    moveddf=pd.DataFrame(moveddf[ids])
    temporaldf=pd.concat((temporaldf,moveddf),axis=0)
temporaldf

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved
4,Adjusted savings: education expenditure (curre...,0.824514,Positive,DEU,1
35,General government final consumption expenditu...,0.822917,Positive,DEU,1
45,Individuals using the Internet (% of population),0.822917,Positive,DEU,1
78,Surface area (sq. km),0.821642,Positive,DEU,1
5,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1
...,...,...,...,...,...
167,Pump price for diesel fuel (US$ per liter),0.787248,Negative,CHN,21
168,Pump price for gasoline (US$ per liter),0.787248,Negative,CHN,21
30,Chemicals (% of value added in manufacturing),0.781385,Negative,CHN,21
162,Proportion of population pushed below the $1.9...,0.769754,Positive,CHN,21


In [80]:
alist=['Indicator','R^2 Spearman','Behaviour','Country','Type']
forcomparassion=corrtable[alist]

quarterfinal=pd.concat((temporaldf,forcomparassion),axis=0)
quarterfinal.fillna('Does not apply',inplace=True)
quarterfinal

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type
4,Adjusted savings: education expenditure (curre...,0.824514,Positive,DEU,1,Does not apply
35,General government final consumption expenditu...,0.822917,Positive,DEU,1,Does not apply
45,Individuals using the Internet (% of population),0.822917,Positive,DEU,1,Does not apply
78,Surface area (sq. km),0.821642,Positive,DEU,1,Does not apply
5,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1,Does not apply
...,...,...,...,...,...,...
5965,Prevalence of anemia among women of reproducti...,0.793298,Negative,CHN,Does not apply,Logarithmic
5966,Logistics performance index: Ability to track ...,0.788049,Positive,CHN,Does not apply,Logarithmic
5967,Logistics performance index: Competence and qu...,0.788049,Positive,CHN,Does not apply,Logarithmic
5968,Out-of-pocket expenditure (% of current health...,0.782573,Negative,CHN,Does not apply,Logarithmic


In [81]:
quarterfinal.to_csv(os.getcwd()+'/Data/Quarterfinal.csv')

In [82]:
quarterfinal= pd.read_csv (os.getcwd()+'/Data/'+'Quarterfinal.csv')

In [83]:
#To understand better the data, we categorize it (Area label and Primary/Secondary).
categories= pd.read_excel (os.getcwd()+'/Data/'+'dfindicators - Copy.xlsx')

In [84]:
categories.rename(columns={'Type':'Group'}, inplace=True)
quarterfinal.drop(columns=('Unnamed: 0'), inplace=True)
clist=quarterfinal['Country'].unique()

To sum up, we compare both of the highest correlations and decide which is best, by comparing the R^2. So the process of searching for correlations is as follows.

![](https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/Final%20table%20process.JPG)

In [85]:
final=pd.DataFrame()
for i in range(0,len(clist)):
    dat=quarterfinal.loc[quarterfinal.loc[:, 'Country'] == clist[i]]
    ids=dat.groupby('Indicator')['R^2 Spearman'].transform(max)==dat['R^2 Spearman']
    dat[ids]
    semifinal=pd.DataFrame(dat[ids])
    final=pd.concat((final,semifinal), axis=0)
final_indicators_list=categories.Indicator.unique()
final['Continent']=final['Country'].map(all_countries)
final=final.loc[final.loc[:, 'Indicator'].isin(np.array(final_indicators_list))]
final=pd.merge(final,categories, left_on='Indicator',right_on='Indicator')

columnssf=final.Indicator.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))
final_indicators = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Indicator','Number of times repeated'])

final=pd.merge(final,final_indicators, left_on='Indicator',right_on='Indicator')
final

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type,Continent,Number of times repeated_x,Group,Level,Number of times repeated_y
0,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1,Does not apply,Europe,22,Demography,secondary,40
1,"Adolescent fertility rate (births per 1,000 wo...",0.759333,Negative,SWE,3,Does not apply,Europe,22,Demography,secondary,40
2,"Adolescent fertility rate (births per 1,000 wo...",0.936963,Negative,GBR,13,Does not apply,Europe,22,Demography,secondary,40
3,"Adolescent fertility rate (births per 1,000 wo...",0.786708,Negative,HRV,8,Does not apply,Europe,22,Demography,secondary,40
4,"Adolescent fertility rate (births per 1,000 wo...",0.924056,Negative,POL,21,Does not apply,Europe,22,Demography,secondary,40
...,...,...,...,...,...,...,...,...,...,...,...
6640,Completeness of birth registration (%),0.839228,Positive,PER,5,Does not apply,Latam,11,Demoraphy,secondary,21
6641,Completeness of birth registration (%),0.792184,Negative,VEN,1,Does not apply,Latam,11,Demoraphy,secondary,21
6642,Completeness of birth registration (%),0.840648,Positive,COL,2,Does not apply,Latam,11,Demoraphy,secondary,21
6643,Completeness of birth registration (%),0.963636,Positive,PAN,21,Does not apply,Latam,11,Demoraphy,secondary,21


The former table final collects the indicators that have shown a high correlation with the GDP (according to the criteria established, see above). It provides detail of each case specifying the numerical value of the correlation (R^2 Spearman); the alignment with the GDP (Behaviour); the country and continent where it applies; if there is any temporal displacement; the group of interest it belongs to and the number of times repeated.
Moreover, if we focus over the column of "Number of times repeated". It reflects the number of times a high correlation of each indicator is repeated over the whole sample (aggregating all countries).

In addition we have used the `itables` library to if wanted any one can search through the table as they wish.

In [86]:
from itables import init_notebook_mode,show

init_notebook_mode(all_interactive=False)

import itables.options as opt

show(pd.DataFrame(final),dom="ftpr")

opt.lengthMenu=[5,10,20,50,100,200]

opt.classes=["display","nowrap"]

show(final,columnDefs=[{"className": "dt-left", "targets": "_all"}],column_filters="header")

<IPython.core.display.Javascript object>



Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type,Continent,Number of times repeated_x,Group,Level,Number of times repeated_y
Loading... (need help?),,,,,,,,,,,




None,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type,Continent,Number of times repeated_x,Group,Level,Number of times repeated_y
Loading... (need help?),,,,,,,,,,,


Finally, there is a recount by the different columns, *Behaviour*, *Relationship* and *Time moved*.

In [87]:
columnssf=final.Behaviour.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

final_Behaviour = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Type of behaviour','Number of times repeated'])
final_Behaviour=final_Behaviour.sort_values(by = 'Number of times repeated',ascending = False)
final_Behaviour

Unnamed: 0,Type of behaviour,Number of times repeated
1,Positive,4723
0,Negative,1922


In [88]:
columnssf=final.Type.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

final_type = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Type of relationship','Number of times repeated'])
final_type=final_type.sort_values(by = 'Number of times repeated',ascending = False)
final_type.drop([2], inplace=True)
final_type

Unnamed: 0,Type of relationship,Number of times repeated
3,Linear,1294
4,Logarithmic,687
1,Cubic,419
0,Cuadratic,253


In the table above, we can observe which is the most common type of relationship.

In [89]:
columnssf=final.Moved.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

final_moved = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Time moved','Number of times repeated'])
final_moved=final_moved.sort_values(by = 'Number of times repeated',ascending = False)
final_moved

Unnamed: 0,Time moved,Number of times repeated
7,Does not apply,2653
6,8,902
5,5,741
1,13,607
0,1,516
3,21,474
2,2,414
4,3,338


As said, the temporary displacement shows us correlation of an indicator with the GDP comparing different periods of time. In this table, we have made a recount to see which range of displacement is most repeated. As we can see, no displacement is the most common case, what means correlation year-by-year is majority in our sample.

# Visualization of results

In [90]:
#Needed imports
import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

import plotly.graph_objects as go


Now that we’ve loaded the data, we can start right away to create widgets. These widgets are essentials to add interactivity to our visualizations.

In [92]:
#COUNT HISTOGRAM: Graph for seeing the frequency of the primary indicators for each region.
selected_primary=final.loc[final['Level']=='primary']

fig=px.histogram(selected_primary,x='Indicator',histfunc="count",height=700,color='Group',text_auto=True,title="Indicators frequency by continents").update_xaxes(categoryorder="total descending")
continents=list(selected_primary['Continent'].unique())
buttons = []

for continent in continents:
    selected_primary_c = selected_primary.loc[(selected_primary['Continent'] == continent)]
    fig_continent = px.histogram(selected_primary_c, x='Indicator', color='Group').update_xaxes(categoryorder="total descending")
    buttons.append(
        dict(
            label=continent,
            method="update",
            args=[
                {
                    "x": [trace['x'] for trace in fig_continent._data],
                }
            ]
        )
    )

fig.update_layout(
    updatemenus=[
        dict(
            type="dropdown",
            direction="down",
            showactive=True,
            buttons=buttons
        )
    ]
)



fig.show()

In [93]:
#TREEMAP: Graph for seeing the correlation of each indicator in each country.
#To resolve args[], but it works.

fig2=px.treemap(selected_primary,path=['Indicator','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman',color_continuous_scale='RdBu')

indics=list(selected_primary['Indicator'].unique())
buttons = []
for indic in indics:
    selected_primary_i = selected_primary.loc[(selected_primary['Indicator'] == indic)]
    fig_indicator = px.treemap(selected_primary_i,path=['Indicator','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman')
    buttons.append(
        dict(
            label=indic,
            method="update",
            args=[]
        )
    )

fig2.update_layout(
    updatemenus=[
        dict(
            type="dropdown",
            direction="down",
            showactive=True,
            buttons=buttons
        )
    ]
)

In [94]:
#TREEMAP: Graph for seeing the correlation of each indicator in each country.

#By indicator.

fig2=px.treemap(selected_primary,path=['Indicator','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman',color_continuous_scale='RdBu_r')

fig2.show()

#By continent

fig2d=px.treemap(selected_primary,path=['Continent','Country','Group'],values='R^2 Spearman',color='R^2 Spearman',color_continuous_scale='RdBu_r')
fig2d.show()

#By group of interest

fig2dd=px.treemap(selected_primary,path=['Group','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman',color_continuous_scale='RdBu_r')

fig2dd.show()

In [95]:

lastdf=final.dropna()

lastdf.isna().sum()

from ipywidgets import widgets, HBox

out = widgets.Output()

def output_treemap(path):

    figA = px.treemap(lastdf, path=path, values='R^2 Spearman',

                  color='R^2 Spearman',

                  color_continuous_scale='RdBu_r')

    figA.update_layout(margin = dict(t=50, l=25, r=25, b=25))

    out.clear_output(wait=True)

    with out:

        figA.show()

In [96]:
path_1_dropdown = widgets.Dropdown(

    options=['Continent', 'Country','Group','Indicator'],

    value='Continent',

    description='Path 1',

    disabled=False,

)

path_2_dropdown = widgets.Dropdown(

    options=['Continent', 'Country','Group','Indicator'],

    value='Country',

    description='Path 2',

    disabled=False,

)

path_3_dropdown=widgets.Dropdown(

    options=['Continent', 'Country','Group','Indicator'],

    value='Group',

    description='Path 3',

    disabled=False,

)

path_4_dropdown=widgets.Dropdown(

    options=['Continent', 'Country','Group','Indicator'],

    value='Indicator',

    description='Path 4',

    disabled=False,

)

ok_button = widgets.Button(

    description='Ready to go',

    disabled=False,

    button_style='info', # 'success', 'info', 'warning', 'danger' or ''

    icon='check' # (FontAwesome names without the `fa-` prefix)

)

ok_button.on_click(lambda _: output_treemap([px.Constant("World"), path_1_dropdown.value, path_2_dropdown.value,path_3_dropdown.value,path_4_dropdown.value]))

In [97]:
display(HBox([path_1_dropdown, path_2_dropdown, path_3_dropdown,path_4_dropdown,ok_button]), out)

HBox(children=(Dropdown(description='Path 1', options=('Continent', 'Country', 'Group', 'Indicator'), value='C…

Output()

# Spurious correlations

However, there is an interesting phenomenon, in some cases there are correlations that have a high coefficient and also an adequate graphics, but they do not make sense in the analysis, these are called **spurious correlations**. Here are some examples:


<img src="https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/chart%20(2).jpeg" width="800" height="400" />

<img src="https://raw.githubusercontent.com/devonfw-forge/python-data-driven-decisions/main-the-big-three/Logos/chart%20(1).jpeg" width="800" height="400" />


Therefore, we have to be carefull with our results because correlation does not imply causation, it may have happened by chance that both variables are really similar.
So, after some thought and experimenting, we have developed a method that will allow us to find out if the correlation has happened by chance or if there is really a correlation.

This method, consists of the following:


Firstly we have classified the indicators by a group, which can be one of the following: *A&D*, *Agriculture*, *Demography*, *Economy*, *Employment*, *Environment*, *Equality*, *Exports*, *Health*, *Mortality* or *Principal*. Moreover inside each group we have also assigned each varible a level, *primary* or *secondary*, depending on their level of relevance. For example we have consider more relevant the *Population in the largest city* over the *Rural population*, thus the first will be *primary* and the latter *secondary*, while both are part of the *Demography* group. 

With this set, we can expose our hypothesis:

"It is assumed that the correlation in the primary indicators can be caused by randomness, however if this correlation also appears in the secondary indicators for at least X% of the countries that appears in the primaries (Pareto's rule), we can suppose that there is no randomness affecting each group. Furthermore, the first assumption has to happen in Y% of the secondary indicators to avoid any fortuity." 

This hypothesis can be used in a global level, all the countries, or in the different regions. 

For example if, X and Y =80% a primary indicator is repeated 20 times the secondary indicators must have repeated 18 times. And if there are 10 secondary indicators, it has to happen for, at least, 8 indicators.

Finally, we will finish with two possible errors of 20%, which combined (20%*20%), leaves us with a 4% of margin of error, which is lower than the wildly spread of 5%.


In [None]:
limita=widgets.FloatSlider( value=0.8, min=0, max=1.0, step=0.05, description='% of primary indicators:', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
limitb=widgets.FloatSlider( value=0.8, min=0, max=1.0, step=0.05, description='% of secondary indicators:', disabled=False, continuous_update=False, orientation='horizontal', readout=True)
display(limita,limitb)

FloatSlider(value=0.8, continuous_update=False, description='% of primary indicators:', max=1.0, step=0.05)

FloatSlider(value=0.8, continuous_update=False, description='% of secondary indicators:', max=1.0, step=0.05)

In [None]:
a=(1-limita.value)*(1-limitb.value)*100
print('The margin of error in this combination is:',a)

The margin of error in this combination is: 3.9999999999999982


Once we have selected both percentages and we agree with the margin of error, we can proceed to put into action our method. Firstly, we filter the primary indicators and get the minimun of times thatthe secondary have to be repeated.

In [None]:
selected_p=categories.loc[categories['Level']=='primary']
minprimary=selected_p.groupby('Group').min()
minprimary['Min']=round(minprimary['Number of times repeated']*limita.value)
minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)
minprimary


Unnamed: 0_level_0,Min
Group,Unnamed: 1_level_1
A&D,13.0
Agriculture,10.0
Demography,24.0
Economy,26.0
Employment,11.0
Environment,13.0
Equality,14.0
Exports,28.0
Health,9.0
Internet,27.0


In [None]:
grouplist=minprimary.index.to_list()

Then, we test if the repetition are accomplished. 

- H_0 data has correlation buy has not happened by randomness.
- H_1 data has correlation due to randomness 

**If Number of times repeated the secondary indicator < Minimun per group, then H_0 denied and H_1 accepted**

In [None]:
secondary=final.loc[final['Level']=='secondary']
secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')
secondaryp=secondary.loc[:,['Group','Min']]
Global_Count=secondaryp.groupby('Group').count()
Global_Count.rename(columns={'Min':'Global Count'},inplace=True)
secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
seco=secondary.groupby(['H_0','Group']).count()
sec=seco.loc['Not Discarded']
secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)
continentlist=final['Continent'].unique()
namescontinents=['European', 'North African', 'Asian', 'Pair', 'Persian', 'South African', 'Latino-American']
finalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
finalcount['Does it have some global casuallity implied?']=np.where(finalcount['Secondary Count']/finalcount['Global Count']>limitb.value,'No', 'Yes')
finalcount['% of count (Global)']=finalcount['Secondary Count']/finalcount['Global Count']*100
finalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)
finalcount

Unnamed: 0_level_0,Does it have some global casuallity implied?,% of count (Global)
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Agriculture,No,100.0
Demography,No,100.0
Economy,Yes,71.039604
Employment,No,100.0
Environment,No,100.0
Equality,No,100.0
Exports,No,82.379863
Health,No,100.0
Internet,Yes,37.190083
Mortality,No,100.0


As we can see, in a Global situation for the groups that have a **NO**, we do not need to worry about casualities, however for the rest of the groups correlation can still be a great indicator as a basis for decision making, if we carefully analyze the variables and found some sort of real relationship between them.   

In [None]:
for i in range(0,len(continentlist)):
    apfinal=final.loc[final['Continent']==continentlist[i]]
    
    selected_p=categories.loc[categories['Level']=='primary']
    minprimary=selected_p.groupby('Group').min()
    minprimary['Min']=round(minprimary['Number of times repeated']*limita.value)
    minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)

    grouplist=minprimary.index.to_list()

    secondary=apfinal.loc[apfinal['Level']=='secondary']
    secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')

    secondaryp=secondary.loc[:,['Group','Min']]
    Global_Count=secondaryp.groupby('Group').count()
    Global_Count.rename(columns={'Min':'Global Count'},inplace=True)

    secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
    seco=secondary.groupby(['H_0','Group']).count()
    sec=seco.loc['Not Discarded']
    secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
    secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)

    apfinalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
    apfinalcount['Does it have some '+namescontinents[i]+' casuallity implied?']=np.where(apfinalcount['Secondary Count']/apfinalcount['Global Count']>limitb.value,'No', 'Yes')
    apfinalcount['% of count ('+namescontinents[i]+')']=apfinalcount['Secondary Count']/apfinalcount['Global Count']*100
    apfinalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)
    finalcount=pd.merge(finalcount,apfinalcount, left_on='Group',right_on='Group')

finalcount

Unnamed: 0_level_0,Does it have some global casuallity implied?,% of count (Global),Does it have some European casuallity implied?,% of count (European),Does it have some North African casuallity implied?,% of count (North African),Does it have some Asian casuallity implied?,% of count (Asian),Does it have some Pair casuallity implied?,% of count (Pair),Does it have some Persian casuallity implied?,% of count (Persian),Does it have some South African casuallity implied?,% of count (South African),Does it have some Latino-American casuallity implied?,% of count (Latino-American)
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Agriculture,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Demography,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Economy,Yes,71.039604,Yes,75.799087,Yes,71.812081,Yes,72.794118,Yes,70.348837,Yes,66.968326,Yes,70.542636,Yes,66.666667
Employment,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Environment,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Equality,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Exports,No,82.379863,Yes,77.142857,No,80.701754,No,86.842105,No,84.0,No,86.075949,No,83.333333,No,83.333333
Health,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Internet,Yes,37.190083,Yes,47.619048,Yes,35.294118,Yes,26.315789,Yes,42.857143,Yes,36.842105,Yes,37.5,Yes,28.571429
Mortality,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0


Finally, we can observe into much more detail the differnt regions that we defined before, and for groups that have a **NO**, we do not need to worry about casualities. Meanwhile, for the rest of the groups correlation can still be a great indicator as a basis for decision making, if we carefully analyze the variables and found some sort of real relationship between them.   