# EXTRACTION

Import libraries and functions.

In [1]:
import pandas as pd
import numpy as np
import glob
import os
import warnings
warnings.filterwarnings("ignore")
import functools as ft
import ipywidgets as widgets
from ipywidgets import Layout
from ipywidgets import interact, interact_manual
import plotly.express as px
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from pandas.api.types import is_numeric_dtype

Firstly we load the database from World Data Bank that has been downloaded and extracted in the *Data extraction* notebook. We acquire it from the predetermined path that is on our computer.

In [2]:
df= pd.read_csv (os.getcwd()+'/Data/'+'WDIData.csv')
df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.825520,19.272212,19.628009,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.499471,6.680066,6.859110,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.794160,32.001027,33.871910,38.880173,40.261358,43.061877,44.270860,45.803485,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.641760,30.404935,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384365,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,14.500000,,,,,,,
384366,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,3.700000,,,,5.418352,,,
384367,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,33.500000,32.400000,,,,33.658057,,,
384368,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.200000,59.400000,59.500000,59.700000,59.900000,60.000000,60.200000,60.400000,,


# INTEGRATION

Moreover, to work more comfortably, we remove those columns not useful for us, as *Country Name* and *Indicator Code*, since with the *Country Code*, *Value* and the *Indicator Name* we have the relevant information.

In [3]:
df.drop(columns=["Country Name","Indicator Code"], axis=1, inplace=True)

FILTER 1: BY COUNTRY

From the almost two hundred countries we have information about in the worldwide database, we have decided to study 50 of them, making an initial grouping by geographical and economical similiarities. With this, we can keep in our dataframe the selected countries.

Criteria for grouping:
- Europe: Germany, France, Sweden, United Kingdom, Spain, Croatia, Poland, Greece, Austria and Netherlands.

*Interesting countries of the European continent that can reflect events such as the Brexit process, the 2008 crisis or their historical strength.*
- Persian Gulf: Iraq, Qatar, United Arab Emirates, Arabia Saudita, Azerbayan, Yemen, Yemen Democratic and Oman.

*Countries located in the Persian Gulf, which have a similar economy based mainly on petrol and social structures.*
- North Africa: Algeria, Egiypt, Lybia, Israel, Turkey and Morroco.

*Countries of the african continent that are middle developed and with high mobility of people and goods.*
- South Africa: Senegal, South Africa, Liberia, Mozambique, Cameroon, Nigeria and Ghana.

*Countries of the south and central africa that are mainly subdeveloped and considered some of the poorest countries worldwide; but, on the contrary, one of them is highly developed.*
- Asia: Bangladesh, India, Vietnam, Thailand, Indonesia, Philipines and Korea (South).

*Converted in the last decades in the manufacturing of the world, they are subdeveloped countries with high population and childhood.*
- Latin America: Mexico, Brasil, Argentina, Peru, Venezuela, Colombia, Chile, Panama and Costa Rica.

*Countries located in same continet and some with singular political structures.* 
- Pair: USA and China.

*Although these countries seem to be confronted between them, they have been the top two most growing worlwide, despite the fact that culturally and economically they are completely distant.*


In [4]:
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list 

In [5]:
df1=df.loc[df['Country Code'].isin(country_list)]

Now we transpose the rows of years into the columns.

In [6]:
df2=(df1.set_index(["Country Code", "Indicator Name"]).stack().reset_index(name='Value').rename(columns={'level_2':'Date'}))
df2

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1769874,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1769875,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1769876,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1769877,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 2: BY YEAR

Our time range covers from 1960 to 2021. However, the record is not uniform and complete for all areas and indicators. We can appreaciate that specially in the first years of the last century, so many data is missing, then it makes no sense to study it. Besides, for the year 2021 many data is also lacking. Therefore, we would delimit our study between 1990 and 2020.

In [7]:
df2[['Date']] = df2[['Date']].astype(int)

In [8]:
df2.dtypes

Country Code       object
Indicator Name     object
Date                int32
Value             float64
dtype: object

In [9]:
df3 = df2[df2['Date'] > 1989]
df3

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1769874,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1769875,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1769876,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1769877,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


In [10]:
BronzeDataFrame=df3

-----

# NORMALIZATION

Taking as reference both works of https://www.pluralsight.com/guides/cleaning-up-data-from-outliers and https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/, for normalizing our data we need to start computing the outliers and removing them from our dataframe. As there is not a direct function of pandas that performs this step, it´s been step-by-step code, where we begin with the computation of the quartiles, then the IQR (Inter Quartile Range) and finally the upper and lower limit.

##### IQR explanation

The interquartile range (IQR) measures the spread of the middle half of your data. It is the range for the middle 50% of your sample. Use the IQR to assess the variability where most of your values lie. Larger values indicate that the central portion of your data spread out further. Conversely, smaller values show that the middle values cluster more tightly.

To visualize the interquartile range, imagine dividing your data into quarters. Statisticians refer to these quarters as quartiles and label them from low to high as Q1, Q2, Q3, and Q4. The lowest quartile (Q1) covers the smallest quarter of values in your dataset. The upper quartile (Q4) comprises the highest quarter of values. The interquartile range is the middle half of the data that lies between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that are above Q1 and below Q4.

When measuring variability, statisticians prefer using the interquartile range instead of the full data range because extreme values and outliers affect it less. Typically, use the IQR with a measure of central tendency, such as the median, to understand your data’s center and spread. This combination creates a fuller picture of your data’s distribution.

Therefore it is being utilized to get rid of all the outliers that may come from errors when creating the data or from unexpected years.

Firstly, we compute the first quartile (Q1=25%) and the third quartile (Q3=75%). For that, we have grouped the data by country code and indicator name, so we get the Q1 and Q3 values for each indicator in each geographical area. 

In [11]:
grouped=BronzeDataFrame.groupby(['Country Code','Indicator Name'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000226920436D0>

In [12]:
Q1=BronzeDataFrame.groupby(['Country Code','Indicator Name']).quantile(0.25)
Q3=BronzeDataFrame.groupby(['Country Code','Indicator Name']).quantile(0.75)
IQR=Q3-Q1
IQR

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Value
Country Code,Indicator Name,Unnamed: 2_level_1,Unnamed: 3_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),10.0,0.00
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",10.0,0.00
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",10.0,0.00
ARE,Access to electricity (% of population),15.0,0.00
ARE,"Access to electricity, rural (% of rural population)",15.0,0.00
...,...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),0.0,0.00
ZAF,Women who were first married by age 15 (% of women ages 20-24),9.0,0.15
ZAF,Women who were first married by age 18 (% of women ages 20-24),9.0,2.15
ZAF,Women's share of population ages 15+ living with HIV (%),15.0,4.90


Once we got the quartiles, we compute the upper and lower limit, with a basic mathematical expression.

In [13]:
lower_limit=Q1 - 1.5 * IQR
lower=lower_limit.drop(['Date'],axis=1)
lower.rename(columns={"Value":"Lower limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Lower limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),100.000
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",100.000
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",100.000
ARE,Access to electricity (% of population),100.000
ARE,"Access to electricity, rural (% of rural population)",100.000
...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),1.000
ZAF,Women who were first married by age 15 (% of women ages 20-24),0.625
ZAF,Women who were first married by age 18 (% of women ages 20-24),1.375
ZAF,Women's share of population ages 15+ living with HIV (%),49.850


In [14]:
upper_limit=Q3 + 1.5 * IQR
upper=upper_limit.drop(['Date'],axis=1)
upper.rename(columns={"Value":"Upper limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Upper limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Access to clean fuels and technologies for cooking (% of population),100.000
ARE,"Access to clean fuels and technologies for cooking, rural (% of rural population)",100.000
ARE,"Access to clean fuels and technologies for cooking, urban (% of urban population)",100.000
ARE,Access to electricity (% of population),100.000
ARE,"Access to electricity, rural (% of rural population)",100.000
...,...,...
ZAF,Women who believe a husband is justified in beating his wife when she refuses sex with him (%),1.000
ZAF,Women who were first married by age 15 (% of women ages 20-24),1.225
ZAF,Women who were first married by age 18 (% of women ages 20-24),9.975
ZAF,Women's share of population ages 15+ living with HIV (%),69.450


Thirdly, we join the three tables we have (main dataframe, upper limit and lower limit) by matching country code and indicator name..

In [15]:
dfs = [BronzeDataFrame,lower,upper]
df_joined = ft.reduce(lambda left, right: pd.merge(left, right, on=['Country Code','Indicator Name']), dfs)
df_joined

Unnamed: 0,Country Code,Indicator Name,Date,Value_x,Value_y,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


In [16]:
list(df_joined)

['Country Code', 'Indicator Name', 'Date', 'Value_x', 'Value_y', 'Value']

We rename the columns of the new table, as the columns headers are not saved after the joining. 

In [17]:
renamed=df_joined.set_axis(['Country','Indicator','Year', 'Real value', 'Lower value', 'Upper value'], axis=1, inplace=False)
renamed

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


Now that we have the table correctly defined, we remove from our dataframe the values that are outside our range, as it means that they are outliers.

In [18]:
sin_outliers=renamed.loc[~((renamed['Real value']<renamed['Lower value']) | (renamed['Real value']>renamed['Upper value']))]
sin_outliers

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1,97.0,101.0
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3,97.0,101.0
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8,97.0,101.0
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0,97.0,101.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2,97.0,101.0
...,...,...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0,-50.0,350.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0,-50.0,350.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0,-50.0,350.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0,-50.0,350.0


From the data above, we can perceive that our data comes down from 1225418  rows to 1189068, so 36.350  were outliers. The next steps are to order and display data better, removing those columns that we just do not need and pivoting the rows and columns. 

In [19]:
df_limpio=sin_outliers.drop(['Lower value','Upper value'],axis=1)
df_limpio

Unnamed: 0,Country,Indicator,Year,Real value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1225413,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1225414,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1225415,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1225416,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


In [20]:
cols=df_limpio['Indicator'].unique().tolist()

In [21]:
SilverDataFrame=df_limpio.set_index(["Country", "Year"]).pivot(columns="Indicator", values="Real value").reset_index()
SilverDataFrame

Indicator,Country,Year,ARI treatment (% of children under 5 taken to a health provider),Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),...,Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
0,ARE,1990,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,18.8,100.0
1,ARE,1991,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,18.2,100.0
2,ARE,1992,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,19.4,100.0
3,ARE,1993,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,20.0,100.0
4,ARE,1994,,,,,100.000000,100.000000,100.000000,,...,,,,,,,,,20.0,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1531,ZAF,2017,,85.2,64.6,94.20,84.400002,76.738983,88.373024,69.218491,...,,,,,,,,,63.3,100000.0
1532,ZAF,2018,,85.7,65.5,94.65,84.699997,77.168495,88.518814,,...,,,,,,,,,63.7,92000.0
1533,ZAF,2019,,86.3,65.5,94.90,85.000000,77.611824,88.662704,,...,,,,,,,,,64.1,85000.0
1534,ZAF,2020,,86.8,65.9,95.20,84.385536,75.264854,88.806267,,...,,,,,,,,,64.4,79000.0


On the other hand, another big stone of normalizations is to nan/null values, which we have in all variables.

In [22]:
SilverDataFrame.isna().sum().sum()

1016628

As we can observe, we have lots of missing data, and as there is no optimal way to fullfill these values, thus, we will test some to arrive to the optimal method for our data set.

First, we need to create some lists so our loops work.

In [23]:
df=SilverDataFrame
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list


We are attempting the linear interpolation, which is achieved by geometrically rendering a straight line between two adjacent points on a graph or plane.

In [24]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

685787

Here we attempt the backward filling, filling the previous cell with future values.

In [25]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.fillna(method='bfill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.fillna(method='bfill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

498648

Here we will attempt the forward filling, which concists of filling the next cell with previous values.

In [26]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.fillna(method='ffill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

685787

The linear interpolation a form of interpolation, which involves the generation of new values based on an existing set of values. Linear interpolation is achieved by geometrically rendering a straight line between two adjacent points on a graph or plane. Whereas the backwards filling, will help us to arrive to those values which have not been fullfilled with the linear interpolation.

And as none of the methods have worked out correctly, independently, we are going to mix them, to achieve a better result.

In [27]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datc=datc.fillna(method='ffill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

KeyboardInterrupt: 

In [None]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datc=datc.fillna(method='bfill')
data=datc

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

310048

And finally, mixing the three methods all together.

In [None]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datf=datc.fillna(method='bfill')
datr=datf.fillna(method='ffill')
data=datr

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

310048

##### Conclusion

Therefore, the preferred method for the Nan values´ treatment that we are going to develop is a mix, between the linear interpolation and backwards filling.

In [None]:
dat=df.loc[df.loc[:, 'Country'] == country_list[0]]
datc=dat.interpolate(method="linear")
datf=datc.fillna(method='bfill')
datr=datf.fillna(method='ffill')
data=datr

for i in range(1,len(country_list)):
    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data

Indicator,Country,Year,ARI treatment (% of children under 5 taken to a health provider),Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),...,Women who believe a husband is justified in beating his wife (any of five reasons) (%),Women who believe a husband is justified in beating his wife when she argues with him (%),Women who believe a husband is justified in beating his wife when she burns the food (%),Women who believe a husband is justified in beating his wife when she goes out without telling him (%),Women who believe a husband is justified in beating his wife when she neglects the children (%),Women who believe a husband is justified in beating his wife when she refuses sex with him (%),Women who were first married by age 15 (% of women ages 20-24),Women who were first married by age 18 (% of women ages 20-24),Women's share of population ages 15+ living with HIV (%),Young people (ages 15-24) newly infected with HIV
352,DEU,1990,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
353,DEU,1991,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
354,DEU,1992,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
355,DEU,1993,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
356,DEU,1994,,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,...,,,,,,,,,19.1,500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,,73.2,55.2,86.2,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
252,CHN,2018,,75.6,59.0,87.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
253,CHN,2019,,77.6,61.9,88.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,
254,CHN,2020,,79.4,65.2,89.4,100.0,100.0,100.0,80.229118,...,,,,,,,,,,


Now, we will drop the columns which have over 308 missing values (20%), because the absence of data creates an unreliable source.

In [None]:
for i in range(0, len(cols)):
    if data[cols[i]].isna().sum()>308:
        del(data[cols[i]])
        print(cols[i])
data

Adults (ages 15+) and children (ages 0-14) newly infected with HIV
Adults (ages 15-49) newly infected with HIV
Antiretroviral therapy coverage (% of people living with HIV)
Antiretroviral therapy coverage for PMTCT (% of pregnant women living with HIV)
ARI treatment (% of children under 5 taken to a health provider)
Average transaction cost of sending remittances to a specific country (%)
Average working hours of children, study and work, ages 7-14 (hours per week)
Average working hours of children, study and work, female, ages 7-14 (hours per week)
Average working hours of children, study and work, male, ages 7-14 (hours per week)
Average working hours of children, working only, ages 7-14 (hours per week)
Average working hours of children, working only, female, ages 7-14 (hours per week)
Average working hours of children, working only, male, ages 7-14 (hours per week)
Bank capital to assets ratio (%)
Bank liquid reserves to bank assets ratio (%)
Bank nonperforming loans to total gross

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100)
352,DEU,1990,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,1.056365,3.031776,5.700000,4.800000,5.170000,92.050003,89.110001,90.330002,54.519497,71.250
353,DEU,1991,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.934908,3.029378,5.700000,4.800000,5.170000,92.050003,89.110001,90.330002,54.519497,71.250
354,DEU,1992,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.884470,3.026980,5.740000,4.930000,5.270000,91.910004,88.589996,89.970001,54.519497,71.250
355,DEU,1993,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.843967,3.024581,5.850000,5.040000,5.380000,91.669998,88.250000,89.669998,56.039631,71.250
356,DEU,1994,100.0,100.0,100.0,100.0,100.0,100.0,98.133621,98.704536,...,0.636245,3.022183,5.610000,5.200000,5.370000,91.629997,87.739998,89.370003,57.559764,71.250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,73.2,55.2,86.2,100.0,100.0,100.0,80.229118,76.364731,...,2.739664,4.203002,46.530001,42.949999,44.530002,52.430000,54.169998,53.400002,21.358958,75.625
252,CHN,2018,75.6,59.0,87.4,100.0,100.0,100.0,80.229118,76.364731,...,2.503401,4.203002,45.720001,41.940001,43.609999,53.209999,55.139999,54.290001,21.358958,75.625
253,CHN,2019,77.6,61.9,88.4,100.0,100.0,100.0,80.229118,76.364731,...,2.290177,4.203002,44.760000,40.819999,42.540000,54.150002,56.279999,55.340000,21.358958,75.625
254,CHN,2020,79.4,65.2,89.4,100.0,100.0,100.0,80.229118,76.364731,...,2.066047,4.203002,44.760000,40.819999,42.540000,54.150002,56.279999,55.340000,21.358958,75.625


Afterwards, we have scaled the values. The process we have followed consists of dividing each value by the initial value (1990) of each variable. Each result is expressed as the growth with respect to the initial data.

In [None]:
columns=data.columns.values.tolist()

In [None]:
datae=data.loc[data.loc[:, 'Country'] == country_list[0]]
for i in range(2,len(columns)):
    a=columns[i]
    datae[a]=datae[a]/datae.iloc[0,i]
datau=datae

In [None]:
for u in range(1,len(country_list)):
    datae=data.loc[data.loc[:, 'Country'] == country_list[u]]   
    for i in range(2,len(columns)):
        a=columns[i]
        datae[a]=datae[a]/datae.iloc[0,i]
    datau=pd.concat((datau, datae), axis = 0)
datau

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Urban population growth (annual %),Urban population living in areas where elevation is below 5 meters (% of total population),"Vulnerable employment, female (% of female employment) (modeled ILO estimate)","Vulnerable employment, male (% of male employment) (modeled ILO estimate)","Vulnerable employment, total (% of total employment) (modeled ILO estimate)","Wage and salaried workers, female (% of female employment) (modeled ILO estimate)","Wage and salaried workers, male (% of male employment) (modeled ILO estimate)","Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100)
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.885023,0.999209,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.837277,0.998418,1.007018,1.027083,1.019342,0.998479,0.994164,0.996015,1.000000,1.000000
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.798935,0.997627,1.026316,1.050000,1.040619,0.995872,0.990349,0.992693,1.027882,1.000000
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.602296,0.996836,0.984210,1.083333,1.038685,0.995437,0.984626,0.989372,1.055765,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,0.635700,1.362819,0.707250,0.617275,0.656204,1.555786,1.984976,1.768798,8.610987,1.273684
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,0.580879,1.362819,0.694938,0.602759,0.642647,1.578932,2.020520,1.798278,8.610987,1.273684
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,0.531403,1.362819,0.680347,0.586663,0.626879,1.606825,2.062294,1.833057,8.610987,1.273684
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,0.479397,1.362819,0.680347,0.586663,0.626879,1.606825,2.062294,1.833057,8.610987,1.273684


In [None]:
shifted=pd.DataFrame()
for i in range(0,len(country_list)):
    dat=datau.loc[datau.loc[:, 'Country'] == country_list[i]]
    dat['GDP (current US$)+1']=dat['GDP (current US$)'].shift(periods=1)
    dat['GDP (current US$)+2']=dat['GDP (current US$)'].shift(periods=2)
    dat['GDP (current US$)+3']=dat['GDP (current US$)'].shift(periods=3)
    dat['GDP (current US$)+5']=dat['GDP (current US$)'].shift(periods=5)
    dat['GDP (current US$)+8']=dat['GDP (current US$)'].shift(periods=8)
    dat['GDP (current US$)+13']=dat['GDP (current US$)'].shift(periods=13)
    dat['GDP (current US$)+21']=dat['GDP (current US$)'].shift(periods=21)
    shifted=pd.concat((shifted, dat), axis = 0)
shifted

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Wage and salaried workers, total (% of total employment) (modeled ILO estimate)","Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),GDP (current US$)+1,GDP (current US$)+2,GDP (current US$)+3,GDP (current US$)+5,GDP (current US$)+8,GDP (current US$)+13,GDP (current US$)+21
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,,,,,,,
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,1.000000,,,,,,
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.996015,1.000000,1.000000,1.054905,1.000000,,,,,
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.992693,1.027882,1.000000,1.203142,1.054905,1.000000,,,,
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,0.989372,1.055765,1.000000,1.169136,1.203142,1.054905,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,1.768798,8.610987,1.273684,31.129362,30.653486,29.029938,23.644292,14.137706,5.418606,2.393592
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,1.798278,8.610987,1.273684,34.114284,31.129362,30.653486,26.521259,16.868589,6.334809,2.664772
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,1.833057,8.610987,1.273684,38.504955,34.114284,31.129362,29.029938,20.926519,7.626636,2.851657
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,1.833057,8.610987,1.273684,39.572189,38.504955,34.114284,30.653486,23.644292,9.838617,3.031657


In [None]:
data=shifted

For the next part of analyzing this data, we think it is gonna be interesting to have it classify by the categories of the Country groups defined before, to which we call "Continent". This category is useful as it groups the nations with similar economies or geographical proximity, so we can extract common conclusions from them.

We create a dictionary with the regions and the countries included in each one. Where we will relate the countries and regions so then we can apply the .map function and arrive to the final dataframe.

In [6]:
countries_by_region = {
    "Europe": ('DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD'),
    'Persian Gulf': ('IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN'),
    'North Africa':('DZA','EGY','LBY','ISR','TUR','MAR'),
    'South Africa':('SEN','ZAF','LBR','MOZ','CMR','NGA','GHA'),
    'Asia':('BGD','IND','VNM','THA','IDN','PHL','KOR'),
    'Latam':('MEX','BRA','ARG','PER','VEN','COL','CHL','PAN','CRI'),
    'Pair':('USA','CHN')
    }

all_countries = {}
for region in countries_by_region.keys():
  for country in countries_by_region[region]:
    all_countries[country] = region

print(all_countries)

{'DEU': 'Europe', 'FRA': 'Europe', 'SWE': 'Europe', 'GBR': 'Europe', 'ESP': 'Europe', 'HRV': 'Europe', 'POL': 'Europe', 'GRC': 'Europe', 'AUT': 'Europe', 'NLD': 'Europe', 'IRQ': 'Persian Gulf', 'QAT': 'Persian Gulf', 'ARE': 'Persian Gulf', 'SAU': 'Persian Gulf', 'AZE': 'Persian Gulf', 'YEM': 'Persian Gulf', 'YDR': 'Persian Gulf', 'OMN': 'Persian Gulf', 'DZA': 'North Africa', 'EGY': 'North Africa', 'LBY': 'North Africa', 'ISR': 'North Africa', 'TUR': 'North Africa', 'MAR': 'North Africa', 'SEN': 'South Africa', 'ZAF': 'South Africa', 'LBR': 'South Africa', 'MOZ': 'South Africa', 'CMR': 'South Africa', 'NGA': 'South Africa', 'GHA': 'South Africa', 'BGD': 'Asia', 'IND': 'Asia', 'VNM': 'Asia', 'THA': 'Asia', 'IDN': 'Asia', 'PHL': 'Asia', 'KOR': 'Asia', 'MEX': 'Latam', 'BRA': 'Latam', 'ARG': 'Latam', 'PER': 'Latam', 'VEN': 'Latam', 'COL': 'Latam', 'CHL': 'Latam', 'PAN': 'Latam', 'CRI': 'Latam', 'USA': 'Pair', 'CHN': 'Pair'}


In [None]:
data['Continent']=data['Country'].map(all_countries)
Goldendataframe=data
Goldendataframe

Indicator,Country,Year,Access to clean fuels and technologies for cooking (% of population),"Access to clean fuels and technologies for cooking, rural (% of rural population)","Access to clean fuels and technologies for cooking, urban (% of urban population)",Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,"Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",Women Business and the Law Index Score (scale 1-100),GDP (current US$)+1,GDP (current US$)+2,GDP (current US$)+3,GDP (current US$)+5,GDP (current US$)+8,GDP (current US$)+13,GDP (current US$)+21,Continent
352,DEU,1990,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,,,,,,,,Europe
353,DEU,1991,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.000000,,,,,,,Europe
354,DEU,1992,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.000000,1.000000,1.054905,1.000000,,,,,,Europe
355,DEU,1993,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.027882,1.000000,1.203142,1.054905,1.000000,,,,,Europe
356,DEU,1994,1.000000,1.000000,1.000000,1.000000,1.000000,1.0,1.000000,1.000000,...,1.055765,1.000000,1.169136,1.203142,1.054905,,,,,Europe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,CHN,2017,1.742857,2.348936,1.261156,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,31.129362,30.653486,29.029938,23.644292,14.137706,5.418606,2.393592,Pair
252,CHN,2018,1.800000,2.510638,1.278713,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,34.114284,31.129362,30.653486,26.521259,16.868589,6.334809,2.664772,Pair
253,CHN,2019,1.847619,2.634043,1.293343,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,38.504955,34.114284,31.129362,29.029938,20.926519,7.626636,2.851657,Pair
254,CHN,2020,1.890476,2.774468,1.307974,1.030696,1.048707,1.0,1.257169,1.272562,...,8.610987,1.273684,39.572189,38.504955,34.114284,30.653486,23.644292,9.838617,3.031657,Pair


With that all, we export our dataframe all-in-one and by the continent category.

In [None]:
Goldendataframe.to_csv(os.getcwd()+'/Data/GoldenDataFrame.csv')

In [None]:
for region, data in Goldendataframe.groupby('Continent'):
   data.to_csv(os.getcwd()+'/Data/{}.csv'.format(region))

#### CATEGORISATION OF VARIABLES FOR A DEEPER STUDY

First of all, for having better treatment, al variables of a dataframe have been pivoting in a same column.

In [None]:
columns_golden=list(Goldendataframe.columns)
del columns_golden[0:2]

In [None]:
Categorization=Goldendataframe.set_index(['Country','Year', 'Continent']).stack().reset_index()
Categorization['Short indicator']=Categorization['Indicator']
Categorization

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul..."
...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13


There are some indicators which show same results through different units. In our study we are going to work only with those expressed in US$. 

The links used to learn about these functions have been:

https://www.geeksforgeeks.org/how-to-drop-rows-that-contain-a-specific-string-in-pandas/ 

https://www.statology.org/pandas-drop-rows-that-contain-string/ 

In [None]:
import re
discard=["annual % growth","constant 2015 US[$]","% of GNI","constant LCU","current LCU"]
Categorization2=Categorization[~Categorization['Short indicator'].str.contains('|'.join(discard))]
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul..."
...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13


To check previous step.

In [None]:
#Categorization2.apply(lambda row: row.astype(str).str.contains('US').any(), axis=1)

Now we are going to structure the indicators in a same way to work better. The first step consist of making a new column that shows the units of each variable. Units are showed inside the parenthesis of the indicator name.

In [None]:
Categorization2['Units']=Categorization2['Short indicator'].str.extract(' (\(.*\))')
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population)
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population)
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population)
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity (% of population),(% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural (% of rural popul...",(% of rural population)
...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP (current US$)+3,(current US$)
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP (current US$)+5,(current US$)
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP (current US$)+8,(current US$)
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP (current US$)+13,(current US$)


Now, short indicator refers to the original indicator name without the units. The extracted information from the origin column has been deleted.

In [None]:
#Delete the extracted information from origin column. 
Categorization2['Short indicator']=Categorization2['Short indicator'].str.replace(r" (\(.*\))","")
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population)
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population)
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population)
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population)
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population)
...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$)
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$)
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$)
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$)


In some cases there are extra information in indicators name. The information of the second parenthesis is extracted as a new column too.

In [None]:
two_parent=Categorization2[Categorization2['Short indicator'].str.contains('Contributing family workers')]
two_parent

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units
151,DEU,1990,Europe,"Contributing family workers, female (% of fema...",1.000000,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
152,DEU,1990,Europe,"Contributing family workers, male (% of male e...",1.000000,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
153,DEU,1990,Europe,"Contributing family workers, total (% of total...",1.000000,"Contributing family workers, total",(% of total employment) (modeled ILO estimate)
1151,DEU,1991,Europe,"Contributing family workers, female (% of fema...",1.000000,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
1152,DEU,1991,Europe,"Contributing family workers, male (% of male e...",1.000000,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
...,...,...,...,...,...,...,...
1538535,CHN,2020,Pair,"Contributing family workers, male (% of male e...",0.252894,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)
1538536,CHN,2020,Pair,"Contributing family workers, total (% of total...",0.329385,"Contributing family workers, total",(% of total employment) (modeled ILO estimate)
1539524,CHN,2021,Pair,"Contributing family workers, female (% of fema...",0.386565,"Contributing family workers, female",(% of female employment) (modeled ILO estimate)
1539525,CHN,2021,Pair,"Contributing family workers, male (% of male e...",0.252894,"Contributing family workers, male",(% of male employment) (modeled ILO estimate)


Moreover, there are some inidcators with an extra parenthesis adding some  more information. As this information isn't related with units, another column named as 'other specification' has been created.

In [None]:
Categorization2[['Units','Other specification']]=Categorization2['Units'].str.split("\) ", n=1,expand=True)
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population),
...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),


At the end of the variable name, separated by the last "," it is informing us about to which subgroup makes reference the variable. Thus, there are some indicators that have information divided for small groups. This information is shown as a new column named 'Subgroup'.

In [None]:
Categorization2[['Subgroup']]=Categorization2['Short indicator'].str.extract(',(?P<field>[^,]*?)$')
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,"Access to electricity, rural",(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,


As before, information which is shown as a new column is deleted from the origin one.

In [None]:
Categorization2['Short indicator']=Categorization2['Short indicator'].str.replace(',(?P<field>[^,]*?)$',"")
Categorization2

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,Access to electricity,(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,


All the indicators don't have these elements. So, a checking point is needed.

In [None]:
Categorization2['Subgroup']=Categorization2['Subgroup'].replace(['None'],['total'])
Categorization2['Subgroup']=Categorization2['Subgroup'].fillna('total')

There are some duplicate variables which should be removed too.

In [None]:
Categorization2.drop_duplicates(subset=['Country','Year','Short indicator','Continent','Subgroup'], keep='first')

Unnamed: 0,Country,Year,Continent,Indicator,0,Short indicator,Units,Other specification,Subgroup
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of population),,total
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of rural population),,rural
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,1.000000,Access to clean fuels and technologies for coo...,(% of urban population),,urban
3,DEU,1990,Europe,Access to electricity (% of population),1.000000,Access to electricity,(% of population),,total
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",1.000000,Access to electricity,(% of rural population),,rural
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,38.504955,GDP+3,(current US$),,total
1540359,CHN,2021,Pair,GDP (current US$)+5,31.129362,GDP+5,(current US$),,total
1540360,CHN,2021,Pair,GDP (current US$)+8,26.521259,GDP+8,(current US$),,total
1540361,CHN,2021,Pair,GDP (current US$)+13,12.731623,GDP+13,(current US$),,total


Reordering columns, categorization3 is our df after all these division in categories.

In [None]:
Categorization2.rename(columns={Categorization2.columns[4]:'Value'},inplace=True)
Categorization3=Categorization2[['Country','Year','Continent','Indicator','Short indicator','Value','Subgroup','Units','Other specification']]
Categorization3

Unnamed: 0,Country,Year,Continent,Indicator,Short indicator,Value,Subgroup,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,total,(% of population),
1,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,rural,(% of rural population),
2,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,urban,(% of urban population),
3,DEU,1990,Europe,Access to electricity (% of population),Access to electricity,1.000000,total,(% of population),
4,DEU,1990,Europe,"Access to electricity, rural (% of rural popul...",Access to electricity,1.000000,rural,(% of rural population),
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,GDP+3,38.504955,total,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,GDP+5,31.129362,total,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,GDP+8,26.521259,total,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,GDP+13,12.731623,total,(current US$),


------------------------------

#### CORRELATION STUDY

In [None]:
Categorization4=Categorization3.loc[Categorization3['Subgroup']=='total']
Categorization4

Unnamed: 0,Country,Year,Continent,Indicator,Short indicator,Value,Subgroup,Units,Other specification
0,DEU,1990,Europe,Access to clean fuels and technologies for coo...,Access to clean fuels and technologies for coo...,1.000000,total,(% of population),
3,DEU,1990,Europe,Access to electricity (% of population),Access to electricity,1.000000,total,(% of population),
6,DEU,1990,Europe,Account ownership at a financial institution o...,Account ownership at a financial institution o...,1.000000,total,(% of population ages 15+),
20,DEU,1990,Europe,Adjusted net national income (current US$),Adjusted net national income,1.000000,total,(current US$),
23,DEU,1990,Europe,Adjusted net national income per capita (curre...,Adjusted net national income per capita,1.000000,total,(current US$),
...,...,...,...,...,...,...,...,...,...
1540358,CHN,2021,Pair,GDP (current US$)+3,GDP+3,38.504955,total,(current US$),
1540359,CHN,2021,Pair,GDP (current US$)+5,GDP+5,31.129362,total,(current US$),
1540360,CHN,2021,Pair,GDP (current US$)+8,GDP+8,26.521259,total,(current US$),
1540361,CHN,2021,Pair,GDP (current US$)+13,GDP+13,12.731623,total,(current US$),


In [None]:
Categorization4.to_csv(os.getcwd()+'/Data/Categorization.csv')

In [None]:
Categorization4= pd.read_csv (os.getcwd()+'/Data/'+'Categorization.csv')

In [None]:
indicators_list=Categorization4['Indicator'].unique().tolist()

In [None]:
columns=indicators_list+['Country','Year','Continent']
clist=Categorization4['Country'].unique()
common=['Unnamed: 0','Country','Year']

In the following cell, we have defined a function that will allow us to calculate the different posibilities of relations: cuadratic, cubic and logaritmic.

In [None]:
def multcolumn(frame):
    for u in range(0, len(columns)-3):
        name=columns[u]+'.l'
        name2=columns[u]+'.^2'
        name3=columns[u]+'.^3'
        namelog=columns[u]+'.log'
        frame.loc[:,name2] = frame[columns[u]]**2
        frame.loc[:,name3] = frame[columns[u]]**3
        frame.loc[:,namelog] = np.log(frame[columns[u]])
        frame.rename(columns={columns[u]:name}, inplace=True)

Moreover, we want to know the correlation between all the variables, so to acomplish this, we have created the following loop, which will help us create a new dataframe where we will have: the *Indicator*, the *Type* of relation, the value of the *R^2*, its *Behaviour*, the *Country* and the *Continent*.

In [None]:
df= pd.read_csv (os.getcwd()+'/Data/'+'GoldenDataFrame.csv')
df_study=df[[c for c in df.columns if c in columns]]
df_study['GDP (current US$)']


0        1.000000
1        1.054905
2        1.203142
3        1.169136
4        1.244629
          ...    
1531    34.114284
1532    38.504955
1533    39.572189
1534    40.799246
1535    40.799246
Name: GDP (current US$), Length: 1536, dtype: float64

Firstly we are going to create two lists for the variables, which their p-value is under 0.05 for each correlation, so later on, we can calculate only the correlations of those variables.

In [None]:
multcolumn(df_study)

In [None]:
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]
listacorpe=[]
listacorsp=[]
clmns=dat.columns.values.tolist()
dat.replace([np.inf, -np.inf], np.nan, inplace=True)
for c in range(0, len(clmns)):
    if dat[clmns[c]].isna().sum()>=1:
        del(dat[clmns[c]])
pilares=dat.columns.values.tolist()
for u in range(0,len(pilares)):
    if is_numeric_dtype(dat[pilares[u]]):
        correlation, pvalue=pearsonr(dat[pilares[u]], dat['GDP (current US$).l'])
        if pvalue<=0.05:
            listacorpe.append(pilares[u])
        else:
            pass
        correlation, pvalue=spearmanr(dat[pilares[u]], dat['GDP (current US$).l'])
        if pvalue<=0.05:
            listacorsp.append(pilares[u])
        else:
            pass
    else:
        pass

Secondly, we need to calculate the correlation table for each country, therefore we use the basic function `corr()` which provides either the Pearson correlation table or the Spearman correlation table, as well as a filter for the countries.

In [None]:
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]

datp=dat[dat.columns[dat.columns.isin(listacorpe)]]
corp=datp.corr('pearson')

datsp=dat[dat.columns[dat.columns.isin(listacorsp)]]
cors=datsp.corr('spearman')

Then we calculate the coefficient of determination which is the correlation squared.

In [None]:
corp.loc[:,'R^2 Pearson'] = corp['GDP (current US$).l']**2

cors.loc[:,'R^2 Spearman'] = cors['GDP (current US$).l']**2

Moreover, we are going to create new columns to know which *Indicator* are we talking about, and the *Type* of correlation that is being analyzed (linear, cuadratic, cubic or logarithmic)

In [None]:
corp.loc[:,'Indicator']=corp.index
corp[['Indicator','Type']]=corp.Indicator.str.split('.',1, expand=True)

cors.loc[:,'Indicator']=cors.index
cors[['Indicator','Type']]=cors.Indicator.str.split('.',1, expand=True)

Now, we can apply the filter we have consider that is enough, R^2>=0.75 to filter the correlations.

In [None]:
corpcolumn=corp[['Indicator','R^2 Pearson','Type','GDP (current US$).l']]
corpcolumn=corpcolumn.loc[corpcolumn.loc[:, 'R^2 Pearson'] >= 0.75]

corscolumn=cors[['Indicator','R^2 Spearman','Type','GDP (current US$).l']]
corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= 0.75]

Furthermore, we add all the columns that we have created into a data frame, thanks to the following cell.

In [None]:
idp=corpcolumn.groupby('Indicator')['R^2 Pearson'].transform(max)==corpcolumn['R^2 Pearson']
corpcolumn[idp]
maxp_df=pd.DataFrame(corpcolumn[idp])

ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
corscolumn[ids]
maxs_df=pd.DataFrame(corscolumn[ids])

Here, we conmute the values, by expressions. For example if the correlation is positive, we want in the new column called *Behaviour* the word Positive. Or for the *Type* column if the greatest correlation is cuadratic we want to put, Cuadratic. We also add the country.

In [None]:
maxp_df['Behaviour']=np.where(maxp_df['GDP (current US$).l']>0, 'Positive', 'Negative')
maxp_df['Type']=maxp_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
maxp_df['Country']= clist[0]

maxs_df['Behaviour']=np.where(maxs_df['GDP (current US$).l']>0, 'Positive', 'Negative')
maxs_df['Type']=maxs_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
maxs_df['Country']= clist[0]

In addition, we also drop the columns which do not add any value, as *GDP*, *Year*, and *Unnamed:0*.

In [None]:
maxp_df.drop("GDP (current US$).l",axis=1,inplace=True)
maxp_df=maxp_df.reset_index(drop=True)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Year'].index)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='GDP (current US$)'].index)
maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Unnamed: 0'].index)

maxs_df.drop("GDP (current US$).l",axis=1,inplace=True)
maxs_df=maxs_df.reset_index(drop=True)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)

And finally we sort the values in descending order by the column *R^2 Pearson*.

In [None]:
maxp_df_deu=maxp_df.sort_values(by = 'R^2 Pearson',ascending = False)
pearsondf= maxp_df_deu
spearmandf=maxs_df

So, we can do it with all the countries and create just one dataframe.

In [None]:
pearsondf
spearmandf
for i in range(1,len(clist)):
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]
    listacorpe=[]
    listacorsp=[]
    clmns=dat.columns.values.tolist()
    dat.replace([np.inf, -np.inf], np.nan, inplace=True)
    for c in range(0, len(clmns)):
        if dat[clmns[c]].isna().sum()>=1:
            del(dat[clmns[c]])
    pilares=dat.columns.values.tolist()
    for u in range(0,len(pilares)):
        if is_numeric_dtype(dat[pilares[u]]):
            correlation, pvalue=pearsonr(dat[pilares[u]], dat['GDP (current US$).l'])
            if pvalue<=0.05:
                listacorpe.append(pilares[u])
            else:
                pass
            correlation, pvalue=spearmanr(dat[pilares[u]], dat['GDP (current US$).l'])
            if pvalue<=0.05:
                listacorsp.append(pilares[u])
            else:
                pass
        else:
            pass
    
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]

    datp=dat[dat.columns[dat.columns.isin(listacorpe)]]
    corp=datp.corr('pearson')

    datsp=dat[dat.columns[dat.columns.isin(listacorsp)]]
    cors=datsp.corr('spearman')


    corp.loc[:,'R^2 Pearson'] = corp['GDP (current US$).l']**2

    cors.loc[:,'R^2 Spearman'] = cors['GDP (current US$).l']**2


    corp.loc[:,'Indicator']=corp.index
    corp[['Indicator','Type']]=corp.Indicator.str.split('.',1, expand=True)

    cors.loc[:,'Indicator']=cors.index
    cors[['Indicator','Type']]=cors.Indicator.str.split('.',1, expand=True)


    corpcolumn=corp[['Indicator','R^2 Pearson','Type','GDP (current US$).l']]
    corpcolumn=corpcolumn.loc[corpcolumn.loc[:, 'R^2 Pearson'] >= 0.75]
    
    corscolumn=cors[['Indicator','R^2 Spearman','Type','GDP (current US$).l']]
    corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= 0.75]


    idp=corpcolumn.groupby('Indicator')['R^2 Pearson'].transform(max)==corpcolumn['R^2 Pearson']
    corpcolumn[idp]
    maxp_df=pd.DataFrame(corpcolumn[idp])

    ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
    corscolumn[ids]
    maxs_df=pd.DataFrame(corscolumn[ids])


    maxp_df['Behaviour']=np.where(maxp_df['GDP (current US$).l']>0, 'Positive', 'Negative')
    maxp_df['Type']=maxp_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
    maxp_df['Country']= clist[i]

    maxs_df['Behaviour']=np.where(maxs_df['GDP (current US$).l']>0, 'Positive', 'Negative')
    maxs_df['Type']=maxs_df['Type'].replace(['l','^2','^3','log'],['Linear','Cuadratic','Cubic','Logarithmic'])
    maxs_df['Country']= clist[i]


    maxp_df.drop("GDP (current US$).l",axis=1,inplace=True)
    maxp_df=maxp_df.reset_index(drop=True)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Year'].index)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='GDP (current US$)'].index)
    maxp_df = maxp_df.drop(maxp_df[maxp_df['Indicator']=='Unnamed: 0'].index)

    maxs_df.drop("GDP (current US$).l",axis=1,inplace=True)
    maxs_df=maxs_df.reset_index(drop=True)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
    maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)


    maxp_df=maxp_df.sort_values(by = 'R^2 Pearson',ascending = False)
    pearsondf=pd.concat((pearsondf, maxp_df), axis = 0)
    spearmandf=pd.concat((spearmandf, maxs_df), axis = 0)

corrtable=spearmandf.merge(pearsondf, left_on=('Indicator', 'Country','Type','Behaviour'), right_on=('Indicator', 'Country','Type','Behaviour'))
display(corrtable)

Unnamed: 0,Indicator,R^2 Spearman,Type,Behaviour,Country,R^2 Pearson
0,Adjusted net national income (current US$),0.996519,Linear,Positive,DEU,0.999141
1,Gross value added at basic prices (GVA) (curre...,0.996519,Linear,Positive,DEU,0.999851
2,GNI (current US$),0.996337,Linear,Positive,DEU,0.999362
3,Gross national expenditure (current US$),0.990490,Linear,Positive,DEU,0.997181
4,Final consumption expenditure (current US$),0.988302,Linear,Positive,DEU,0.996540
...,...,...,...,...,...,...
5948,Prevalence of anemia among women of reproducti...,0.793298,Logarithmic,Negative,CHN,0.752969
5949,Logistics performance index: Ability to track ...,0.788049,Logarithmic,Positive,CHN,0.926947
5950,Logistics performance index: Competence and qu...,0.788049,Logarithmic,Positive,CHN,0.890297
5951,Out-of-pocket expenditure (% of current health...,0.782573,Logarithmic,Negative,CHN,0.929883


Finally, a table has been created showing the number of times a variable has a high relationship in our 48 countries. These that appear many times will be interesting for us to draw conclusions. Then, we will checck if they are primary or seconday variable type.

In [None]:
columnssf=corrtable.Indicator.to_list()
columnsf=np.unique(columnssf)

In [None]:
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

df_indicators = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Indicator','Number of times repeated'])
df_indicators=df_indicators.sort_values(by = 'Number of times repeated',ascending = False)
df_indicators
display(df_indicators)

Unnamed: 0,Indicator,Number of times repeated
127,GDP per capita (current US$),46
176,"Industry (including construction), value added...",46
132,GNI (current US$),46
152,Households and NPISHs Final consumption expend...,45
108,Final consumption expenditure (current US$),45
...,...,...
368,Time required to enforce a contract (days),1
246,"Net capital account (BoP, current US$)",1
244,Natural gas rents (% of GDP),1
95,Expenditure on tertiary education (% of govern...,1


In [None]:
#To get list of all number of times repeated.
#from IPython.display import HTML

#HTML(df_indicators.to_html(index=False))

In [None]:
df= pd.read_csv (os.getcwd()+'/Data/'+'GoldenDataFrame.csv')
df_study=df[[c for c in df.columns if c in columns]]

In [None]:
moveddf=pd.DataFrame()
dat=df_study.loc[df_study.loc[:, 'Country'] == clist[0]]
clmns=dat.columns.values.tolist()
dat.replace([np.inf, -np.inf], np.nan, inplace=True)
tempdiffs=['GDP (current US$)+1','GDP (current US$)+2','GDP (current US$)+3','GDP (current US$)+5','GDP (current US$)+8','GDP (current US$)+13','GDP (current US$)+21']
cors=dat.corr('spearman')
for f in range(0, len(tempdiffs)):
    cors.loc[:,'R^2 Spearman'] = cors[tempdiffs[f]]**2
    cors.loc[:,'Indicator']=cors.index
    corscolumn=cors[['Indicator','R^2 Spearman',tempdiffs[f]]]
    corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= 0.75]
    ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
    corscolumn[ids]
    maxs_df=pd.DataFrame(corscolumn[ids])
    maxs_df['Behaviour']=np.where(maxs_df[tempdiffs[f]]>0, 'Positive', 'Negative')
    maxs_df['Country']= clist[0]
    maxs_df[['Variable','Moved']]=tempdiffs[f].split('+')
    maxs_df.drop(tempdiffs[f],axis=1,inplace=True)
    maxs_df.drop(columns='Variable',inplace=True)
    maxs_df=maxs_df.reset_index(drop=True)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='GDP (current US$)'].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator'].isin(tempdiffs)].index)
    maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
    maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)
    moveddf=pd.concat((moveddf,maxs_df),axis=0)
ids=moveddf.groupby('Indicator')['R^2 Spearman'].transform(max)==moveddf['R^2 Spearman']
moveddf[ids]
moveddf=pd.DataFrame(moveddf[ids])
temporaldf=moveddf

In [None]:
for i in range(1,len(clist)):
    dat=df_study.loc[df_study.loc[:, 'Country'] == clist[i]]
    clmns=dat.columns.values.tolist()
    dat.replace([np.inf, -np.inf], np.nan, inplace=True)
    tempdiffs=['GDP (current US$)+1','GDP (current US$)+2','GDP (current US$)+3','GDP (current US$)+5','GDP (current US$)+8','GDP (current US$)+13','GDP (current US$)+21']
    cors=dat.corr('spearman')
    moveddf=pd.DataFrame()
    for f in range(0, len(tempdiffs)):
        cors.loc[:,'R^2 Spearman'] = cors[tempdiffs[f]]**2
        cors.loc[:,'Indicator']=cors.index
        corscolumn=cors[['Indicator','R^2 Spearman',tempdiffs[f]]]
        corscolumn=corscolumn.loc[corscolumn.loc[:, 'R^2 Spearman'] >= 0.75]
        ids=corscolumn.groupby('Indicator')['R^2 Spearman'].transform(max)==corscolumn['R^2 Spearman']
        corscolumn[ids]
        maxs_df=pd.DataFrame(corscolumn[ids])
        maxs_df['Behaviour']=np.where(maxs_df[tempdiffs[f]]>0, 'Positive', 'Negative')
        maxs_df['Country']= clist[i]
        maxs_df[['Variable','Moved']]=tempdiffs[f].split('+')
        maxs_df.drop(tempdiffs[f],axis=1,inplace=True)
        maxs_df.drop(columns='Variable',inplace=True)
        maxs_df=maxs_df.reset_index(drop=True)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Year'].index)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator'].isin(tempdiffs)].index)
        maxs_df = maxs_df.drop(maxs_df[maxs_df['Indicator']=='Unnamed: 0'].index)
        maxs_df=maxs_df.sort_values(by = 'R^2 Spearman',ascending = False)
        moveddf=pd.concat((moveddf,maxs_df),axis=0)
    ids=moveddf.groupby('Indicator')['R^2 Spearman'].transform(max)==moveddf['R^2 Spearman']
    moveddf[ids]
    moveddf=pd.DataFrame(moveddf[ids])
    temporaldf=pd.concat((temporaldf,moveddf),axis=0)
temporaldf

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved
4,Adjusted savings: education expenditure (curre...,0.824514,Positive,DEU,1
35,General government final consumption expenditu...,0.822917,Positive,DEU,1
45,Individuals using the Internet (% of population),0.822917,Positive,DEU,1
78,Surface area (sq. km),0.821642,Positive,DEU,1
5,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1
...,...,...,...,...,...
167,Pump price for diesel fuel (US$ per liter),0.787248,Negative,CHN,21
168,Pump price for gasoline (US$ per liter),0.787248,Negative,CHN,21
30,Chemicals (% of value added in manufacturing),0.781385,Negative,CHN,21
162,Proportion of population pushed below the $1.9...,0.769754,Positive,CHN,21


In [None]:
alist=['Indicator','R^2 Spearman','Behaviour','Country','Type']
forcomparassion=corrtable[alist]

quarterfinal=pd.concat((temporaldf,forcomparassion),axis=0)
quarterfinal.fillna('Does not apply',inplace=True)
quarterfinal

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type
4,Adjusted savings: education expenditure (curre...,0.824514,Positive,DEU,1,Does not apply
35,General government final consumption expenditu...,0.822917,Positive,DEU,1,Does not apply
45,Individuals using the Internet (% of population),0.822917,Positive,DEU,1,Does not apply
78,Surface area (sq. km),0.821642,Positive,DEU,1,Does not apply
5,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1,Does not apply
...,...,...,...,...,...,...
5948,Prevalence of anemia among women of reproducti...,0.793298,Negative,CHN,Does not apply,Logarithmic
5949,Logistics performance index: Ability to track ...,0.788049,Positive,CHN,Does not apply,Logarithmic
5950,Logistics performance index: Competence and qu...,0.788049,Positive,CHN,Does not apply,Logarithmic
5951,Out-of-pocket expenditure (% of current health...,0.782573,Negative,CHN,Does not apply,Logarithmic


In [None]:
quarterfinal.to_csv(os.getcwd()+'/Data/Quarterfinal.csv')

In [2]:
quarterfinal= pd.read_csv (os.getcwd()+'/Data/'+'Quarterfinal.csv')

In [3]:
#To understand better the data, we categorize it (Area label and Primary/Secondary).
categories= pd.read_excel (os.getcwd()+'/Data/'+'dfindicators - Copy.xlsx')

In [4]:
categories.rename(columns={'Type':'Group'}, inplace=True)
quarterfinal.drop(columns=('Unnamed: 0'), inplace=True)
clist=quarterfinal['Country'].unique()

In [7]:
final=pd.DataFrame()
for i in range(0,len(clist)):
    dat=quarterfinal.loc[quarterfinal.loc[:, 'Country'] == clist[i]]
    ids=dat.groupby('Indicator')['R^2 Spearman'].transform(max)==dat['R^2 Spearman']
    dat[ids]
    semifinal=pd.DataFrame(dat[ids])
    final=pd.concat((final,semifinal), axis=0)
final_indicators_list=categories.Indicator.unique()
final['Continent']=final['Country'].map(all_countries)
final=final.loc[final.loc[:, 'Indicator'].isin(np.array(final_indicators_list))]
final=pd.merge(final,categories, left_on='Indicator',right_on='Indicator')
final

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type,Continent,Number of times repeated,Group,Level
0,Individuals using the Internet (% of population),0.822917,Positive,DEU,1,Does not apply,Europe,8,,
1,Individuals using the Internet (% of population),0.760791,Positive,FRA,5,Does not apply,Europe,8,,
2,Individuals using the Internet (% of population),0.768616,Positive,SWE,Does not apply,Cubic,Europe,8,,
3,Individuals using the Internet (% of population),0.934571,Positive,GBR,8,Does not apply,Europe,8,,
4,Individuals using the Internet (% of population),0.801763,Positive,ESP,8,Does not apply,Europe,8,,
...,...,...,...,...,...,...,...,...,...,...
9484,"Tuberculosis case detection rate (%, all forms)",0.858062,Positive,PHL,Does not apply,Logarithmic,Asia,8,,
9485,Natural gas rents (% of GDP),0.888673,Negative,GHA,Does not apply,Logarithmic,South Africa,1,,
9486,Natural gas rents (% of GDP),0.893013,Negative,BGD,21,Does not apply,Asia,1,,
9487,Natural gas rents (% of GDP),0.792003,Negative,IND,21,Does not apply,Asia,1,,


In [8]:
columnssf=final.Indicator.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

final_indicators = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Indicator','Number of times repeated'])
final_indicators=final_indicators.sort_values(by = 'Number of times repeated',ascending = False)
final_indicators

Unnamed: 0,Indicator,Number of times repeated
293,Population in urban agglomerations of more tha...,61
291,Population in largest city,60
294,Population in urban agglomerations of more tha...,49
132,GNI (current US$),46
127,GDP per capita (current US$),46
...,...,...
30,Arms imports (SIPRI trend indicator values),3
244,Natural gas rents (% of GDP),3
86,Electricity production from nuclear sources (%...,2
359,Surface area (sq,2


In [9]:
columnssf=final.Moved.to_list()
columnsf=np.unique(columnssf)
powerind=[]
for i in range(0, len(columnsf)):
    powerind.append(columnssf.count(columnsf[i]))

final_moved = pd.DataFrame(list(zip(columnsf,powerind)), columns = ['Time moved','Number of times repeated'])
final_moved=final_moved.sort_values(by = 'Number of times repeated',ascending = False)
final_moved

Unnamed: 0,Time moved,Number of times repeated
7,Does not apply,3229
6,8,1278
5,5,1120
1,13,982
3,21,957
0,1,793
2,2,589
4,3,541


There is an interesting phenomenon, in some cases there are correlations that have a high coefficient and also an adequate graphics, but they do not make sense in the analysis, these are called spurious correlations. Here are some examples:
![Cheese consumption VS. People killed by becoming tangled in their bedsheets](os.getcwd()+'/Logos/'+'chart%(1).png')
![People drowned VS. Nicolas Cage appearences](os.getcwd()+'/Logos/'+'chart.png')
Therefore, we have to be carefull with our results because correlation does not imply causation, it may have happened by chance that both variables are really similar.
So, after some thought and experimenting, we have developed a method that will allow us to find out if the correlation has happened by chance or if there is really a correlation.

This method, consists of the following:


Firstly we have classified the indicators by a group, which can be one of the following: *A&D*, *Agriculture*, *Demography*, *Economy*, *Employment*, *Environment*, *Equality*, *Exports*, *Health*, *Mortality* or *Principal*. Moreover inside each group we have also assigned each varible a level, *primary* or *secondary*, depending on their level of relevance. For example we have consider more relevant the *Population in the largest city* over the *Rural population*, thus the first will be *primary* and the latter *secondary*, while both are part of the *Demography* group. 

With this set, we can expose our hypothesis:

"It is assumed that the correlation in the primary indicators can be caused by randomness, however if this correlation also appears in the secondary indicators for at least 80% of the countries that appears in the primaries (Pareto's rule), we can suppose that there is no randomness affecting each group. Furthermore, the first assumption has to happen in 80% of the secondary indicators to avoid any fortuity." 

This hypothesis can be used in a global level, all the countries, or in the different regions. 

For example if a primary indicator is repeated 20 times the secondary indicators must have repeated 18 times. And if there are 10 secondary indicators, it has to happen for, at least, 8 indicators.


In [10]:
selected_p=categories.loc[categories['Level']=='primary']
minprimary=selected_p.groupby('Group').min()
minprimary['Min']=round(minprimary['Number of times repeated']*0.8)
minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)
minprimary


Unnamed: 0_level_0,Min
Group,Unnamed: 1_level_1
A&D,13.0
Agriculture,10.0
Demography,24.0
Economy,26.0
Employment,11.0
Environment,13.0
Equality,14.0
Exports,28.0
Health,9.0
Mortality,14.0


In [11]:
grouplist=minprimary.index.to_list()
grouplist

['A&D',
 'Agriculture',
 'Demography',
 'Economy',
 'Employment',
 'Environment',
 'Equality',
 'Exports',
 'Health',
 'Mortality',
 'Principal']

In [12]:
secondary=final.loc[final['Level']=='secondary']
secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')
secondary

Unnamed: 0,Indicator,R^2 Spearman,Behaviour,Country,Moved,Type,Continent,Number of times repeated,Group,Level,Min
0,"Adolescent fertility rate (births per 1,000 wo...",0.817804,Negative,DEU,1,Does not apply,Europe,22,Demography,secondary,24.0
1,"Adolescent fertility rate (births per 1,000 wo...",0.759333,Negative,SWE,3,Does not apply,Europe,22,Demography,secondary,24.0
2,"Adolescent fertility rate (births per 1,000 wo...",0.936963,Negative,GBR,13,Does not apply,Europe,22,Demography,secondary,24.0
3,"Adolescent fertility rate (births per 1,000 wo...",0.786708,Negative,HRV,8,Does not apply,Europe,22,Demography,secondary,24.0
4,"Adolescent fertility rate (births per 1,000 wo...",0.924056,Negative,POL,21,Does not apply,Europe,22,Demography,secondary,24.0
...,...,...,...,...,...,...,...,...,...,...,...
4561,Crop production index (2014-2016 = 100),0.851083,Positive,COL,8,Does not apply,Latam,21,Agriculture,secondary,10.0
4562,Crop production index (2014-2016 = 100),0.882284,Positive,CHL,Does not apply,Cuadratic,Latam,21,Agriculture,secondary,10.0
4563,Crop production index (2014-2016 = 100),0.967446,Positive,CRI,Does not apply,Cubic,Latam,21,Agriculture,secondary,10.0
4564,Crop production index (2014-2016 = 100),0.845314,Positive,USA,Does not apply,Linear,Pair,21,Agriculture,secondary,10.0


In [13]:
secondaryp=secondary.loc[:,['Group','Min']]
Global_Count=secondaryp.groupby('Group').count()
Global_Count.rename(columns={'Min':'Global Count'},inplace=True)
Global_Count

Unnamed: 0_level_0,Global Count
Group,Unnamed: 1_level_1
Agriculture,244
Demography,517
Economy,1313
Employment,83
Environment,612
Equality,56
Exports,409
Health,731
Mortality,522
Principal,79


In [14]:
secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
seco=secondary.groupby(['H_0','Group']).count()
sec=seco.loc['Not Discarded']
secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)
secondarycount

Unnamed: 0_level_0,Secondary Count
Group,Unnamed: 1_level_1
Agriculture,244
Demography,186
Economy,659
Employment,83
Environment,450
Equality,33
Exports,263
Health,731
Mortality,482
Principal,79


In [15]:
continentlist=final['Continent'].unique()
namescontinents=['European', 'North African', 'Asian', 'Pair', 'Persian', 'South African', 'Latino-American']

In [16]:
finalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
finalcount['Does it have some global casuallity implied?']=np.where(finalcount['Secondary Count']/finalcount['Global Count']>0.8,'No', 'Yes')
finalcount['% of count (Global)']=finalcount['Secondary Count']/finalcount['Global Count']*100
finalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)

In [17]:
for i in range(0,len(continentlist)):
    apfinal=final.loc[final['Continent']==continentlist[i]]
    
    selected_p=categories.loc[categories['Level']=='primary']
    minprimary=selected_p.groupby('Group').min()
    minprimary['Min']=round(minprimary['Number of times repeated']*0.8)
    minprimary.drop(columns=['Indicator','Number of times repeated','Level'], inplace=True)

    grouplist=minprimary.index.to_list()

    secondary=apfinal.loc[apfinal['Level']=='secondary']
    secondary=pd.merge(secondary,minprimary, left_on='Group',right_on='Group')

    secondaryp=secondary.loc[:,['Group','Min']]
    Global_Count=secondaryp.groupby('Group').count()
    Global_Count.rename(columns={'Min':'Global Count'},inplace=True)

    secondary['H_0']=np.where(secondary['Number of times repeated']-secondary['Min']>0,'Not Discarded', 'Denied')
    seco=secondary.groupby(['H_0','Group']).count()
    sec=seco.loc['Not Discarded']
    secondarycount=sec.drop(columns=['Indicator','R^2 Spearman','Behaviour','Country','Moved','Type','Continent','Number of times repeated','Level'])
    secondarycount.rename(columns={'Min':'Secondary Count'},inplace=True)

    apfinalcount=pd.merge(Global_Count,secondarycount, left_on='Group',right_on='Group')
    apfinalcount['Does it have some '+namescontinents[i]+' casuallity implied?']=np.where(apfinalcount['Secondary Count']/apfinalcount['Global Count']>0.8,'No', 'Yes')
    apfinalcount['% of count ('+namescontinents[i]+')']=apfinalcount['Secondary Count']/apfinalcount['Global Count']*100
    apfinalcount.drop(columns=['Global Count','Secondary Count'],inplace=True)
    finalcount=pd.merge(finalcount,apfinalcount, left_on='Group',right_on='Group')

finalcount

Unnamed: 0_level_0,Does it have some global casuallity implied?,% of count (Global),Does it have some European casuallity implied?,% of count (European),Does it have some North African casuallity implied?,% of count (North African),Does it have some Asian casuallity implied?,% of count (Asian),Does it have some Pair casuallity implied?,% of count (Pair),Does it have some Persian casuallity implied?,% of count (Persian),Does it have some South African casuallity implied?,% of count (South African),Does it have some Latino-American casuallity implied?,% of count (Latino-American)
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Agriculture,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Demography,Yes,35.976789,Yes,34.117647,Yes,33.962264,Yes,35.294118,Yes,38.461538,Yes,39.68254,Yes,37.5,Yes,35.0
Economy,Yes,50.190404,Yes,60.538117,Yes,48.993289,Yes,43.6,Yes,49.206349,Yes,52.941176,Yes,47.42268,Yes,49.11032
Employment,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Environment,Yes,73.529412,Yes,62.184874,No,81.333333,Yes,72.727273,Yes,65.789474,Yes,76.54321,No,81.081081,Yes,76.923077
Equality,Yes,58.928571,Yes,77.777778,Yes,55.555556,Yes,50.0,Yes,33.333333,Yes,75.0,Yes,50.0,Yes,63.636364
Exports,Yes,64.303178,Yes,64.583333,Yes,74.285714,Yes,65.333333,Yes,63.636364,Yes,59.259259,Yes,63.829787,Yes,62.5
Health,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0
Mortality,No,92.337165,No,96.296296,No,90.47619,No,90.217391,No,88.461538,No,92.753623,No,92.753623,No,91.578947
Principal,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0,No,100.0


In [None]:
#Needed imports
import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

import plotly.graph_objects as go


Now that we’ve loaded the data, we can start right away to create widgets. These widgets are essentials to add interactivity to our visualizations.

In [None]:
#COUNT HISTOGRAM: Graph for seeing the frequency of the relevant primary indicators for each region.
selected_primary=final.loc[final['Level']=='primary']
selected_primary=selected_primary.loc[selected_primary['Number of times repeated']>=11]

continents=list(selected_primary['Continent'].unique())

fig=px.histogram(selected_primary,x='Indicator',histfunc="count",color='Group',text_auto=True,title="Indicators frequency by continents").update_xaxes(categoryorder="total descending")

buttons = []
for continent in continents:
    selected_primary_c = selected_primary.loc[(selected_primary['Continent'] == continent)]
    fig_continent = px.histogram(selected_primary_c, x='Indicator', color='Group').update_xaxes(categoryorder="total descending")
    buttons.append(
        dict(
            label=continent,
            method="update",
            args=[
                {
                    "x": [trace['x'] for trace in fig_continent._data],
                }
            ]
        )
    )

fig.update_layout(
    updatemenus=[
        dict(
            type="dropdown",
            direction="down",
            showactive=True,
            buttons=buttons
        )
    ]
)

fig.show()

In [None]:
#TREEMAP: Graph for seeing the correlation of each indicator in each country.
#To resolve args[], but it works.

fig2=px.treemap(selected_primary,path=['Indicator','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman',color_continuous_scale='RdBu')

indics=list(selected_primary['Indicator'].unique())
buttons = []
for indic in indics:
    selected_primary_i = selected_primary.loc[(selected_primary['Indicator'] == indic)]
    fig_indicator = px.treemap(selected_primary_i,path=['Indicator','Continent','Country'],values='R^2 Spearman',color='R^2 Spearman')
    buttons.append(
        dict(
            label=indic,
            method="update",
            args=[]
        )
    )

fig2.update_layout(
    updatemenus=[
        dict(
            type="dropdown",
            direction="down",
            showactive=True,
            buttons=buttons
        )
    ]
)

DRAFT GRAPHS

In [None]:
unique_tri = selected_primary['Indicator'].unique()
tri = widgets.SelectMultiple(
    options = unique_tri.tolist(),
    value = ['Exports of goods and services (current US$)'],
    description='Indicator',
    disabled=False,
    layout = Layout(width='50%', height='80px')
)

def graf1(tri):
    dat=selected_primary.loc[selected_primary.loc[:, 'Indicator'].isin(np.array(tri))]
    a=px.choropleth(dat, locations="Country", locationmode='ISO-3', 
                     color="R^2 Spearman", hover_name="Country",hover_data = [dat.Type, dat.Behaviour,dat.Moved,dat.Group],projection="natural earth",
                     color_continuous_scale='Reds', width=700, height=500, title= dat.Indicator.unique().tolist()[0])
    print(tri)
    a.show()
widgets.interactive(graf1, tri=tri)


interactive(children=(SelectMultiple(description='Indicator', index=(3,), layout=Layout(height='80px', width='…

To wrap up, we can create the second widget that is exactly the same as the previous multiple selection widget. The purpose of this widget is to enable us to choose which Continent we want to visualize. Below is the code implementation of this widget.

In [None]:
unique_tric1=final_join['Continent'].unique()
unique_tric2=final_join['Group'].unique()

tric1=widgets.SelectMultiple(options=unique_tric1.tolist(),value=['Europe'],description='Continent',disabled=False,layout=Layout(width='50%',height='80px'))
tric2=widgets.SelectMultiple(options=unique_tric2.tolist(),value=['Exports'],description='Group',disabled=False,layout=Layout(width='50%',height='80px'))

def graph1(tric1,tric2):
    dat=final_join.loc[final_join.loc[:,'Continent'].isin(np.array(tric1))]
    dat=dat.loc[dat.loc[:,'Group'].isin(np.array(tric2))]
    
    fig=sns.countplot(data=dat, y="Indicator", hue="Level")
    fig
widgets.interactive(graph1,tric1=tric1,tric2=tric2)


NameError: name 'final_join' is not defined

----

Now, if we execute the following loop, it will provide with the variables that follow a normal distribution.

In [None]:
for i in range(0,len(clist)):
    dat=df.loc[df.loc[:, 'Country'] == clist[i]]
    for e in range(2,len(columns)):
        data=dat.iloc[:, e]
        stat, p = shapiro(data)
        print(clist[i] +"-"+ columns[e])
        print('Statistical=%.3f, p=%.3f' % (stat, p))
        alpha = 0.05
        if p > alpha:
            print('Data is NORMAL ( H0 not denied )')
        else:
            pass

DEU-Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)
Statistical=0.957, p=0.229
Data is NORMAL ( H0 not denied )
DEU-Adjusted net national income (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted net national income per capita (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted savings: carbon dioxide damage (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted savings: consumption of fixed capital (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted savings: education expenditure (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted savings: energy depletion (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
DEU-Adjusted savings: mineral depletion (current US$)
Statistical=0.620, p=0.000
DEU-Adjusted savings: net national savings (

FRA-Gini index
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
FRA-Goods exports (BoP, current US$)
Statistical=0.714, p=0.000
FRA-Goods imports (BoP, current US$)
Statistical=0.893, p=0.004
FRA-Grants and other revenue (% of revenue)
Statistical=0.753, p=0.000
FRA-Gross capital formation (% of GDP)
Statistical=0.876, p=0.002
FRA-Gross capital formation (current US$)
Statistical=0.879, p=0.002
FRA-Gross domestic savings (% of GDP)
Statistical=0.977, p=0.720
Data is NORMAL ( H0 not denied )
FRA-Gross domestic savings (current US$)
Statistical=0.955, p=0.204
Data is NORMAL ( H0 not denied )
FRA-Gross fixed capital formation (% of GDP)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
FRA-Gross fixed capital formation (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
FRA-Gross national expenditure (% of GDP)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
FRA-Gross national expenditure (current US$)
Statistical=1.000, p=1.000
Data is

Statistical=0.945, p=0.103
Data is NORMAL ( H0 not denied )
SWE-Merchandise imports from low- and middle-income economies in Europe & Central Asia (% of total merchandise imports)
Statistical=0.833, p=0.000
SWE-Merchandise imports from low- and middle-income economies in Latin America & the Caribbean (% of total merchandise imports)
Statistical=0.851, p=0.000
SWE-Merchandise imports from low- and middle-income economies in Middle East & North Africa (% of total merchandise imports)
Statistical=0.867, p=0.001
SWE-Merchandise imports from low- and middle-income economies in South Asia (% of total merchandise imports)
Statistical=0.945, p=0.104
Data is NORMAL ( H0 not denied )
SWE-Merchandise imports from low- and middle-income economies in Sub-Saharan Africa (% of total merchandise imports)
Statistical=0.905, p=0.008
SWE-Merchandise imports from low- and middle-income economies outside region (% of total merchandise imports)
Statistical=0.858, p=0.001
SWE-Merchandise trade (% of GDP)
Sta

GBR-Logistics performance index: Competence and quality of logistics services (1=low to 5=high)
Statistical=0.755, p=0.000
GBR-Logistics performance index: Ease of arranging competitively priced shipments (1=low to 5=high)
Statistical=0.907, p=0.010
GBR-Logistics performance index: Efficiency of customs clearance process (1=low to 5=high)
Statistical=0.797, p=0.000
GBR-Logistics performance index: Frequency with which shipments reach consignee within scheduled or expected time (1=low to 5=high)
Statistical=0.807, p=0.000
GBR-Logistics performance index: Overall (1=low to 5=high)
Statistical=0.947, p=0.121
Data is NORMAL ( H0 not denied )
GBR-Logistics performance index: Quality of trade and transport-related infrastructure (1=low to 5=high)
Statistical=0.970, p=0.495
Data is NORMAL ( H0 not denied )
GBR-Low-birthweight babies (% of births)
Statistical=0.968, p=0.440
Data is NORMAL ( H0 not denied )
GBR-Lower secondary school starting age (years)
Statistical=0.841, p=0.000
GBR-Machinery

ESP-Logistics performance index: Ease of arranging competitively priced shipments (1=low to 5=high)
Statistical=0.940, p=0.075
Data is NORMAL ( H0 not denied )
ESP-Logistics performance index: Efficiency of customs clearance process (1=low to 5=high)
Statistical=0.825, p=0.000
ESP-Logistics performance index: Frequency with which shipments reach consignee within scheduled or expected time (1=low to 5=high)
Statistical=0.817, p=0.000
ESP-Logistics performance index: Overall (1=low to 5=high)
Statistical=0.841, p=0.000
ESP-Logistics performance index: Quality of trade and transport-related infrastructure (1=low to 5=high)
Statistical=0.971, p=0.529
Data is NORMAL ( H0 not denied )
ESP-Low-birthweight babies (% of births)
Statistical=0.965, p=0.373
Data is NORMAL ( H0 not denied )
ESP-Lower secondary school starting age (years)
Statistical=0.815, p=0.000
ESP-Machinery and transport equipment (% of value added in manufacturing)
Statistical=0.834, p=0.000
ESP-Manufactures exports (% of merc

HRV-Forest area (sq. km)
Statistical=0.945, p=0.102
Data is NORMAL ( H0 not denied )
HRV-Forest rents (% of GDP)
Statistical=0.845, p=0.000
HRV-Fossil fuel energy consumption (% of total)
Statistical=0.827, p=0.000
HRV-Fuel exports (% of merchandise exports)
Statistical=0.824, p=0.000
HRV-Fuel imports (% of merchandise imports)
Statistical=0.848, p=0.000
HRV-GDP (current US$)
Statistical=0.972, p=0.561
Data is NORMAL ( H0 not denied )
HRV-GDP deflator (base year varies by country)
Statistical=0.935, p=0.055
Data is NORMAL ( H0 not denied )
HRV-GDP deflator: linked series (base year varies by country)
Statistical=0.969, p=0.473
Data is NORMAL ( H0 not denied )
HRV-GDP growth (annual %)
Statistical=0.957, p=0.226
Data is NORMAL ( H0 not denied )
HRV-GDP per capita (current US$)
Statistical=0.866, p=0.001
HRV-GDP per capita growth (annual %)
Statistical=0.950, p=0.143
Data is NORMAL ( H0 not denied )
HRV-GDP per person employed (constant 2017 PPP $)
Statistical=0.816, p=0.000
HRV-GDP per 

POL-Agricultural nitrous oxide emissions (% of total)
Statistical=0.779, p=0.000
POL-Agricultural nitrous oxide emissions (thousand metric tons of CO2 equivalent)
Statistical=0.864, p=0.001
POL-Agricultural raw materials exports (% of merchandise exports)
Statistical=0.904, p=0.008
POL-Agricultural raw materials imports (% of merchandise imports)
Statistical=0.886, p=0.003
POL-Alternative and nuclear energy (% of total energy use)
Statistical=0.868, p=0.001
POL-Aquaculture production (metric tons)
Statistical=0.904, p=0.008
POL-Arable land (% of land area)
Statistical=0.885, p=0.003
POL-Arable land (hectares per person)
Statistical=0.952, p=0.167
Data is NORMAL ( H0 not denied )
POL-Arable land (hectares)
Statistical=0.853, p=0.000
POL-Armed forces personnel (% of total labor force)
Statistical=0.958, p=0.235
Data is NORMAL ( H0 not denied )
POL-Arms exports (SIPRI trend indicator values)
Statistical=0.852, p=0.000
POL-Arms imports (SIPRI trend indicator values)
Statistical=0.834, p=0.

POL-Technical cooperation grants (BoP, current US$)
Statistical=0.935, p=0.054
Data is NORMAL ( H0 not denied )
POL-People practicing open defecation (% of population)
Statistical=0.933, p=0.047
POL-Taxes on exports (% of tax revenue)
Statistical=0.938, p=0.066
Data is NORMAL ( H0 not denied )
POL-Adjusted savings: net forest depletion (current US$)
Statistical=0.882, p=0.002
POL-Country
Statistical=0.814, p=0.000
POL-Year
Statistical=0.936, p=0.057
Data is NORMAL ( H0 not denied )
POL-Continent
Statistical=0.904, p=0.008
GRC-Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)
Statistical=0.957, p=0.229
Data is NORMAL ( H0 not denied )
GRC-Adjusted net national income (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
GRC-Adjusted net national income per capita (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
GRC-Adjusted savings: carbon dioxide damage (current US$)
Statistical=1

AUT-Diabetes prevalence (% of population ages 20 to 79)
Statistical=0.920, p=0.021
AUT-Domestic credit to private sector (% of GDP)
Statistical=0.933, p=0.047
AUT-Domestic credit to private sector by banks (% of GDP)
Statistical=0.910, p=0.011
AUT-Domestic general government health expenditure (% of GDP)
Statistical=0.937, p=0.060
Data is NORMAL ( H0 not denied )
AUT-Domestic general government health expenditure (% of current health expenditure)
Statistical=0.899, p=0.006
AUT-Domestic general government health expenditure (% of general government expenditure)
Statistical=0.783, p=0.000
AUT-Domestic general government health expenditure per capita (current US$)
Statistical=0.815, p=0.000
AUT-Domestic private health expenditure (% of current health expenditure)
Statistical=0.863, p=0.001
AUT-Domestic private health expenditure per capita (current US$)
Statistical=0.826, p=0.000
AUT-Ease of doing business rank (1=most business-friendly regulations)
Statistical=0.826, p=0.000
AUT-Ease of 

Data is NORMAL ( H0 not denied )
AUT-Adjusted savings: net forest depletion (current US$)
Statistical=0.880, p=0.002
AUT-Country
Statistical=0.432, p=0.000
AUT-Year
Statistical=0.937, p=0.061
Data is NORMAL ( H0 not denied )
AUT-Continent
Statistical=0.880, p=0.002
NLD-Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)
Statistical=0.957, p=0.229
Data is NORMAL ( H0 not denied )
NLD-Adjusted net national income (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NLD-Adjusted net national income per capita (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NLD-Adjusted savings: carbon dioxide damage (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NLD-Adjusted savings: consumption of fixed capital (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NLD-Adjusted savings: education expenditure (current US$)
Statistical=1.000, p=1.000
Data 

NLD-Researchers in R&D (per million people)
Statistical=0.919, p=0.020
NLD-Reserves and related items (BoP, current US$)
Statistical=0.898, p=0.006
NLD-Risk of catastrophic expenditure for surgical care (% of people at risk)
Statistical=0.947, p=0.120
Data is NORMAL ( H0 not denied )
NLD-Rural land area (sq. km)
Statistical=0.876, p=0.002
NLD-Rural land area where elevation is below 5 meters (% of total land area)
Statistical=0.980, p=0.797
Data is NORMAL ( H0 not denied )
NLD-Rural land area where elevation is below 5 meters (sq. km)
Statistical=0.976, p=0.682
Data is NORMAL ( H0 not denied )
NLD-Rural population
Statistical=0.934, p=0.050
NLD-Rural population (% of total population)
Statistical=0.934, p=0.050
NLD-Rural population growth (annual %)
Statistical=0.950, p=0.141
Data is NORMAL ( H0 not denied )
NLD-Rural population living in areas where elevation is below 5 meters (% of total population)
Statistical=0.905, p=0.008
NLD-SF6 gas emissions (thousand metric tons of CO2 equival

Statistical=0.949, p=0.136
Data is NORMAL ( H0 not denied )
IRQ-Risk of catastrophic expenditure for surgical care (% of people at risk)
Statistical=0.875, p=0.002
IRQ-Rural land area (sq. km)
Statistical=0.887, p=0.003
IRQ-Rural land area where elevation is below 5 meters (% of total land area)
Statistical=0.900, p=0.006
IRQ-Rural land area where elevation is below 5 meters (sq. km)
Statistical=0.725, p=0.000
IRQ-Rural population
Statistical=0.669, p=0.000
IRQ-Rural population (% of total population)
Statistical=0.669, p=0.000
IRQ-Rural population growth (annual %)
Statistical=0.781, p=0.000
IRQ-Rural population living in areas where elevation is below 5 meters (% of total population)
Statistical=0.795, p=0.000
IRQ-SF6 gas emissions (thousand metric tons of CO2 equivalent)
Statistical=0.701, p=0.000
IRQ-Scientific and technical journal articles
Statistical=0.647, p=0.000
IRQ-Secondary income receipts (BoP, current US$)
Statistical=0.641, p=0.000
IRQ-Secure Internet servers
Statistical

Statistical=0.800, p=0.000
QAT-Transport services (% of service imports, BoP)
Statistical=0.967, p=0.424
Data is NORMAL ( H0 not denied )
QAT-Travel services (% of commercial service exports)
Statistical=0.619, p=0.000
QAT-Travel services (% of commercial service imports)
Statistical=0.496, p=0.000
QAT-Travel services (% of service exports, BoP)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
QAT-Travel services (% of service imports, BoP)
Statistical=0.476, p=0.000
QAT-Tuberculosis case detection rate (%, all forms)
Statistical=0.797, p=0.000
QAT-Tuberculosis treatment success rate (% of new cases)
Statistical=0.797, p=0.000
QAT-UHC service coverage index
Statistical=0.662, p=0.000
QAT-Unemployment with advanced education (% of total labor force with advanced education)
Statistical=0.869, p=0.001
QAT-Unemployment with basic education (% of total labor force with basic education)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
QAT-Unemployment with intermediate ed

SAU-Average time to clear exports through customs (days)
Statistical=0.823, p=0.000
SAU-Births attended by skilled health staff (% of total)
Statistical=0.592, p=0.000
SAU-Business extent of disclosure index (0=less disclosure to 10=more disclosure)
Statistical=0.845, p=0.000
SAU-CO2 emissions (kg per 2015 US$ of GDP)
Statistical=0.902, p=0.007
SAU-CO2 emissions (kg per 2017 PPP $ of GDP)
Statistical=0.882, p=0.002
SAU-CO2 emissions (kg per PPP $ of GDP)
Statistical=0.959, p=0.258
Data is NORMAL ( H0 not denied )
SAU-CO2 emissions (kt)
Statistical=0.923, p=0.026
SAU-CO2 emissions (metric tons per capita)
Statistical=0.882, p=0.002
SAU-CO2 emissions from gaseous fuel consumption (% of total)
Statistical=0.902, p=0.007
SAU-CO2 emissions from gaseous fuel consumption (kt)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
SAU-CO2 emissions from liquid fuel consumption (% of total)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
SAU-CO2 emissions from liquid fuel consumpti

AZE-Alternative and nuclear energy (% of total energy use)
Statistical=0.862, p=0.001
AZE-Aquaculture production (metric tons)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
AZE-Arable land (% of land area)
Statistical=0.827, p=0.000
AZE-Arable land (hectares per person)
Statistical=0.860, p=0.001
AZE-Arable land (hectares)
Statistical=0.798, p=0.000
AZE-Armed forces personnel (% of total labor force)
Statistical=0.858, p=0.001
AZE-Arms exports (SIPRI trend indicator values)
Statistical=0.798, p=0.000
AZE-Arms imports (SIPRI trend indicator values)
Statistical=0.854, p=0.001
AZE-Automated teller machines (ATMs) (per 100,000 adults)
Statistical=0.894, p=0.004
AZE-Average precipitation in depth (mm per year)
Statistical=0.878, p=0.002
AZE-Average time to clear exports through customs (days)
Statistical=0.882, p=0.002
AZE-Births attended by skilled health staff (% of total)
Statistical=0.908, p=0.010
AZE-Business extent of disclosure index (0=less disclosure to 10=more disclo

YEM-Forest area (sq. km)
Statistical=0.931, p=0.041
YEM-Forest rents (% of GDP)
Statistical=0.929, p=0.037
YEM-Fossil fuel energy consumption (% of total)
Statistical=0.835, p=0.000
YEM-Fuel exports (% of merchandise exports)
Statistical=0.780, p=0.000
YEM-Fuel imports (% of merchandise imports)
Statistical=0.666, p=0.000
YEM-GDP (current US$)
Statistical=0.955, p=0.196
Data is NORMAL ( H0 not denied )
YEM-GDP deflator (base year varies by country)
Statistical=0.883, p=0.002
YEM-GDP deflator: linked series (base year varies by country)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
YEM-GDP growth (annual %)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
YEM-GDP per capita (current US$)
Statistical=0.678, p=0.000
YEM-GDP per capita growth (annual %)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
YEM-GDP per person employed (constant 2017 PPP $)
Statistical=0.876, p=0.002
YEM-GDP per unit of energy use (PPP $ per kg of oil equivalent)
Statistical=0.803, p

OMN-Machinery and transport equipment (% of value added in manufacturing)
Statistical=0.812, p=0.000
OMN-Manufactures exports (% of merchandise exports)
Statistical=0.931, p=0.042
OMN-Manufactures imports (% of merchandise imports)
Statistical=0.838, p=0.000
OMN-Marine protected areas (% of territorial waters)
Statistical=0.882, p=0.002
OMN-Market capitalization of listed domestic companies (% of GDP)
Statistical=0.834, p=0.000
OMN-Market capitalization of listed domestic companies (current US$)
Statistical=0.846, p=0.000
OMN-Maternal mortality ratio (modeled estimate, per 100,000 live births)
Statistical=0.894, p=0.004
OMN-Maternal mortality ratio (national estimate, per 100,000 live births)
Statistical=0.844, p=0.000
OMN-Medium and high-tech exports (% manufactured exports)
Statistical=0.918, p=0.018
OMN-Medium and high-tech manufacturing value added (% manufacturing value added)
Statistical=0.590, p=0.000
OMN-Merchandise exports (current US$)
Statistical=0.902, p=0.007
OMN-Merchandi

Statistical=0.894, p=0.004
DZA-Prevalence of anemia among children (% of children ages 6-59 months)
Statistical=0.894, p=0.004
DZA-Prevalence of anemia among non-pregnant women (% of women ages 15-49)
Statistical=0.882, p=0.002
DZA-Prevalence of anemia among pregnant women (%)
Statistical=0.860, p=0.001
DZA-Prevalence of anemia among women of reproductive age (% of women ages 15-49)
Statistical=0.985, p=0.919
Data is NORMAL ( H0 not denied )
DZA-Prevalence of current tobacco use (% of adults)
Statistical=0.884, p=0.003
DZA-Prevalence of overweight (modeled estimate, % of children under 5)
Statistical=0.884, p=0.003
DZA-Prevalence of undernourishment (% of population)
Statistical=0.883, p=0.002
DZA-Price level ratio of PPP conversion factor (GDP) to market exchange rate
Statistical=0.988, p=0.968
Data is NORMAL ( H0 not denied )
DZA-Primary income payments (BoP, current US$)
Statistical=0.880, p=0.002
DZA-Primary income receipts (BoP, current US$)
Statistical=0.884, p=0.003
DZA-Primary 

EGY-Poverty gap at $3.20 a day (2011 PPP) (%)
Statistical=0.971, p=0.534
Data is NORMAL ( H0 not denied )
EGY-Taxes on international trade (% of revenue)
Statistical=0.971, p=0.534
Data is NORMAL ( H0 not denied )
EGY-Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population)
Statistical=0.961, p=0.302
Data is NORMAL ( H0 not denied )
EGY-Broad money (% of GDP)
Statistical=0.914, p=0.014
EGY-Broad money growth (annual %)
Statistical=0.880, p=0.002
EGY-Broad money to total reserves ratio
Statistical=0.915, p=0.016
EGY-Claims on central government (annual growth as % of broad money)
Statistical=0.889, p=0.003
EGY-Claims on private sector (annual growth as % of broad money)
Statistical=0.797, p=0.000
EGY-Poverty gap at $1.90 a day (2011 PPP) (%)
Statistical=0.900, p=0.006
EGY-External health expenditure (% of current health expenditure)
Statistical=0.799, p=0.000
EGY-External health expenditure per capita (current US$)
Statistical=0.362, p=0.000
EGY-Risk of impoverishing expendit

ISR-Current account balance (% of GDP)
Statistical=0.911, p=0.012
ISR-Current account balance (BoP, current US$)
Statistical=0.938, p=0.066
Data is NORMAL ( H0 not denied )
ISR-Current health expenditure (% of GDP)
Statistical=0.898, p=0.005
ISR-Current health expenditure per capita (current US$)
Statistical=0.898, p=0.005
ISR-DEC alternative conversion factor (LCU per US$)
Statistical=0.881, p=0.002
ISR-Depth of credit information index (0=low to 8=high)
Statistical=0.822, p=0.000
ISR-Diabetes prevalence (% of population ages 20 to 79)
Statistical=0.865, p=0.001
ISR-Domestic credit to private sector (% of GDP)
Statistical=0.923, p=0.025
ISR-Domestic credit to private sector by banks (% of GDP)
Statistical=0.952, p=0.167
Data is NORMAL ( H0 not denied )
ISR-Domestic general government health expenditure (% of GDP)
Statistical=0.898, p=0.005
ISR-Domestic general government health expenditure (% of current health expenditure)
Statistical=0.818, p=0.000
ISR-Domestic general government hea

Statistical=0.978, p=0.749
Data is NORMAL ( H0 not denied )
TUR-Gini index
Statistical=0.881, p=0.002
TUR-Goods exports (BoP, current US$)
Statistical=0.878, p=0.002
TUR-Goods imports (BoP, current US$)
Statistical=0.892, p=0.004
TUR-Grants and other revenue (% of revenue)
Statistical=0.793, p=0.000
TUR-Gross capital formation (% of GDP)
Statistical=0.908, p=0.010
TUR-Gross capital formation (current US$)
Statistical=0.900, p=0.006
TUR-Gross domestic savings (% of GDP)
Statistical=0.764, p=0.000
TUR-Gross domestic savings (current US$)
Statistical=0.863, p=0.001
TUR-Gross fixed capital formation (% of GDP)
Statistical=0.757, p=0.000
TUR-Gross fixed capital formation (current US$)
Statistical=0.556, p=0.000
TUR-Gross national expenditure (% of GDP)
Statistical=0.556, p=0.000
TUR-Gross national expenditure (current US$)
Statistical=0.585, p=0.000
TUR-Gross national expenditure deflator (base year varies by country)
Statistical=0.773, p=0.000
TUR-Gross savings (% of GDP)
Statistical=0.862

Data is NORMAL ( H0 not denied )
MAR-Human capital index (HCI), male (scale 0-1)
Statistical=0.948, p=0.125
Data is NORMAL ( H0 not denied )
MAR-Human capital index (HCI), male, lower bound (scale 0-1)
Statistical=0.981, p=0.816
Data is NORMAL ( H0 not denied )
MAR-Human capital index (HCI), male, upper bound (scale 0-1)
Statistical=0.894, p=0.004
MAR-Human capital index (HCI), upper bound (scale 0-1)
Statistical=0.787, p=0.000
MAR-ICT goods exports (% of total goods exports)
Statistical=0.688, p=0.000
MAR-ICT goods imports (% total goods imports)
Statistical=0.565, p=0.000
MAR-ICT service exports (% of service exports, BoP)
Statistical=0.849, p=0.000
MAR-ICT service exports (BoP, current US$)
Statistical=0.781, p=0.000
MAR-Import unit value index (2000 = 100)
Statistical=0.814, p=0.000
MAR-Import value index (2000 = 100)
Statistical=0.829, p=0.000
MAR-Import volume index (2000 = 100)
Statistical=0.962, p=0.303
Data is NORMAL ( H0 not denied )
MAR-Imports of goods and services (% of GD

SEN-Current health expenditure per capita (current US$)
Statistical=0.864, p=0.001
SEN-DEC alternative conversion factor (LCU per US$)
Statistical=0.902, p=0.007
SEN-Depth of credit information index (0=low to 8=high)
Statistical=0.884, p=0.003
SEN-Diabetes prevalence (% of population ages 20 to 79)
Statistical=0.846, p=0.000
SEN-Domestic credit to private sector (% of GDP)
Statistical=0.790, p=0.000
SEN-Domestic credit to private sector by banks (% of GDP)
Statistical=0.880, p=0.002
SEN-Domestic general government health expenditure (% of GDP)
Statistical=0.932, p=0.044
SEN-Domestic general government health expenditure (% of current health expenditure)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
SEN-Domestic general government health expenditure (% of general government expenditure)
Statistical=0.693, p=0.000
SEN-Domestic general government health expenditure per capita (current US$)
Statistical=0.694, p=0.000
SEN-Domestic private health expenditure (% of current health

ZAF-Imports of goods and services (BoP, current US$)
Statistical=0.916, p=0.016
ZAF-Imports of goods and services (current US$)
Statistical=0.930, p=0.040
ZAF-Incidence of tuberculosis (per 100,000 people)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
ZAF-Income share held by fourth 20%
Statistical=0.628, p=0.000
ZAF-Income share held by highest 10%
Statistical=0.499, p=0.000
ZAF-Income share held by highest 20%
Statistical=0.873, p=0.001
ZAF-Income share held by lowest 10%
Statistical=0.972, p=0.562
Data is NORMAL ( H0 not denied )
ZAF-Income share held by lowest 20%
Statistical=0.963, p=0.331
Data is NORMAL ( H0 not denied )
ZAF-Income share held by second 20%
Statistical=0.838, p=0.000
ZAF-Income share held by third 20%
Statistical=0.839, p=0.000
ZAF-Increase in poverty gap at $1.90 ($ 2011 PPP) poverty line due to out-of-pocket health care expenditure (% of poverty line)
Statistical=0.869, p=0.001
ZAF-Increase in poverty gap at $1.90 ($ 2011 PPP) poverty line due to o

LBR-Increase in poverty gap at $1.90 ($ 2011 PPP) poverty line due to out-of-pocket health care expenditure (USD)
Statistical=0.797, p=0.000
LBR-Increase in poverty gap at $3.20 ($ 2011 PPP) poverty line due to out-of-pocket health care expenditure (% of poverty line)
Statistical=0.815, p=0.000
LBR-Increase in poverty gap at $3.20 ($ 2011 PPP) poverty line due to out-of-pocket health care expenditure (USD)
Statistical=0.866, p=0.001
LBR-Individuals using the Internet (% of population)
Statistical=0.871, p=0.001
LBR-Industry (including construction), value added (% of GDP)
Statistical=0.890, p=0.004
LBR-Industry (including construction), value added (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
LBR-Insurance and financial services (% of commercial service exports)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
LBR-Insurance and financial services (% of commercial service imports)
Statistical=0.448, p=0.000
LBR-Insurance and financial services (% 

LBR-Continent
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
MOZ-Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)
Statistical=0.957, p=0.229
Data is NORMAL ( H0 not denied )
MOZ-Adjusted net national income (current US$)
Statistical=0.776, p=0.000
MOZ-Adjusted net national income per capita (current US$)
Statistical=0.812, p=0.000
MOZ-Adjusted savings: carbon dioxide damage (current US$)
Statistical=0.844, p=0.000
MOZ-Adjusted savings: consumption of fixed capital (current US$)
Statistical=0.866, p=0.001
MOZ-Adjusted savings: education expenditure (current US$)
Statistical=0.740, p=0.000
MOZ-Adjusted savings: energy depletion (current US$)
Statistical=0.858, p=0.001
MOZ-Adjusted savings: mineral depletion (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
MOZ-Adjusted savings: net national savings (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
MOZ-Adjusted savings

CMR-Labor force with advanced education (% of total working-age population with advanced education)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Labor force with basic education (% of total working-age population with basic education)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Labor force with intermediate education (% of total working-age population with intermediate education)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Labor tax and contributions (% of commercial profits)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Land area (sq. km)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Land area where elevation is below 5 meters (% of total land area)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Land under cereal production (hectares)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CMR-Level of water stress: freshwater withdrawal as a proportion of available fres

NGA-Labor force with intermediate education (% of total working-age population with intermediate education)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Labor tax and contributions (% of commercial profits)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Land area (sq. km)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Land area where elevation is below 5 meters (% of total land area)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Land under cereal production (hectares)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Level of water stress: freshwater withdrawal as a proportion of available freshwater resources
Statistical=0.880, p=0.002
NGA-Lifetime risk of maternal death (%)
Statistical=0.882, p=0.002
NGA-Lifetime risk of maternal death (1 in: rate varies by country)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
NGA-Liner shipping connectivity index (maximum value in 2004 = 100)
Statistica

GHA-Taxes less subsidies on products (current US$)
Statistical=0.747, p=0.000
GHA-Taxes on goods and services (% of revenue)
Statistical=0.747, p=0.000
GHA-Taxes on goods and services (% value added of industry and services)
Statistical=0.732, p=0.000
GHA-Terrestrial and marine protected areas (% of total territorial area)
Statistical=0.817, p=0.000
GHA-Terrestrial protected areas (% of total land area)
Statistical=0.721, p=0.000
GHA-Textiles and clothing (% of value added in manufacturing)
Statistical=0.707, p=0.000
GHA-Time required to build a warehouse (days)
Statistical=0.747, p=0.000
GHA-Time required to enforce a contract (days)
Statistical=0.827, p=0.000
GHA-Time required to get electricity (days)
Statistical=0.954, p=0.187
Data is NORMAL ( H0 not denied )
GHA-Time required to register property (days)
Statistical=0.738, p=0.000
GHA-Time required to start a business (days)
Statistical=0.397, p=0.000
GHA-Time spent dealing with the requirements of government regulations (% of seni

BGD-Pregnant women receiving prenatal care (%)
Statistical=0.897, p=0.005
BGD-Prevalence of anemia among children (% of children ages 6-59 months)
Statistical=0.897, p=0.005
BGD-Prevalence of anemia among non-pregnant women (% of women ages 15-49)
Statistical=0.814, p=0.000
BGD-Prevalence of anemia among pregnant women (%)
Statistical=0.820, p=0.000
BGD-Prevalence of anemia among women of reproductive age (% of women ages 15-49)
Statistical=0.949, p=0.136
Data is NORMAL ( H0 not denied )
BGD-Prevalence of current tobacco use (% of adults)
Statistical=0.888, p=0.003
BGD-Prevalence of overweight (modeled estimate, % of children under 5)
Statistical=0.888, p=0.003
BGD-Prevalence of undernourishment (% of population)
Statistical=0.823, p=0.000
BGD-Price level ratio of PPP conversion factor (GDP) to market exchange rate
Statistical=0.943, p=0.088
Data is NORMAL ( H0 not denied )
BGD-Primary income payments (BoP, current US$)
Statistical=0.819, p=0.000
BGD-Primary income receipts (BoP, curre

Statistical=0.908, p=0.010
IND-Primary income payments (BoP, current US$)
Statistical=0.856, p=0.001
IND-Primary income receipts (BoP, current US$)
Statistical=0.907, p=0.009
IND-Primary school starting age (years)
Statistical=0.907, p=0.009
IND-Private credit bureau coverage (% of adults)
Statistical=0.855, p=0.001
IND-Probability of dying among adolescents ages 10-14 years (per 1,000)
Statistical=0.900, p=0.006
IND-Probability of dying among adolescents ages 15-19 years (per 1,000)
Statistical=0.897, p=0.005
IND-Probability of dying among children ages 5-9 years (per 1,000)
Statistical=0.831, p=0.000
IND-Probability of dying among youth ages 20-24 years (per 1,000)
Statistical=0.971, p=0.518
Data is NORMAL ( H0 not denied )
IND-Procedures to build a warehouse (number)
Statistical=0.955, p=0.202
Data is NORMAL ( H0 not denied )
IND-Procedures to register property (number)
Statistical=0.896, p=0.005
IND-Profit tax (% of commercial profits)
Statistical=0.896, p=0.005
IND-Progression to 

Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
VNM-Broad money to total reserves ratio
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
VNM-Claims on central government (annual growth as % of broad money)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
VNM-Claims on private sector (annual growth as % of broad money)
Statistical=0.712, p=0.000
VNM-Poverty gap at $1.90 a day (2011 PPP) (%)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
VNM-External health expenditure (% of current health expenditure)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
VNM-External health expenditure per capita (current US$)
Statistical=0.530, p=0.000
VNM-Risk of impoverishing expenditure for surgical care (% of people at risk)
Statistical=0.902, p=0.007
VNM-Net official development assistance and official aid received (constant 2018 US$)
Statistical=0.810, p=0.000
VNM-Net official development assistance and official aid received (current US$)
Statistical=0.7

IDN-Domestic credit to private sector (% of GDP)
Statistical=0.930, p=0.040
IDN-Domestic credit to private sector by banks (% of GDP)
Statistical=0.812, p=0.000
IDN-Domestic general government health expenditure (% of GDP)
Statistical=0.880, p=0.002
IDN-Domestic general government health expenditure (% of current health expenditure)
Statistical=0.948, p=0.124
Data is NORMAL ( H0 not denied )
IDN-Domestic general government health expenditure (% of general government expenditure)
Statistical=0.935, p=0.053
Data is NORMAL ( H0 not denied )
IDN-Domestic general government health expenditure per capita (current US$)
Statistical=0.898, p=0.005
IDN-Domestic private health expenditure (% of current health expenditure)
Statistical=0.917, p=0.017
IDN-Domestic private health expenditure per capita (current US$)
Statistical=0.900, p=0.006
IDN-Ease of doing business rank (1=most business-friendly regulations)
Statistical=0.900, p=0.006
IDN-Ease of doing business score (0 = lowest performance to 10

PHL-Insurance and financial services (% of service imports, BoP)
Statistical=0.871, p=0.001
PHL-Intentional homicides (per 100,000 people)
Statistical=0.876, p=0.002
PHL-Interest payments (% of revenue)
Statistical=0.855, p=0.001
PHL-International migrant stock (% of population)
Statistical=0.787, p=0.000
PHL-Labor force with advanced education (% of total working-age population with advanced education)
Statistical=0.864, p=0.001
PHL-Labor force with basic education (% of total working-age population with basic education)
Statistical=0.726, p=0.000
PHL-Labor force with intermediate education (% of total working-age population with intermediate education)
Statistical=0.688, p=0.000
PHL-Labor tax and contributions (% of commercial profits)
Statistical=0.721, p=0.000
PHL-Land area (sq. km)
Statistical=0.838, p=0.000
PHL-Land area where elevation is below 5 meters (% of total land area)
Statistical=0.890, p=0.004
PHL-Land under cereal production (hectares)
Statistical=0.864, p=0.001
PHL-Le

KOR-Merchandise imports from low- and middle-income economies outside region (% of total merchandise imports)
Statistical=0.876, p=0.002
KOR-Merchandise trade (% of GDP)
Statistical=0.954, p=0.187
Data is NORMAL ( H0 not denied )
KOR-Methane emissions (% change from 1990)
Statistical=0.882, p=0.002
KOR-Methane emissions (kt of CO2 equivalent)
Statistical=0.882, p=0.002
KOR-Methane emissions in energy sector (thousand metric tons of CO2 equivalent)
Statistical=0.865, p=0.001
KOR-Military expenditure (% of GDP)
Statistical=0.870, p=0.001
KOR-Military expenditure (% of general government expenditure)
Statistical=0.876, p=0.002
KOR-Military expenditure (current USD)
Statistical=0.974, p=0.628
Data is NORMAL ( H0 not denied )
KOR-Mineral rents (% of GDP)
Statistical=0.453, p=0.000
KOR-Mobile cellular subscriptions
Statistical=0.901, p=0.007
KOR-Mobile cellular subscriptions (per 100 people)
Statistical=0.913, p=0.013
KOR-Monetary Sector credit to private sector (% GDP)
Statistical=nan, p=1.

MEX-Ratio of female to male labor force participation rate (%) (national estimate)
Statistical=0.854, p=0.001
MEX-Refugee population by country or territory of asylum
Statistical=0.952, p=0.168
Data is NORMAL ( H0 not denied )
MEX-Refugee population by country or territory of origin
Statistical=0.976, p=0.664
Data is NORMAL ( H0 not denied )
MEX-Renewable electricity output (% of total electricity output)
Statistical=0.946, p=0.109
Data is NORMAL ( H0 not denied )
MEX-Renewable energy consumption (% of total final energy consumption)
Statistical=0.946, p=0.109
Data is NORMAL ( H0 not denied )
MEX-Renewable internal freshwater resources per capita (cubic meters)
Statistical=0.933, p=0.048
MEX-Research and development expenditure (% of GDP)
Statistical=0.922, p=0.023
MEX-Researchers in R&D (per million people)
Statistical=0.949, p=0.132
Data is NORMAL ( H0 not denied )
MEX-Reserves and related items (BoP, current US$)
Statistical=0.978, p=0.747
Data is NORMAL ( H0 not denied )
MEX-Risk o

Data is NORMAL ( H0 not denied )
BRA-Claims on central government (annual growth as % of broad money)
Statistical=0.918, p=0.019
BRA-Claims on private sector (annual growth as % of broad money)
Statistical=0.979, p=0.783
Data is NORMAL ( H0 not denied )
BRA-Poverty gap at $1.90 a day (2011 PPP) (%)
Statistical=0.499, p=0.000
BRA-External health expenditure (% of current health expenditure)
Statistical=0.658, p=0.000
BRA-External health expenditure per capita (current US$)
Statistical=0.775, p=0.000
BRA-Risk of impoverishing expenditure for surgical care (% of people at risk)
Statistical=0.868, p=0.001
BRA-Net official development assistance and official aid received (constant 2018 US$)
Statistical=0.969, p=0.471
Data is NORMAL ( H0 not denied )
BRA-Net official development assistance and official aid received (current US$)
Statistical=0.957, p=0.230
Data is NORMAL ( H0 not denied )
BRA-Technical cooperation grants (BoP, current US$)
Statistical=0.813, p=0.000
BRA-People practicing open

PER-Consumer price index (2010 = 100)
Statistical=0.873, p=0.001
PER-Container port traffic (TEU: 20 foot equivalent units)
Statistical=0.887, p=0.003
PER-Crop production index (2014-2016 = 100)
Statistical=0.939, p=0.070
Data is NORMAL ( H0 not denied )
PER-Current account balance (% of GDP)
Statistical=0.908, p=0.010
PER-Current account balance (BoP, current US$)
Statistical=0.974, p=0.623
Data is NORMAL ( H0 not denied )
PER-Current health expenditure (% of GDP)
Statistical=0.945, p=0.104
Data is NORMAL ( H0 not denied )
PER-Current health expenditure per capita (current US$)
Statistical=0.945, p=0.104
Data is NORMAL ( H0 not denied )
PER-DEC alternative conversion factor (LCU per US$)
Statistical=0.914, p=0.014
PER-Depth of credit information index (0=low to 8=high)
Statistical=0.834, p=0.000
PER-Diabetes prevalence (% of population ages 20 to 79)
Statistical=0.872, p=0.001
PER-Domestic credit to private sector (% of GDP)
Statistical=0.876, p=0.002
PER-Domestic credit to private se

VEN-Arms imports (SIPRI trend indicator values)
Statistical=0.877, p=0.002
VEN-Automated teller machines (ATMs) (per 100,000 adults)
Statistical=0.918, p=0.018
VEN-Average precipitation in depth (mm per year)
Statistical=0.933, p=0.047
VEN-Average time to clear exports through customs (days)
Statistical=0.810, p=0.000
VEN-Births attended by skilled health staff (% of total)
Statistical=0.824, p=0.000
VEN-Business extent of disclosure index (0=less disclosure to 10=more disclosure)
Statistical=0.773, p=0.000
VEN-CO2 emissions (kg per 2015 US$ of GDP)
Statistical=0.882, p=0.002
VEN-CO2 emissions (kg per 2017 PPP $ of GDP)
Statistical=0.924, p=0.027
VEN-CO2 emissions (kg per PPP $ of GDP)
Statistical=0.902, p=0.007
VEN-CO2 emissions (kt)
Statistical=0.864, p=0.001
VEN-CO2 emissions (metric tons per capita)
Statistical=0.790, p=0.000
VEN-CO2 emissions from gaseous fuel consumption (% of total)
Statistical=0.884, p=0.002
VEN-CO2 emissions from gaseous fuel consumption (kt)
Statistical=nan, 

COL-Domestic general government health expenditure (% of current health expenditure)
Statistical=0.961, p=0.294
Data is NORMAL ( H0 not denied )
COL-Domestic general government health expenditure (% of general government expenditure)
Statistical=0.804, p=0.000
COL-Domestic general government health expenditure per capita (current US$)
Statistical=0.775, p=0.000
COL-Domestic private health expenditure (% of current health expenditure)
Statistical=0.803, p=0.000
COL-Domestic private health expenditure per capita (current US$)
Statistical=0.800, p=0.000
COL-Ease of doing business rank (1=most business-friendly regulations)
Statistical=0.800, p=0.000
COL-Ease of doing business score (0 = lowest performance to 100 = best performance)
Statistical=0.953, p=0.179
Data is NORMAL ( H0 not denied )
COL-Electric power consumption (kWh per capita)
Statistical=0.889, p=0.003
COL-Electric power transmission and distribution losses (% of output)
Statistical=0.862, p=0.001
COL-Electricity production fr

CHL-Goods imports (BoP, current US$)
Statistical=0.898, p=0.006
CHL-Grants and other revenue (% of revenue)
Statistical=0.868, p=0.001
CHL-Gross capital formation (% of GDP)
Statistical=0.931, p=0.043
CHL-Gross capital formation (current US$)
Statistical=0.893, p=0.004
CHL-Gross domestic savings (% of GDP)
Statistical=0.888, p=0.003
CHL-Gross domestic savings (current US$)
Statistical=0.909, p=0.011
CHL-Gross fixed capital formation (% of GDP)
Statistical=0.708, p=0.000
CHL-Gross fixed capital formation (current US$)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
CHL-Gross national expenditure (% of GDP)
Statistical=nan, p=1.000
Data is NORMAL ( H0 not denied )
CHL-Gross national expenditure (current US$)
Statistical=1.000, p=1.000
Data is NORMAL ( H0 not denied )
CHL-Gross national expenditure deflator (base year varies by country)
Statistical=0.901, p=0.007
CHL-Gross savings (% of GDP)
Statistical=0.907, p=0.009
CHL-Gross savings (current US$)
Statistical=0.979, p=0.761
Da

PAN-Methane emissions (% change from 1990)
Statistical=0.826, p=0.000
PAN-Methane emissions (kt of CO2 equivalent)
Statistical=0.826, p=0.000
PAN-Methane emissions in energy sector (thousand metric tons of CO2 equivalent)
Statistical=0.858, p=0.001
PAN-Military expenditure (% of GDP)
Statistical=0.858, p=0.001
PAN-Military expenditure (% of general government expenditure)
Statistical=0.848, p=0.000
PAN-Military expenditure (current USD)
Statistical=0.978, p=0.741
Data is NORMAL ( H0 not denied )
PAN-Mineral rents (% of GDP)
Statistical=0.797, p=0.000
PAN-Mobile cellular subscriptions
Statistical=0.965, p=0.368
Data is NORMAL ( H0 not denied )
PAN-Mobile cellular subscriptions (per 100 people)
Statistical=0.965, p=0.368
Data is NORMAL ( H0 not denied )
PAN-Monetary Sector credit to private sector (% GDP)
Statistical=0.905, p=0.008
PAN-Mortality caused by road traffic injury (per 100,000 population)
Statistical=0.916, p=0.016
PAN-Mortality rate attributed to unintentional poisoning (per 

CRI-People using at least basic drinking water services (% of population)
Statistical=0.859, p=0.001
CRI-People using at least basic sanitation services (% of population)
Statistical=0.896, p=0.005
CRI-People using safely managed sanitation services (% of population)
Statistical=0.896, p=0.005
CRI-Permanent cropland (% of land area)
Statistical=0.938, p=0.064
Data is NORMAL ( H0 not denied )
CRI-Physicians (per 1,000 people)
Statistical=0.923, p=0.024
CRI-Population ages 0-14 (% of total population)
Statistical=0.923, p=0.024
CRI-Population ages 15-64 (% of total population)
Statistical=0.878, p=0.002
CRI-Population ages 65 and above (% of total population)
Statistical=0.856, p=0.001
CRI-Population density (people per sq. km of land area)
Statistical=0.984, p=0.901
Data is NORMAL ( H0 not denied )
CRI-Population growth (annual %)
Statistical=0.923, p=0.024
CRI-Population in largest city
Statistical=0.895, p=0.005
CRI-Population in the largest city (% of urban population)
Statistical=0.

Data is NORMAL ( H0 not denied )
USA-Time required to build a warehouse (days)
Statistical=0.939, p=0.068
Data is NORMAL ( H0 not denied )
USA-Time required to enforce a contract (days)
Statistical=0.941, p=0.078
Data is NORMAL ( H0 not denied )
USA-Time required to get electricity (days)
Statistical=0.889, p=0.003
USA-Time required to register property (days)
Statistical=0.941, p=0.078
Data is NORMAL ( H0 not denied )
USA-Time required to start a business (days)
Statistical=0.630, p=0.000
USA-Time spent dealing with the requirements of government regulations (% of senior management time)
Statistical=0.674, p=0.000
USA-Time to obtain an electrical connection (days)
Statistical=0.672, p=0.000
USA-Time to prepare and pay taxes (hours)
Statistical=0.676, p=0.000
USA-Time to resolve insolvency (years)
Statistical=0.633, p=0.000
USA-Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)
Statistical=0.631, p=0.000
USA-Total fisheries production (