# Basics

Import libraries and functions.

In [229]:
import pandas as pd
import numpy as np
import glob
import os
from pyspark.sql.functions import concat, col, lit, split

Firstly we load the database from World Data Bank that has been downloaded and extracted in the *Data extraction* notebook. We acquire it from the predetermined path that is on our computer.

In [230]:
df= pd.read_csv (os.getcwd()+'\WDIData.csv')
df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.825520,19.272212,19.628009,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.499471,6.680066,6.859110,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.794160,32.001027,33.871910,38.880173,40.261358,43.061877,44.270860,45.803485,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.641760,30.404935,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384365,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,14.500000,,,,,,,
384366,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,3.700000,,,,5.418352,,,
384367,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,33.500000,32.400000,,,,33.658057,,,
384368,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.200000,59.400000,59.500000,59.700000,59.900000,60.000000,60.200000,60.400000,,


Moreover, to work more comfortably, we remove those columns not useful for us, as *Country Name* and *Indicator Code*, since with the *Country Code*, *Value* and the *Indicator Name* we have the information relevant.

In [231]:
df.drop(columns=["Country Name","Indicator Code"], axis=1, inplace=True)

FILTER 1: BY COUNTRY

From the, almost, two hundred countries we have information about in the worldwide database, we have decided to study 50 of them, grouping them by geographical and economical similiarities. With this, we can keep in our dataframe the selected countries.

In [232]:
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PCZ','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list 

In [233]:
df1=df.loc[df['Country Code'].isin(country_list)]

Now we transpose the rows of years into the columns.

In [234]:
df2=(df1.set_index(["Country Code", "Indicator Name"]).stack().reset_index(name='Value').rename(columns={'level_2':'Date'}))
df2

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1729300,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1729301,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1729302,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1729303,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 2: BY YEAR

Our time range covers from 1960 to 2021. However, the record is not uniform and complete for all areas and indicators. We can appreaciate that specially in the first years of the last century, so many data is missing, so it makes no sense to study it. Besides, for the year 2021 many data is also lacking. Therefore, we would delimit our study between 1990 and 2020.

In [235]:
df2[['Date']] = df2[['Date']].astype(int)

In [236]:
df2.dtypes

Country Code       object
Indicator Name     object
Date                int32
Value             float64
dtype: object

In [237]:
df3 = df2[df2['Date'] > 1989]
df3

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1729300,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1729301,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1729302,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1729303,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 3: BY INDICATOR

As there are lots of indicators that have very similar meaning we have decided to select some indicators to perform the study (**Indicator group** = *Name of the selected indicator*):
- **GDP** = *GDP (current US$)*
- **Literacy** = *Literacy rate, adult total (% of people ages 15 and above)', 'Government expenditure on education, total (% of government expenditure)*
- **Migration** = *Net migration*
- **Exports** = *Commercial service exports (current US$)* & *Exports of goods and services (current US$)*
- **International trading** = *Taxes on international trade (current LCU)*
- **Fertility** = *Fertility rate, total (births per woman)*
- **Healthcare** = *People using at least basic sanitation services (% of population)*
- **Employment** = *Employment in agriculture (% of total employment) (modeled ILO estimate)*, *Employment in services (% of total employment) (modeled ILO estimate)* & *Employment in industry (% of total employment) (modeled ILO estimate)*
- **Renewable energy** = *Electricity production from renewable sources, excluding hydroelectric (kWh)*
- **Mortality** = *Number of infant deaths*
- **Outside investment** = *Foreign direct investment, net (BoP, current US$)*
- **Pollution** = *Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)*
- **Alcoholism** = *Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)*
- **Tech adoption** = *Research and development expenditure (% of GDP)*
- **Workers high education** = *Labor force with advanced education (% of total working-age population with advanced education)*
- **Optimisim and pessimisim** = *Suicide mortality rate (per 100,000 population)*
- **Gender inequality** = *CPIA gender equality rating (**1=low to **6=high)*
- **Education** = *Share of youth not in education, employment or training, total (% of youth population)* & *Government expenditure on education, total (% of government expenditure)'*

To acomplish this, we use the function `isin` that will allow us to only select the the indicators afromentioned, that have been compilied in the list called *indicators_list*

In [238]:
indicators_list=['GDP (current US$)','Literacy rate, adult total (% of people ages 15 and above)', 'Government expenditure on education, total (% of government expenditure)','Net migration','Commercial service exports (current US$)','Exports of goods and services (current US$)','Taxes on international trade (current LCU)','Fertility rate, total (births per woman)','People using at least basic sanitation services (% of population)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Number of infant deaths','Number of infant deaths','Foreign direct investment, net (BoP, current US$)','Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)','Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)','Research and development expenditure (% of GDP)','Labor force with advanced education (% of total working-age population with advanced education)','Suicide mortality rate (per 100,000 population)','CPIA gender equality rating (1=low to 6=high)','Share of youth not in education, employment or training, total (% of youth population)','Government expenditure on education, total (% of government expenditure)']

In [239]:
df4=df3.loc[df3['Indicator Name'].isin(indicators_list)]
pd.set_option('display.max_rows', 10)
df4

Unnamed: 0,Country Code,Indicator Name,Date,Value
5334,DZA,Commercial service exports (current US$),1990,4.795977e+08
5335,DZA,Commercial service exports (current US$),1991,3.747657e+08
5336,DZA,Commercial service exports (current US$),2005,2.466000e+09
5337,DZA,Commercial service exports (current US$),2006,2.512000e+09
5338,DZA,Commercial service exports (current US$),2007,2.786733e+09
...,...,...,...,...
1727918,YEM,Total alcohol consumption per capita (liters o...,2000,7.900000e-01
1727919,YEM,Total alcohol consumption per capita (liters o...,2005,3.400000e-01
1727920,YEM,Total alcohol consumption per capita (liters o...,2010,1.800000e-01
1727921,YEM,Total alcohol consumption per capita (liters o...,2015,5.500000e-02


PMHOURS PRUEBA PARA SACAR QUARTILES

Primero: cambiar los nombres de los indicadores (filas)

In [240]:
df4['Indicator Name']=df4['Indicator Name'].replace(['CPIA gender equality rating (1=low to 6=high)','Commercial service exports (current US$)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Exports of goods and services (current US$)','Fertility rate, total (births per woman)','Foreign direct investment, net (BoP, current US$)','GDP (current US$)','Government expenditure on education, total (% of government expenditure)','Labor force with advanced education (% of total working-age population with advanced education)','Literacy rate, adult total (% of people ages 15 and above)','Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population','Net migration','Number of infant deaths','People using at least basic sanitation services (% of population)','Research and development expenditure (% of GDP)','Share of youth not in education, employment or training, total (% of youth population)','Suicide mortality rate (per 100,000 population)','Taxes on international trade (current LCU)','Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)'],['Gender equality','Exports-Commercial services','Renewable electricity','Employment-agriculture','Employment-industry','Employment-services','Exports-G&S','Fertility rate','Foreign investment','GDP','Education GExp','Workers high education','Literacy rate','Mortality-pollution','Net migration','Mortality-infants','Health services use','R&D GExp','Ninis','Suicide','International taxes','Alcohol per capita'])
df4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Indicator Name']=df4['Indicator Name'].replace(['CPIA gender equality rating (1=low to 6=high)','Commercial service exports (current US$)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Exports of goods and services (current US$)','Fertility rate, total (births per woman)','Foreign direct investment, net (BoP, current US$)','GDP (current US$)','Government expenditure on education, total (% of government expenditure)','Labor force with advanced education (% of total working-

Unnamed: 0,Country Code,Indicator Name,Date,Value
5334,DZA,Exports-Commercial services,1990,4.795977e+08
5335,DZA,Exports-Commercial services,1991,3.747657e+08
5336,DZA,Exports-Commercial services,2005,2.466000e+09
5337,DZA,Exports-Commercial services,2006,2.512000e+09
5338,DZA,Exports-Commercial services,2007,2.786733e+09
...,...,...,...,...
1727918,YEM,Alcohol per capita,2000,7.900000e-01
1727919,YEM,Alcohol per capita,2005,3.400000e-01
1727920,YEM,Alcohol per capita,2010,1.800000e-01
1727921,YEM,Alcohol per capita,2015,5.500000e-02


Segundo: calcular cuartiles e IQR.

In [241]:
grouped=df4.groupby(['Country Code','Indicator Name'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001969F5DE890>

In [242]:
Q1=df4.groupby(['Country Code','Indicator Name']).quantile(0.25)
Q3=df4.groupby(['Country Code','Indicator Name']).quantile(0.75)
IQR=Q3-Q1
IQR

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Value
Country Code,Indicator Name,Unnamed: 2_level_1,Unnamed: 3_level_1
ARE,Alcohol per capita,10.0,6.400000e-01
ARE,Education GExp,0.0,0.000000e+00
ARE,Employment-agriculture,14.0,5.130000e+00
ARE,Employment-industry,14.0,1.449997e+00
ARE,Employment-services,14.0,3.830002e+00
...,...,...,...
ZAF,Ninis,12.5,2.997499e+00
ZAF,R&D GExp,8.0,1.011100e-01
ZAF,Renewable electricity,12.5,2.292500e+08
ZAF,Suicide,9.5,1.025000e+00


In [243]:
lower_limit=Q1 - 1.5 * IQR
lower=lower_limit.drop(['Date'],axis=1)
lower.rename(columns={"Value":"Lower limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Lower limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Alcohol per capita,2.190000e+00
ARE,Education GExp,1.026766e+01
ARE,Employment-agriculture,-4.905000e+00
ARE,Employment-industry,3.131501e+01
ARE,Employment-services,5.269500e+01
...,...,...
ZAF,Ninis,2.684375e+01
ZAF,R&D GExp,5.828450e-01
ZAF,Renewable electricity,-2.623750e+08
ZAF,Suicide,2.203750e+01


In [244]:
upper_limit=Q3 + 1.5 * IQR
upper=upper_limit.drop(['Date'],axis=1)
upper.rename(columns={"Value":"Upper limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Upper limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Alcohol per capita,4.750000e+00
ARE,Education GExp,1.026766e+01
ARE,Employment-agriculture,1.561500e+01
ARE,Employment-industry,3.711499e+01
ARE,Employment-services,6.801500e+01
...,...,...
ZAF,Ninis,3.883375e+01
ZAF,R&D GExp,9.872850e-01
ZAF,Renewable electricity,6.546250e+08
ZAF,Suicide,2.613750e+01


Tercero: unir las tres tablas con country e indicador.

In [248]:
dfs = [df4,lower,upper]
import functools as ft
df_joined = ft.reduce(lambda left, right: pd.merge(left, right, on=['Country Code','Indicator Name']), dfs)
df_joined

Unnamed: 0,Country Code,Indicator Name,Date,Value_x,Value_y,Value
0,DZA,Exports-Commercial services,1990,4.795977e+08,1.736231e+09,4.453536e+09
1,DZA,Exports-Commercial services,1991,3.747657e+08,1.736231e+09,4.453536e+09
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19939,YEM,Alcohol per capita,2000,7.900000e-01,-3.725000e-01,7.675000e-01
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


In [249]:
list(df_joined)

['Country Code', 'Indicator Name', 'Date', 'Value_x', 'Value_y', 'Value']

In [250]:
renamed=df_joined.set_axis(['Country','Indicator','Year', 'Real value', 'Lower value', 'Upper value'], axis=1, inplace=False)
renamed

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Exports-Commercial services,1990,4.795977e+08,1.736231e+09,4.453536e+09
1,DZA,Exports-Commercial services,1991,3.747657e+08,1.736231e+09,4.453536e+09
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19939,YEM,Alcohol per capita,2000,7.900000e-01,-3.725000e-01,7.675000e-01
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


In [251]:
sin_outliers=renamed.loc[~((renamed['Real value']<renamed['Lower value']) | (renamed['Real value']>renamed['Upper value']))]
sin_outliers

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
5,DZA,Exports-Commercial services,2008,3.412421e+09,1.736231e+09,4.453536e+09
6,DZA,Exports-Commercial services,2009,2.744716e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19938,YEM,Suicide,2019,5.800000e+00,5.400000e+00,6.200000e+00
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


In [252]:
df_limpio=sin_outliers.drop(['Lower value','Upper value'],axis=1)
df_limpio

Unnamed: 0,Country,Indicator,Year,Real value
2,DZA,Exports-Commercial services,2005,2.466000e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09
5,DZA,Exports-Commercial services,2008,3.412421e+09
6,DZA,Exports-Commercial services,2009,2.744716e+09
...,...,...,...,...
19938,YEM,Suicide,2019,5.800000e+00
19940,YEM,Alcohol per capita,2005,3.400000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02


In [253]:
transpuesto=df_limpio.set_index(["Country", "Year"]).pivot(columns="Indicator", values="Real value").reset_index()
transpuesto

Indicator,Country,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,...,International taxes,Literacy rate,"Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)",Mortality-infants,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education
0,ARE,1990,,,,,,,,4.454,...,,,,672.0,,,,0.0,,
1,ARE,1991,,,8.46,33.330002,58.200001,,,4.253,...,,,,645.0,,,,0.0,,
2,ARE,1992,,,8.37,33.360001,58.279999,,,4.041,...,,,,618.0,368126.0,,,0.0,,
3,ARE,1993,,,8.24,33.470001,58.290001,,,3.827,...,,,,592.0,,,,0.0,,
4,ARE,1994,,,8.13,33.490002,58.380001,,,3.618,...,,,,568.0,,,,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,18.719290,5.28,23.340000,71.379997,1.614806e+10,1.042884e+11,2.430,...,4.993941e+10,87.046669,,32777.0,727026.0,31.010000,0.83215,,25.2,83.809998
1473,ZAF,2018,9.52,18.901590,5.16,23.129999,71.709999,1.670823e+10,1.112854e+11,2.405,...,5.572291e+10,,,31810.0,,31.559999,,,24.1,82.879997
1474,ZAF,2019,,19.596230,5.28,22.309999,72.410004,1.554886e+10,1.060698e+11,2.381,...,5.522342e+10,95.022972,,30937.0,,32.459999,,,23.5,82.019997
1475,ZAF,2020,,19.527281,,,,8.404204e+09,9.317915e+10,2.358,...,,,,30153.0,,32.400002,,,,


In [255]:
transpuesto.isna().sum()

Indicator
Country                      0
Year                         0
Alcohol per capita        1251
Education GExp             749
Employment-agriculture     121
                          ... 
Ninis                      978
R&D GExp                   833
Renewable electricity      369
Suicide                    552
Workers high education     893
Length: 24, dtype: int64

PENDIENTE POR HACER

Separar el dataset por países.

Una vez separados, buscar los nan. Si hay un valor Nan, sustituir por la media de toda la columna (ya que cada grupo es un único país) y si toda es Nan, eliminar dicha columna.

In [259]:
country_grouped = transpuesto.groupby('Country')
country_grouped.to_csv('country partitions')

AttributeError: 'DataFrameGroupBy' object has no attribute 'to_csv'

DE AQUÍ EN ADELANTE EN SUCIO

--------

INTENTO DE SACAR MEDIA POR PAÍS Y SUSTITUIR EN AÑOS CON NULL. FALLO PORQUE SON DE DISTINTOS TAMAÑOS.

In [254]:
mean=transpuesto.groupby('Country').mean()
mean

Indicator,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,Foreign investment,...,International taxes,Literacy rate,"Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)",Mortality-infants,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARE,2005.0,3.4120,10.267660,5.204483,34.332759,60.463448,,2.529880e+11,2.420065,,...,5.687960e+07,93.795387,54.7,597.096774,5.052272e+05,,0.863376,0.000000e+00,7.570000,80.820000
ARG,2005.0,9.1400,14.879010,0.720345,23.424400,75.806400,8.086058e+09,4.985330e+10,2.501677,-5.369811e+09,...,3.377443e+10,99.018743,26.6,11383.838710,-7.066667e+04,19.279167,0.493970,1.175615e+09,8.738889,82.304285
AUT,2005.5,12.4280,10.720981,5.635517,28.200690,66.165518,6.063244e+10,1.505921e+11,1.433226,6.588530e+09,...,-4.100000e+04,,15.3,337.607143,2.028288e+05,7.400000,2.473488,3.884692e+09,16.361111,76.605238
AZE,2005.5,3.4440,11.809866,39.469655,12.764138,47.767587,2.007968e+09,1.427552e+10,1.937058,-6.294689e+08,...,4.716260e+08,99.782907,63.9,7774.548387,-2.039367e+04,,0.254667,0.000000e+00,4.247368,
BGD,2005.5,0.1180,14.903409,54.253103,14.862414,30.887586,1.042460e+09,1.640451e+10,2.877548,-8.497250e+08,...,2.017016e+11,61.730904,149.0,183494.903226,-1.739225e+06,29.915999,,0.000000e+00,4.390000,79.083331
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
USA,2005.5,9.5460,,1.544138,22.442414,76.014828,4.412045e+11,1.478503e+12,1.955984,1.926563e+10,...,2.627814e+10,,13.3,27441.483871,4.992630e+06,13.172857,2.648437,1.019440e+11,13.220000,76.839642
VEN,2005.0,7.6400,18.579773,9.685172,22.213704,68.531035,1.402185e+09,4.751239e+10,2.698032,-1.188000e+09,...,,95.358164,34.6,10561.800000,-1.321620e+05,16.512000,0.243271,0.000000e+00,3.520000,71.364999
VNM,2005.5,5.2520,16.347795,55.376896,17.747931,26.876897,7.081860e+09,8.023430e+10,2.045769,-6.370960e+09,...,,92.427185,64.5,31836.518519,-4.768672e+05,9.447273,0.344926,4.561538e+07,6.940000,87.338181
YEM,2005.0,0.1565,24.433267,36.354138,14.085172,49.562069,9.239832e+08,5.947737e+09,5.702548,-1.036230e+08,...,,45.594999,194.2,43631.225806,-1.005856e+05,44.770000,,0.000000e+00,5.815000,87.525002


In [257]:
transpuesto.fillna(mean,inplace=True)
transpuesto

Indicator,Country,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,...,International taxes,Literacy rate,"Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)",Mortality-infants,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education
0,ARE,1990,,,,,,,,4.454,...,,,,672.0,,,,0.0,,
1,ARE,1991,,,8.46,33.330002,58.200001,,,4.253,...,,,,645.0,,,,0.0,,
2,ARE,1992,,,8.37,33.360001,58.279999,,,4.041,...,,,,618.0,368126.0,,,0.0,,
3,ARE,1993,,,8.24,33.470001,58.290001,,,3.827,...,,,,592.0,,,,0.0,,
4,ARE,1994,,,8.13,33.490002,58.380001,,,3.618,...,,,,568.0,,,,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,18.719290,5.28,23.340000,71.379997,1.614806e+10,1.042884e+11,2.430,...,4.993941e+10,87.046669,,32777.0,727026.0,31.010000,0.83215,,25.2,83.809998
1473,ZAF,2018,9.52,18.901590,5.16,23.129999,71.709999,1.670823e+10,1.112854e+11,2.405,...,5.572291e+10,,,31810.0,,31.559999,,,24.1,82.879997
1474,ZAF,2019,,19.596230,5.28,22.309999,72.410004,1.554886e+10,1.060698e+11,2.381,...,5.522342e+10,95.022972,,30937.0,,32.459999,,,23.5,82.019997
1475,ZAF,2020,,19.527281,,,,8.404204e+09,9.317915e+10,2.358,...,,,,30153.0,,32.400002,,,,


CÓDIGO ORIGINAL

In [200]:
means=df4.groupby(['Country Code','Indicator Name']).mean()
Mean_value=means.drop(['Date'],axis=1)
Mean_value

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Alcohol per capita,3.412000e+00
ARE,Education GExp,1.026766e+01
ARE,Employment-agriculture,5.204483e+00
ARE,Employment-industry,3.433276e+01
ARE,Employment-services,6.046345e+01
...,...,...
ZAF,Ninis,3.288125e+01
ZAF,R&D GExp,7.877506e-01
ZAF,Renewable electricity,4.568846e+08
ZAF,Suicide,2.424000e+01


In [44]:
df5=df4.set_index(["Country Code", "Date"]).pivot(columns="Indicator Name", values="Value").reset_index()
df5

Indicator Name,Country Code,Date,CPIA gender equality rating (1=low to 6=high),Commercial service exports (current US$),"Electricity production from renewable sources, excluding hydroelectric (kWh)",Employment in agriculture (% of total employment) (modeled ILO estimate),Employment in industry (% of total employment) (modeled ILO estimate),Employment in services (% of total employment) (modeled ILO estimate),Exports of goods and services (current US$),"Fertility rate, total (births per woman)",...,"Literacy rate, adult total (% of people ages 15 and above)","Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)",Net migration,Number of infant deaths,People using at least basic sanitation services (% of population),Research and development expenditure (% of GDP),"Share of youth not in education, employment or training, total (% of youth population)","Suicide mortality rate (per 100,000 population)",Taxes on international trade (current LCU),"Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)"
0,ARE,1990,,,0.0,,,,,4.454,...,,,,672.0,,,,,,
1,ARE,1991,,,0.0,8.46,33.330002,58.200001,,4.253,...,,,,645.0,,,,,,
2,ARE,1992,,,0.0,8.37,33.360001,58.279999,,4.041,...,,,368126.0,618.0,,,,,,
3,ARE,1993,,,0.0,8.24,33.470001,58.290001,,3.827,...,,,,592.0,,,,,,
4,ARE,1994,,,0.0,8.13,33.490002,58.380001,,3.618,...,,,,568.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,1.614806e+10,,5.28,23.340000,71.379997,1.042884e+11,2.430,...,87.046669,,727026.0,32777.0,75.770868,0.83215,31.010000,25.2,4.993941e+10,
1473,ZAF,2018,,1.670823e+10,,5.16,23.129999,71.709999,1.112854e+11,2.405,...,,,,31810.0,76.683188,,31.559999,24.1,5.572291e+10,9.52
1474,ZAF,2019,,1.554886e+10,,5.28,22.309999,72.410004,1.060698e+11,2.381,...,95.022972,,,30937.0,77.584480,,32.459999,23.5,5.522342e+10,
1475,ZAF,2020,,8.404204e+09,,,,,9.317915e+10,2.358,...,,,,30153.0,78.474611,,32.400002,,,


In [46]:
df5.columns=['Country','Year','Gender equality','Exports-Commercial services','Renewable electricity','Employment-agriculture','Employment-industry','Employment-services','Exports-G&S','Fertility rate','Foreign investment','GDP','Education GExp','Workers high education','Literacy rate','Mortality-pollution','Net migration','Mortality-infants','Health services use','R&D GExp','Ninis','Suicide','International taxes','Alcohol per capita']

In [47]:
list(df5)
df5

Unnamed: 0,Country,Year,Gender equality,Exports-Commercial services,Renewable electricity,Employment-agriculture,Employment-industry,Employment-services,Exports-G&S,Fertility rate,...,Literacy rate,Mortality-pollution,Net migration,Mortality-infants,Health services use,R&D GExp,Ninis,Suicide,International taxes,Alcohol per capita
0,ARE,1990,,,0.0,,,,,4.454,...,,,,672.0,,,,,,
1,ARE,1991,,,0.0,8.46,33.330002,58.200001,,4.253,...,,,,645.0,,,,,,
2,ARE,1992,,,0.0,8.37,33.360001,58.279999,,4.041,...,,,368126.0,618.0,,,,,,
3,ARE,1993,,,0.0,8.24,33.470001,58.290001,,3.827,...,,,,592.0,,,,,,
4,ARE,1994,,,0.0,8.13,33.490002,58.380001,,3.618,...,,,,568.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,1.614806e+10,,5.28,23.340000,71.379997,1.042884e+11,2.430,...,87.046669,,727026.0,32777.0,75.770868,0.83215,31.010000,25.2,4.993941e+10,
1473,ZAF,2018,,1.670823e+10,,5.16,23.129999,71.709999,1.112854e+11,2.405,...,,,,31810.0,76.683188,,31.559999,24.1,5.572291e+10,9.52
1474,ZAF,2019,,1.554886e+10,,5.28,22.309999,72.410004,1.060698e+11,2.381,...,95.022972,,,30937.0,77.584480,,32.459999,23.5,5.522342e+10,
1475,ZAF,2020,,8.404204e+09,,,,,9.317915e+10,2.358,...,,,,30153.0,78.474611,,32.400002,,,


TIL HERE-INTEGRATION

In [100]:
dfp=df5.set_index(["Country", "Year"]).pivot(columns=, values="Value").reset_index()

KeyError: 'Indicator Name'

Get the mean of each column by country.

In [60]:
mean_value=df5.groupby('Country').mean()
mean_value

Unnamed: 0_level_0,Year,Gender equality,Exports-Commercial services,Renewable electricity,Employment-agriculture,Employment-industry,Employment-services,Exports-G&S,Fertility rate,Foreign investment,...,Literacy rate,Mortality-pollution,Net migration,Mortality-infants,Health services use,R&D GExp,Ninis,Suicide,International taxes,Alcohol per capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARE,2005.0,,,2.680769e+07,5.204483,34.332759,60.463448,2.529880e+11,2.420065,,...,93.795387,54.7,1.019762e+06,597.096774,98.121723,0.863376,,7.570,5.687960e+07,3.4120
ARG,2005.0,,8.086058e+09,1.175615e+09,0.720345,24.454483,74.825172,4.985330e+10,2.501677,-5.914572e+09,...,98.618455,26.6,-7.066667e+04,11383.838710,91.407733,0.493970,19.618571,8.890,5.575804e+10,9.1400
AUT,2005.5,,6.063244e+10,3.884692e+09,5.635517,28.200690,66.165518,1.505921e+11,1.433226,6.588530e+09,...,,15.3,2.028288e+05,370.741935,99.985922,2.473488,7.603529,16.685,1.175665e+08,12.4280
AZE,2005.5,4.0000,2.007968e+09,1.019231e+07,39.469655,12.764138,47.767587,1.427552e+10,2.062371,-4.075839e+08,...,99.684505,63.9,-2.039367e+04,7774.548387,82.292682,0.254667,,4.190,4.716260e+08,3.4440
BGD,2005.5,3.5625,1.162060e+09,1.719231e+07,54.253103,14.862414,30.887586,1.640451e+10,2.877548,-8.497250e+08,...,61.730904,149.0,-1.739225e+06,183494.903226,38.960588,,28.298333,4.390,2.017016e+11,0.1180
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
USA,2005.5,,4.412045e+11,1.242732e+11,1.544138,22.442414,76.014828,1.478503e+12,1.955984,7.536710e+09,...,,13.3,5.637184e+06,27441.483871,99.821926,2.648437,13.172857,13.220,2.930337e+10,9.5460
VEN,2005.0,,1.402185e+09,0.000000e+00,9.685172,21.785172,68.531035,4.751239e+10,2.698032,-1.187593e+09,...,94.567748,34.6,-6.545088e+05,10829.709677,94.333506,0.243271,16.512000,3.520,,7.6400
VNM,2005.5,4.5000,7.081860e+09,4.561538e+07,55.376896,17.747931,26.876897,8.023430e+10,2.239065,-6.370960e+09,...,92.427185,64.5,-4.768672e+05,36290.838710,71.100662,0.344926,10.304615,6.940,,5.2520
YEM,2005.0,1.8750,9.239832e+08,0.000000e+00,36.354138,14.085172,49.562069,5.947737e+09,5.702548,-2.245402e+08,...,45.594999,194.2,1.747550e+04,43631.225806,49.644188,,44.770000,5.815,,0.2832


Fill null values of each indicator of each country with the mean computed previously.

In [61]:
df6=df5.fillna(value=mean_value)

In [50]:
df6

Unnamed: 0,Country,Year,Gender equality,Exports-Commercial services,Renewable electricity,Employment-agriculture,Employment-industry,Employment-services,Exports-G&S,Fertility rate,...,Literacy rate,Mortality-pollution,Net migration,Mortality-infants,Health services use,R&D GExp,Ninis,Suicide,International taxes,Alcohol per capita
0,ARE,1990,,,0.0,,,,,4.454,...,,,,672.0,,,,,,
1,ARE,1991,,,0.0,8.46,33.330002,58.200001,,4.253,...,,,,645.0,,,,,,
2,ARE,1992,,,0.0,8.37,33.360001,58.279999,,4.041,...,,,368126.0,618.0,,,,,,
3,ARE,1993,,,0.0,8.24,33.470001,58.290001,,3.827,...,,,,592.0,,,,,,
4,ARE,1994,,,0.0,8.13,33.490002,58.380001,,3.618,...,,,,568.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,1.614806e+10,,5.28,23.340000,71.379997,1.042884e+11,2.430,...,87.046669,,727026.0,32777.0,75.770868,0.83215,31.010000,25.2,4.993941e+10,
1473,ZAF,2018,,1.670823e+10,,5.16,23.129999,71.709999,1.112854e+11,2.405,...,,,,31810.0,76.683188,,31.559999,24.1,5.572291e+10,9.52
1474,ZAF,2019,,1.554886e+10,,5.28,22.309999,72.410004,1.060698e+11,2.381,...,95.022972,,,30937.0,77.584480,,32.459999,23.5,5.522342e+10,
1475,ZAF,2020,,8.404204e+09,,,,,9.317915e+10,2.358,...,,,,30153.0,78.474611,,32.400002,,,


In [78]:
grouped=df6.groupby('Country')

Cálculo de los percentiles para posterior eliminación de los outliers. 
- Q1: first quartile (25%)
- Q3: third quartile (75%)
- IQR: interquartile range

Referencias: https://www.pluralsight.com/guides/cleaning-up-data-from-outliers, https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/

In [84]:
Q1=grouped.quantile(0.25)
Q3=grouped.quantile(0.75)
IQR=Q3-Q1
dflower_limit=Q1 - 1.5 * IQR
dfupper_limit=Q3 + 1.5 * IQR

if df6 value (indicator x, country y, year z) is <lower_limit or >upper_limit (indicator x, country y, independent z) 


In [87]:
dfupper_limit

Unnamed: 0_level_0,Year,Gender equality,Exports-Commercial services,Renewable electricity,Employment-agriculture,Employment-industry,Employment-services,Exports-G&S,Fertility rate,Foreign investment,...,Literacy rate,Mortality-pollution,Net migration,Mortality-infants,Health services use,R&D GExp,Ninis,Suicide,International taxes,Alcohol per capita
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ARE,2035.0,,,0.000000e+00,15.615000,37.114994,68.015003,7.435685e+11,4.94200,,...,101.318459,54.7,2100229.000,830.50,100.516611,1.368750,,10.5000,3.117321e+08,4.7500
ARG,2035.0,,2.633594e+10,3.713750e+09,1.950000,28.079998,79.865013,1.436012e+11,3.11600,6.836118e+09,...,99.475943,26.6,165000.000,21646.25,99.871008,0.778515,20.885002,10.1000,1.837167e+11,10.7500
AUT,2036.5,,8.558712e+10,1.340338e+10,8.964999,35.760000,77.904991,4.310484e+11,1.57750,2.331400e+10,...,,15.3,403570.875,640.50,100.021697,4.374741,8.605000,19.2625,9.768750e+06,13.8700
AZE,2036.5,4.0000,9.543696e+09,0.000000e+00,46.824997,17.695000,52.595001,5.715226e+10,2.31250,1.249584e+09,...,99.821364,63.9,197371.875,20079.50,129.468695,0.471093,,5.3125,1.550540e+09,6.5700
BGD,2036.5,4.9375,3.359632e+09,0.000000e+00,93.510004,29.584999,53.360003,6.202523e+10,5.24925,1.903986e+09,...,95.441296,149.0,829308.125,470174.00,69.987950,,35.250000,6.9500,5.249663e+11,0.3685
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
USA,2036.5,,1.326866e+12,2.627162e+11,2.230000,32.435000,86.785007,3.996301e+12,2.31650,2.798625e+11,...,,13.3,6281799.500,37114.00,100.056181,3.040595,16.421250,17.4000,5.648375e+10,10.5100
VEN,2035.0,,2.244500e+09,0.000000e+00,15.135001,26.849998,78.535000,1.488264e+11,3.74725,2.830750e+09,...,99.333979,34.6,537277.500,16539.50,97.122320,0.463359,29.272498,7.6500,,12.6450
VNM,2036.5,4.5000,2.235550e+10,1.931250e+08,92.070002,35.250002,45.435000,3.170287e+11,2.81550,9.741000e+09,...,99.825373,64.5,-106818.375,59454.25,109.433566,0.813730,11.515002,9.3625,,15.7050
YEM,2035.0,2.7500,2.925226e+09,0.000000e+00,71.894999,23.010000,80.559999,2.121337e+10,10.58125,9.523454e+08,...,62.604998,194.2,-6527.500,69184.00,57.481794,,44.770000,6.2000,,0.7675
