# EXTRACTION

Import libraries and functions.

In [2]:
import pandas as pd
import numpy as np
import glob
import os
from pyspark.sql.functions import concat, col, lit, split

Firstly we load the database from World Data Bank that has been downloaded and extracted in the *Data extraction* notebook. We acquire it from the predetermined path that is on our computer.

In [5]:
df= pd.read_csv (os.getcwd()+'/Data/''WDIData.csv')
df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.825520,19.272212,19.628009,,
1,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.RU.ZS,,,,,,,...,6.499471,6.680066,6.859110,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,Africa Eastern and Southern,AFE,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.UR.ZS,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,Africa Eastern and Southern,AFE,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,31.794160,32.001027,33.871910,38.880173,40.261358,43.061877,44.270860,45.803485,,
4,Africa Eastern and Southern,AFE,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.641760,30.404935,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384365,Zimbabwe,ZWE,Women who believe a husband is justified in be...,SG.VAW.REFU.ZS,,,,,,,...,,,14.500000,,,,,,,
384366,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,,,...,,,3.700000,,,,5.418352,,,
384367,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,,,...,,33.500000,32.400000,,,,33.658057,,,
384368,Zimbabwe,ZWE,Women's share of population ages 15+ living wi...,SH.DYN.AIDS.FE.ZS,,,,,,,...,59.200000,59.400000,59.500000,59.700000,59.900000,60.000000,60.200000,60.400000,,


# INTEGRATION

Moreover, to work more comfortably, we remove those columns not useful for us, as *Country Name* and *Indicator Code*, since with the *Country Code*, *Value* and the *Indicator Name* we have the relevant information.

In [6]:
df.drop(columns=["Country Name","Indicator Code"], axis=1, inplace=True)

FILTER 1: BY COUNTRY

From the almost two hundred countries we have information about in the worldwide database, we have decided to study 50 of them, grouping them by geographical and economical similiarities. With this, we can keep in our dataframe the selected countries.

Criteria for grouping:
- Europe: Germany, France, Sweden, United Kingdom, Spain, Croatia, Poland, Greece, Austria and Netherlands.

*Interesting countries of the European continent that can reflect events such as the Brexit process, the 2008 crisis or their historical strength.*
- Persian Gulf: Iraq, Qatar, United Arab Emirates, Arabia Saudita, Azerbayan, Yemen, Yemen Democratic and Oman.

*Countries located in the Persian Gulf, which have a similar economy based mainly on petrol and social structures.*
- North Africa: Algeria, Egiypt, Lybia, Israel, Turkey and Morroco.

*Countries of the african continent that are middle developed and with high mobility of people and goods.*
- South Africa: Senegal, South Africa, Liberia, Mozambique, Cameroon, Nigeria and Ghana.

*Countries of the south and central africa that are mainly subdeveloped and considered some of the poorest countries worldwide; but, on the contrary, one of them is highly developed.*
- Asia: Bangladesh, India, Vietnam, Thailand, Indonesia, Philipines and Korea (South).

*Converted in the last decades in the manufacturing of the world, they are subdeveloped countries with high population and childhood.*
- Latin America: Mexico, Brasil, Argentina, Peru, Venezuela, Colombia, Chile, Panama and Costa Rica.

*Countries located in same continet and some with singular political structures.* 
- Pair: USA and China.

*Although these countries seem to be confronted between them, they have been the top two most growing worlwide, despite the fact that culturally and economically they are completely distant.*


In [7]:
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PCZ','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list 

In [8]:
df1=df.loc[df['Country Code'].isin(country_list)]

Now we transpose the rows of years into the columns.

In [9]:
df2=(df1.set_index(["Country Code", "Indicator Name"]).stack().reset_index(name='Value').rename(columns={'level_2':'Date'}))
df2

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1729300,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1729301,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1729302,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1729303,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 2: BY YEAR

Our time range covers from 1960 to 2021. However, the record is not uniform and complete for all areas and indicators. We can appreaciate that specially in the first years of the last century, so many data is missing, then it makes no sense to study it. Besides, for the year 2021 many data is also lacking. Therefore, we would delimit our study between 1990 and 2020.

In [10]:
df2[['Date']] = df2[['Date']].astype(int)

In [11]:
df2.dtypes

Country Code       object
Indicator Name     object
Date                int32
Value             float64
dtype: object

In [12]:
df3 = df2[df2['Date'] > 1989]
df3

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
1729300,YEM,Young people (ages 15-24) newly infected with HIV,2016,200.0
1729301,YEM,Young people (ages 15-24) newly infected with HIV,2017,200.0
1729302,YEM,Young people (ages 15-24) newly infected with HIV,2018,200.0
1729303,YEM,Young people (ages 15-24) newly infected with HIV,2019,200.0


FILTER 3: BY INDICATOR

As there are lots of indicators that have very similar meaning we have decided to select some indicators to perform the study (**Indicator group** = *Name of the selected indicator*):
- **GDP** = *GDP (current US$), measures the monetary value of final goods and services produced in a country at a given period of time.*
- **Literacy** = *Literacy rate, % of people ages 15 and above which are able to expand one's knowledge of reading and writing in order to develop one's thinking and learning for the purpose of understanding oneself and the world. Government expenditure on education, total % of government expenditure incurred on education service.*
- **Migration** = *Net migration, difference between the number of immigrants (people coming into an area) and the number of emigrants (people leaving an area) throughout the year.*
- **Exports** = *Commercial service exports (current US$)* and *Exports of goods and services (current US$). Exports term is referred to the goods and services which are produced in a country and sold to buyers in another one.*
- **International trading** = *Taxes on international trade.*
- **Fertility** = *Fertility rate, mean of total births per woman. How many childs have born during a year per women.*
- **Healthcare** = *% of people using at least basic sanitation services. Amount of children covert by sanitation.*
- **Employment** = *Employment in agriculture (% of total employment), *Employment in services (% of total employment), and *Employment in industry (% of total employment). Amount of people employed in these three relevant sectors.*
- **Renewable energy** = *Electricity production from renewable sources, excluding hydroelectric. The units are KWh.*
- **Mortality** = *Number of infant deaths.*
- **Outside investment** = *Foreign direct investment, which is the net inflow of investment to acquire a lasting management interest  (BoP, current US$).*
- **Pollution** = *Mortality rate over 100,000 population attributed to household and ambient air pollution and age-standardized.*
- **Alcoholism** = *Total alcohol consumed per capita measure in liters of pure alcohol, taking into account people who are 15 or more years of age.*
- **Tech adoption** = *% of GDP which goes to the research and development expenditure.*
- **Workers high education** = *Labor force with advanced education. % of total working-age population with high level education. It measures the probability of having a good job according to the studies.*
- **Optimisim and pessimisim** = *Suicide mortality rate per 100,000 population.*
- **Gender equality** = *Rate of gender equality in a country between  (**1=low to **6=high). It assesses the extent to which the country has installed institutions and programs to enforce laws and policies that promote equal access for men and women in education, health, the economy, and protection under law.*
- **Education** = *Share of youth not in education, employment or training, total, Total number of young people.* and *Government expenditure on education of total. *

To acomplish this, we use the function `isin` that will allow us to only select the the indicators afromentioned, that have been compilied in the list called *indicators_list*

In [13]:
indicators_list=['GDP (current US$)','Literacy rate, adult total (% of people ages 15 and above)', 'Government expenditure on education, total (% of government expenditure)','Net migration','Commercial service exports (current US$)','Exports of goods and services (current US$)','Taxes on international trade (current LCU)','Fertility rate, total (births per woman)','People using at least basic sanitation services (% of population)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Number of infant deaths','Number of infant deaths','Foreign direct investment, net (BoP, current US$)','Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)','Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)','Research and development expenditure (% of GDP)','Labor force with advanced education (% of total working-age population with advanced education)','Suicide mortality rate (per 100,000 population)','CPIA gender equality rating (1=low to 6=high)','Share of youth not in education, employment or training, total (% of youth population)','Government expenditure on education, total (% of government expenditure)']

In [14]:
df4=df3.loc[df3['Indicator Name'].isin(indicators_list)]
pd.set_option('display.max_rows', 10)
df4

Unnamed: 0,Country Code,Indicator Name,Date,Value
5334,DZA,Commercial service exports (current US$),1990,4.795977e+08
5335,DZA,Commercial service exports (current US$),1991,3.747657e+08
5336,DZA,Commercial service exports (current US$),2005,2.466000e+09
5337,DZA,Commercial service exports (current US$),2006,2.512000e+09
5338,DZA,Commercial service exports (current US$),2007,2.786733e+09
...,...,...,...,...
1727918,YEM,Total alcohol consumption per capita (liters o...,2000,7.900000e-01
1727919,YEM,Total alcohol consumption per capita (liters o...,2005,3.400000e-01
1727920,YEM,Total alcohol consumption per capita (liters o...,2010,1.800000e-01
1727921,YEM,Total alcohol consumption per capita (liters o...,2015,5.500000e-02


# NORMALIZATION

Taking as reference both works of https://www.pluralsight.com/guides/cleaning-up-data-from-outliers and https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/, for normalizing our data we need to start computing the outliers and removing them from our dataframe. As there is not a direct function of pandas that performs this step, it´s been step-by-step code, where we begin with the computation of the quartiles, then the IQR (Inter Quartile Range) and finally the upper and lower limit.

Firstly, what we have done is to change the name of our indicators, as their original denomination is not easy to handle.

In [15]:
df4['Indicator Name']=df4['Indicator Name'].replace(['CPIA gender equality rating (1=low to 6=high)','Commercial service exports (current US$)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Exports of goods and services (current US$)','Fertility rate, total (births per woman)','Foreign direct investment, net (BoP, current US$)','GDP (current US$)','Government expenditure on education, total (% of government expenditure)','Labor force with advanced education (% of total working-age population with advanced education)','Literacy rate, adult total (% of people ages 15 and above)','Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)','Net migration','Number of infant deaths','People using at least basic sanitation services (% of population)','Research and development expenditure (% of GDP)','Share of youth not in education, employment or training, total (% of youth population)','Suicide mortality rate (per 100,000 population)','Taxes on international trade (current LCU)','Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)'],['Gender equality','Exports-Commercial services','Renewable electricity','Employment-agriculture','Employment-industry','Employment-services','Exports-G&S','Fertility rate','Foreign investment','GDP','Education GExp','Workers high education','Literacy rate','Mortality-pollution','Net migration','Mortality-infants','Health services use','R&D GExp','Ninis','Suicide','International taxes','Alcohol per capita'])
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['Indicator Name']=df4['Indicator Name'].replace(['CPIA gender equality rating (1=low to 6=high)','Commercial service exports (current US$)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Exports of goods and services (current US$)','Fertility rate, total (births per woman)','Foreign direct investment, net (BoP, current US$)','GDP (current US$)','Government expenditure on education, total (% of government expenditure)','Labor force with advanced education (% of total working-

Unnamed: 0,Country Code,Indicator Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,Unnamed: 66
0,AFE,Access to clean fuels and technologies for coo...,,,,,,,,,...,16.936004,17.337896,17.687093,18.140971,18.491344,18.825520,19.272212,19.628009,,
1,AFE,Access to clean fuels and technologies for coo...,,,,,,,,,...,6.499471,6.680066,6.859110,7.016238,7.180364,7.322294,7.517191,7.651598,,
2,AFE,Access to clean fuels and technologies for coo...,,,,,,,,,...,37.855399,38.046781,38.326255,38.468426,38.670044,38.722783,38.927016,39.042839,,
3,AFE,Access to electricity (% of population),,,,,,,,,...,31.794160,32.001027,33.871910,38.880173,40.261358,43.061877,44.270860,45.803485,,
4,AFE,"Access to electricity, rural (% of rural popul...",,,,,,,,,...,18.663502,17.633986,16.464681,24.531436,25.345111,27.449908,29.641760,30.404935,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
384365,ZWE,Women who believe a husband is justified in be...,,,,,,,,,...,,,14.500000,,,,,,,
384366,ZWE,Women who were first married by age 15 (% of w...,,,,,,,,,...,,,3.700000,,,,5.418352,,,
384367,ZWE,Women who were first married by age 18 (% of w...,,,,,,,,,...,,33.500000,32.400000,,,,33.658057,,,
384368,ZWE,Women's share of population ages 15+ living wi...,,,,,,,,,...,59.200000,59.400000,59.500000,59.700000,59.900000,60.000000,60.200000,60.400000,,


Secondly, we compute the first quartile (Q1=25%) and the third quartile (Q3=75%). For that, we have grouped the data by country code and indicator name, so we get the Q1 and Q3 values for each indicator in each geographical area. 

In [16]:
grouped=df4.groupby(['Country Code','Indicator Name'])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025B5AEE2D40>

In [17]:
Q1=df4.groupby(['Country Code','Indicator Name']).quantile(0.25)
Q3=df4.groupby(['Country Code','Indicator Name']).quantile(0.75)
IQR=Q3-Q1
IQR

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Value
Country Code,Indicator Name,Unnamed: 2_level_1,Unnamed: 3_level_1
ARE,Alcohol per capita,10.0,6.400000e-01
ARE,Education GExp,0.0,0.000000e+00
ARE,Employment-agriculture,14.0,5.130000e+00
ARE,Employment-industry,14.0,1.449997e+00
ARE,Employment-services,14.0,3.830002e+00
...,...,...,...
ZAF,Ninis,12.5,2.997499e+00
ZAF,R&D GExp,8.0,1.011100e-01
ZAF,Renewable electricity,12.5,2.292500e+08
ZAF,Suicide,9.5,1.025000e+00


Once we got the quartiles, we compute the upper and lower limit, with a basic mathematical expression.

In [18]:
lower_limit=Q1 - 1.5 * IQR
lower=lower_limit.drop(['Date'],axis=1)
lower.rename(columns={"Value":"Lower limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Lower limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Alcohol per capita,2.190000e+00
ARE,Education GExp,1.026766e+01
ARE,Employment-agriculture,-4.905000e+00
ARE,Employment-industry,3.131501e+01
ARE,Employment-services,5.269500e+01
...,...,...
ZAF,Ninis,2.684375e+01
ZAF,R&D GExp,5.828450e-01
ZAF,Renewable electricity,-2.623750e+08
ZAF,Suicide,2.203750e+01


In [19]:
upper_limit=Q3 + 1.5 * IQR
upper=upper_limit.drop(['Date'],axis=1)
upper.rename(columns={"Value":"Upper limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Upper limit
Country Code,Indicator Name,Unnamed: 2_level_1
ARE,Alcohol per capita,4.750000e+00
ARE,Education GExp,1.026766e+01
ARE,Employment-agriculture,1.561500e+01
ARE,Employment-industry,3.711499e+01
ARE,Employment-services,6.801500e+01
...,...,...
ZAF,Ninis,3.883375e+01
ZAF,R&D GExp,9.872850e-01
ZAF,Renewable electricity,6.546250e+08
ZAF,Suicide,2.613750e+01


Thirdly, we join the three tables we have (main dataframe, upper limit and lower limit) by matching country code and indicator name..

In [20]:
dfs = [df4,lower,upper]
import functools as ft
df_joined = ft.reduce(lambda left, right: pd.merge(left, right, on=['Country Code','Indicator Name']), dfs)
df_joined

Unnamed: 0,Country Code,Indicator Name,Date,Value_x,Value_y,Value
0,DZA,Exports-Commercial services,1990,4.795977e+08,1.736231e+09,4.453536e+09
1,DZA,Exports-Commercial services,1991,3.747657e+08,1.736231e+09,4.453536e+09
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19939,YEM,Alcohol per capita,2000,7.900000e-01,-3.725000e-01,7.675000e-01
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


In [21]:
list(df_joined)

['Country Code', 'Indicator Name', 'Date', 'Value_x', 'Value_y', 'Value']

We rename the columns of the new table, as the columns headers are not saved after the joining. 

In [22]:
renamed=df_joined.set_axis(['Country','Indicator','Year', 'Real value', 'Lower value', 'Upper value'], axis=1, inplace=False)
renamed

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
0,DZA,Exports-Commercial services,1990,4.795977e+08,1.736231e+09,4.453536e+09
1,DZA,Exports-Commercial services,1991,3.747657e+08,1.736231e+09,4.453536e+09
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19939,YEM,Alcohol per capita,2000,7.900000e-01,-3.725000e-01,7.675000e-01
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


Now that we have the table correctly defined, we remove from our dataframe the values that are outside our range, as it means that they are outliers.

In [23]:
sin_outliers=renamed.loc[~((renamed['Real value']<renamed['Lower value']) | (renamed['Real value']>renamed['Upper value']))]
sin_outliers

Unnamed: 0,Country,Indicator,Year,Real value,Lower value,Upper value
2,DZA,Exports-Commercial services,2005,2.466000e+09,1.736231e+09,4.453536e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09,1.736231e+09,4.453536e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09,1.736231e+09,4.453536e+09
5,DZA,Exports-Commercial services,2008,3.412421e+09,1.736231e+09,4.453536e+09
6,DZA,Exports-Commercial services,2009,2.744716e+09,1.736231e+09,4.453536e+09
...,...,...,...,...,...,...
19938,YEM,Suicide,2019,5.800000e+00,5.400000e+00,6.200000e+00
19940,YEM,Alcohol per capita,2005,3.400000e-01,-3.725000e-01,7.675000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01,-3.725000e-01,7.675000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02,-3.725000e-01,7.675000e-01


From the data above, we can perceive that our data comes down from 19944 rows to 19424, so 500 were outliers. The next steps are to order and display data better, removing those columns that we just do not need and pivoting the rows and columns. 

In [24]:
df_limpio=sin_outliers.drop(['Lower value','Upper value'],axis=1)
df_limpio

Unnamed: 0,Country,Indicator,Year,Real value
2,DZA,Exports-Commercial services,2005,2.466000e+09
3,DZA,Exports-Commercial services,2006,2.512000e+09
4,DZA,Exports-Commercial services,2007,2.786733e+09
5,DZA,Exports-Commercial services,2008,3.412421e+09
6,DZA,Exports-Commercial services,2009,2.744716e+09
...,...,...,...,...
19938,YEM,Suicide,2019,5.800000e+00
19940,YEM,Alcohol per capita,2005,3.400000e-01
19941,YEM,Alcohol per capita,2010,1.800000e-01
19942,YEM,Alcohol per capita,2015,5.500000e-02


In [25]:
transpuesto=df_limpio.set_index(["Country", "Year"]).pivot(columns="Indicator", values="Real value").reset_index()
transpuesto

Indicator,Country,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,...,International taxes,Literacy rate,Mortality-infants,Mortality-pollution,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education
0,ARE,1990,,,,,,,,4.454,...,,,672.0,,,,,0.0,,
1,ARE,1991,,,8.46,33.330002,58.200001,,,4.253,...,,,645.0,,,,,0.0,,
2,ARE,1992,,,8.37,33.360001,58.279999,,,4.041,...,,,618.0,,368126.0,,,0.0,,
3,ARE,1993,,,8.24,33.470001,58.290001,,,3.827,...,,,592.0,,,,,0.0,,
4,ARE,1994,,,8.13,33.490002,58.380001,,,3.618,...,,,568.0,,,,,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1472,ZAF,2017,,18.719290,5.28,23.340000,71.379997,1.614806e+10,1.042884e+11,2.430,...,4.993941e+10,87.046669,32777.0,,727026.0,31.010000,0.83215,,25.2,83.809998
1473,ZAF,2018,9.52,18.901590,5.16,23.129999,71.709999,1.670823e+10,1.112854e+11,2.405,...,5.572291e+10,,31810.0,,,31.559999,,,24.1,82.879997
1474,ZAF,2019,,19.596230,5.28,22.309999,72.410004,1.554886e+10,1.060698e+11,2.381,...,5.522342e+10,95.022972,30937.0,,,32.459999,,,23.5,82.019997
1475,ZAF,2020,,19.527281,,,,8.404204e+09,9.317915e+10,2.358,...,,,30153.0,,,32.400002,,,,


On the other hand, another big stone of normalizations is to nan/null values, which we have in all variables.

In [26]:
transpuesto.isna().sum()

Indicator
Country                      0
Year                         0
Alcohol per capita        1251
Education GExp             749
Employment-agriculture     121
                          ... 
Ninis                      978
R&D GExp                   833
Renewable electricity      369
Suicide                    552
Workers high education     893
Length: 24, dtype: int64

The Nan values´ treatment is to replace them by the corresponding mean of the indicator in the country.

In [27]:
df=transpuesto
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PCZ','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list

dat=df.loc[df.loc[:, 'Country'] == country_list[0]]

mean_dat=dat.mean()

data=dat.fillna(mean_dat,inplace=False)
for i in range(1,len(country_list)):

    dat=df.loc[df.loc[:, 'Country'] == country_list[i]]

    mean_dat=dat.mean()

    datc=dat.fillna(mean_dat,inplace=False)

    data=data.append(datc)

data

  mean_dat=dat.mean()
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=dat.mean()
  data=data.append(datc)
  mean_dat=da

Indicator,Country,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,...,International taxes,Literacy rate,Mortality-infants,Mortality-pollution,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education
347,DEU,1990,12.9725,10.224579,2.251034,31.338276,66.412414,4.922992e+10,4.045759e+11,1.450,...,0.000000e+00,,3090.866667,16.0,1.478270e+06,7.866111,2.608368,1.667000e+09,13.30,74.552174
348,DEU,1991,12.9725,10.224579,3.480000,37.720001,58.790001,4.842279e+10,4.422840e+11,1.330,...,0.000000e+00,,5404.000000,16.0,1.478270e+06,7.866111,2.608368,2.088000e+09,13.30,74.552174
349,DEU,1992,12.9725,10.224579,3.400000,37.369999,59.230000,5.465244e+10,4.730958e+11,1.290,...,0.000000e+00,,5019.000000,16.0,2.628459e+06,7.866111,2.608368,2.338000e+09,13.30,74.552174
350,DEU,1993,12.9725,9.582970,3.350000,36.740002,59.919998,5.254376e+10,4.207571e+11,1.280,...,0.000000e+00,,4679.000000,16.0,1.478270e+06,7.866111,2.608368,2.642000e+09,13.30,74.552174
351,DEU,1994,12.9725,9.395570,3.260000,36.419998,60.320000,5.468545e+10,4.655068e+11,1.240,...,0.000000e+00,,4388.000000,16.0,1.478270e+06,7.866111,2.608368,3.503000e+09,13.30,74.552174
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,CHN,2016,5.8200,12.269560,27.700001,28.799999,43.500000,2.071921e+11,2.199968e+12,1.675,...,2.651774e+11,90.167660,131852.000000,112.7,-1.433525e+06,,2.100330,1.431141e+10,8.20,
248,CHN,2017,5.8200,12.155300,26.980000,28.110001,44.910000,2.113642e+11,2.424200e+12,1.683,...,3.048248e+11,90.167660,120075.000000,112.7,-1.741996e+06,,2.116030,1.431141e+10,8.10,
249,CHN,2018,7.0500,11.450690,26.070000,28.320000,45.610001,2.318095e+11,2.655592e+12,1.690,...,2.897563e+11,96.840889,109028.000000,112.7,-1.433525e+06,,2.140580,1.431141e+10,8.10,
250,CHN,2019,5.8200,13.037553,25.330000,27.420000,47.250000,2.427723e+11,2.628935e+12,1.696,...,2.318647e+11,90.167660,98805.000000,112.7,-1.433525e+06,,1.427239,1.431141e+10,8.10,


For the next part of analyzing this data, we think it is gonna be interesting to have it classify by the categories of the Country groups defined before, to which we call "Continent". This category is useful as it groups the nations with similar economies or geographical proximity, so we can extract common conclusions from them.

We create a dictionary with the regions and the countries included in each one. 

In [28]:
countries_by_region = {
    "Europe": ('DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD'),
    'Persian Gulf': ('IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN'),
    'North Africa':('DZA','EGY','LBY','ISR','TUR','MAR'),
    'South Africa':('SEN','ZAF','LBR','MOZ','CMR','NGA','GHA'),
    'Asia':('BGD','IND','VNM','THA','IDN','PHL','KOR'),
    'Latam':('MEX','BRA','ARG','PER','VEN','COL','CHL','PCZ','CRI'),
    'Pair':('USA','CHN')
    }

Now, we have two alternatives through loops:
- Get a list country-region to convert into table and then merge it with our main dataframe basing on Country. 
- Get a dictionary country-region so then we can apply the .map function.

In [29]:
#alternative 1
all_countries = []
for region in countries_by_region.keys():
  all_countries += [(region, country) for country in countries_by_region[region]]

# Table(Region, Country)
# Merge "Join" by Country

In [30]:
#alternative 2: dictionary
all_countries = {}
for region in countries_by_region.keys():
  for country in countries_by_region[region]:
    all_countries[country] = region

print(all_countries)



{'DEU': 'Europe', 'FRA': 'Europe', 'SWE': 'Europe', 'GBR': 'Europe', 'ESP': 'Europe', 'HRV': 'Europe', 'POL': 'Europe', 'GRC': 'Europe', 'AUT': 'Europe', 'NLD': 'Europe', 'IRQ': 'Persian Gulf', 'QAT': 'Persian Gulf', 'ARE': 'Persian Gulf', 'SAU': 'Persian Gulf', 'AZE': 'Persian Gulf', 'YEM': 'Persian Gulf', 'YDR': 'Persian Gulf', 'OMN': 'Persian Gulf', 'DZA': 'North Africa', 'EGY': 'North Africa', 'LBY': 'North Africa', 'ISR': 'North Africa', 'TUR': 'North Africa', 'MAR': 'North Africa', 'SEN': 'South Africa', 'ZAF': 'South Africa', 'LBR': 'South Africa', 'MOZ': 'South Africa', 'CMR': 'South Africa', 'NGA': 'South Africa', 'GHA': 'South Africa', 'BGD': 'Asia', 'IND': 'Asia', 'VNM': 'Asia', 'THA': 'Asia', 'IDN': 'Asia', 'PHL': 'Asia', 'KOR': 'Asia', 'MEX': 'Latam', 'BRA': 'Latam', 'ARG': 'Latam', 'PER': 'Latam', 'VEN': 'Latam', 'COL': 'Latam', 'CHL': 'Latam', 'PCZ': 'Latam', 'CRI': 'Latam', 'USA': 'Pair', 'CHN': 'Pair'}


In [31]:
#Para comprobar que todos los países de nuestros datos están incluidos en nuestro diccionario:
# Lista de todos los countries
# Sobre todos los datos .distintict() country
# Anterior - Lista de countries -> Country en ninguna Region

for region, countries in countries_by_region.values():
    [(region, c) for c in countries]


ValueError: too many values to unpack (expected 2)

In [32]:
data['Continent']=data['Country'].map(all_countries)
GoldenDataFrame=data

Indicator,Country,Year,Alcohol per capita,Education GExp,Employment-agriculture,Employment-industry,Employment-services,Exports-Commercial services,Exports-G&S,Fertility rate,...,Literacy rate,Mortality-infants,Mortality-pollution,Net migration,Ninis,R&D GExp,Renewable electricity,Suicide,Workers high education,Continent
347,DEU,1990,12.9725,10.224579,2.251034,31.338276,66.412414,4.922992e+10,4.045759e+11,1.450,...,,3090.866667,16.0,1.478270e+06,7.866111,2.608368,1.667000e+09,13.30,74.552174,Europe
348,DEU,1991,12.9725,10.224579,3.480000,37.720001,58.790001,4.842279e+10,4.422840e+11,1.330,...,,5404.000000,16.0,1.478270e+06,7.866111,2.608368,2.088000e+09,13.30,74.552174,Europe
349,DEU,1992,12.9725,10.224579,3.400000,37.369999,59.230000,5.465244e+10,4.730958e+11,1.290,...,,5019.000000,16.0,2.628459e+06,7.866111,2.608368,2.338000e+09,13.30,74.552174,Europe
350,DEU,1993,12.9725,9.582970,3.350000,36.740002,59.919998,5.254376e+10,4.207571e+11,1.280,...,,4679.000000,16.0,1.478270e+06,7.866111,2.608368,2.642000e+09,13.30,74.552174,Europe
351,DEU,1994,12.9725,9.395570,3.260000,36.419998,60.320000,5.468545e+10,4.655068e+11,1.240,...,,4388.000000,16.0,1.478270e+06,7.866111,2.608368,3.503000e+09,13.30,74.552174,Europe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,CHN,2016,5.8200,12.269560,27.700001,28.799999,43.500000,2.071921e+11,2.199968e+12,1.675,...,90.167660,131852.000000,112.7,-1.433525e+06,,2.100330,1.431141e+10,8.20,,Pair
248,CHN,2017,5.8200,12.155300,26.980000,28.110001,44.910000,2.113642e+11,2.424200e+12,1.683,...,90.167660,120075.000000,112.7,-1.741996e+06,,2.116030,1.431141e+10,8.10,,Pair
249,CHN,2018,7.0500,11.450690,26.070000,28.320000,45.610001,2.318095e+11,2.655592e+12,1.690,...,96.840889,109028.000000,112.7,-1.433525e+06,,2.140580,1.431141e+10,8.10,,Pair
250,CHN,2019,5.8200,13.037553,25.330000,27.420000,47.250000,2.427723e+11,2.628935e+12,1.696,...,90.167660,98805.000000,112.7,-1.433525e+06,,1.427239,1.431141e+10,8.10,,Pair


With that all, we export our dataframe all-in-one and by the continent category.

In [33]:
GoldenDataFrame.to_csv('GoldenDataFrame.csv')

In [34]:
for continent, data in data.groupby('Continent'):
    GoldenDataFrame.to_csv("{}.csv".format(continent))