# Basics

Import libraries and functions.

In [3]:
import pandas as pd
import numpy as np
import glob
import os
from pyspark.sql.functions import concat, col, lit, split

Firstly we load the database from World Data Bank that has been downloaded and extracted in the *Data extraction* notebook. We acquire it from the predetermined path that is on our computer.

In [7]:
df= pd.read_csv(os.getcwd()+"\Data"+'\WDIData.csv')
df.head

<bound method NDFrame.head of                       Country Name Country Code  \
0      Africa Eastern and Southern          AFE   
1      Africa Eastern and Southern          AFE   
2      Africa Eastern and Southern          AFE   
3      Africa Eastern and Southern          AFE   
4      Africa Eastern and Southern          AFE   
...                            ...          ...   
93277                   Bangladesh          BGD   
93278                   Bangladesh          BGD   
93279                   Bangladesh          BGD   
93280                   Bangladesh          BGD   
93281                   Bangladesh          BGD   

                                          Indicator Name  \
0      Access to clean fuels and technologies for coo...   
1      Access to clean fuels and technologies for coo...   
2      Access to clean fuels and technologies for coo...   
3                Access to electricity (% of population)   
4      Access to electricity, rural (% of rural popul... 

Moreover, to work more comfortably, we remove those columns not useful for us, as *Country Name* and *Indicator Code*, since with the *Country Code*, *Value* and the *Indicator Name* we have the information relevant.

In [8]:
df.drop(columns=["Country Name","Indicator Code"], axis=1, inplace=True)

From the, almost, two hundred countries we have information about in the worldwide database, we have decided to study 50 of them, grouping them by geographical and economical similiarities. With this, we can keep in our dataframe the selected countries.

In [9]:
europe_list=['DEU','FRA','SWE','GBR','ESP','HRV','POL','GRC','AUT','NLD']
persian_list=['IRQ','QAT','ARE','SAU','AZE','YEM','YDR','OMN']
naf_list=['DZA','EGY','LBY','ISR','TUR','MAR']
saf_list=['SEN','ZAF','LBR','MOZ','CMR','NGA','GHA']
asia_list=['BGD','IND','VNM','THA','IDN','PHL','KOR']
latam_list=['MEX','BRA','ARG','PER','VEN','COL','CHL','PCZ','CRI']
two_list=['USA','CHN']
country_list=europe_list+persian_list+naf_list+saf_list+asia_list+latam_list+two_list 

In [10]:
df=df.loc[df['Country Code'].isin(country_list)]

Now we transpose the rows of years into the columns.

In [11]:
dftras=(df.set_index(["Country Code", "Indicator Name"]).stack().reset_index(name='Value').rename(columns={'level_2':'Date'}))
dftras

Unnamed: 0,Country Code,Indicator Name,Date,Value
0,DZA,Access to clean fuels and technologies for coo...,2000,97.1
1,DZA,Access to clean fuels and technologies for coo...,2001,97.3
2,DZA,Access to clean fuels and technologies for coo...,2002,97.8
3,DZA,Access to clean fuels and technologies for coo...,2003,98.0
4,DZA,Access to clean fuels and technologies for coo...,2004,98.2
...,...,...,...,...
166150,BGD,"Mortality rate, under-5, male (per 1,000 live ...",2016,38.1
166151,BGD,"Mortality rate, under-5, male (per 1,000 live ...",2017,36.3
166152,BGD,"Mortality rate, under-5, male (per 1,000 live ...",2018,34.4
166153,BGD,"Mortality rate, under-5, male (per 1,000 live ...",2019,32.6


As there are lots of indicators that have very similar meaning we have decided to select some indicators to perform the study (**Indicator group** = *Name of the selected indicator*):
- **GDP** = *GDP (current US$)*
- **Literacy** = *Literacy rate, adult total (% of people ages 15 and above)', 'Government expenditure on education, total (% of government expenditure)*
- **Migration** = *Net migration*
- **Exports** = *Commercial service exports (current US$)* & *Exports of goods and services (current US$)*
- **International trading** = *Taxes on international trade (current LCU)*
- **Fertility** = *Fertility rate, total (births per woman)*
- **Healthcare** = *People using at least basic sanitation services (% of population)*
- **Employment** = *Employment in agriculture (% of total employment) (modeled ILO estimate)*, *Employment in services (% of total employment) (modeled ILO estimate)* & *Employment in industry (% of total employment) (modeled ILO estimate)*
- **Renewable energy** = *Electricity production from renewable sources, excluding hydroelectric (kWh)*
- **Mortality** = *Number of infant deaths*
- **Outside investment** = *Foreign direct investment, net (BoP, current US$)*
- **Pollution** = *Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)*
- **Alcoholism** = *Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)*
- **Tech adoption** = *Research and development expenditure (% of GDP)*
- ** ** = *Labor force with advanced education (% of total working-age population with advanced education)*
- **Optimisim and pessimisim** = *Suicide mortality rate (per 100,000 population)*
- **Gender inequality** = *CPIA gender equality rating (**1=low to **6=high)*
- **Education** = *Share of youth not in education, employment or training, total (% of youth population)* & *Government expenditure on education, total (% of government expenditure)'*

To acomplish this, we use the function `isin` that will allow us to only select the the indicators afromentioned, that have been compilied in the list called *indicators_list*

In [12]:
indicators_list=['GDP (current US$)','Literacy rate, adult total (% of people ages 15 and above)', 'Government expenditure on education, total (% of government expenditure)','Net migration','Commercial service exports (current US$)','Exports of goods and services (current US$)','Taxes on international trade (current LCU)','Fertility rate, total (births per woman)','People using at least basic sanitation services (% of population)','Employment in agriculture (% of total employment) (modeled ILO estimate)','Employment in services (% of total employment) (modeled ILO estimate)','Employment in industry (% of total employment) (modeled ILO estimate)','Electricity production from renewable sources, excluding hydroelectric (kWh)','Number of infant deaths','Foreign direct investment, net (BoP, current US$)','Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)','Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)','Research and development expenditure (% of GDP)','Labor force with advanced education (% of total working-age population with advanced education)','Suicide mortality rate (per 100,000 population)','CPIA gender equality rating (1=low to 6=high)','Share of youth not in education, employment or training, total (% of youth population)','Government expenditure on education, total (% of government expenditure)']

In [18]:
dfl=dftras.loc[dftras['Indicator Name'].isin(indicators_list)]
pd.set_option('display.max_rows', 10)
dfl

Unnamed: 0,Country Code,Indicator Name,Date,Value
5321,DZA,Commercial service exports (current US$),1977,2.592386e+08
5322,DZA,Commercial service exports (current US$),1978,2.919892e+08
5323,DZA,Commercial service exports (current US$),1979,4.391079e+08
5324,DZA,Commercial service exports (current US$),1980,4.463902e+08
5325,DZA,Commercial service exports (current US$),1981,4.425590e+08
...,...,...,...,...
162805,BGD,"Literacy rate, adult total (% of people ages 1...",2017,7.289297e+01
162806,BGD,"Literacy rate, adult total (% of people ages 1...",2018,7.391220e+01
162807,BGD,"Literacy rate, adult total (% of people ages 1...",2019,7.468446e+01
162808,BGD,"Literacy rate, adult total (% of people ages 1...",2020,7.490890e+01


In [19]:
dfl=dfl.set_index(["Country Code", "Date"]).pivot(columns="Indicator Name", values="Value").reset_index()

Furthermore, as our time range covers from 1960 to 2021, the record for this range is not uniform and complete for all areas and indicators. We can appreciate it in the first years of the last century, where lots of data is missing, so it makes no sense to study it. Besides, in 2021 lots of data is also lacking. Therefore, we would delimit our study between 1990 and 2020, to have a more prepared data set. 

In [20]:
dfl[['Date']] = dfl[['Date']].astype(int)
df3 = dfl[dfl['Date'] > 1989]
df3

Indicator Name,Country Code,Date,CPIA gender equality rating (1=low to 6=high),Commercial service exports (current US$),"Electricity production from renewable sources, excluding hydroelectric (kWh)",Employment in agriculture (% of total employment) (modeled ILO estimate),Employment in industry (% of total employment) (modeled ILO estimate),Employment in services (% of total employment) (modeled ILO estimate),Exports of goods and services (current US$),"Fertility rate, total (births per woman)",...,"Literacy rate, adult total (% of people ages 15 and above)","Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)",Net migration,Number of infant deaths,People using at least basic sanitation services (% of population),Research and development expenditure (% of GDP),"Share of youth not in education, employment or training, total (% of youth population)","Suicide mortality rate (per 100,000 population)",Taxes on international trade (current LCU),"Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age)"
30,ARG,1990,,2.264000e+09,107000000.0,,,,1.464345e+10,2.997,...,,,,18201.0,,,,,1.018100e+09,
31,ARG,1991,,2.174000e+09,100000000.0,0.34,32.900002,66.750000,1.456109e+10,2.965,...,96.040718,,,18003.0,,,,,1.590800e+09,
32,ARG,1992,,2.842100e+09,102000000.0,0.40,32.349998,67.239998,1.509590e+10,2.925,...,,,-105000.0,17571.0,,,,,2.177200e+09,
33,ARG,1993,,2.885460e+09,107000000.0,0.45,29.719999,69.830002,1.635732e+10,2.879,...,,,,16981.0,,,,,2.720100e+09,
34,ARG,1994,,3.180600e+09,120000000.0,0.45,28.600000,70.949997,1.938510e+10,2.828,...,,,,16286.0,,,,,2.791800e+09,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,DZA,2017,,3.023719e+09,,10.16,30.990000,58.849998,3.849675e+10,3.045,...,,,-50002.0,21337.0,86.458826,0.54297,20.950001,2.5,,
305,DZA,2018,,3.224527e+09,,9.88,30.910000,59.220001,4.523397e+10,3.023,...,81.407837,,,20873.0,86.303092,,,2.5,,0.95
306,DZA,2019,,3.067329e+09,,9.60,30.420000,59.990002,3.901432e+10,2.988,...,,,,20239.0,86.138504,,,2.5,,
307,DZA,2020,,2.937941e+09,,,,,2.610336e+10,2.942,...,,,,19465.0,85.965470,,,,,


NEXT STEP: NORMALIZATION

- Group countries by group list name.
- Nan values: replace by 0, by the mean or eliminating.
- Remove outliers.
- For the main variable to compare (GDP): analyse distribution: if not normal, make logarithmic.

In [None]:
selected['GDP (current US$)'].describe()

KeyError: 'GDP (current US$)'

In [None]:
sns.displot(selected['GDP (current US$)'])

In [None]:
selected.mean()
selected.fillna(selected.mean())