#  <u> 1st step : Load

First we open the merged table with the 6 indicators that we have created in the 'Load part'

We import the different librairies that we are going to use :

    - pandas : Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical arrays and time series.
    
    - functools : The functools module is for higher-order functions: functions that act on or return other functions. In general, any callable object can be treated as a function for the purposes of this module
    

In [14]:
# we import the useful librairies 
import pandas as pd
import functools

Then we open and read the merged table of all indicators

In [17]:
# we open and read the merged table of all indicators
bronze_dataset = pd.read_csv ('./data/all_indicators_table.csv')
bronze_dataset = bronze_dataset.drop(bronze_dataset.columns[0], axis = 1)

 and Then we put all the indicators in only one column 

In [19]:
#Then we put all the indicators in only one column 
bronze_dataset=(bronze_dataset.set_index(["Code", "Year"]).stack().reset_index(name='Value').rename(columns={'level_2':'Indicator'})) 
bronze_dataset

Unnamed: 0,Code,Year,Indicator,Value
0,AFG,1966,Entity,Afghanistan
1,AFG,1966,Deaths,161659.0
2,AFG,1966,LifeExpectancy,35.5
3,AFG,1966,GDP,500000224.0
4,AFG,1966,Fertility,7.3203
...,...,...,...,...
88076,ZWE,1947,GDP,59000000.0
88077,ZWE,1948,Entity,Zimbabwe
88078,ZWE,1948,GDP,72000000.0
88079,ZWE,1949,Entity,Zimbabwe


# <u>STEP 2 : Beginning of Normalization  
    
### IQR method 
    
For normalizing our data we need to start computing the outliers and removing them from our dataframe. 
    
We are going to realize taht thanks to the IQR method by identifying outliers to set up a “fence” outside of the interqurtile range. Any values that fall outside of this fence are considered outliers. 
    

To build the fence we begin with the computation of the quartiles, then the IQR (Inter Quartile Range) and finally the upper and lower limit.

### 1) computing the quartiles :  

Definition :

To understand what are the quartiles you need to divide your data into quarters.   
Each quarter is called in statistics a quartile named  Q1, Q2, Q3, and Q4. 

The lowest quartile (Q1=25%) is the value below which 25% of the data lies   
The upper quartile (Q3=75%) is the value below which 75% of the data lies 

The interquartile range is the difference between the upper quartile and the lower quartile : IQR=Q3-Q1
    

Fisrt we compute the first quartile, the second quartile and the interquartile range 

In [8]:
# 1st quartile 
Q1=bronze_dataset.groupby(['Code','Indicator']).quantile(0.25)

# 3rd quartile 
Q3=bronze_dataset.groupby(['Code','Indicator']).quantile(0.75)

#interquartile range 
IQR=Q3-Q1

IQR

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,Value
Code,Indicator,Unnamed: 2_level_1,Unnamed: 3_level_1
ABW,Fertility,35.5,1.317100e+00
ABW,GDP,8.5,3.090634e+08
ABW,LifeExpectancy,35.5,7.025000e+00
AFG,Deaths,27.0,4.707350e+04
AFG,Fertility,35.5,3.561750e-01
...,...,...,...
ZWE,Fertility,35.5,3.106400e+00
ZWE,GDP,45.5,4.186134e+09
ZWE,GenderInequality,15.5,5.250000e-02
ZWE,LifeExpectancy,35.5,8.150000e+00


### 2) computing the limits :

The next step to build the fence is to take 1.5 times the IQR and then subtract this value from Q1 and add this value to Q3.   This gives us the minimum and maximum fence posts that we compare each observation to. 

Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers. 

So we compute the upper and lower limit, and we delete the column year and rename the Value column

In [9]:
lower_limit =Q1 - 1.5 * IQR
lower_table =lower_limit.drop(['Year'],axis=1)
lower_table.rename(columns={"Value":"Lower limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Lower limit
Code,Indicator,Unnamed: 2_level_1
ABW,Fertility,-1.925000e-02
ABW,GDP,5.567916e+08
ABW,LifeExpectancy,5.616250e+01
AFG,Deaths,3.980225e+04
AFG,Fertility,6.647813e+00
...,...,...
ZWE,Fertility,-6.594250e-01
ZWE,GDP,-6.228451e+09
ZWE,GenderInequality,4.677500e-01
ZWE,LifeExpectancy,3.835000e+01


In [10]:
upper_limit=Q3 + 1.5 * IQR
upper_table=upper_limit.drop(['Year'],axis=1)
upper_table.rename(columns={"Value":"Upper limit"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Upper limit
Code,Indicator,Unnamed: 2_level_1
ABW,Fertility,5.249150e+00
ABW,GDP,1.793045e+09
ABW,LifeExpectancy,8.426250e+01
AFG,Deaths,2.280962e+05
AFG,Fertility,8.072512e+00
...,...,...
ZWE,Fertility,1.176618e+01
ZWE,GDP,1.051608e+10
ZWE,GenderInequality,6.777500e-01
ZWE,LifeExpectancy,7.095000e+01



Then we merge the three tables : the Bronze_dataset, the upper_table and the lower_table
However, we use the functions reduce from functools
It allows to merge the three tables in one command

In [11]:
three_tables = [bronze_dataset,lower_table,upper_table]
tables_joined = functools.reduce(lambda left, right: pd.merge(left, right, on=['Code','Indicator']), three_tables)
tables_joined

Unnamed: 0,Code,Year,Indicator,Value_x,Value_y,Value
0,AFG,1966,Deaths,1.616590e+05,3.980225e+04,2.280962e+05
1,AFG,1967,Deaths,1.625790e+05,3.980225e+04,2.280962e+05
2,AFG,1968,Deaths,1.635730e+05,3.980225e+04,2.280962e+05
3,AFG,1969,Deaths,1.646380e+05,3.980225e+04,2.280962e+05
4,AFG,1970,Deaths,1.654300e+05,3.980225e+04,2.280962e+05
...,...,...,...,...,...,...
66489,OWID_GFR,1986,GDP,7.110545e+11,-5.467304e+11,9.798115e+11
66490,OWID_GFR,1987,GDP,7.913833e+11,-5.467304e+11,9.798115e+11
66491,OWID_GFR,1988,GDP,7.847509e+11,-5.467304e+11,9.798115e+11
66492,OWID_GFR,1989,GDP,8.517760e+11,-5.467304e+11,9.798115e+11


We rename the columns in order to understand better

In [12]:
renamed=tables_joined.set_axis(['Code','Year','Indicator', 'Real value', 'Lower value', 'Upper value'], axis=1)
renamed

Unnamed: 0,Code,Year,Indicator,Real value,Lower value,Upper value
0,AFG,1966,Deaths,1.616590e+05,3.980225e+04,2.280962e+05
1,AFG,1967,Deaths,1.625790e+05,3.980225e+04,2.280962e+05
2,AFG,1968,Deaths,1.635730e+05,3.980225e+04,2.280962e+05
3,AFG,1969,Deaths,1.646380e+05,3.980225e+04,2.280962e+05
4,AFG,1970,Deaths,1.654300e+05,3.980225e+04,2.280962e+05
...,...,...,...,...,...,...
66489,OWID_GFR,1986,GDP,7.110545e+11,-5.467304e+11,9.798115e+11
66490,OWID_GFR,1987,GDP,7.913833e+11,-5.467304e+11,9.798115e+11
66491,OWID_GFR,1988,GDP,7.847509e+11,-5.467304e+11,9.798115e+11
66492,OWID_GFR,1989,GDP,8.517760e+11,-5.467304e+11,9.798115e+11


In [13]:
renamed.to_csv('./data/bronze_dataset_with_outliers.csv')