#  <u> 1st step : Load


We import the different librairies that we are going to use :

    - pandas : Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical arrays and time series.

In [1]:
# we import the useful librairies 
import pandas as pd

We open the final merged table without outliers of the 'outliers part ' 

In [2]:
# we open and read the merge table of all indicators
silver_dataset = pd.read_csv ('./data/SilverDataset.csv')
silver_dataset = silver_dataset.drop(silver_dataset.columns[0], axis = 1)
silver_dataset

Unnamed: 0,Code,Year,Deaths,Fertility,GDP,GenderInequality,LifeExpectancy,tertiary_education
0,ABW,1950,,,,,57.2,
1,ABW,1951,,,,,57.7,
2,ABW,1952,,,,,58.7,
3,ABW,1953,,,,,59.5,
4,ABW,1954,,,,,60.4,
...,...,...,...,...,...,...,...,...
21097,ZWE,2017,26069.0,3.7064,,0.532,60.7,
21098,ZWE,2018,24648.0,3.6591,,0.535,61.4,
21099,ZWE,2019,24006.0,3.5994,,0.533,61.3,
21100,ZWE,2020,23533.0,3.5451,,0.533,61.1,


#  <u> 2nd step : Filter by year 


We notice that we don't have data before 1830 apart from one country, it will lead us to filter the dataset and remove the data before the years 1825 

In [3]:
silver_dataset = silver_dataset[silver_dataset['Year'] > 1830]
silver_dataset

Unnamed: 0,Code,Year,Deaths,Fertility,GDP,GenderInequality,LifeExpectancy,tertiary_education
0,ABW,1950,,,,,57.2,
1,ABW,1951,,,,,57.7,
2,ABW,1952,,,,,58.7,
3,ABW,1953,,,,,59.5,
4,ABW,1954,,,,,60.4,
...,...,...,...,...,...,...,...,...
21097,ZWE,2017,26069.0,3.7064,,0.532,60.7,
21098,ZWE,2018,24648.0,3.6591,,0.535,61.4,
21099,ZWE,2019,24006.0,3.5994,,0.533,61.3,
21100,ZWE,2020,23533.0,3.5451,,0.533,61.1,


#  <u> 3rd step : Count the number of NaN values 


We can count how many NaN values we have, it corresponds to the number of outliers we have removed in the step forward

In [4]:
silver_dataset.isna().sum().sum()

61621

Then the can count the number of unique countries taht we have, and put them in a list

In [5]:
nb_countries_serie = silver_dataset.Code.value_counts()
nb_countries_serie 

Code
NOR         191
GBR         191
CHL         191
NLD         191
DNK         191
           ... 
OWID_SRM     56
OWID_CZS     44
OWID_AUH     40
OWID_GFR     40
OWID_CIS      9
Name: count, Length: 243, dtype: int64

We can do the same thing with an other method

In [6]:
nb_countries_df  = silver_dataset.groupby(by='Code', as_index=False).nunique()

Then we convert all the name of different countries to a list 

In [7]:
list_diff_countries = nb_countries_serie.index.tolist()

#  <u> 4th step : Reducing the number of NaN values 



## First to reduce the missing data we do a linear interpolation

Linear interpolation is a method useful for curve fitting using linear polynomials.  
It helps in building new data points within the range of a discrete set of already known data points.   
Therefore, the Linear interpolation is the simplest method for estimating a channel from the vector of the given channel's estimates.

In [8]:
df = silver_dataset

# we take the first country of the list 
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

# we call the interpolate function
datc=dat.interpolate(method="linear")
data=datc

#Then we do that for the rest of teh countries 
for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    datc=dat.interpolate(method="linear")
    
    # and we concatenate the new data with the previous data interpolated
    data=pd.concat((data, datc), axis = 0)
    
#Finally we calculate the number of NaN values     
data.isna().sum().sum()

51200

We can observe that it reduces a lot the number of missing datas

## Then we do the backward filling

We can use backward filling for gaps that have no beginning or begin before our date range.

It means Filling the previous cell with future values

In [9]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

#We call the backward filling method
datc=dat.fillna(method='bfill')
data=datc

for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    datc=dat.fillna(method='bfill')
    
    data=pd.concat((data, datc), axis = 0)
    
    
data.isna().sum().sum()

23875

It is not better , we try an other method ,
## the forward filling

We can use forward filling for gaps that have a definitive start date.

It consists of padding non-existent values "forward", causing existing values to pad non-existent values that follow them.  
Forward-filling imputed missing values using the last observed value. 


In [10]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

#We call the forward filling method
datc=dat.fillna(method='ffill')
data=datc

for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    datc=dat.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

51200

To obtain a better result we gonna mix the methods

### First we mix the forward filling and the linear interpolation

In [11]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

#linear interpolation method
datc=dat.interpolate(method="linear")

#then forward filling method
datc=datc.fillna(method='ffill')
data=datc

for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='ffill')
    
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

51200

It is not perfect 
### Then we mix the backward filling and the linear interpolation

In [12]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

#linear interpolation method
datc=dat.interpolate(method="linear")

#then forward filling method
datc=datc.fillna(method='bfill')
data=datc

for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

18622

It is perfect , but we can also
### mix the three methods

In [13]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

#linear interpollation method
datc=dat.interpolate(method="linear")

#backward filling method
datf=datc.fillna(method='bfill')

#forward filling method
datr=datf.fillna(method='ffill')
data=datr

for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    datc=datc.fillna(method='ffill')
    data=pd.concat((data, datc), axis = 0)
data.isna().sum().sum()

18622

In [14]:
data

Unnamed: 0,Code,Year,Deaths,Fertility,GDP,GenderInequality,LifeExpectancy,tertiary_education
14098,NOR,1831,1427.0,2.5142,7.780508e+06,0.132,48.03,2.91
14099,NOR,1832,1427.0,2.5142,7.023810e+06,0.132,48.03,2.91
14100,NOR,1833,1427.0,2.5142,7.471980e+06,0.132,48.03,2.91
14101,NOR,1834,1427.0,2.5142,8.048104e+06,0.132,48.03,2.91
14102,NOR,1835,1427.0,2.5142,8.737864e+06,0.132,48.03,2.91
...,...,...,...,...,...,...,...,...
14701,OWID_CIS,2002,,,4.446153e+09,,,
14702,OWID_CIS,2003,,,4.490794e+09,,,
14703,OWID_CIS,2004,,,4.671829e+09,,,
14704,OWID_CIS,2005,,,4.854983e+09,,,


# <u> 5th step : Normalize the values 

Now, we are going to normalize the datas with the formula :
### zi=xi−min(x)/ max(x)−min(x)

It allows to bring all the values of the variables between 0 and 1.  
The goal is to use a common scale, without the differences in ranges of values being distorted and without loss of information

In [15]:
dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[0]]

all_colums = ['Deaths','Fertility','GDP ','GenderInequality','LifeExpectancy','tertiary_education']

#linear interpollation method and backward filling method
datc=dat.interpolate(method="linear")
datf=datc.fillna(method='bfill')

# use of the formula : zi=(xi−min(x))/ (max(x)−min(x))
datr[all_colums]=(datr[all_colums]-datr[all_colums].min())/(datr[all_colums].max()-datr[all_colums].min())
data=datr

# We do the same for all the countries
for i in range(1,len(list_diff_countries)):
    dat=df.loc[df.loc[:, 'Code'] == list_diff_countries[i]]
    
    datc=dat.interpolate(method="linear")
    datc=datc.fillna(method='bfill')
    
    datc[all_colums]=(datc[all_colums]-datc[all_colums].min())/(datc[all_colums].max()-datc[all_colums].min())
    data=pd.concat((data, datc), axis = 0)
    
data

Unnamed: 0,Code,Year,Deaths,Fertility,GDP,GenderInequality,LifeExpectancy,tertiary_education
14098,NOR,1831,1.0,0.687195,0.000109,1.0,0.086731,0.0
14099,NOR,1832,1.0,0.687195,0.000000,1.0,0.086731,0.0
14100,NOR,1833,1.0,0.687195,0.000065,1.0,0.086731,0.0
14101,NOR,1834,1.0,0.687195,0.000148,1.0,0.086731,0.0
14102,NOR,1835,1.0,0.687195,0.000247,1.0,0.086731,0.0
...,...,...,...,...,...,...,...,...
14701,OWID_CIS,2002,,,0.512971,,,
14702,OWID_CIS,2003,,,0.539722,,,
14703,OWID_CIS,2004,,,0.648209,,,
14704,OWID_CIS,2005,,,0.757965,,,


We do  reset the index and rename a column

In [16]:
new_data=data.reset_index()
new_data.columns.names = ['']
new_data.rename(columns = {'index':'Unnamed'}, inplace = True)
new_data

Unnamed: 0,Unnamed,Code,Year,Deaths,Fertility,GDP,GenderInequality,LifeExpectancy,tertiary_education
0,14098,NOR,1831,1.0,0.687195,0.000109,1.0,0.086731,0.0
1,14099,NOR,1832,1.0,0.687195,0.000000,1.0,0.086731,0.0
2,14100,NOR,1833,1.0,0.687195,0.000065,1.0,0.086731,0.0
3,14101,NOR,1834,1.0,0.687195,0.000148,1.0,0.086731,0.0
4,14102,NOR,1835,1.0,0.687195,0.000247,1.0,0.086731,0.0
...,...,...,...,...,...,...,...,...,...
20900,14701,OWID_CIS,2002,,,0.512971,,,
20901,14702,OWID_CIS,2003,,,0.539722,,,
20902,14703,OWID_CIS,2004,,,0.648209,,,
20903,14704,OWID_CIS,2005,,,0.757965,,,


we create the without clustering golden dataframe 

In [17]:
new_data.to_csv('./data/GoldenDataFrameWithoutCluster.csv')