# Bronze to Silver
In this notebook the transformation from a bronze dataframe to silver dataframe will be performed
The tasks that will be executed are the following:
- Filter the years
- Drop countries with no GDP data or not in the region csv
- Identify outliers and treat them
- Drop indicators with low data
- Drop countries with low data





In [1]:
import pandas as pd
import numpy as np
import os

read_path = os.getcwd() + '\Databases' #Path to your databases folder to be read
write_path = os.getcwd() + '\Output' #Path to the folder you want to store the dataframes

from Project.Utils.data_treat import iqr_treatment, nan_treatment
from Project.Utils.standardize import standardize

## Establish the variables and dataframe
Define variables that will be used along this notebook and import the Bronze dataframe generated in the previous notebook, which has been saved in the Output folder.

### Variables that can be changed
- Nan_threshold: max missing values of a country indicator to not perform the analysis and delete the whole data of it due to low information.
- Indicator_threshold: max missing indicators of a country to not consider the country in the analysis.
- Country_threshold: minimun  countries of an indicator to  consider the indicator in the analysis.
- Year min, max: years between the analysis will be executed.

In [2]:
nan_threshold = 5
indicator_threshold = 15
country_threshold = 20
year_min = 2000
year_max = 2020

## Other variables
This variables can't be modified

In [3]:

columns_index = ('Country', 'Year')
column_year  = 'Year'
column_country = 'Country'
df = pd.read_csv(write_path + '/BronzeDataframe.csv')
df.dtypes


Country                       object
Year                           int64
Gender Equality              float64
% Undernourishment           float64
AgriShareGDP                 float64
CreditToAgriFishForest       float64
EmploymentRural              float64
GDP                          float64
%EmploymentAgriFishForest    float64
TotalAgri                    float64
Gender Inequality            float64
% Soldiers                   float64
Marriage Rate                float64
Birth Rate                   float64
Death Rate                   float64
Homicides                    float64
Life Expectancy              float64
Maternal Death Risk          float64
Literacy Rate                float64
Infant Mortality             float64
% Population Growth          float64
% Rural Population           float64
Suicide Rate                 float64
Gini                         float64
Civil Liberties              float64
Freedom of Expression        float64
% Healthcare Investment      float64
%

## Narrow the range
Narrow the range of the data to the years selected using the variables that have previously defined. From 1990 to 2020.

In [10]:
df[column_year]= df[column_year].astype(int)
df.drop(df[df[column_year] < year_min].index, inplace = True)
df.drop(df[df[column_year] > year_max].index, inplace = True)
display(df)


Unnamed: 0,Country,Year,Gender Equality,% Undernourishment,AgriShareGDP,CreditToAgriFishForest,EmploymentRural,GDP,%EmploymentAgriFishForest,TotalAgri,...,Civil Liberties,Freedom of Expression,% Healthcare Investment,% Employment Industry,Women Schooling Years,Men Schooling Years,% Education Expenditure,% Men Employment,% Women Employment,Population
17962,Yemen,2000,,,,0.174673,,10864.562835,,1249657.0,...,0.25,0.432,7.49,12.675,,,30.489281,,,17409071.0
17963,Palestine,2000,,,,,179.528,4313.6,14.1,920575.0,...,,,,34.331,,,,,,3224009.0
18026,Bahamas,2000,,,,,,8076.5,,,...,,,12.66,17.846,11.1,10.7,18.92281,,,298045.0
18177,Congo,2000,,,,,,3358.958017,,834232.0,...,0.236,0.377,2.27,25.224,,,,,,3127420.0
18731,Slovakia,2000,,,,,,20719.405913,6.9,997869.0,...,0.944,0.942,8.79,37.325,,,7.44282,54.72,42.56,5399207.0


## Intersection of countries between .csvs
Countries that aren't defined in the region .csv and also in the .csv of GDP will be dropped.

Consists in making an intersection between the 2 csv and appending them in to a variable 'country_list'. If the main dataframe contains a country that isn't in the list it will be dropped.

In [5]:
df_countries = pd.read_csv(read_path + '/FAOSTAT_GDP.csv')
df_countries = standardize(df_countries, ['Area', 'Year'])
df_regions = pd.read_csv(read_path + '/AuxiliarData/world-regions.csv')
country_list = [] #List to insert all countries of the previous faostat csv and world regions

#Loop to insert all the possible different countries in to the list with the condition that has to be in both .csv of gdp countries and regions  
for country in df_countries['Area'][df_countries['Area'].isin(df_regions['Entity'])]:
    if country not in country_list:
        country_list.append(country)

#If is not in the list drop the country from the main dataframe.
for country in df['Country']:
    if country not in country_list:
        print(country)
        df.drop(df.loc[df['Country'] == country].index, inplace = True)



American Samoa
Bahamas, The
Bolivia
Brunei Darussalam
Cabo Verde
Channel Islands
Congo, Dem. Rep.
Congo, Rep.
Cote d'Ivoire
Curacao
Czech Republic
Egypt, Arab Rep.
Faroe Islands
Gambia, The
Gibraltar
Guam
Hong Kong SAR, China
Iran, Islamic Rep.
Isle of Man
Korea, Dem. People's Rep.
Korea, Rep.
Kosovo
Kyrgyz Republic
Lao PDR
Macao SAR, China
Micronesia, Fed. Sts.
Moldova
Northern Mariana Islands
Slovak Republic
St. Kitts and Nevis
St. Lucia
St. Martin (French part)
St. Vincent and the Grenadines
Syrian Arab Republic
Tanzania
Timor-Leste
Turkey
Venezuela, RB
Vietnam
Virgin Islands (U.S.)
West Bank and Gaza
Yemen, Rep.
American Samoa
Bahamas, The
Bolivia
Brunei Darussalam
Cabo Verde
Channel Islands
Congo, Dem. Rep.
Congo, Rep.
Cote d'Ivoire
Curacao
Czech Republic
Egypt, Arab Rep.
Faroe Islands
Gambia, The
Gibraltar
Guam
Hong Kong SAR, China
Iran, Islamic Rep.
Isle of Man
Korea, Dem. People's Rep.
Korea, Rep.
Kosovo
Kyrgyz Republic
Lao PDR
Macao SAR, China
Micronesia, Fed. Sts.
Moldova
Nor

# Identifying outliers and treat
For each country and indicator we will identify if in first place has sufficient data. If the indicator has low data the iqr method won't be performed and the values of the indicator for the country will be erased. 

If it has enough entries the iqr method will be called to treat the outliers. After executing the iqr a nan treatment will be runned, consisting in the interpolation of the data.

This code block performs all the previous operations extracting all the entries of a country in an auxiliar dataframe that later will be concatenated in new dataframe, 'final_df', on which future methods will be executed.

In [6]:
final_df = pd.DataFrame()

for country in country_list:
        aux = df[df['Country'] == country].copy()
        for column in aux.columns[2:]:
                if(aux[column].isna().sum() > nan_threshold):
                        aux[column] = np.nan 
        aux = iqr_treatment(aux)
        aux = nan_treatment(aux) 
        final_df = pd.concat([final_df, aux], axis = 0)


## Drop indicators 
This method will drop the indicators that don't have sufficient data.
In the output a print will inform of all the actions taken.

The country_threshold variable is multiplied by 20 because each country has 20 rows (20 years for each country). If an indicator doesn't have at least 20 countries it's erased.

If the list is empty it means all the indicators have enough meaning to be included in the analysis.

In [7]:
print("This indicators aren't useful. A drop action will be performed: \n")
for column in final_df.columns[2:]:
    if len(final_df[column].value_counts()) <= country_threshold * 20:
        print(column)
        final_df = final_df.drop(columns = column)


final_df.shape      


This indicators aren't useful. A drop action will be performed: 

Gender Equality
% Undernourishment
%EmploymentAgriFishForest
Gender Inequality
Marriage Rate
Literacy Rate
Suicide Rate
Gini
Women Schooling Years
Men Schooling Years


(3927, 24)

#### What would have happened if the threshold is different?
We will see the situation if the variable is 3 and 15.
- If  threshold = 3 it's so demanding that most of the data isn't considered in the final analysis.
- If  threshold = 15 the statistical aproximation is wrong due to we are interpolating too many values.

Change the 'threshold' to the desired value in the beginning of the notebook and the previous code block will display the irrelevant indicators.

#### What would have happened if the analysis is from 1990 instead of 2000?
The result is that many of the indicators are dropped, because of the lack of data. The reason is many of the studies that created the dataframes started around of 2000, causing that between de range of 1990 and 2000 not many indicators have values.

Change the 'year_min' to the desired value in the beginning of the notebook and the previous code block will display the irrelevant indicators.

## Drop countries
We will scan the dataframe and find which  countries have missing most of the indicators and a drop action will be performed.

For this code block a new column 'MISSING' is created with the number of missing indicators for the country. This column will be dropped in the end. If the value is bigger than 'indicator_threshold' the country is dropped and saved in a list to display the names below.




In [8]:
dropped_countries = []
final_df['MISSING'] = final_df.apply(lambda x: x.isnull().sum(), axis='columns')

list = set(final_df.loc[final_df['MISSING'] > indicator_threshold][column_country])


final_df.drop(final_df.loc[final_df['MISSING'] > indicator_threshold].index, inplace = True)
final_df = final_df.drop('MISSING', axis=1)

print('The following countries have been deleted: ')
print(list)

The following countries have been deleted: 
{'Saint Vincent and the Grenadines', 'Turks and Caicos Islands', 'British Virgin Islands', 'Saint Kitts and Nevis', 'Bahamas', 'Sint Maarten (Dutch part)', 'Palestine', 'Monaco'}


## Save
The silver dataframe has been completed and now can be saved. The index parameter is to don't create a new index column that later on could cause problems in future read_csv.

In [9]:
final_df.to_csv(write_path + '/SilverDataframe.csv', index = False)     