## IMPORTS

We will need a series of libraries to aid us with the data analysis.

The most important ones will be pandas to be able to work with dataframes, numpy as it offers excellent tools to analyze the information, and matplotlib and seaborn which will provide with many options to print the data.

But also, we use some other libraries such as os to read the databases from our folder, warnings to be able to raise warnings when needed, and requests.

In [1]:
#IMPORTS

import os
import pandas as pd
import numpy as np

from Project.Utils.read_data import read_data
from Project.Utils.preprocess import preprocess
#from Project.Utils.preprocess import set_indicators
from Project.Utils.standardize import standardize
from Project.Utils.merge_data import merge_data

## CONSTANTS AND STRUCTURES

In order to keep our code not only readable and clean, but also easy to mantain and modify, we will keep most of our constants and data structures here.

Some of them will be strings to specify paths or names of columns, numbers to delimit ranges, or boolean variables to determine the behaviour of some parts of the code. All of these can be changed at will to modify the extension of the analysis and customize the output.

Some mutable structures will be kept here as well, since they are intended to behave like constants, like the dictionary to translate some diverse column names to a same given title, or a list with all the special treatment sources that have a predefined preprocessing procedure.

Also, we will include some other mutable structures that are intended to have a global scope, like the list of urls, dataframes and so on.

In [2]:
#CONSTANTS AND STRUCTURES TO BE USED

read_path = os.getcwd() + '\Databases' #Path to your databases folder to be read
write_path = os.getcwd() + '\Output' #Path to the folder you want to store the dataframes

column_country = 'Country' #Name for the column containing the name of the countries
column_year = 'Year' #Name for the column containing the years
columns_index = [column_country, column_year] #List with all the values for the columns that will be used as indexes
columns_rename = dict.fromkeys(['Area', 'Entity', 'Country or Area', 'Name', 'Country Name'], column_country) #Dictionary to rename all the index columns so they have a common name

#special_source = ['databank', 'faostat', 'kaggle', 'un_data', 'worldbank', 'WID'] #List with all the sources that need special treatment


## INDICATORS

Since our data comes from many different sources, it is convenient we keep all our indicators in a structure that will allow us to quickly locate them and identify them when reading the files. Since most of them have long, over-describing names, we will use a dictionary so we can later rewrite the title of our columns and make them easier to understand.

In [3]:
#INDICATORS
#Dictionary containing all the indicators to be used, and the aliases for the columns where they will be represented

indicators = {
    
    #databank
    'CPIA gender equality rating (1=low to 6=high)': 'Gender Equality',
    'Prevalence of undernourishment (% of population)': '% Undernourishment',
    
    #faostat
    'Credit to Agriculture, Forestry and Fishing': 'CreditToAgriFishForest',
    '2.a.1 Agriculture value added share of GDP (%)': 'AgriShareGDP',
    'Employment by status of employment, total, rural areas': 'EmploymentRural',
    'Gross Domestic Product': 'GDP',
    'Share of employment in agriculture, forestry and fishing in total employment': '%EmploymentAgriFishForest',
    'Agriculture': 'TotalAgri',
    
    #kaggle
    'Gender Inequality': 'Gender Inequality',
    
    #ourworldindata
    'Armed forces personnel (% of total labor force)': '% Soldiers', 
    'Crude marriage rate (per 1,000 inhabitants)': 'Marriage Rate',
    
    #theworldbank  
    'Birth rate, crude (per 1,000 people)': 'Birth Rate',
    'Death rate, crude (per 1,000 people)': 'Death Rate',
    'Intentional homicides (per 100,000 people)': 'Homicides',
    'Life expectancy at birth, total (years)': 'Life Expectancy',
    'Lifetime risk of maternal death (%)': 'Maternal Death Risk',
    'Literacy rate, adult total (% of people ages 15 and above)': 'Literacy Rate',
    'Mortality rate, infant (per 1,000 live births)': 'Infant Mortality',
    'Population growth (annual %)': '% Population Growth',
    'Rural population (% of total population)': '% Rural Population',
    'Suicide mortality rate (per 100,000 population)': 'Suicide Rate',
    
    #UN_Data
    'Value': 'Gini',
    
    #WID
    #(To Do)
    
    #WorldInData
    'civlib_vdem_owid': 'Civil Liberties',
    'Employment-to-population ratio, men (%)': '% Men Employment',
    'Employment-to-population ratio, women (%)': '% Women Employment',
    'Population (historical estimates)': 'Population',
    'freeexpr_vdem_owid': 'Freedom of Expression',
    'Indicator:Domestic general government health expenditure (GGHE-D) as percentage of general government expenditure (GGE) (%)': '% Healthcare Investment',
    'Industry as % of total employment -- ILO modelled estimates, May 2017': '% Employment Industry',
    'UIS: Mean years of schooling of the population age 25+. Female': 'Women Schooling Years',
    'UIS: Mean years of schooling of the population age 25+. Male': 'Men Schooling Years',
    'Government expenditure on education, total (% of government expenditure)': '% Education Expenditure'

}

## READING

The first step will explore the directory specified in the str read_path, check what files are .csv, and try to read and preprocess them. Those succesfully read and processed will be stored in a dictionary with their url as a key. Both of them, dataframes and url, will also be stored in two lists.

For those .csv that cannot be read or processed, we will store their url in a discarded list, and raise a warning should there be any faulty url.

In [4]:
data_dict, _ = read_data(read_path)

index_to_show = 0

#Show before standardization
key = list(data_dict.keys())[index_to_show]
print(data_dict[key].iloc[:, :5].head())

                                     Series Name     Series Code  \
0  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
1  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
2  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
3  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
4  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   

     Country Name Country Code 1960 [YR1960]  
0     Afghanistan          AFG            ..  
1         Albania          ALB            ..  
2         Algeria          DZA            ..  
3  American Samoa          ASM            ..  
4         Andorra          AND            ..  


# Preprocessing

In order to be able to merge all the dataframes into one big table, we need to standardize them. This process will take two steps: first, we rearrange them so they all have a similar shape; then, we format their content so each column's type is roughly the same and can be merged.

In [5]:
index_to_show = 0

#Show before standardization
key = list(data_dict.keys())[index_to_show]
print(data_dict[key].iloc[:, :5].head())
print('----------------------------------------------')
#set_indicators(indicators)
df_list = []
for url, df in data_dict.items():
    df = preprocess(url = url, df = df, columns_index = columns_index, columns_rename = columns_rename, inplace = True)
    df = standardize(df = df, columns_index = columns_index, inplace = True)
    df.set_index(columns_index)
    df_list.append(df)

#Show after standardization
df_list[index_to_show].head()

                                     Series Name     Series Code  \
0  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
1  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
2  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
3  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   
4  CPIA gender equality rating (1=low to 6=high)  IQ.CPA.GNDR.XQ   

     Country Name Country Code 1960 [YR1960]  
0     Afghanistan          AFG            ..  
1         Albania          ALB            ..  
2         Algeria          DZA            ..  
3  American Samoa          ASM            ..  
4         Andorra          AND            ..  
----------------------------------------------


Unnamed: 0,Country,Year,Gender Equality
0,Afghanistan,1960,
1,Albania,1960,
2,Algeria,1960,
3,American Samoa,1960,
4,Andorra,1960,


# Merge

After standardizing all the dataframes to have the same shape and roughly the same format for their columns, we apply a merge operation.

In [6]:
merged_df = merge_data(df_list, columns_index = columns_index)

#Show head and tail of the merged dataframe
print(merged_df.iloc[:, :5].head())
print(merged_df.iloc[:, :5].tail())

if not os.path.exists(write_path):
    os.makedirs(write_path)

merged_df.to_csv(write_path + '/BronzeDataframe.csv', index = False)

          Country  Year  Gender Equality  % Undernourishment  AgriShareGDP
0     Afghanistan  1960              NaN                 NaN           NaN
1         Albania  1960              NaN                 NaN           NaN
2         Algeria  1960              NaN                 NaN           NaN
3  American Samoa  1960              NaN                 NaN           NaN
4         Andorra  1960              NaN                 NaN           NaN
             Country  Year  Gender Equality  % Undernourishment  AgriShareGDP
67519       Zimbabwe  1896              NaN                 NaN           NaN
67520       Zimbabwe  1897              NaN                 NaN           NaN
67521       Zimbabwe  1898              NaN                 NaN           NaN
67522       Zimbabwe  1899              NaN                 NaN           NaN
67523  Ã…land Islands  2015              NaN                 NaN           NaN


In [7]:
merged_df.dtypes

Country                       object
Year                           int32
Gender Equality              float64
% Undernourishment           float64
AgriShareGDP                 float64
CreditToAgriFishForest       float64
EmploymentRural              float64
GDP                          float64
%EmploymentAgriFishForest    float64
TotalAgri                    float64
Gender Inequality            float64
% Soldiers                   float64
Marriage Rate                float64
Birth Rate                   float64
Death Rate                   float64
Homicides                    float64
Life Expectancy              float64
Maternal Death Risk          float64
Literacy Rate                float64
Infant Mortality             float64
% Population Growth          float64
% Rural Population           float64
Suicide Rate                 float64
Gini                         float64
Civil Liberties              float64
Freedom of Expression        float64
% Healthcare Investment      float64
%