## IMPORTS

We will need a series of libraries to aid us with the data analysis.

The most important ones will be pandas to be able to work with dataframes, numpy as it offers excellent tools to analyze the information, and matplotlib and seaborn which will provide with many options to print the data.

But also, we use some other libraries such as os to read the databases from our folder, warnings to be able to raise warnings when needed, and requests.

In [2]:
#IMPORTS

import pandas as pd
import numpy as np
import requests
import seaborn as sns
import matplotlib.pyplot as plt
import os
import warnings

from os import listdir
from os.path import isfile, join


from Project.Utils.preprocess import preprocess
from Project.Utils.read_databases import read_databases
from Project.Utils.merge_data import merge_data

## CONSTANTS AND STRUCTURES

In order to keep our code not only readable and clean, but also easy to mantain and modify, we will keep most of our constants and data structures here.

Some of them will be strings to specify paths or names of columns, numbers to delimit ranges, or boolean variables to determine the behaviour of some parts of the code. All of these can be changed at will to modify the extension of the analysis and customize the output.

Some mutable structures will be kept here as well, since they are intended to behave like constants, like the dictionary to translate some diverse column names to a same given title, or a list with all the special treatment sources that have a predefined preprocessing procedure.

Also, we will include some other mutable structures that are intended to have a global scope, like the list of urls, dataframes and so on.

In [21]:
#CONSTANTS AND STRUCTURES TO BE USED

read_path = os.getcwd() + '\Databases' #Path to your databases folder to be read
write_path = os.getcwd() + '\Output' #Path to the folder you want to store the dataframes

verbose = False #Determines wether full information will be printed or only the most relevant

column_country = 'Country' #Name for the column containing the name of the countries
column_year = 'Year' #Name for the column containing the years
columns_index = [column_country, column_year] #List with all the values for the columns that will be used as indexes
columns_rename = dict.fromkeys(['Area', 'Entity', 'Country or Area', 'Name', 'Country Name'], column_country) #Dictionary to rename all the index columns so they have a common name

year_min = 1990
year_max = 2050
year_range = (year_min, year_max)

special_source = ['databank', 'faostat', 'kaggle', 'un_data', 'worldbank', 'WID'] #List with all the sources that need special treatment


## INDICATORS

Since our data comes from many different sources, it is convenient we keep all our indicators in a structure that will allow us to quickly locate them and identify them when reading the files. Since most of them have long, over-describing names, we will use a dictionary so we can later rewrite the title of our columns and make them easier to understand.

In [22]:
#INDICATORS
#Dictionary containing all the indicators to be used, and the aliases for the columns where they will be represented

indicators = {
    
    #databank
    'CPIA gender equality rating (1=low to 6=high)': 'Gender Equality',
    'Prevalence of undernourishment (% of population)': '% Undernourishment',
    
    #faostat
    'Credit to Agriculture, Forestry and Fishing': 'CreditToAgriFishForest',
    '2.a.1 Agriculture value added share of GDP (%)': 'AgriShareGDP',
    'Employment by status of employment, total, rural areas': 'EmploymentRural',
    'Gross Domestic Product': 'GDP',
    'Share of employment in agriculture, forestry and fishing in total employment': '%EmploymentAgriFishForest',
    'Agriculture': 'TotalAgri',
    
    #kaggle
    'Gender Inequality': 'Gender Inequality',
    
    #ourworldindata
    'Armed forces personnel (% of total labor force)': '% Soldiers', 
    'Crude marriage rate (per 1,000 inhabitants)': 'Marriage Rate',
    
    #theworldbank  
    'Birth rate, crude (per 1,000 people)': 'Birth Rate',
    'Death rate, crude (per 1,000 people)': 'Death Rate',
    'Intentional homicides (per 100,000 people)': 'Homicides',
    'Life expectancy at birth, total (years)': 'Life Expectancy',
    'Lifetime risk of maternal death (%)': 'Maternal Death Risk',
    'Literacy rate, adult total (% of people ages 15 and above)': 'Literacy Rate',
    'Mortality rate, infant (per 1,000 live births)': 'Infant Mortality',
    'Population growth (annual %)': '% Population Growth',
    'Rural population (% of total population)': '% Rural Population',
    'Suicide mortality rate (per 100,000 population)': 'Suicide Rate',
    
    #UN_Data
    'Value': 'Gini',
    
    #WID
    #(To Do)
    
    #WorldInData
    'civlib_vdem_owid': 'Civil Liberties',
    'Employment-to-population ratio, men (%)': '% Men Employment',
    'Employment-to-population ratio, women (%)': '% Women Employment',
    'Population (historical estimates)': 'Population',
    'freeexpr_vdem_owid': 'Freedom of Expression',
    'Indicator:Domestic general government health expenditure (GGHE-D) as percentage of general government expenditure (GGE) (%)': '% Healthcare Investment',
    'Industry as % of total employment -- ILO modelled estimates, May 2017': '% Employment Industry',
    'UIS: Mean years of schooling of the population age 25+. Female': 'Women Schooling Years',
    'UIS: Mean years of schooling of the population age 25+. Male': 'Men Schooling Years',
    'Government expenditure on education, total (% of government expenditure)': '% Education Expenditure'

}

## AUXILIARY METHODS

Again, as a means to keep the code tidy, readable and easy to modify if needed we have created a series of auxiliary methods separated from the main chunk of code.

## RENAME_VALUE_COLUMN
The first method will allow us to rename the column that contains the different values of a given indicator. We rename the title of that column to such indicator, either based on the previous name of said column and the one that contains the indicator, either by searching for those.

The method will return the modified dataframe or simply modify the given one as a parameter. It will raise an exception if unable to find the column names.

In [23]:
def rename_value_column(dataframe, column_value = None, column_indicator = None, row_index = 1, inplace = False):
    """
        Method that takes a dataframe and renames the value column with the name of the indicator
        
        PARAMETERS:
            dataframe: dataframe
                the dataframe to be modified
            column_value: str, default None
                the name of the column that contains the values and whose name will be changed. If not specified, it will try to search for it
            column_indicator: str, default None
                the name of the column that contains the name of the indicator of the dataframe. If not specified, it will try to search for it
            inplace: bool, default False
                determines if the changes will be made in the same dataframe or returned as a result
        
        RETURNS:
            DataFrame or None
                If inplace = False, it will return the modified dataframe. Else, the return will be None and the dataframe
        
        RAISES:
            Exception
                If either column_value or column_indicator was not specified, and it was unable to find them itself 
    """
    
    #If no column name specified, iterate over the list of possible names.
    col_values = ['Value']
    col_indicators = ['Item', 'Indicator']
    
    #Try to find
    if not column_indicator:
        for indicator in col_indicators:
            if indicator in dataframe:
                column_indicator = indicator
    
    if not column_value:
        for value in col_values:
            if value in dataframe:
                column_value = value             
       
    if not column_indicator:
        raise Exception('Unable to determine indicator column')
    if not column_value:
        raise Exception('Unable to determine value column')
        
    dataframe.rename(columns = {column_value: dataframe.loc[:, column_indicator][1]}, inplace = inplace)
    
    return dataframe if not inplace else None

## PREPROCESS

The second method will take care of the gross preprocessing procedure. Given a dataframe, it will normalize the name of the index columns so they can be later merged together, remove unnecessary columns, shorten the names of the indicators, and melt or apply other changes depending on the specified protocol.

It will return the modified and preprocessed dataframe, ready to be merged and treated. The code of this function has been migrated to a Visual Studio Python Module.

## MAIN CODE

The following cells of code will contain the main part of the code, and will be divided according to the different stages of the data processing they take care of.

## READING

The first step will explore the directory specified in the str read_path, check what files are .csv, and try to read and preprocess them. Those succesfully read and processed will be stored in a dictionary with their url as a key. Both of them, dataframes and url, will also be stored in two lists.

For those .csv that cannot be read or processed, we will store their url in a discarded list, and raise a warning should there be any faulty url.

In [27]:
#||||||||||START OF MAIN CODE|||||||||||||

#Explore all the files from the specified directory in read_path
#If the file is a .csv, try to read it, preprocess it and append it to the list of dataframes
#If any error is raised during the process, it appends the url to the list of discarded files and shows a warning at the end

url_list = [] #List with all the urls for the dataframes successfully read
data_list = [] #List containing all the dataframes succesfully read
data_dict = {} #Dictionary that correlates the urls with their read databases
discarded_urls = [] #List with all the urls whose dataframes could not be read

for element in listdir(read_path):
        url = join(read_path, element)
        if isfile(url) and url.endswith('.csv'):
            url_list.append(url)
            if 'WID' in url:
                continue
            try:
                dataframe = pd.read_csv(url)
            except Exception (e):
                print('Unable to read dataframe: ' + url)
                print(e)
                discarded_urls.append(url)
            else:
                special = None
                for source in special_source:
                    if source in url.lower():
                        special = source
                        break
                
                try:
                    dataframe = preprocess(dataframe, columns_index, columns_rename, treatment = special)
                except Exception (e):
                    print('Unexpected error when preprocessing the dataframe: ' + url)
                    print(e)
                    discarded_urls.append(url)
                else:
                    url_list.append(url)
                    data_list.append(dataframe)
                    data_dict[url] = dataframe

if len(discarded_urls) > 0:
        warn = 'Unable to read the following files:'
        for url in discarded_urls:
            warn += '\n' + url
        warnings.warn(warn)                    
                    
if (verbose):
    for data in data_list:
        print(data)
        print('\n' + '----------------------------------------------------------' + '\n')
else:
    print('Done')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe.rename(columns = columns_rename, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe.rename(columns = indicators, inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe.drop(dataframe.columns.difference(list(columns_index) + list(indicators.values())), axis = 1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/s

Done


# Merge

This is the process of integration of all the databases. In first place, all the column of Country and Year must be casted to a string in order to avoid errors. After this the merge function is performed with the 'outer' parameter to avoid losing data.

In [28]:
final_df = data_list[0]
print(len(data_list))

final_df[column_year]= final_df[column_year].astype(str)
final_df[column_country]= final_df[column_country].astype(str)

#Merge all the different databases into one single dataframe with the format: columns_index + indicators
for data in data_list[1:]:
    data[column_year]= data[column_year].astype(str)
    data[column_country]= data[column_country].astype(str)
    final_df = pd.merge(final_df, data, on = columns_index, how = "outer")

final_df.to_csv(write_path + '/Merged_File.csv')

#final_df contains the whole merged dataframe

30


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_year]= data[column_year].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_country]= data[column_country].astype(str)


In [30]:
dict_country = {} #Ad-hoc dictionary to count the number of entries for each country
dict_df_countries = {} #Dictionary that relates each country to its dataframe
VALUE = 1
THRESHOLD = 15 #Minimum number of entries a country needs to have to be included into our research


#Counts all the entries for each country and stores into the dictionary with countries as keys and the count as its associated value    
for country in final_df[column_country]:
    if country not in dict_country:
        dict_country[country] = VALUE
    else:
        dict_country[country] += VALUE

#Removes all the countries that do no meet the minimun entries requirement        
dict_country = {country:num for country, num in dict_country.items() if num >= THRESHOLD}




In [31]:
for country in dict_country.keys():
    df = final_df.loc[final_df[column_country] == country]
    dict_df_countries[country] = df
    


dict_df_countries['Spain'].corr()

cor=dat.corr()
corcolumn=cor[['GDP']].sort_values(by = 'GDP',ascending = False).style.background_gradient()
display(corcolumn)


Unnamed: 0,AgriShareGDP,CreditToAgriFishForest,EmploymentRural,GDP,%EmploymentAgriFishForest,TotalAgri,% Soldiers,Marriage Rate,Gini,Civil Liberties,Freedom of Expression,% Healthcare Investment,% Employment Industry,Women Schooling Years,Men Schooling Years,% Education Expenditure,% Men Employment,% Women Employment,Population
AgriShareGDP,1.0,0.236294,-0.152431,-0.904849,0.79896,-0.603719,0.714219,0.736451,-0.516946,-0.157728,0.020233,-0.632145,0.51879,-0.673603,-0.718923,0.238292,0.337664,-0.888028,-0.783885
CreditToAgriFishForest,0.236294,1.0,-0.640325,-0.123719,0.123307,-0.019841,0.191381,0.68414,-0.553991,-0.188009,-0.252563,0.22549,0.544474,-0.799031,-0.807203,0.436329,0.753591,-0.298103,-0.173631
EmploymentRural,-0.152431,-0.640325,1.0,0.122702,-0.132824,-0.07892,-0.221481,-0.184924,-0.262532,0.639827,0.481346,-0.300183,0.026694,-0.693691,-0.602671,0.298925,0.562543,0.923712,0.032014
GDP,-0.904849,-0.123719,0.122702,1.0,-0.885823,0.932739,-0.863263,-0.867742,0.308085,0.022275,-0.441217,0.81104,-0.719672,0.567841,0.628994,-0.031818,-0.023638,0.928638,0.96038
%EmploymentAgriFishForest,0.79896,0.123307,-0.132824,-0.885823,1.0,-0.821268,0.859803,0.833497,-0.369667,-0.024893,0.483784,-0.82195,0.732355,-0.850764,-0.869289,-0.068535,-0.002513,-0.951783,-0.929871
TotalAgri,-0.603719,-0.019841,-0.07892,0.932739,-0.821268,1.0,-0.784887,-0.837648,0.127551,-0.177498,-0.523944,0.784139,-0.725804,0.522305,0.572789,-0.137206,-0.091318,0.835455,0.902258
% Soldiers,0.714219,0.191381,-0.221481,-0.863263,0.859803,-0.784887,1.0,0.813335,-0.280656,0.114053,0.546821,-0.838217,0.728046,-0.993816,-0.976561,0.207282,0.052458,-0.881826,-0.910589
Marriage Rate,0.736451,0.68414,-0.184924,-0.867742,0.833497,-0.837648,0.813335,1.0,-0.675958,0.153672,0.582182,-0.623379,0.952159,-0.960551,-0.948803,0.32312,0.471184,-0.773397,-0.9426
Gini,-0.516946,-0.553991,-0.262532,0.308085,-0.369667,0.127551,-0.280656,-0.675958,1.0,-0.370507,-0.435657,0.153825,-0.673945,0.936287,0.892534,-0.520646,-0.80615,0.129348,0.473858
Civil Liberties,-0.157728,-0.188009,0.639827,0.022275,-0.024893,-0.177498,0.114053,0.153672,-0.370507,1.0,0.744908,-0.122613,0.410987,-0.432933,-0.335825,0.799864,0.629884,0.296217,-0.188267


In [None]:
for country in range(-10000, 2050):
    df = final_df.loc[final_df[column_country] == country]
    dict_df_countries[country] = df