## IMPORTS

We will need a series of libraries to aid us with the data analysis.

The most important ones will be pandas to be able to work with dataframes, numpy as it offers excellent tools to analyze the information, and matplotlib and seaborn which will provide with many options to print the data.

But also, we use some other libraries such as os to read the databases from our folder, warnings to be able to raise warnings when needed, and requests.

In [34]:
#IMPORTS

import pandas as pd
import numpy as np
import requests
import seaborn as sns
import matplotlib.pyplot as plt
import os
import warnings

from os import listdir
from os.path import isfile, join

from Project.Utils.read_data import read_data
from Project.Utils.preprocess import preprocess
from Project.Utils.normalize import normalize

## CONSTANTS AND STRUCTURES

In order to keep our code not only readable and clean, but also easy to mantain and modify, we will keep most of our constants and data structures here.

Some of them will be strings to specify paths or names of columns, numbers to delimit ranges, or boolean variables to determine the behaviour of some parts of the code. All of these can be changed at will to modify the extension of the analysis and customize the output.

Some mutable structures will be kept here as well, since they are intended to behave like constants, like the dictionary to translate some diverse column names to a same given title, or a list with all the special treatment sources that have a predefined preprocessing procedure.

Also, we will include some other mutable structures that are intended to have a global scope, like the list of urls, dataframes and so on.

In [48]:
#CONSTANTS AND STRUCTURES TO BE USED

read_path = os.getcwd() + '\Databases' #Path to your databases folder to be read
write_path = os.getcwd() + '\Output' #Path to the folder you want to store the dataframes

url_list = [] #List with all the urls for the dataframes successfully read
data_list = [] #List containing all the dataframes succesfully read
data_dict = {} #Dictionary that correlates the urls with their read databases
discarded_urls = [] #List with all the urls whose dataframes could not be read

verbose = False #Determines wether full information will be printed or only the most relevant

column_country = 'Country' #Name for the column containing the name of the countries
column_year = 'Year' #Name for the column containing the years
columns_index = [column_country, column_year] #List with all the values for the columns that will be used as indexes
columns_rename = dict.fromkeys(['Area', 'Entity', 'Country or Area', 'Name', 'Country Name'], column_country) #Dictionary to rename all the index columns so they have a common name

year_min = 1990
year_max = 2050
year_range = (year_min, year_max)

special_source = ['databank', 'faostat', 'kaggle', 'un_data', 'worldbank', 'WID'] #List with all the sources that need special treatment


## INDICATORS

Since our data comes from many different sources, it is convenient we keep all our indicators in a structure that will allow us to quickly locate them and identify them when reading the files. Since most of them have long, over-describing names, we will use a dictionary so we can later rewrite the title of our columns and make them easier to understand.

In [49]:
#INDICATORS
#Dictionary containing all the indicators to be used, and the aliases for the columns where they will be represented

indicators = {
    
    #databank
    'CPIA gender equality rating (1=low to 6=high)': 'Gender Equality',
    'Prevalence of undernourishment (% of population)': '% Undernourishment',
    
    #faostat
    'Credit to Agriculture, Forestry and Fishing': 'CreditToAgriFishForest',
    '2.a.1 Agriculture value added share of GDP (%)': 'AgriShareGDP',
    'Employment by status of employment, total, rural areas': 'EmploymentRural',
    'Gross Domestic Product': 'GDP',
    'Share of employment in agriculture, forestry and fishing in total employment': '%EmploymentAgriFishForest',
    'Agriculture': 'TotalAgri',
    
    #kaggle
    'Gender Inequality': 'Gender Inequality',
    
    #ourworldindata
    'Armed forces personnel (% of total labor force)': '% Soldiers', 
    'Crude marriage rate (per 1,000 inhabitants)': 'Marriage Rate',
    
    #theworldbank  
    'Birth rate, crude (per 1,000 people)': 'Birth Rate',
    'Death rate, crude (per 1,000 people)': 'Death Rate',
    'Intentional homicides (per 100,000 people)': 'Homicides',
    'Life expectancy at birth, total (years)': 'Life Expectancy',
    'Lifetime risk of maternal death (%)': 'Maternal Death Risk',
    'Literacy rate, adult total (% of people ages 15 and above)': 'Literacy Rate',
    'Mortality rate, infant (per 1,000 live births)': 'Infant Mortality',
    'Population growth (annual %)': '% Population Growth',
    'Rural population (% of total population)': '% Rural Population',
    'Suicide mortality rate (per 100,000 population)': 'Suicide Rate',
    
    #UN_Data
    'Value': 'Gini',
    
    #WID
    #(To Do)
    
    #WorldInData
    'civlib_vdem_owid': 'Civil Liberties',
    'Employment-to-population ratio, men (%)': '% Men Employment',
    'Employment-to-population ratio, women (%)': '% Women Employment',
    'Population (historical estimates)': 'Population',
    'freeexpr_vdem_owid': 'Freedom of Expression',
    'Indicator:Domestic general government health expenditure (GGHE-D) as percentage of general government expenditure (GGE) (%)': '% Healthcare Investment',
    'Industry as % of total employment -- ILO modelled estimates, May 2017': '% Employment Industry',
    'UIS: Mean years of schooling of the population age 25+. Female': 'Women Schooling Years',
    'UIS: Mean years of schooling of the population age 25+. Male': 'Men Schooling Years',
    'Government expenditure on education, total (% of government expenditure)': '% Education Expenditure'

}

## AUXILIARY METHODS

Again, as a means to keep the code tidy, readable and easy to modify if needed we have created a series of auxiliary methods separated from the main chunk of code.

## RENAME_VALUE_COLUMN
The first method will allow us to rename the column that contains the different values of a given indicator. We rename the title of that column to such indicator, either based on the previous name of said column and the one that contains the indicator, either by searching for those.

The method will return the modified dataframe or simply modify the given one as a parameter. It will raise an exception if unable to find the column names.

In [4]:
def rename_value_column(dataframe, column_value = None, column_indicator = None, row_index = 1, inplace = False):
    """
        Method that takes a dataframe and renames the value column with the name of the indicator
        
        PARAMETERS:
            dataframe: dataframe
                the dataframe to be modified
            column_value: str, default None
                the name of the column that contains the values and whose name will be changed. If not specified, it will try to search for it
            column_indicator: str, default None
                the name of the column that contains the name of the indicator of the dataframe. If not specified, it will try to search for it
            inplace: bool, default False
                determines if the changes will be made in the same dataframe or returned as a result
        
        RETURNS:
            DataFrame or None
                If inplace = False, it will return the modified dataframe. Else, the return will be None and the dataframe
        
        RAISES:
            Exception
                If either column_value or column_indicator was not specified, and it was unable to find them itself 
    """
    
    #If no column name specified, iterate over the list of possible names.
    col_values = ['Value']
    col_indicators = ['Item', 'Indicator']
    
    #Try to find
    if not column_indicator:
        for indicator in col_indicators:
            if indicator in dataframe:
                column_indicator = indicator
    
    if not column_value:
        for value in col_values:
            if value in dataframe:
                column_value = value             
       
    if not column_indicator:
        raise Exception('Unable to determine indicator column')
    if not column_value:
        raise Exception('Unable to determine value column')
        
    dataframe.rename(columns = {column_value: dataframe.loc[:, column_indicator][1]}, inplace = inplace)
    
    return dataframe if not inplace else None

## PREPROCESS

The second method will take care of the gross preprocessing procedure. Given a dataframe, it will normalize the name of the index columns so they can be later merged together, remove unnecessary columns, shorten the names of the indicators, and melt or apply other changes depending on the specified protocol.

It will return the modified and preprocessed dataframe, ready to be merged and treated.

In [5]:
def preprocess (dataframe, treatment = '', melt_on_value = None, rename_value_columns = False, inplace = False):
    
    """
        Take a dataframe, rearrange its columns or rows, rename them if needed, and return the resulting dataframe
        
        PARAMETERS:
            dataframe: dataframe
                the dataframe to be modified
            treatment: str, default ''
                a flag-like string to indicate that the dataframe must be treated according to a predefined protocol
            melt_on_value: str, default None
                determines wether the dataframe must be melted on the value specified as the string. If not specified, it will ignore it unless the treatment determines otherwise
            rename_value_columns: bool, default False
                determines wether the dataframe has a column with values whose header needs to be renamed. If not specified, it will ignore it unless the treatment determines otherwise
            inplace: bool, default False
                determines if the changes will be made in the same dataframe or returned as a result
        
        RETURNS:
            DataFrame
                Return the modified dataframe  
    """
    
    match treatment:
        
        case 'databank':
            melt_on_value = dataframe.loc[:, 'Series Name'][1]
            dataframe.drop(['Series Name', 'Series Code', 'Country Code'], axis=1, inplace = True)
        
        case 'faostat':
            rename_value_columns = True
        
        case 'kaggle':            
            dataframe.drop(['HDI Rank'], axis=1, inplace = True)
            melt_on_value = 'Gender Inequality'
        
        case 'un_data':
            dataframe = dataframe[pd.to_numeric(dataframe[column_year], errors='coerce').notnull()]

        case 'worldbank':
            melt_on_value = dataframe.loc[:, 'Series Name'][1]
            dataframe.drop(['Series Name', 'Series Code', 'Country Code'], axis=1, inplace = True)
            
    dataframe.rename(columns = columns_rename, inplace = True)
    
    if rename_value_columns:
        rename_value_column(dataframe, inplace = True)
    
    if melt_on_value:
        dataframe = pd.melt(dataframe, id_vars=column_country, var_name = column_year, value_name = melt_on_value)
    
    for value in dataframe[column_year]: #Normalize year format
                if type(value) is not int and len(value) > 4:
                    dataframe[column_year].replace({value: str(value[:4])}, inplace = True)
    
    for column in dataframe.columns: #Drop completely empty columns
                if (len(dataframe.loc[:, column].value_counts()) == 1):
                    dataframe.drop(column, axis=1, inplace = True) 
    
    #Shorten indicators column name and remove all the other columns except for the index columns
    dataframe.rename(columns = indicators, inplace = True)
    dataframe.drop(dataframe.columns.difference(columns_index + list(indicators.values())), axis = 1, inplace=True)
    
    #Remove rows with no country
    dataframe.dropna(subset=column_country, inplace=True)
    
    #Normalize all countries name, removing blank spaces before and after the string
    dataframe[column_country] = dataframe[column_country].str.strip()
    dataframe.replace(['..'], '', inplace=True)
    
    #Narrow the range of the data to the years selected
    dataframe[column_year]= dataframe[column_year].astype(int)
    dataframe.drop(dataframe[dataframe[column_year] < year_min].index, inplace=True)
    
    return dataframe


## MAIN CODE

The following cells of code will contain the main part of the code, and will be divided according to the different stages of the data processing they take care of.

## READING

The first step will explore the directory specified in the str read_path, check what files are .csv, and try to read and preprocess them. Those succesfully read and processed will be stored in a dictionary with their url as a key. Both of them, dataframes and url, will also be stored in two lists.

For those .csv that cannot be read or processed, we will store their url in a discarded list, and raise a warning should there be any faulty url.

In [56]:
data_dict, _ = read_data(read_path)

for url, df in data_dict.items():
    try:
        preprocess(url = url, df = df, columns_index = columns_index, columns_rename = columns_rename, inplace = True)
        #print(df.columns)
        normalize(df = df, columns_index = columns_index, inplace = True)
    except:
        print('ERROR')
    else:
        print(url)
        print(df)


ERROR
ERROR
c:\Users\vperezlo\Documents\GitHub\python-data-driven-decisions\Databases\FAOSTAT_agriculture-share-gdp.csv
          Country  Year      Gini
0     Afghanistan  2001  54.06300
1     Afghanistan  2002  45.13440
2     Afghanistan  2003  41.90340
3     Afghanistan  2004  35.61280
4     Afghanistan  2005  35.14760
...           ...   ...       ...
4935     Zimbabwe  2016   7.87399
4936     Zimbabwe  2017   8.34095
4937     Zimbabwe  2018   8.30469
4938     Zimbabwe  2019   8.17322
4939     Zimbabwe  2020  10.93630

[4940 rows x 3 columns]
c:\Users\vperezlo\Documents\GitHub\python-data-driven-decisions\Databases\FAOSTAT_AgricultureOrientationIndex-CreditToAgri-Forestry-Fishing.csv
          Country  Year      Gini
0     Afghanistan  2008  0.004207
1     Afghanistan  2009  0.017894
2     Afghanistan  2010  0.028466
3     Afghanistan  2011  0.025143
4     Afghanistan  2012  0.072139
...           ...   ...       ...
2591       Zambia  2018  6.076848
2592       Zambia  2019  5.1679

# Merge

This is the process of integration of all the databases. In first place, all the column of Country and Year must be casted to a string in order to avoid errors. After this the merge function is performed with the 'outer' parameter to avoid losing data.

In [55]:
final_df = data_list[0]

final_df[column_year]= final_df[column_year].astype(str)
final_df[column_country]= final_df[column_country].astype(str)

#Merge all the different databases into one single dataframe with the format: columns_index + indicators
for data in data_list[1:]:
    data[column_year]= data[column_year].astype(str)
    data[column_country]= data[column_country].astype(str)
    final_df = pd.merge(final_df, data, on = columns_index, how = "outer")

#final_df contains the whole merged dataframe

IndexError: list index out of range

In [9]:
dict_country = {} #Ad-hoc dictionary to count the number of entries for each country
dict_df_countries = {} #Dictionary that relates each country to its dataframe
VALUE = 1
THRESHOLD = 15 #Minimum number of entries a country needs to have to be included into our research


#Counts all the entries for each country and stores into the dictionary with countries as keys and the count as its associated value    
for country in final_df[column_country]:
    if country not in dict_country:
        dict_country[country] = VALUE
    else:
        dict_country[country] += VALUE

#Removes all the countries that do no meet the minimun entries requirement        
dict_country = {country:num for country, num in dict_country.items() if num >= THRESHOLD}




In [17]:
for country in dict_country.keys():
    df = final_df.loc[final_df[column_country] == country]
    dict_df_countries[country] = df
    

#dict_df_countries['Spain'].columns
dict_df_countries['Spain'][dict_df_countries['Spain'].columns[2:]] = dict_df_countries['Spain'][dict_df_countries['Spain'].columns[2:]].apply(pd.to_numeric, errors='ignore')
dict_df_countries['Spain'].corr()
for element in dict_df_countries['Spain']['% Undernourishment']:
    print(type(element))
#dict_df_countries['Spain']['% Undernourishment']


<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dict_df_countries['Spain'][dict_df_countries['Spain'].columns[2:]] = dict_df_countries['Spain'][dict_df_countries['Spain'].columns[2:]].apply(pd.to_numeric, errors='ignore')


In [19]:
dict_years = {}

for year in range(2010, 2050):
    for country in dict_df_countries.keys():
        print(dict_df_countries[country].loc[final_df[column_country] == country])
        break
        


          Country  Year Gender Equality % Undernourishment  AgriShareGDP  \
0     Afghanistan  1990                                              NaN   
217   Afghanistan  1991                                              NaN   
434   Afghanistan  1992                                              NaN   
651   Afghanistan  1993                                              NaN   
868   Afghanistan  1994                                              NaN   
1085  Afghanistan  1995                                              NaN   
1302  Afghanistan  1996                                              NaN   
1519  Afghanistan  1997                                              NaN   
1736  Afghanistan  1998                                              NaN   
1953  Afghanistan  1999                                              NaN   
2170  Afghanistan  2000                                              NaN   
2387  Afghanistan  2001                               47.8       54.0630   
2604  Afghan

          Country  Year Gender Equality % Undernourishment  AgriShareGDP  \
0     Afghanistan  1990                                              NaN   
217   Afghanistan  1991                                              NaN   
434   Afghanistan  1992                                              NaN   
651   Afghanistan  1993                                              NaN   
868   Afghanistan  1994                                              NaN   
1085  Afghanistan  1995                                              NaN   
1302  Afghanistan  1996                                              NaN   
1519  Afghanistan  1997                                              NaN   
1736  Afghanistan  1998                                              NaN   
1953  Afghanistan  1999                                              NaN   
2170  Afghanistan  2000                                              NaN   
2387  Afghanistan  2001                               47.8       54.0630   
2604  Afghan