# Guided Project: Web Data Pipeline

## Overview

En este proyecto, voy a analizar diferentes características básicas de todos los países del mundo, con el fin de poder ver las diferencias entre ellos en las principales magnitudes de libertad, y cómo han evolucionado estas en el tiempo.
EL informe oficial completo se puede encontrar en el siguiente enlace: https://object.cato.org/sites/cato.org/files/human-freedom-index-files/human-freedom-index-2018-revised.pdf

## Acquisition

A través de Kaggle, he descargado el archivo "The Human Freedom Index", el cual muestra una medida global de la libertad personal, civil y económica.

Una vez descargado el archivo, a través de Jupyter Notebook lo he abierto para empezar a trabajar en él.

In [50]:
import pandas as pd
import numpy as np

data = pd.read_csv('hfi_cc_2018.csv')
data.head()

Unnamed: 0,year,ISO_code,countries,region,pf_rol_procedural,pf_rol_civil,pf_rol_criminal,pf_rol,pf_ss_homicide,pf_ss_disappearances_disap,...,ef_regulation_business_bribes,ef_regulation_business_licensing,ef_regulation_business_compliance,ef_regulation_business,ef_regulation,ef_score,ef_rank,hf_score,hf_rank,hf_quartile
0,2016,ALB,Albania,Eastern Europe,6.661503,4.547244,4.666508,5.291752,8.920429,10.0,...,4.050196,7.324582,7.074366,6.705863,6.906901,7.54,34.0,7.56814,48.0,2.0
1,2016,DZA,Algeria,Middle East & North Africa,,,,3.819566,9.456254,10.0,...,3.765515,8.523503,7.029528,5.676956,5.268992,4.99,159.0,5.135886,155.0,4.0
2,2016,AGO,Angola,Sub-Saharan Africa,,,,3.451814,8.06026,5.0,...,1.94554,8.096776,6.782923,4.930271,5.5185,5.17,155.0,5.640662,142.0,4.0
3,2016,ARG,Argentina,Latin America & the Caribbean,7.098483,5.79196,4.34393,5.744791,7.622974,10.0,...,3.260044,5.253411,6.508295,5.535831,5.369019,4.84,160.0,6.469848,107.0,3.0
4,2016,ARM,Armenia,Caucasus & Central Asia,,,,5.003205,8.80875,10.0,...,4.575152,9.319612,6.491481,6.79753,7.378069,7.57,29.0,7.241402,57.0,2.0


## Wrangling

Procedo a ver las columnas que tiene y qué información contiene cada una. Para esto me he apoyado principalmente en Kaggle y en la descripción que hacen de cada columna.

In [51]:
null_cols = data.isnull().sum()
null_cols[null_cols > 0].head()

pf_rol_procedural    578
pf_rol_civil         578
pf_rol_criminal      578
pf_rol                80
pf_ss_homicide        80
dtype: int64

In [52]:
data.columns

Index(['year', 'ISO_code', 'countries', 'region', 'pf_rol_procedural',
       'pf_rol_civil', 'pf_rol_criminal', 'pf_rol', 'pf_ss_homicide',
       'pf_ss_disappearances_disap',
       ...
       'ef_regulation_business_bribes', 'ef_regulation_business_licensing',
       'ef_regulation_business_compliance', 'ef_regulation_business',
       'ef_regulation', 'ef_score', 'ef_rank', 'hf_score', 'hf_rank',
       'hf_quartile'],
      dtype='object', length=123)

A traveś de Python me voy quedando con las columnas que más me interesan, y les cambio el nombre para hacer más legible el archivo.
Las columnas con las que me quedo son las siguientes:
    Civil_justice
    Freedom_of_association
    Freedom_of_expression
    Same_sex_relationships
    Personal_Freedom
    Economic_Freedom
    Human_Freedom

In [53]:
new_columns = ['year', 'countries', 'pf_rol_civil', 'pf_association_association', 'pf_expression', 
               'pf_identity_sex', 'pf_score', 'ef_score', 'hf_score']
data = data[new_columns]
data.head()

Unnamed: 0,year,countries,pf_rol_civil,pf_association_association,pf_expression,pf_identity_sex,pf_score,ef_score,hf_score
0,2016,Albania,4.547244,10.0,8.607143,10.0,7.596281,7.54,7.56814
1,2016,Algeria,,5.0,7.380952,0.0,5.281772,4.99,5.135886
2,2016,Angola,,2.5,6.452381,0.0,6.111324,5.17,5.640662
3,2016,Argentina,5.79196,7.5,8.738095,10.0,8.099696,4.84,6.469848
4,2016,Armenia,,7.5,7.154762,10.0,6.912804,7.57,7.241402


In [54]:
data = data.rename(index=str, columns={"year": "Year", "countries": "Country", 
                                       "pf_rol_civil": "Civil_justice", 
                                       "pf_association_association": "Freedom_of_association",
                                       "pf_expression": "Freedom_of_expression", 
                                       "pf_identity_sex": "Same_sex_relationships", 
                                       "pf_score": "Personal_Freedom",
                                       "ef_score": "Economic_Freedom",
                                       "hf_score": "Human_Freedom"})
data.head()

Unnamed: 0,Year,Country,Civil_justice,Freedom_of_association,Freedom_of_expression,Same_sex_relationships,Personal_Freedom,Economic_Freedom,Human_Freedom
0,2016,Albania,4.547244,10.0,8.607143,10.0,7.596281,7.54,7.56814
1,2016,Algeria,,5.0,7.380952,0.0,5.281772,4.99,5.135886
2,2016,Angola,,2.5,6.452381,0.0,6.111324,5.17,5.640662
3,2016,Argentina,5.79196,7.5,8.738095,10.0,8.099696,4.84,6.469848
4,2016,Armenia,,7.5,7.154762,10.0,6.912804,7.57,7.241402


In [55]:
null_cols = data.isnull().sum()
null_cols[null_cols > 0]

Civil_justice             578
Freedom_of_association    329
Freedom_of_expression      80
Same_sex_relationships     80
Personal_Freedom           80
Economic_Freedom           80
Human_Freedom              80
dtype: int64

Una vez tengo las columnas que quiero, veo si existe algún elemento duplicado, aunque como era de esperar, no hay ninguno.

In [56]:
before = len(data)
data = data.drop_duplicates()
after = len(data)
print('Number of duplicate records dropped: ', str(before - after))

Number of duplicate records dropped:  0


Ahora miro las filas que tengan NaN's para deshacerme de ellas.

In [58]:
data[data.isnull().any(axis=1)].head()

Unnamed: 0,Year,Country,Civil_justice,Freedom_of_association,Freedom_of_expression,Same_sex_relationships,Personal_Freedom,Economic_Freedom,Human_Freedom
1,2016,Algeria,,5.0,7.380952,0.0,5.281772,4.99,5.135886
2,2016,Angola,,2.5,6.452381,0.0,6.111324,5.17,5.640662
4,2016,Armenia,,7.5,7.154762,10.0,6.912804,7.57,7.241402
7,2016,Azerbaijan,,2.5,4.708462,10.0,5.676553,6.49,6.083277
8,2016,Bahamas,6.008696,,8.895833,10.0,7.454538,7.34,7.397269


In [59]:
data = data.dropna()
data.head()

Unnamed: 0,Year,Country,Civil_justice,Freedom_of_association,Freedom_of_expression,Same_sex_relationships,Personal_Freedom,Economic_Freedom,Human_Freedom
0,2016,Albania,4.547244,10.0,8.607143,10.0,7.596281,7.54,7.56814
3,2016,Argentina,5.79196,7.5,8.738095,10.0,8.099696,4.84,6.469848
5,2016,Australia,7.525648,10.0,9.392857,10.0,9.184438,7.98,8.582219
6,2016,Austria,7.872188,10.0,9.333333,10.0,9.246948,7.58,8.413474
10,2016,Bangladesh,3.712171,5.0,7.04199,5.0,5.3026,6.3,5.8013


## Analysis

A partir de aquí empiezo a seleccionar la información con la que me quiero quedar. En este caso, con información del año 2016, el último año disponible, para saber cuales fueron los países con mayor y menor libertad,medido en la columna Human Freedom.

In [66]:
years = data["Year"]
drop_cols = list(years[years < 2016].index)
data_2016 = data.drop(drop_cols, axis=0)

data_2016_top5 = data_2016.sort_values(by='Human_Freedom', ascending=False).head()
data_2016_last5 = data_2016.sort_values(by='Human_Freedom', ascending=True).head()

display(data_2016_top5)
display(data_2016_last5)

Unnamed: 0,Year,Country,Civil_justice,Freedom_of_association,Freedom_of_expression,Same_sex_relationships,Personal_Freedom,Economic_Freedom,Human_Freedom
107,2016,New Zealand,7.877339,10.0,9.52381,10.0,9.284819,8.49,8.88741
63,2016,Hong Kong,7.724473,7.5,8.666667,10.0,8.58368,8.97,8.77684
5,2016,Australia,7.525648,10.0,9.392857,10.0,9.184438,7.98,8.582219
27,2016,Canada,7.181646,10.0,9.511905,10.0,9.151727,7.98,8.565863
106,2016,Netherlands,8.7146,10.0,9.72619,10.0,9.398842,7.71,8.554421


Unnamed: 0,Year,Country,Civil_justice,Freedom_of_association,Freedom_of_expression,Same_sex_relationships,Personal_Freedom,Economic_Freedom,Human_Freedom
157,2016,Venezuela,3.271891,10.0,5.109508,10.0,5.521449,2.88,4.200724
44,2016,Egypt,3.763257,5.0,6.460099,0.0,3.894554,5.72,4.807277
68,2016,Iran,5.211233,2.5,5.357637,0.0,4.532449,6.03,5.281225
47,2016,Ethiopia,3.896895,0.0,5.252983,0.0,5.06409,5.73,5.397045
103,2016,Myanmar,3.657112,7.5,7.544895,5.0,5.463361,5.42,5.44168


Así mismo, hago un breve resumen de las características de las principales métricas de ambos grupos para, a través de un simple vistazo, ver la principales diferencias.

In [67]:
stats_top = data_2016_top5[important_columns].describe().transpose()
stats_last = data_2016_last5[important_columns].describe().transpose()

display(stats_top)
display(stats_last)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Civil_justice,5.0,7.804741,0.571336,7.181646,7.525648,7.724473,7.877339,8.7146
Freedom_of_association,5.0,9.5,1.118034,7.5,10.0,10.0,10.0,10.0
Freedom_of_expression,5.0,9.364286,0.407953,8.666667,9.392857,9.511905,9.52381,9.72619
Same_sex_relationships,5.0,10.0,0.0,10.0,10.0,10.0,10.0,10.0
Personal_Freedom,5.0,9.120701,0.315323,8.58368,9.151727,9.184438,9.284819,9.398842
Economic_Freedom,5.0,8.226,0.502623,7.71,7.98,7.98,8.49,8.97
Human_Freedom,5.0,8.673351,0.150444,8.554421,8.565863,8.582219,8.77684,8.88741


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Civil_justice,5.0,3.960078,0.737141,3.271891,3.657112,3.763257,3.896895,5.211233
Freedom_of_association,5.0,5.0,3.952847,0.0,2.5,5.0,7.5,10.0
Freedom_of_expression,5.0,5.945025,1.042464,5.109508,5.252983,5.357637,6.460099,7.544895
Same_sex_relationships,5.0,3.0,4.472136,0.0,0.0,0.0,5.0,10.0
Personal_Freedom,5.0,4.895181,0.684909,3.894554,4.532449,5.06409,5.463361,5.521449
Economic_Freedom,5.0,5.156,1.290477,2.88,5.42,5.72,5.73,6.03
Human_Freedom,5.0,5.02559,0.525481,4.200724,4.807277,5.281225,5.397045,5.44168


Y eso ahora cuando empiezo a hacer el anaĺisis de los 3 países que me interesan de este dataset, que son Nueva Zelanda, por ser el número 1, Venezuela por ser el último, y España por ser mi país.
Lo que voy a hacer es ver la evolución de la Libertad Total y de la Justicia Civil a lo largo de los últimos 8 años,viendo la variación entre años y entre 2008 y 2016.

In [68]:
data_new_zealand = data[(data['Country']=='New Zealand')]
data_new_zealand = data_new_zealand.sort_values(by='Year', ascending=False)

columns_analysis = ['Year', 'Civil_justice', 'Human_Freedom']
stats_new_zealand = data_new_zealand[columns_analysis]
stats_new_zealand = stats_new_zealand.set_index('Year')

display(stats_new_zealand)
stats_new_zealand.pct_change()

Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,7.877339,8.88741
2015,7.840253,8.87689
2014,7.775732,8.921386
2013,7.5,8.868585
2012,7.5,8.784639
2011,7.599611,8.79173
2010,7.599611,8.879077
2009,7.599611,8.797717
2008,7.599611,8.910542


Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,,
2015,-0.004708,-0.001184
2014,-0.008229,0.005013
2013,-0.035461,-0.005918
2012,0.0,-0.009465
2011,0.013281,0.000807
2010,0.0,0.009935
2009,0.0,-0.009163
2008,0.0,0.012824


In [69]:
data_spain = data[(data['Country']=='Spain')]
data_spain = data_spain.sort_values(by='Year', ascending=False)

stats_spain = data_spain[columns_analysis]
stats_spain = stats_spain.set_index('Year')

display(stats_spain)
stats_spain.pct_change()

Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,6.596637,8.159107
2015,6.512283,8.175966
2014,6.418713,8.197269
2013,6.2,8.005897
2012,6.2,8.016794
2011,6.456055,8.125422
2010,6.456055,8.2462
2009,6.456055,8.101026
2008,6.456055,8.247249


Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,,
2015,-0.012788,0.002066
2014,-0.014368,0.002606
2013,-0.034074,-0.023346
2012,0.0,0.001361
2011,0.041299,0.01355
2010,0.0,0.014864
2009,0.0,-0.017605
2008,0.0,0.01805


In [70]:
data_venezuela = data[(data['Country']=='Venezuela')]
data_venezuela = data_venezuela.sort_values(by='Year', ascending=False)

stats_venezuela = data_venezuela[columns_analysis]
stats_venezuela = stats_venezuela.set_index('Year')

display(stats_venezuela)
stats_venezuela.pct_change()

Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,3.271891,4.200724
2015,2.92087,4.24175
2014,3.48551,4.57373
2013,3.3,5.013857
2012,3.3,5.22565
2011,3.781688,5.332458
2010,3.781688,5.156619
2009,3.781688,5.283514
2008,3.781688,5.257879


Unnamed: 0_level_0,Civil_justice,Human_Freedom
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,,
2015,-0.107284,0.009766
2014,0.193312,0.078265
2013,-0.053223,0.096229
2012,0.0,0.042242
2011,0.145966,0.020439
2010,0.0,-0.032975
2009,0.0,0.024608
2008,0.0,-0.004852


Y termino el análisis anual midiendo lo que han variado las variables Justicia Social y Libertad Humana a lo largo de los últimos años para cada uno de los 3 países.

In [81]:
var_ven_per_year = ((stats_venezuela.loc[2016, 'Civil_justice']/stats_venezuela.loc[2008, 'Civil_justice'])**(1/8)-1)*100
var_ven = ((stats_venezuela.loc[2016, 'Civil_justice']/stats_venezuela.loc[2008, 'Civil_justice'])-1)*100
var_ven_per_year = ((stats_venezuela.loc[2016, 'Civil_justice']/stats_venezuela.loc[2008, 'Civil_justice'])**(1/8)-1)*100
var_ven = ((stats_venezuela.loc[2016, 'Civil_justice']/stats_venezuela.loc[2008, 'Civil_justice'])-1)*100

print("La variación ANUAL de la Justicia Civil en Venezuela durante los últimos 8 años ha sido de un " + str(round(var_ven_per_year, 2)) + "%.")
print("La variación TOTAL de la Justicia Civil en Venezuela entre 2008 y 2016 ha sido de un " + str(round(var_ven, 2)) + "%.")

La variación ANUAL de Venezuela durante los últimos 8 años ha sido de un -1.79%.
La variación TOTAL de Venezuela entre 2008 y 2016 ha sido de un -13.48%.


-1.7937480969623665

-13.480681457155242

In [None]:
import matplotlib as mpl
import numpy as np
import matplotlib.pyplot as plt

plt.matshow(data.corr())