# Investigate the influence of the nature and wellbeing of a Country in its Co2 emissions

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Data sets
This analysis is based on date extracted from [Gapminder](https://www.google.com/url?q=http://www.gapminder.org/data/&sa=D&ust=1532469042121000), a website that has collected a lot of
information about how people live their lives in dierent countries, tracked across the years, and on a number of diferent indicators. 

I've focused my analysis on enviromental pollution and what influences it. 

The analysis consists of the following indicators:
 
- [C02 Emission (tonnes per person)](https://cdiac.ess-dive.lbl.gov/)
Carbon dioxide emissions from the burning of fossil fuels (metric tonnes of C02 per person). 
This indicator has been chosen to analyze the ecological behaviour of each country, the more Co2 emissions a country makes the less ecological friendly it is.  

- [Forest coverage(%)](https://www.fao.org/forestry/sofo/en/)
Percentage of total land area that has been covered with forest during the given year.
This indicator has been chosen to analyze the amount of nature that surrounds the people in each country, and also to see its evolution accross the years. 

- [Democracy score](http://www.systemicpeace.org/inscrdata.html)
Summary measure of a country's democratic and free nature in all independent countries with total population greater than 500,000 in 2018. 
For a better understanding of this index, -10 is the lowest value and 10 the highest. 
This indicator has been chosen to understand the influece of the people on the countries' decisions. 
_This dataset was extracted directly from the sorucelink, since it was a newer version than the one provided in gapminder._

- [Human Development Index (HDI)](http://hdr.undp.org/en/indicators/137506)
Index used to rank countries by level of "human development". It contains three dimensions: health level, educational level and living standard. 
For a better understanding of this index, the score can be understood the following way: 

| Human development | Score   |
|-----------------------------|-------------|
| Very high | >= 0.800    |
| High | 0.700–0.799 |
| Medium | 0.550–0.699 |
| Low | < 0.550     |

### Objective

The main purpose of this analysis is to understand if the wellbeing of the society (HDI), the democratic system (polity) or the surrounding nature (forest coverage) impact somehow its ecological impact (emissions of Co2). 
Do countries with a better democratic system, understood as a more influence of the people in its decisions, less Co2 emissions? Within these countries, does the wellbeing have an impact? Do people with a better education, health system and living standard impact on its countries ecological impact?
And lastly, has the amount of nature that surrounds these people impact on their actions? 

With the last variable, forest coverage, I also want to see if there is any relation between the increase of Co2 emissions and the decrease of the forest coverage. 

In [1]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

#### C02 Emission (tonnes per person) 

In [2]:
# Import the csv file as a dataframe
df_co2 = pd.read_csv('Data/co2_emissions_tonnes_per_person.csv')

#print first 10 lines
df_co2.head(10)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
0,Afghanistan,,,,,,,,,,...,0.0529,0.0637,0.0854,0.154,0.242,0.294,0.412,0.35,0.316,0.299
1,Albania,,,,,,,,,,...,1.38,1.28,1.3,1.46,1.48,1.56,1.79,1.68,1.73,1.96
2,Algeria,,,,,,,,,,...,3.22,2.99,3.19,3.16,3.42,3.3,3.29,3.46,3.51,3.72
3,Andorra,,,,,,,,,,...,7.3,6.75,6.52,6.43,6.12,6.12,5.87,5.92,5.9,5.83
4,Angola,,,,,,,,,,...,0.98,1.1,1.2,1.18,1.23,1.24,1.25,1.33,1.25,1.29
5,Antigua and Barbuda,,,,,,,,,,...,4.81,4.91,5.14,5.19,5.45,5.54,5.36,5.42,5.36,5.38
6,Argentina,,,,,,,,,,...,4.14,4.43,4.38,4.68,4.41,4.56,4.6,4.57,4.46,4.75
7,Armenia,,,,,,,,,,...,1.46,1.48,1.73,1.91,1.51,1.47,1.71,1.98,1.9,1.9
8,Australia,,,,,,,,,,...,17.3,17.8,17.8,18.1,18.2,17.7,17.4,17.0,16.1,15.4
9,Austria,,,,,,,,0.0517,,...,8.99,8.71,8.39,8.28,7.49,8.03,7.69,7.31,7.28,6.8


In [3]:
# print shape to see the amount of countries (x) and amount of years (y)
df_co2.shape

(192, 216)

In [4]:
# Count missing values
df_co2.isna().sum()

country      0
1800       187
1801       187
1802       185
1803       187
1804       186
1805       187
1806       187
1807       186
1808       187
1809       187
1810       186
1811       186
1812       186
1813       186
1814       186
1815       186
1816       186
1817       186
1818       186
1819       185
1820       185
1821       185
1822       185
1823       185
1824       185
1825       185
1826       185
1827       185
1828       185
          ... 
1985        20
1986        20
1987        20
1988        20
1989        20
1990        16
1991        15
1992         4
1993         4
1994         3
1995         3
1996         3
1997         3
1998         3
1999         3
2000         3
2001         3
2002         2
2003         2
2004         2
2005         2
2006         2
2007         1
2008         1
2009         1
2010         1
2011         1
2012         0
2013         0
2014         0
Length: 216, dtype: int64

Even though the dataset consists of a very wide range of years, the amount of missing values is considerably high in the years before the 90's. 

In [5]:
# Check for duplicated rows
df_co2.duplicated().sum()

0

#### Human Development Index (HDI)

In [6]:
# Import the csv file as a dataframe
df_hdi = pd.read_csv('Data/hdi_human_development_index.csv')

#print first 10 lines
df_hdi.head(10)


Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Afghanistan,0.295,0.3,0.309,0.305,0.3,0.324,0.328,0.332,0.335,...,0.415,0.433,0.434,0.448,0.454,0.463,0.47,0.476,0.479,0.479
1,Albania,0.635,0.618,0.603,0.608,0.616,0.628,0.637,0.636,0.646,...,0.703,0.713,0.721,0.725,0.738,0.752,0.759,0.761,0.762,0.764
2,Algeria,0.577,0.581,0.587,0.591,0.595,0.6,0.609,0.617,0.627,...,0.69,0.697,0.705,0.714,0.724,0.732,0.737,0.741,0.743,0.745
3,Andorra,,,,,,,,,,...,,,,,0.819,0.819,0.843,0.85,0.857,0.858
4,Angola,,,,,,,,,,...,0.454,0.468,0.48,0.488,0.495,0.508,0.523,0.527,0.531,0.533
5,Antigua and Barbuda,,,,,,,,,,...,0.781,0.786,0.788,0.783,0.782,0.778,0.781,0.782,0.784,0.786
6,Argentina,0.705,0.713,0.72,0.725,0.728,0.731,0.738,0.746,0.753,...,0.788,0.792,0.794,0.802,0.816,0.822,0.823,0.825,0.826,0.827
7,Armenia,0.634,0.628,0.595,0.593,0.597,0.603,0.609,0.618,0.632,...,0.707,0.721,0.725,0.72,0.729,0.732,0.736,0.739,0.741,0.743
8,Australia,0.866,0.867,0.871,0.874,0.876,0.885,0.888,0.891,0.894,...,0.918,0.921,0.925,0.927,0.927,0.93,0.933,0.936,0.937,0.939
9,Austria,0.794,0.798,0.804,0.806,0.812,0.816,0.819,0.823,0.833,...,0.86,0.864,0.87,0.872,0.88,0.884,0.887,0.892,0.892,0.893


In [7]:
# print shape to see the amount of countries (x) and amount of years (y)
df_hdi.shape

(187, 27)

This result is already very small compared to the Co2 emissions data set that consists of records from the last 216 years compared to 27 in this one. 
Another big difference to consider is that the biggest year also differs, here is 2015 but in the Co2 emissions dataset was 2014. 
I'll check the next datasets, but so far **the year range of the analysis should go from 1990 until 2014**. 

There's also a **difference in the amount of couentries** that should also be compared and unified before proceding with the analysis phase. 

In [8]:
# Count missing values
df_hdi.isna().sum()

country     0
1990       44
1991       44
1992       44
1993       44
1994       44
1995       40
1996       40
1997       40
1998       40
1999       37
2000       20
2001       20
2002       20
2003       18
2004       15
2005        6
2006        6
2007        6
2008        6
2009        6
2010        0
2011        0
2012        0
2013        0
2014        0
2015        0
dtype: int64

Additional to the missing years, this dataset also has several missing values, especially before the year 2000 where almost 24%
of the countries have missing data (44 out of 187)

In [9]:
# Check for duplicated rows
df_hdi.duplicated().sum()

0

#### Forest coverage(%)

In [10]:
# Import the csv file as a dataframe
df_fc = pd.read_csv('forest_coverage_percent.csv')

#print first 10 lines
df_fc.head(10)

FileNotFoundError: File b'forest_coverage_percent.csv' does not exist

In [None]:
# print shape to see the amount of countries (x) and amount of years (y)
df_fc.shape

In this case the amount of countries matches with the Co2 indicator, I'll have to check if the values match. But the range of years is considerable lesss, and it matches the HDI dataset. 

In [None]:
# Count missing values
df_fc.isna().sum()

In [None]:
# Check for duplicated rows
df_fc.duplicated().sum()

#### Democracy score (Polity)

This dataset was obtained from the official website, because it was more updated that the version provided by gapminder. 
The main difference is found in the format of the file, it's provided as an .xls format instead of .csv, contains more variables and the year is listed as one more column instead of being the first row of the dataset. 

In [None]:
# Import the excel file as a dataframe
df_polity = pd.read_excel('Data/p4v2018.xls')

#print first 10 lines
df_polity.head(10)

In [None]:
# the dataset is formatted diferently than the others, so it will have to be adapted to the format
# check columns in the dataset
df_polity.info()

In [None]:
# The indicator we need, and that was listed in Gapminder is polity2
# Only the columns Country, year and polity2 are needed
df_polity = df_polity[['country','year','polity2']]
df_polity.head(10)

In [None]:
# Due to the difference in format I can't use shape to determine amount of years and countries
df_polity.nunique() 

In [None]:
# Get year range
print(df_polity['year'].min(),"-", df_polity['year'].max())

This dataset is the one that a greater year range and also more countries than the other, I'll have to limit it for the analysis.

In [None]:
# Check for duplicated rows
df_polity.duplicated().sum()

In [None]:
# check the duplicated line
df_polity[df_polity.duplicated()]

Even though there's a line duplicated, the problem is that the dataset has countries, like Yugoslavia that just appeared, that don't exist anymore. 

In [None]:
# Check amount of "valid" countries in the dataset
# filtering with year == 2014 because that will be the end of the year range I'll use
df_polity[df_polity.year == 2014].count()

In [None]:
# Count missing values
df_polity.isna().sum()

In [None]:
#df_polity[df_polity.year >= 1990].isnull().any(axis=1)

#df_polity[df_polity.isnull().any(axis=1)].query('year > 1990')

In [None]:
#df_polity.isnull().values.any(axis=1)

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning

#### Apply the year range to all datasets

Only the data belonging to the year range between 1990 until 2014 will be analyzed, the remaining years will be dropped from each dataframe. 
In the case of the first 3 indicators (Co2, fc and HDI) it can be done by selecting the columns that correspond to the years withing the year range. 

In [None]:
# Drop the columns outside of the year range (1990-2014)
df_co2 = df_co2.drop(df_co2.loc[:,'1800':'1989'].columns, axis = 1) 
df_co2.head()

In [None]:
# I'll use the list of columns to filter the year range in the other dataframes
columns = list(df_co2.columns) 

df_hdi = df_hdi[columns]
df_fc = df_fc[columns]

In [None]:
# Check new structure
df_hdi.head()

In [None]:
# Check new structure
df_fc.head()

In the case of polity, since the years are not columns as in the previous dataframes, I'll use pandas query to filter only the rows belonging to the desired year range.

In [None]:
df_polity = df_polity.query('year > 1989 and year < 2015')

In [None]:
# Remove missing data
df_polity.dropna(subset=['polity2'],inplace=True);

#### Unify list of countries
Although every dataframe had a similar amount of countries, it has to be checked if they match. 

Aditionally, in the case of the polity indicator, I'll remove all countries that in the last year of analysis have no data, I'll assume these countries don't have a current political system, as it's with Yugoslavia, the one duplicated row. 

I'll use the valid countries from the polity indicator as the initial list of valid countries for the analysis. 

Then, I'll remove the countries not existing in the other dataframes. 

Lastly, I'll use that list to filter the valid data from the datasets. 

In [None]:
# Keep only the countries that have data in the year 2014 
# This will also remove the one duplicate we found

countries  = df_polity.query('year == 2014')
countries = countries[['country']]


In [None]:
# Keep only the countries that exist in the other datasets

countries = countries[countries.country.isin(df_co2['country'])]
countries = countries[countries.country.isin(df_fc['country'])]
countries = countries[countries.country.isin(df_hdi['country'])]

In [None]:
# Check the amount of valid countries for the analysis
countries.shape

In [None]:
# Keep in each indicator only the countries that have data in all datasets
df_polity = df_polity[df_polity.country.isin(countries['country'])]
df_co2 = df_fc[df_fc.country.isin(countries['country'])]
df_fc = df_fc[df_fc.country.isin(countries['country'])]
df_hdi = df_fc[df_fc.country.isin(countries['country'])]

In [None]:
# To facilitate the analysis the year has to be used as the column names, 
# and the polity score will be found in the intersection between the country and the year

df_polity_pivot = pd.pivot_table(df_polity, values='polity2', index=['country'],
                    columns=['year'], dropna = True)

df_polity_pivot.head(10)

In [None]:
df_polity.head(10)

In [None]:
# I'll drop the multiindex to have all datasets formated the same way
df_polity_pivot.reset_index(drop=False, col_level=1, inplace=True)
df_polity_pivot.head()

In [None]:
df_polity_pivot.shape

In [None]:
df_polity 

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
#from subprocess import call
#call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])