# Investigate the influence of the nature and wellbeing of a Country in its Co2 emissions

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Data sets
This analysis is based on date extracted from [Gapminder](https://www.google.com/url?q=http://www.gapminder.org/data/&sa=D&ust=1532469042121000), a website that has collected a lot of
information about how people live their lives in dierent countries, tracked across the years, and on a number of diferent indicators. 

I've focused my analysis on enviromental pollution and what influences it. 

The analysis consists of the following indicators:
 
- [C02 Emission (tonnes per person)](https://cdiac.ess-dive.lbl.gov/)
Carbon dioxide emissions from the burning of fossil fuels (metric tonnes of C02 per person). 
This indicator has been chosen to analyze the ecological behaviour of each country, the more Co2 emissions a country makes the less ecological friendly it is.  

- [Forest coverage(%)](https://www.fao.org/forestry/sofo/en/)
Percentage of total land area that has been covered with forest during the given year.
This indicator has been chosen to analyze the amount of nature that surrounds the people in each country, and also to see its evolution accross the years. 

- [Democracy score](http://www.systemicpeace.org/inscrdata.html)
Summary measure of a country's democratic and free nature in all independent countries with total population greater than 500,000 in 2018. 
For a better understanding of this index, -10 is the lowest value and 10 the highest. 
This indicator has been chosen to understand the influece of the people on the countries' decisions. 
_This dataset was extracted directly from the sorucelink, since it was a newer version than the one provided in gapminder._

- [Human Development Index (HDI)](http://hdr.undp.org/en/indicators/137506)
Index used to rank countries by level of "human development". It contains three dimensions: health level, educational level and living standard. 
For a better understanding of this index, the score can be understood the following way: 

| Human development | Score   |
|-----------------------------|-------------|
| Very high | >= 0.800    |
| High | 0.700–0.799 |
| Medium | 0.550–0.699 |
| Low | < 0.550     |

### Objective

The main purpose of this analysis is to understand if the wellbeing of the society (HDI), the democratic system (polity) or the surrounding nature (forest coverage) impact somehow its ecological impact (emissions of Co2). 
Do countries with a better democratic system, understood as a more influence of the people in its decisions, less Co2 emissions? Within these countries, does the wellbeing have an impact? Do people with a better education, health system and living standard impact on its countries ecological impact?
And lastly, has the amount of nature that surrounds these people impact on their actions? 

With the last variable, forest coverage, I also want to see if there is any relation between the increase of Co2 emissions and the decrease of the forest coverage. 

In [1]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### General Properties

#### C02 Emission (tonnes per person) 

In [2]:
# Import the csv file as a dataframe
df_co2 = pd.read_csv('Data/co2_emissions_tonnes_per_person.csv')

#print first 10 lines
df_co2.head(10)

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
0,Afghanistan,,,,,,,,,,...,0.0529,0.0637,0.0854,0.154,0.242,0.294,0.412,0.35,0.316,0.299
1,Albania,,,,,,,,,,...,1.38,1.28,1.3,1.46,1.48,1.56,1.79,1.68,1.73,1.96
2,Algeria,,,,,,,,,,...,3.22,2.99,3.19,3.16,3.42,3.3,3.29,3.46,3.51,3.72
3,Andorra,,,,,,,,,,...,7.3,6.75,6.52,6.43,6.12,6.12,5.87,5.92,5.9,5.83
4,Angola,,,,,,,,,,...,0.98,1.1,1.2,1.18,1.23,1.24,1.25,1.33,1.25,1.29
5,Antigua and Barbuda,,,,,,,,,,...,4.81,4.91,5.14,5.19,5.45,5.54,5.36,5.42,5.36,5.38
6,Argentina,,,,,,,,,,...,4.14,4.43,4.38,4.68,4.41,4.56,4.6,4.57,4.46,4.75
7,Armenia,,,,,,,,,,...,1.46,1.48,1.73,1.91,1.51,1.47,1.71,1.98,1.9,1.9
8,Australia,,,,,,,,,,...,17.3,17.8,17.8,18.1,18.2,17.7,17.4,17.0,16.1,15.4
9,Austria,,,,,,,,0.0517,,...,8.99,8.71,8.39,8.28,7.49,8.03,7.69,7.31,7.28,6.8


In [3]:
# print shape to see the amount of countries (x) and amount of years (y)
df_co2.shape

(192, 216)

In [4]:
# Count missing values
df_co2.isna().sum()

country      0
1800       187
1801       187
1802       185
1803       187
1804       186
1805       187
1806       187
1807       186
1808       187
1809       187
1810       186
1811       186
1812       186
1813       186
1814       186
1815       186
1816       186
1817       186
1818       186
1819       185
1820       185
1821       185
1822       185
1823       185
1824       185
1825       185
1826       185
1827       185
1828       185
          ... 
1985        20
1986        20
1987        20
1988        20
1989        20
1990        16
1991        15
1992         4
1993         4
1994         3
1995         3
1996         3
1997         3
1998         3
1999         3
2000         3
2001         3
2002         2
2003         2
2004         2
2005         2
2006         2
2007         1
2008         1
2009         1
2010         1
2011         1
2012         0
2013         0
2014         0
Length: 216, dtype: int64

Even though the dataset consists of a very wide range of years, the amount of missing values is considerably high in the years before the 90's. 

In [5]:
# Check for duplicated rows
df_co2.duplicated().sum()

0

#### Human Development Index (HDI)

In [6]:
# Import the csv file as a dataframe
df_hdi = pd.read_csv('Data/hdi_human_development_index.csv')

#print first 10 lines
df_hdi.head(10)


Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Afghanistan,0.295,0.3,0.309,0.305,0.3,0.324,0.328,0.332,0.335,...,0.415,0.433,0.434,0.448,0.454,0.463,0.47,0.476,0.479,0.479
1,Albania,0.635,0.618,0.603,0.608,0.616,0.628,0.637,0.636,0.646,...,0.703,0.713,0.721,0.725,0.738,0.752,0.759,0.761,0.762,0.764
2,Algeria,0.577,0.581,0.587,0.591,0.595,0.6,0.609,0.617,0.627,...,0.69,0.697,0.705,0.714,0.724,0.732,0.737,0.741,0.743,0.745
3,Andorra,,,,,,,,,,...,,,,,0.819,0.819,0.843,0.85,0.857,0.858
4,Angola,,,,,,,,,,...,0.454,0.468,0.48,0.488,0.495,0.508,0.523,0.527,0.531,0.533
5,Antigua and Barbuda,,,,,,,,,,...,0.781,0.786,0.788,0.783,0.782,0.778,0.781,0.782,0.784,0.786
6,Argentina,0.705,0.713,0.72,0.725,0.728,0.731,0.738,0.746,0.753,...,0.788,0.792,0.794,0.802,0.816,0.822,0.823,0.825,0.826,0.827
7,Armenia,0.634,0.628,0.595,0.593,0.597,0.603,0.609,0.618,0.632,...,0.707,0.721,0.725,0.72,0.729,0.732,0.736,0.739,0.741,0.743
8,Australia,0.866,0.867,0.871,0.874,0.876,0.885,0.888,0.891,0.894,...,0.918,0.921,0.925,0.927,0.927,0.93,0.933,0.936,0.937,0.939
9,Austria,0.794,0.798,0.804,0.806,0.812,0.816,0.819,0.823,0.833,...,0.86,0.864,0.87,0.872,0.88,0.884,0.887,0.892,0.892,0.893


In [7]:
# print shape to see the amount of countries (x) and amount of years (y)
df_hdi.shape

(187, 27)

This result is already very small compared to the Co2 emissions data set that consists of records from the last 216 years compared to 27 in this one. 
Another big difference to consider is that the biggest year also differs, here is 2015 but in the Co2 emissions dataset was 2014. 
I'll check the next datasets, but so far **the year range of the analysis should go from 1990 until 2014**. 

There's also a **difference in the amount of couentries** that should also be compared and unified before proceding with the analysis phase. 

In [8]:
# Count missing values
df_hdi.isna().sum()

country     0
1990       44
1991       44
1992       44
1993       44
1994       44
1995       40
1996       40
1997       40
1998       40
1999       37
2000       20
2001       20
2002       20
2003       18
2004       15
2005        6
2006        6
2007        6
2008        6
2009        6
2010        0
2011        0
2012        0
2013        0
2014        0
2015        0
dtype: int64

Additional to the missing years, this dataset also has several missing values, especially before the year 2000 where almost 24%
of the countries have missing data (44 out of 187)

In [9]:
# Check for duplicated rows
df_hdi.duplicated().sum()

0

#### Forest coverage(%)

In [10]:
# Import the csv file as a dataframe
df_fc = pd.read_csv('Data/forest_coverage_percent.csv')

#print first 10 lines
df_fc.head(10)

Unnamed: 0,country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Afghanistan,2.07,2.07,2.07,2.07,2.07,2.07,2.07,2.07,2.07,...,2.07,2.07,2.07,2.07,2.07,2.07,2.07,2.07,2.07,2.07
1,Albania,28.8,28.7,28.6,28.6,28.5,28.4,28.4,28.3,28.2,...,28.5,28.5,28.4,28.4,28.3,28.3,28.3,28.2,28.2,28.2
2,Algeria,0.7,0.7,0.69,0.69,0.69,0.68,0.68,0.67,0.67,...,0.68,0.71,0.74,0.77,0.81,0.81,0.81,0.81,0.82,0.82
3,Andorra,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,...,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0,34.0
4,Angola,48.9,48.8,48.7,48.6,48.5,48.4,48.3,48.2,48.1,...,47.3,47.2,47.1,47.0,46.9,46.8,46.7,46.6,46.5,46.4
5,Antigua and Barbuda,23.4,23.3,23.3,23.2,23.1,23.1,23.0,22.9,22.9,...,22.3,22.3,22.3,22.3,22.3,22.3,22.3,22.3,22.3,22.3
6,Argentina,12.7,12.6,12.5,12.4,12.3,12.2,12.1,12.0,11.9,...,10.9,10.8,10.7,10.6,10.4,10.3,10.2,10.1,10.0,9.91
7,Armenia,,,11.8,11.8,11.7,11.7,11.7,11.7,11.7,...,11.7,11.7,11.6,11.6,11.6,11.6,11.6,11.7,11.7,11.7
8,Australia,16.7,16.7,16.7,16.7,16.8,16.8,16.8,16.8,16.8,...,16.5,16.4,16.3,16.1,16.0,16.1,16.1,16.2,16.2,16.2
9,Austria,45.7,45.8,45.9,46.0,46.0,46.1,46.2,46.3,46.3,...,46.7,46.7,46.7,46.7,46.8,46.8,46.8,46.8,46.9,46.9


In [11]:
# print shape to see the amount of countries (x) and amount of years (y)
df_fc.shape

(192, 27)

In this case the amount of countries matches with the Co2 indicator, I'll have to check if the values match. But the range of years is considerable lesss, and it matches the HDI dataset. 

In [12]:
# Count missing values
df_fc.isna().sum()

country     0
1990       30
1991       27
1992        8
1993        4
1994        4
1995        4
1996        4
1997        4
1998        4
1999        4
2000        2
2001        2
2002        2
2003        2
2004        2
2005        2
2006        0
2007        0
2008        0
2009        0
2010        0
2011        1
2012        1
2013        1
2014        1
2015        1
dtype: int64

In [13]:
# Check for duplicated rows
df_fc.duplicated().sum()

0

#### Democracy score (Polity)

This dataset was obtained from the official website, because it was more updated that the version provided by gapminder. 
The main difference is found in the format of the file, it's provided as an .xls format instead of .csv, contains more variables and the year is listed as one more column instead of being the first row of the dataset. 

In [14]:
# Import the excel file as a dataframe
df_polity = pd.read_excel('Data/p4v2018.xls')

#print first 10 lines
df_polity.head(10)

Unnamed: 0,cyear,ccode,scode,country,year,flag,fragment,democ,autoc,polity,...,interim,bmonth,bday,byear,bprec,post,change,d4,sf,regtrans
0,21800,2,USA,United States,1800,0,,7,3,4,...,,1.0,1.0,1800.0,1.0,4.0,88.0,1.0,,
1,21801,2,USA,United States,1801,0,,7,3,4,...,,,,,,,,,,
2,21802,2,USA,United States,1802,0,,7,3,4,...,,,,,,,,,,
3,21803,2,USA,United States,1803,0,,7,3,4,...,,,,,,,,,,
4,21804,2,USA,United States,1804,0,,7,3,4,...,,,,,,,,,,
5,21805,2,USA,United States,1805,0,,7,3,4,...,,,,,,,,,,
6,21806,2,USA,United States,1806,0,,7,3,4,...,,,,,,,,,,
7,21807,2,USA,United States,1807,0,,7,3,4,...,,,,,,,,,,
8,21808,2,USA,United States,1808,0,,7,3,4,...,,,,,,,,,,
9,21809,2,USA,United States,1809,0,,9,0,9,...,,3.0,5.0,1809.0,1.0,9.0,5.0,1.0,,2.0


In [15]:
# the dataset is formatted diferently than the others, so it will have to be adapted to the format
# check columns in the dataset
df_polity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17562 entries, 0 to 17561
Data columns (total 36 columns):
cyear       17562 non-null int64
ccode       17562 non-null int64
scode       17562 non-null object
country     17562 non-null object
year        17562 non-null int64
flag        17562 non-null int64
fragment    3220 non-null float64
democ       17562 non-null int64
autoc       17562 non-null int64
polity      17562 non-null int64
polity2     17325 non-null float64
durable     16298 non-null float64
xrreg       17562 non-null int64
xrcomp      17562 non-null int64
xropen      17562 non-null int64
xconst      17562 non-null int64
parreg      17562 non-null int64
parcomp     17562 non-null int64
exrec       17361 non-null float64
exconst     17562 non-null int64
polcomp     17436 non-null float64
prior       1352 non-null float64
emonth      1408 non-null float64
eday        1408 non-null float64
eyear       1408 non-null float64
eprec       1412 non-null float64
interim     1200 

In [16]:
# The indicator we need, and that was listed in Gapminder is polity2
# Only the columns Country, year and polity2 are needed
df_polity = df_polity[['country','year','polity2']]
df_polity.head(10)

Unnamed: 0,country,year,polity2
0,United States,1800,4.0
1,United States,1801,4.0
2,United States,1802,4.0
3,United States,1803,4.0
4,United States,1804,4.0
5,United States,1805,4.0
6,United States,1806,4.0
7,United States,1807,4.0
8,United States,1808,4.0
9,United States,1809,9.0


In [17]:
# Due to the difference in format I can't use shape to determine amount of years and countries
df_polity.nunique() 

country    195
year       219
polity2     21
dtype: int64

In [18]:
# Get year range
print(df_polity['year'].min(),"-", df_polity['year'].max())

1800 - 2018


This dataset is the one that a greater year range and also more countries than the other, I'll have to limit it for the analysis.

In [19]:
# Check for duplicated rows
df_polity.duplicated().sum()

1

In [20]:
# check the duplicated line
df_polity[df_polity.duplicated()]

Unnamed: 0,country,year,polity2
7897,Yugoslavia,1991,-5.0


Even though there's a line duplicated, the problem is that the dataset has countries, like Yugoslavia that just appeared, that don't exist anymore. I'll check within the year range that appears in every other dataset the fluctuation in the amount of countries to understand how much data is missing and where. 

In [21]:
# Check amount of countries within the year range
df_polity.query('year > 1989 and year < 2015').groupby('year').size()

year
1990    147
1991    161
1992    161
1993    164
1994    163
1995    163
1996    163
1997    163
1998    163
1999    163
2000    163
2001    163
2002    164
2003    164
2004    164
2005    164
2006    166
2007    165
2008    166
2009    166
2010    166
2011    168
2012    167
2013    167
2014    167
dtype: int64

In [22]:
#Check missing data within the year range
df_polity.query('year > 1989 and year < 2015').isna().sum()

country     0
year        0
polity2    59
dtype: int64

After assessing the data there are the steps for cleaning the data will be:

**1. Formatting the data**
In order to analyze the data together I need to merge those dataframes, but in the way they are formatted right now it's not possible. 

The dataframes must have a format like the polity dataframe has, where the year is a value in a column next to the country. 

The expected output from the merged dataframes should have this format:

| Country   | Year | Co2   | forest_coverage | HDI | polity2 |
|-----------|------|-------|-----------------|-----|---------|
|     ...   | ...  | ...   |    ...          | ... |    ...  |

**2. Unifying list of countries and years**
Every dataframe has a different range of years and some countries differ from one another. All these values should be consistent accross dataframes. To accomplish that the merge between the dataframes should be an inner join, that way only countries and years existing in all dataframes will remain. 
The expected output will be the data belonging to the year range 1990-2014, that is present in all data sets. 

**3. Fill missing values**
Every dataframe has several data missing, but most of it will be dismissed during the inner join of the dataframes because it belongs to older years.

For visualization purposes, I'd like to keep most of the records instead of dropping the ones with missing data in order to have the first and the last observed value during the period. 

Given that the older values are more inconsisten, to minimize the missing values I will limit the analysis to a 20 year range: 1994-2014. 

For the values that remain missing after the merge, I'll refill them using the forward fill method. This way the values will be consistent along time. 

With the remaining missing values, dropping them would leave an incosistent dataset that could generate fake outliers in the analysis, and similar would be filling with the mean, given that all of the indicadors can present a notable evolution over time, specially with the polity2 indicator. 

### Data Cleaning

#### Formatting the data and unifying list of countries and years 

In [23]:
# unpivot the data
df_co2 = df_co2.melt(id_vars=['country'], var_name=['year'])

# Use the indicator as a column name
df_co2.rename(columns={"value": "co2"}, inplace=True)

# Check the new structure
df_co2.sample(10)

Unnamed: 0,country,year,co2
38568,Tajikistan,2000,0.36
34650,Lao,1980,0.0574
15395,China,1880,
31779,Madagascar,1965,0.0966
34398,Canada,1979,18.2
25475,Panama,1932,
2727,"Congo, Rep.",1814,
27755,Mauritius,1944,
34224,Dominica,1978,0.346
32878,Denmark,1971,11.5


In [24]:
# check the new structure columns
df_co2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41280 entries, 0 to 41279
Data columns (total 3 columns):
country    41280 non-null object
year       41280 non-null object
co2        16905 non-null float64
dtypes: float64(1), object(2)
memory usage: 967.6+ KB


In [25]:
# unpivot the data
df_fc = df_fc.melt(id_vars=['country'], var_name=['year'])

# Use the indicator as a column name
df_fc.rename(columns={"value": "forest_coverage"}, inplace=True)

# Check the new structure
df_fc.sample(10)

Unnamed: 0,country,year,forest_coverage
1829,Malaysia,1999,66.0
1208,Ethiopia,1996,14.3
2914,Chile,2005,21.6
3339,India,2007,23.1
4917,Nauru,2015,0.0
424,Costa Rica,1992,49.5
2905,Bulgaria,2005,33.6
234,Croatia,1991,
1056,Liechtenstein,1995,41.9
784,Belgium,1994,


In [26]:
# check the new structure columns
df_fc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4992 entries, 0 to 4991
Data columns (total 3 columns):
country            4992 non-null object
year               4992 non-null object
forest_coverage    4882 non-null float64
dtypes: float64(1), object(2)
memory usage: 117.1+ KB


In [27]:
# unpivot the data
df_hdi = df_hdi.melt(id_vars=['country'], var_name=['year'])

# Use the indicator as a column name
df_hdi.rename(columns={"value": "human_development"}, inplace=True)

# Check the new structure
df_hdi.sample(10)# check the new structure columns

Unnamed: 0,country,year,human_development
813,Greece,1994,0.772
3929,Algeria,2011,0.732
4265,South Korea,2012,0.891
2171,Myanmar,2001,0.435
1951,Italy,2000,0.828
2250,Argentina,2002,0.77
1195,Hungary,1996,0.745
1331,Botswana,1997,0.572
4849,Uganda,2015,0.493
3413,Djibouti,2008,0.436


In [28]:
# check the new structure columns
df_hdi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4862 entries, 0 to 4861
Data columns (total 3 columns):
country              4862 non-null object
year                 4862 non-null object
human_development    4322 non-null float64
dtypes: float64(1), object(2)
memory usage: 114.0+ KB


In [29]:
# Merge the dataframes in one
# co2, hdi and forest coverage can be easily merged because they have the same datatypes
df = df_co2.merge(df_fc, how ='inner', on = ['country','year'])
df = df.merge(df_hdi, how ='inner', on = ['country','year'])

In [30]:
# melt function converted the year to a string value
# but in the polity2 dataframe it's type is int, because it was already formatted in the source file 
# change the type of the column to int to match polity and merge
df['year'] = df['year'].astype('int')
df = df.merge(df_polity, how ='inner', on = ['country','year'])

# Check the combined dataframe
df.sample(10)

Unnamed: 0,country,year,co2,forest_coverage,human_development,polity2
2299,Moldova,2005,1.18,11.0,0.648,9.0
1574,Portugal,2000,6.06,36.5,0.782,10.0
902,Cuba,1996,2.46,21.3,0.665,-7.0
2022,Qatar,2003,60.3,0.0,0.826,-10.0
915,Fiji,1996,1.05,53.1,0.675,5.0
1116,Niger,1997,0.0893,1.19,0.24,-6.0
470,France,1993,6.23,26.8,0.803,9.0
1393,Lesotho,1999,0.987,1.38,0.445,2.0
1946,Cyprus,2003,7.8,18.6,0.823,10.0
3596,Costa Rica,2014,1.63,53.4,0.775,10.0


The data is in the expected format, but there's still missing data, as expected

In [31]:
#check missing data
df.isna().sum()

country                0
year                   0
co2                    6
forest_coverage       39
human_development    256
polity2               37
dtype: int64

In [32]:
# Check that the year range is the expected
print(df['year'].min() ,"-", df['year'].max())

1990 - 2014


#### Handle missing values

In [33]:
# Keep only records in the defined year range for analysis (1994-2014)
df =  df.query('year > 1993')

In [34]:
# Check amount of countries
df['country'].nunique()

151

In [35]:
#check missing data
df.isna().sum()

country                0
year                   0
co2                    0
forest_coverage       13
human_development    171
polity2               32
dtype: int64

Now I'll refill the missing values with a backward fill, this method takes the propagates the non-null values backward.
To avoid mixing data within countries I'll order the dataframe first. 

In [36]:
#df_combined.fillna(method='ffill', inplace =True)
df.sort_values(by=['country', 'year'])
df.fillna(method='bfill', inplace =True)

In [37]:
#check missing data
df.isna().sum()

country              0
year                 0
co2                  0
forest_coverage      0
human_development    0
polity2              0
dtype: int64

<a id='eda'></a>
## Exploratory Data Analysis

### Research Question 1 (Replace this header name!)

In [44]:
# To find the correlation among 
# the columns using pearson method 
df.corr(method ='kendall') 

Unnamed: 0,year,co2,forest_coverage,human_development,polity2
year,1.0,0.031416,0.000194,0.114876,0.054623
co2,0.031416,1.0,-0.02631,0.642775,0.260232
forest_coverage,0.000194,-0.02631,1.0,0.050585,0.184299
human_development,0.114876,0.642775,0.050585,1.0,0.361635
polity2,0.054623,0.260232,0.184299,0.361635,1.0


In [38]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [39]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [40]:
#from subprocess import call
#call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])