**The data is stored in the data folder:**

First we have **World Happiness Report** reported by *Sustainable Development Solutions Network*, each representing Happiness scored according to economic production, social support, etc among different countries. Here we have data from 2015, 2016, and 2017, which are stored as the following:
- 2015.csv
- 2016.csv
- 2017.csv

We also have the data about Happiness and alcohol consumption:
- HappinessAlcoholConsumption.csv

## Numpy

Why Numpy? Numpy provides a powerful data structure called **array**. An array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key. With arrays, we can easily store and access values. **Numpy functions** such as `numpy.sum` and `numpy.mean` make it easier to manipulate lists of values.

First, we want to create an array.

### Import the libraries 

In [2]:
import pandas as pd
import numpy as np

### Load the data

The **World Happiness Report** data has the following columns:
- **Country**: Name of the country.
- **Region**: Region the country belongs to.
- **Happiness Rank**: Rank of the country based on the Happiness Score.
- **Happiness Score**: A metric measured in 2015 by asking the sampled people the question: "How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest."
- **Standard Error**: The standard error of the happiness score.
- **Economy (GDP per Capita)**: The extent to which GDP contributes to the calculation of the Happiness Score.
- **Family**: The extent to which Family contributes to the calculation of the Happiness Score
- **Health (Life Expectancy)**: The extent to which Life expectancy contributed to the calculation of the Happiness Score
- **Freedom**: The extent to which Freedom contributed to the calculation of the Happiness Score.
- **Trust (Government Corruption)**: The extent to which Perception of Corruption contributes to Happiness Score.
- **Generosity**: The extent to which Generosity contributed to the calculation of the Happiness Score.
- **Dystopia Residual**: The extent to which Dystopia Residual contributed to the calculation of the Happiness Score.

In [3]:
df_2015 = pd.read_csv('data/2015.csv')
df_2016 = pd.read_csv('data/2016.csv')
df_2017 = pd.read_csv('data/2017.csv')

### Examine the data 

If you call `df_2015` directly, `pandas` will display the entire Dataframe. Imagine you have a million data points, your notebook will be extremely long. But do we need that much information to understand the data structure? No. So instead, we use `df.head()` to only show the first couple of lines. You can also indicate how many lines you want to see by calling `df.head(n)` if you want to display the first 10 rows.

### shape

In [4]:
df_2015.shape

(159, 12)

In [15]:
df_2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,La La Land,North America,1000,,,,,,,,,
1,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
2,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
3,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
4,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531


Do you see anything unusual? 

You should have noticed the abnormity in the first row. In 'La La Land', there are a few 'NaN's, which indicate missing data in our dataframe. There might be different reasons why the data is truncated, such as, the survey wasn't able to capture the information for this country, or someone intentially deleted the data to bias the report. In our case, we added 'La La Land' on purpose to show you how to deal with missing data, and here it is:

Missing data don't always show up on the first couple of rows, so you may not be able to catch them if you only display the head of the dataframe. In order to check if something is missing, we can use `df.isna()`: 

In [32]:
df_2015.isna().head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False


To better visualize how much data is missing for each column, we can sum them up:

In [21]:
df_2015.isna().sum()

Country                          0
Region                           0
Happiness Rank                   0
Happiness Score                  1
Standard Error                   1
Economy (GDP per Capita)         1
Family                           1
Health (Life Expectancy)         1
Freedom                          1
Trust (Government Corruption)    1
Generosity                       1
Dystopia Residual                1
dtype: int64

It looks like only the row I added has missing data. One way to deal with missing data is to impute them. There are many ways to do data imputation, the easiest one will be fill them all with zeros. Conviniently, if you call `df.fillna(0)`, `pandas` would do it for us.

In [25]:
df_2015.fillna(0).head(1)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,La La Land,North America,1000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Another option is to drop the rows that have missing data. This is usually not a good idea because we might be removing important information from our dataframe. But in our case, the first row is completely useless, so we'll drop the data. We can do that by calling `df.drop`. See [this documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) for more details. 

Make sure you understand what the two 0's stand for.

In [None]:
df_2015 = df_2015.drop(0,0)

In [31]:
df_2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
1,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
2,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
3,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
4,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
5,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


Now, try examing `df_2016` and `df_2017`.

In [8]:
# your code here


In [9]:
# your code here


Try and see if you can only display the last few rows of `df_2015`

In [11]:
# your code here


Other than looking at what the data exactly look like, we also want to examine if there's any outliers. How can we do that? We can examine the data's mean, min, max, etc. We can certainly use `numpy` and compute these values column by column, but if we have a lot of columns, this might be too much work. Luckily, `pandas` has a function called `df.describe()` that makes things easy.

In [13]:
df_2015.describe()

Unnamed: 0,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,79.493671,5.375734,0.047885,0.846137,0.991046,0.630259,0.428615,0.143422,0.237296,2.098977
std,45.754363,1.14501,0.017146,0.403121,0.272369,0.247078,0.150693,0.120034,0.126685,0.55355
min,1.0,2.839,0.01848,0.0,0.0,0.0,0.0,0.0,0.0,0.32858
25%,40.25,4.526,0.037268,0.545808,0.856823,0.439185,0.32833,0.061675,0.150553,1.75941
50%,79.5,5.2325,0.04394,0.910245,1.02951,0.696705,0.435515,0.10722,0.21613,2.095415
75%,118.75,6.24375,0.0523,1.158448,1.214405,0.811013,0.549092,0.180255,0.309883,2.462415
max,158.0,7.587,0.13693,1.69042,1.40223,1.02525,0.66973,0.55191,0.79588,3.60214


You can also check the columns of the data are consistent with what we provided. 

In [33]:
df_2015.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual'],
      dtype='object')

In [34]:
df_2016.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Lower Confidence Interval', 'Upper Confidence Interval',
       'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)',
       'Freedom', 'Trust (Government Corruption)', 'Generosity',
       'Dystopia Residual'],
      dtype='object')

In [35]:
df_2017.columns

Index(['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high',
       'Whisker.low', 'Economy..GDP.per.Capita.', 'Family',
       'Health..Life.Expectancy.', 'Freedom', 'Generosity',
       'Trust..Government.Corruption.', 'Dystopia.Residual'],
      dtype='object')

### Merge, join, and concatenate

Notice that the three dataframs have some columns in common, so we are going to combine them together. To differentiate years, we'll add an extra column `year` to indicate which years they are taken from. You can add a column to a dataframe by calling `df['column name'] = value` 

Add a `year` column to all three dataframes:

In [36]:
# your code here

df_2015['year'] = 2015