# <span style="color:green">SOLUTION</span>: Explaining and Predicting Life Satisfaction

## Overview

This exercises uses combined data from two datasets to predict and explain **life satisfaction** between different developed countries.

Life satisfaction is measured (quantified) as follows:

> The indicator considers people's evaluation of their life as a whole. It is a weighted-sum of different
response categories based on people's rates of their **current life relative to the best and worst possible lives
for them on a scale from 0 to 10**, using the Cantril Ladder (known also as the "Self-Anchoring Striving Scale") ([source](http://www.oecd.org/statistics/OECD-Better-Life-Index-definitions-2019.pdf))

In this exercise, you will use other country-level variables, such as water quality, air polution, and GDP per capita, to **explain** and **predict** life satisfaction.

## How to Verify Your Answers

In some places here, you will see `assert` statements.

An assert statement will raise an error if the condition you pass to it evaluates to `False`.

If you run a cell containing an `assert` statement and no output is shown, **that's a good thing and means that the condition 'passed.'**

For example, say that you run into a section that asks you:

### <strong><span style="color:red">Challenge:</span></strong> Define a function `my_test_func` that accepts a parameter `x` and adds the integer `1` to `x`, then returns the result

In [12]:
def my_test_func(x):
    pass  # Define me!

You'd define the function as follows:

In [13]:
def my_test_func(x):
    return x + 1

A corresponding `assert` statement would look like this:

In [14]:
assert my_test_func(2) == 3, "2 plus 1 should equal 3!"

You do not need to modify the contents of the `assert` statements. The goal is to run those cells successfully after you've answered a challenege problem.

## Links

Raw data sources:

- https://stats.oecd.org/index.aspx?DataSetCode=BLI
- https://www.imf.org/external/index.htm

Definitions: http://www.oecd.org/statistics/OECD-Better-Life-Index-definitions-2019.pdf

## Load Data from CSV

In [15]:
import pandas as pd

df = pd.read_csv("data/country_satis.csv", encoding="utf-8", index_col="Country")

In [16]:
df.head()

Unnamed: 0_level_0,SW_LIFS,CG_SENG,CG_VOTO,EQ_AIRP,EQ_WATER,ES_EDUA,ES_EDUEX,ES_STCS,HO_BASE,HO_HISH,...,JE_LMIS,JE_LTUR,JE_PEARN,PS_FSAFEN,PS_REPH,SC_SNTWS,WL_EWLH,WL_TNOW,GDPC,UNEMP
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Australia,7.3,2.7,91,5,93,81,21.0,502,0.9,20,...,5.4,1.31,49126,63.5,1.1,95,13.04,14.35,52725.75,5.158
Austria,7.1,1.3,80,16,92,85,17.0,492,0.9,21,...,3.5,1.84,50349,80.6,0.5,92,6.66,14.55,58849.55,4.5
Belgium,6.9,2.0,89,15,84,77,19.3,503,1.9,21,...,3.7,3.54,49675,70.1,1.0,91,4.75,15.7,54028.79,5.358
Brazil,6.4,2.2,79,10,73,49,16.2,395,6.7,21,...,4.8,1.77,40863,35.6,26.7,90,7.13,14.91,15336.84,11.925
Canada,7.4,2.9,68,7,91,91,17.3,523,0.2,22,...,6.0,0.77,47622,82.2,1.3,93,3.69,14.56,51190.13,5.667


Each column is a code that corresponds to a human-readable indicator:

| Code | Value |
| ---- | ----- |
| JE_LMIS | Labour market insecurity |
| CG_SENG | Stakeholder engagement for developing regulations |
| HO_BASE | Dwellings without basic facilities |
| HO_HISH | Housing expenditure |
| PS_FSAFEN | Feeling safe walking alone at night |
| HO_NUMR | Rooms per person |
| IW_HADI | Household net adjusted disposable income |
| IW_HNFW | Household net wealth |
| JE_EMPL | Employment rate |
| JE_LTUR | Long-term unemployment rate |
| JE_PEARN | Personal earnings |
| SC_SNTWS | Quality of support network |
| ES_EDUA | Educational attainment |
| ES_STCS | Student skills |
| ES_EDUEX | Years in education |
| EQ_AIRP | Air pollution |
| EQ_WATER | Water quality |
| CG_VOTO | Voter turnout |
| HS_LEB | Life expectancy |
| HS_SFRH | Self-reported health |
| SW_LIFS | Life satisfaction |
| PS_REPH | Homicide rate |
| WL_EWLH | Employees working very long hours |
| WL_TNOW | Time devoted to leisure and personal care |
| GDPC | Gross domestic product per capita, current prices, 2019 |
| UNEMP | Unemployment rate |

Here you will be using `SW_LIFS` (long name: Life satisfaction) as the **response** (**endogenous**) variable.

In [17]:
df['SW_LIFS'].describe()

count    40.000000
mean      6.535000
std       0.752279
min       4.700000
25%       5.900000
50%       6.500000
75%       7.225000
max       7.600000
Name: SW_LIFS, dtype: float64

> Note: this dataset has been pre-cleaned and merged so that you can focus on the regression problem itself. If you'd like to see how the dataset was formed, see [this notebook](./merge_data.ipynb).

## Break out the `x` and `y` arrays

### <strong><span style="color:red">Challenge:</span></strong> Use `df.pop(...)` with the `SW_LIFS` column name to assign to the variable `y`

In [1]:
# Take y-variable from oecd_stat, in-place.
# The remaining oecd_stat becomes our exogenous data (X matrix)

# <INSERT YOUR CODE HERE>

In [19]:
assert "y" in locals()
assert isinstance(y, pd.Series)
assert len(y)

Here is a sample of the `y` vector:

In [20]:
y.sample(10)

Country
Finland           7.6
Chile             6.5
Turkey            5.5
Czech Republic    6.7
Hungary           5.6
Japan             5.9
Greece            5.4
Brazil            6.4
Denmark           7.6
United States     6.9
Name: SW_LIFS, dtype: float64

## Make Statistical Inference Across the Full Dataset

In this section, you will use `LinearRegression` to draw conclusions about the _explainability_ of the `SW_LIFS` variable.

In [21]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

### <strong><span style="color:red">Challenge:</span></strong> Fit the model using `df` and `y` over the full dataset

Hint: the solution uses a _method_ of the `model` object created above.

In [2]:
# <INSERT YOUR CODE HERE>

In [23]:
assert hasattr(model, "coef_")

In [24]:
model.score(df, y)

0.9443334835430864

In [25]:
assert isinstance(model.score(df, y), float)

### <strong><span style="color:red">Question:</span></strong> What does the result of `model.score()` _mean_ here?

<span style="color:blue">Your answer:</span>

[Type your answer here.]

## Build Test and Train Datasets

In this section, you will experiment with the _predictability_ of `SW_LIFS` by training on an in-sample portion of the data and testing on an out-of-sample portion of the data.

### <strong><span style="color:red">Challenge:</span></strong> Use `train_test_split()` to break the data into training and test sets

In [26]:
from sklearn.model_selection import train_test_split

In [4]:
X_train, X_test, y_train, y_test = # <INSERT YOUR CODE HERE>

SyntaxError: invalid syntax (<ipython-input-4-10fc82943dab>, line 1)

In [28]:
assert len(y_train) == 0.75 * len(df), "Did you use train_test_split() to define y_train and other variables?"

In [29]:
print(f"total rows: {len(df)}")
print(f"train rows: {len(y_train)}")
print(f"test rows: {len(y_test)}")

total rows: 40
train rows: 30
test rows: 10


## Use `LinearRegression` for Predicting Out-of-Sample

In [30]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.pipeline import make_pipeline

### <strong><span style="color:red">Challenge:</span></strong> Create an _instance_ of an `sklearn` linear model, such as `sklearn.linear_model.LinearRegression`

Name the variable `regression_model`.

You can choose from a number of other models besides `LinearRegression` if you are feeling adventurous:

> https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

In [31]:
from sklearn.linear_model import LinearRegression

# <INSERT YOUR CODE HERE>

In [32]:
from sklearn.base import BaseEstimator

assert "regression_model" in locals(), "Did you define a variable called 'regression_model'?"
assert isinstance(regression_model, BaseEstimator), "regression_model should be a linear model instance"

In [33]:
pipe = make_pipeline(
    SelectKBest(f_regression, k=4),
    regression_model
)

In [34]:
assert len(pipe) == 2

### <strong><span style="color:red">Challenge:</span></strong> `.fit()` the pipeline on `X_train` and `y_train` data

In [35]:
# <INSERT YOUR CODE HERE>

Pipeline(steps=[('selectkbest',
                 SelectKBest(k=4,
                             score_func=<function f_regression at 0x133523790>)),
                ('linearregression', LinearRegression())])

In [36]:
pipe.score(X_test, y_test)

0.47865866955495884

In [37]:
assert isinstance(pipe.score(X_test, y_test), float)

### <strong><span style="color:red">Question:</span></strong> Why is `model.score()` lower than the previous test?

<span style="color:blue">Your answer:</span>

[Type your answer here.]