# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

# YOUR CODE HERE
df = pd.read_csv(WHRDataSet_filename)

df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

<Double click this Markdown cell to make it editable, and record your answers here.>
1. The World Happiness Report Dataset: WHR2018Chapter2OnlineData.csv
2. We will be predicting the label "Life Ladder", which is an evaluation by people based on how good they think their lives are. https://news.gallup.com/poll/122453/understanding-gallup-uses-cantril-scale.aspx 
3. This will be a supervised learning problem, since there is a label that we are trying to predict. The "Life Ladder" values in the dataset are continuous, so this will be a regression problem. 
4. The features in the dataset are 'country', 'year', 'Log GDP per capita', 'Social support', 'Healthy life expectancy at birth', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect', 'Confidence in national government', 'Democratic Quality', 'Delivery Quality', 'Standard deviation of ladder by country-year', 'Standard deviation/Mean of ladder by country-year', 'GINI index (World Bank estimate)', 'GINI index (World Bank estimate), average 2000-15', 'gini of household income reported in Gallup, by wp5-year'. 
5. Since we are predicting "Life Ladder", higher values indicate that a people feel they are living good lives, while lower values indicate that people don't feel like their lives aren't as good as they could be. These beliefs about the lives they are living could be tied to how happy they are with the lives they are living. The aim of our machine learning model is to predict the Life Ladder score for a country, and "happiness" in a sense. This is important data because it reflects how happy people are with their lives. Our group felt that this model might not work as well for a company. The problem being solved doesn't feel like a product or tool to be sold. But, this may be valuable for research and policy makers. When policy makers are considering certain policies, they could predict how that would affect one of these attributes we are training on, such as Social Support. Then, use those features to make a Life Ladder prediction with the machine learning model. To understand how a certain attribute would be affected by a policy, they could create surveys to get people's input or analyze past data. Though this WHR is reported on the country level, this might also apply to the state and city level. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
# Inspecting the data: 
print(f"Shape:\n{df.shape}")
print()
print(f"Columns:\n{df.columns}")
print()
print(f"Data types:\n{df.dtypes}")
print()
print(f"Pandas describe:\n{df.describe()}")
print()

# Look for missing values: 
nan_count = np.sum(df.isnull(), axis = 0)
print(f"Missing value count:\n{nan_count}")

Shape:
(1562, 19)

Columns:
Index(['country', 'year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality', 'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-15',
       'gini of household income reported in Gallup, by wp5-year'],
      dtype='object')

Data types:
country                                                      object
year                                                          int64
Life Ladder                                                 float64
Log GDP per capita                                          float64
Social support                            

In [4]:
df.corr()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
year,1.0,-0.014505,0.05114,-0.052845,0.100904,0.134332,-0.014111,-0.051141,-0.02473,0.171805,-0.018925,-0.0174,-0.011016,0.273838,0.20968,-0.053342,-0.026927,0.071773
Life Ladder,-0.014505,1.0,0.779476,0.700299,0.729852,0.526058,0.20491,-0.425013,0.554462,-0.267492,-0.085543,0.607034,0.706673,-0.154257,-0.756076,-0.097255,-0.172745,-0.29408
Log GDP per capita,0.05114,0.779476,1.0,0.658591,0.841612,0.362998,-0.000334,-0.350142,0.311868,-0.120597,-0.162,0.630107,0.77037,-0.086494,-0.566376,-0.342142,-0.314639,-0.35585
Social support,-0.052845,0.700299,0.658591,1.0,0.586759,0.418213,0.077543,-0.217857,0.459656,-0.352552,-0.160353,0.536387,0.54501,-0.174091,-0.594465,-0.148387,-0.128284,-0.314072
Healthy life expectancy at birth,0.100904,0.729852,0.841612,0.586759,1.0,0.340026,0.047079,-0.311037,0.297759,-0.105255,-0.188827,0.597106,0.721081,-0.06587,-0.526026,-0.306798,-0.364279,-0.42289
Freedom to make life choices,0.134332,0.526058,0.362998,0.418213,0.340026,1.0,0.357158,-0.496932,0.615916,-0.284391,0.408096,0.445323,0.486678,-0.081104,-0.369111,0.044033,0.057697,0.108313
Generosity,-0.014111,0.20491,-0.000334,0.077543,0.047079,0.357158,1.0,-0.305019,0.380896,-0.117508,0.275648,0.118966,0.203871,-0.182119,-0.193145,-0.016602,-0.04381,0.194036
Perceptions of corruption,-0.051141,-0.425013,-0.350142,-0.217857,-0.311037,-0.496932,-0.305019,1.0,-0.302946,0.267359,-0.436614,-0.322063,-0.514183,0.30173,0.378509,0.158565,0.170775,-0.043064
Positive affect,-0.02473,0.554462,0.311868,0.459656,0.297759,0.615916,0.380896,-0.302946,1.0,-0.384112,0.144219,0.369666,0.365544,-0.069609,-0.410061,0.371113,0.298045,0.121792
Negative affect,0.171805,-0.267492,-0.120597,-0.352552,-0.105255,-0.284391,-0.117508,0.267359,-0.384112,1.0,-0.159316,-0.198636,-0.211019,0.510342,0.520042,0.171791,0.074559,0.148413


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

<Double click this Markdown cell to make it editable, and record your answers here.>

1. The new feature list is 'Life Ladder', 'Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Generosity', 'Perceptions of corruption', 'Positive affect',
       'Negative affect', 'Confidence in national government',
       'Democratic Quality', 'Delivery Quality',
       'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate), average 2000-15'.
   I'm going to keep all of the features except for 'country', 'year', 'GINI index (World Bank estimate)', and 'gini of household income reported in Gallup, by wp5-year'.
2. For data preparation, I plan on removing missing values and removing some columns that wouldn't make sense for prediction. Since each country has multiple rows of data, the data for each country needs to be aggregated. Since I'm using a decision tree, the data won't need to be rescaled or normalized. I'll also split the data into training and test data sets. 
3. I'm going to use a decision tree for the regression problem. I'll use the scikit-learn DecisionTreeRegressor(). The hope is that the decision tree model will be more intuitive to interpret. This will help to pinpoint what features have a strong effect on the life ladder score.
4. I'm going to split my dataset into 70/15/15 for training, validation and testing. I'll train one model with default values and evaluate the model with the validation set. Then I'm going to create a param grid and use scikit-learn's GridSearchCV to perform cross validation. For that, I'll resplit the data 85/15. 85% will be for training and cross validation and 15% for the final test. After finding the best params, I'll do a final test to see how well the model performs on the testing dataset. I'll use either MSE or RMSE to evaluate the models. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [5]:
# Import additional packages. 
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import math
import seaborn as sns

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
# There are a lot of missing values, and it doesn't make sense to try to replace the data with an average value, in this case. 

# Since the features "GINI index (World Bank estimate)" and "gini of household income reported in Gallup, by wp5-year" 
# have so many missing values, I'm just going to remove those features entirely. 

df.drop(columns=['GINI index (World Bank estimate)', 'gini of household income reported in Gallup, by wp5-year'], inplace=True)

# Check missing values again: 
nan_count = np.sum(df.isnull(), axis = 0)
print(f"Missing value count:\n{nan_count}")

Missing value count:
country                                                0
year                                                   0
Life Ladder                                            0
Log GDP per capita                                    27
Social support                                        13
Healthy life expectancy at birth                       9
Freedom to make life choices                          29
Generosity                                            80
Perceptions of corruption                             90
Positive affect                                       18
Negative affect                                       12
Confidence in national government                    161
Democratic Quality                                   171
Delivery Quality                                     171
Standard deviation of ladder by country-year           0
Standard deviation/Mean of ladder by country-year      0
GINI index (World Bank estimate), average 2000-15    176
dtype: int

In [7]:
# Remove rows with missing values: 
df.dropna(axis=0, inplace=True)

# Check missing values again: 
nan_count = np.sum(df.isnull(), axis = 0)
print(f"Missing value count:\n{nan_count}\n")

# Reset index since we removed rows. 
df = df.reset_index()
df.drop(columns='index', inplace=True)

# Check shape of the dataframe: 
print(f"Dataframe shape:\n{df.shape}\n")

# Check dataframe: 
print(df.head())

Missing value count:
country                                              0
year                                                 0
Life Ladder                                          0
Log GDP per capita                                   0
Social support                                       0
Healthy life expectancy at birth                     0
Freedom to make life choices                         0
Generosity                                           0
Perceptions of corruption                            0
Positive affect                                      0
Negative affect                                      0
Confidence in national government                    0
Democratic Quality                                   0
Delivery Quality                                     0
Standard deviation of ladder by country-year         0
Standard deviation/Mean of ladder by country-year    0
GINI index (World Bank estimate), average 2000-15    0
dtype: int64

Dataframe shape:
(1120, 17)

 

Since each country has multiple rows for data from different years, the initial idea was to combine all the data into one row
per country. 

    # Combine values for each country into one row. Find the average value of each feature for that country. 
    df = df.groupby(['country']).mean()
    
    # Then reset the index: 
    df = df.reset_index()

After doing so, the dataset only has 135 examples. This is not a lot to train the model on, especially after splitting the 
dataset into training, validation, and test datasets. 

After doing some thinking, I realized that the examples for each country did not need to be combined. If we combined them, 
we would need to remove the column "country", since each row would have a different country name. This wouldn't provide 
much predictive value at all. I'm more interested in how the other features will impact the life ladder score. So I'll just remove the country column without combining the values. This will also give us more examples for training and validation. 

In [8]:
# # Combine values for each country into one row. Find the average value of each feature for that country. 
# df = df.groupby(['country']).mean()

# # Then reset the index: 
# df = df.reset_index()

In [9]:
# Remove 'year' and 'country'. The year doesn't provide any important data, 
# and each row represents a different country, so they shouldn't be used as predictive features.

# Remove the year column:
df.drop(columns=['year', 'country'], inplace=True)

# Also dropping "Standard deviation/Mean of ladder by country-year", since it uses the Life Ladder value to calculate. 
df.drop(columns=['Standard deviation/Mean of ladder by country-year'], inplace=True)

In [10]:
df

Unnamed: 0,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,"GINI index (World Bank estimate), average 2000-15"
0,4.634252,9.077325,0.821372,66.576630,0.528605,-0.016183,0.874700,0.552678,0.246335,0.300681,-0.045108,-0.420024,1.764947,0.30325
1,5.510124,9.246649,0.784502,68.028885,0.601512,-0.174559,0.847675,0.606636,0.271393,0.364894,-0.060784,-0.328862,1.921203,0.30325
2,4.550648,9.258439,0.759477,68.291374,0.631830,-0.132977,0.862905,0.633609,0.338379,0.338095,0.070411,-0.330956,2.315580,0.30325
3,4.813763,9.278097,0.625587,68.512100,0.734648,-0.030553,0.882704,0.684911,0.334543,0.498786,0.314873,-0.187407,2.660069,0.30325
4,4.606651,9.303031,0.639356,68.691956,0.703851,-0.086883,0.884793,0.688370,0.350427,0.506978,0.251629,-0.152544,2.729001,0.30325
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1115,4.955101,7.534424,0.896476,47.654789,0.469531,-0.072877,0.858691,0.669279,0.177311,0.407084,-1.125315,-1.555728,1.853195,0.43200
1116,4.690188,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,-1.026085,-1.526321,1.964805,0.43200
1117,4.184451,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,-0.985267,-1.484067,2.079248,0.43200
1118,3.703191,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,-0.893078,-1.357514,2.198865,0.43200


In [11]:
df.columns

Index(['Life Ladder', 'Log GDP per capita', 'Social support',
       'Healthy life expectancy at birth', 'Freedom to make life choices',
       'Generosity', 'Perceptions of corruption', 'Positive affect',
       'Negative affect', 'Confidence in national government',
       'Democratic Quality', 'Delivery Quality',
       'Standard deviation of ladder by country-year',
       'GINI index (World Bank estimate), average 2000-15'],
      dtype='object')

In [12]:
df.shape

(1120, 14)

# Initial evaluation :

In [13]:
# Split the data: 
X = df.drop(columns='Life Ladder', inplace=False)
y = df['Life Ladder']

# 70/15/15 Train, val, test. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.21, random_state=42)

In [14]:
X

Unnamed: 0,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,"GINI index (World Bank estimate), average 2000-15"
0,9.077325,0.821372,66.576630,0.528605,-0.016183,0.874700,0.552678,0.246335,0.300681,-0.045108,-0.420024,1.764947,0.30325
1,9.246649,0.784502,68.028885,0.601512,-0.174559,0.847675,0.606636,0.271393,0.364894,-0.060784,-0.328862,1.921203,0.30325
2,9.258439,0.759477,68.291374,0.631830,-0.132977,0.862905,0.633609,0.338379,0.338095,0.070411,-0.330956,2.315580,0.30325
3,9.278097,0.625587,68.512100,0.734648,-0.030553,0.882704,0.684911,0.334543,0.498786,0.314873,-0.187407,2.660069,0.30325
4,9.303031,0.639356,68.691956,0.703851,-0.086883,0.884793,0.688370,0.350427,0.506978,0.251629,-0.152544,2.729001,0.30325
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1115,7.534424,0.896476,47.654789,0.469531,-0.072877,0.858691,0.669279,0.177311,0.407084,-1.125315,-1.555728,1.853195,0.43200
1116,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,-1.026085,-1.526321,1.964805,0.43200
1117,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,-0.985267,-1.484067,2.079248,0.43200
1118,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,-0.893078,-1.357514,2.198865,0.43200


In [15]:
y

0       4.634252
1       5.510124
2       4.550648
3       4.813763
4       4.606651
          ...   
1115    4.955101
1116    4.690188
1117    4.184451
1118    3.703191
1119    3.735400
Name: Life Ladder, Length: 1120, dtype: float64

In [16]:
# Train the model: 

initial_model = DecisionTreeRegressor()

initial_model.fit(X_train, y_train)

initial_train_predictions = initial_model.predict(X_train)
initial_val_predictions = initial_model.predict(X_val)

train_mse = mean_squared_error(y_train, initial_train_predictions)
val_mse = mean_squared_error(y_val, initial_val_predictions)

print(f"Train MSE: {train_mse}\n")
print(f"Val MSE: {val_mse}\n\n")

train_rmse = math.sqrt(train_mse)
val_rmse = math.sqrt(val_mse)

print(f"Train RMSE: {train_rmse}\n")
print(f"Val RMSE: {val_rmse}")

Train MSE: 5.789219659357277e-10

Val MSE: 0.22676690818453765


Train RMSE: 2.4060797283875022e-05

Val RMSE: 0.4762004915836791


In [17]:
train_r2_score = r2_score(y_train, initial_train_predictions)
val_r2_score = r2_score(y_val, initial_val_predictions)

print(f"Train R2: {train_r2_score}\n")
print(f"Val R2: {val_r2_score}")

Train R2: 0.9999999995355874

Val R2: 0.8400258596268183


The error for training and validation are both very low. The training data having such low MSE and RMSE could indicate that there is some overfitting occurring. But, the validation scores are quite low as well. 

Both R2 scores are very high, which is a good thing. 

# Use Grid Search CV

In [18]:
max_depth_list = [x for x in range(2, 50, 2)]
min_samples_split_list = [x for x in range(2, 50, 2)]
min_samples_leaf_list = [x for x in range(1, 50, 5)]

param_grid = {'max_depth': max_depth_list, 'min_samples_split': min_samples_split_list,
             'min_samples_leaf': min_samples_leaf_list}

estimator = DecisionTreeRegressor()

# 85/15 train, test:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

In [19]:
grid_search = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=5)

grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")

Best params: {'max_depth': 8, 'min_samples_leaf': 11, 'min_samples_split': 26}


In [20]:
print(grid_search.best_estimator_)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=8,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=11, min_samples_split=26,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')


In [21]:
best_model = grid_search.best_estimator_

train_predictions = best_model.predict(X_train)
test_predictions = best_model.predict(X_test)

In [22]:
mse_train = mean_squared_error(y_train, train_predictions)
mse_test = mean_squared_error(y_test, test_predictions)

print(f"Train MSE: {mse_train}\n")
print(f"Test MSE: {mse_test}\n\n")

rmse_train = math.sqrt(mse_train)
rmse_test = math.sqrt(mse_test)

print(f"Train RMSE: {rmse_train}\n")
print(f"Test RMSE: {rmse_test}")

Train MSE: 0.11499499600012367

Test MSE: 0.28349385985186987


Train RMSE: 0.33910912108069824

Test RMSE: 0.5324414144785038


In [23]:
r2_score_train = r2_score(y_train, train_predictions)
r2_score_test = r2_score(y_test, test_predictions)

print(f"Train R2: {r2_score_train}\n")
print(f"Test R2: {r2_score_test}")

Train R2: 0.9105545491576945

Test R2: 0.7986279997223231


The R2 scores for training and testing are both quite high as well. 

Compared to before, the training RMSE is higher than before, but still low, which indicates that is is not overfitting as much as before. Though the test RMSE is a little higher than the validation RMSE from the initial model, the values are very close.

In [24]:
y.describe()

count    1120.000000
mean        5.398692
std         1.142460
min         2.693061
25%         4.553355
50%         5.256277
75%         6.210087
max         7.970892
Name: Life Ladder, dtype: float64

The RMSE value has the same units as the Life Ladder score. Since the Life Ladder scores are confined to the single digits, an RMSE of 0.5 is still not the best value. Though the new model is not as overfit as before, it is still not generalizing as well as it could be. 

This could be due to the dataset being rather small, which makes it more likely for the model to overfit. To improve upon the model in the future, more data will likely be needed. I could also try a different model, such as Linear Regression. I noticed during data inspection that quite a few of the features had pretty high correlation coefficient values with the label. 

# Trying Linear Regression

In [25]:
from sklearn.linear_model import LinearRegression

In [26]:
df.corr()['Life Ladder'].abs().sort_values()

Confidence in national government                    0.100633
Standard deviation of ladder by country-year         0.109855
GINI index (World Bank estimate), average 2000-15    0.162708
Negative affect                                      0.215237
Generosity                                           0.226619
Perceptions of corruption                            0.463530
Freedom to make life choices                         0.541789
Positive affect                                      0.571141
Democratic Quality                                   0.626150
Social support                                       0.686535
Delivery Quality                                     0.722175
Healthy life expectancy at birth                     0.750038
Log GDP per capita                                   0.778659
Life Ladder                                          1.000000
Name: Life Ladder, dtype: float64

In [35]:
all_features = df.corr()['Life Ladder'].abs().sort_values(ascending=False).index.tolist()

all_features.remove('Life Ladder')

all_features

['Log GDP per capita',
 'Healthy life expectancy at birth',
 'Delivery Quality',
 'Social support',
 'Democratic Quality',
 'Positive affect',
 'Freedom to make life choices',
 'Perceptions of corruption',
 'Generosity',
 'Negative affect',
 'GINI index (World Bank estimate), average 2000-15',
 'Standard deviation of ladder by country-year',
 'Confidence in national government']

In [38]:
# 70/15/15 Train, val, test. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.21, random_state=42)

In [52]:
RMSE_list_val = []

for i in range(1, len(all_features) + 1, 1):
    features = all_features[0:i+1]
    lr_model = LinearRegression()
    lr_model.fit(X_train[features], y_train)
    predictions = lr_model.predict(X_val[features])
    rmse_score = math.sqrt(mean_squared_error(y_val, predictions))
    RMSE_list.append(rmse_score)

In [53]:
min_val = min(RMSE_list)
min_index = RMSE_list.index(min_val)

print(min_val)
print(min_index)

0.5022535828744622
11


It seems that the lowest RMSE score on the validation set uses the top 11 features to train. Though, the RMSE score is just about the same that I got for the DT. 

In [55]:
# Evaluate on test data: 

test_predictions = lr_model.predict(X_test[features])
test_rmse = math.sqrt(mean_squared_error(y_test, test_predictions))
print(test_rmse)

0.566126190384712


The test RMSE did not show any imrovements from before. 