# Lab 6:  Train Various Regression Models and Compare Their Performances

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In this lab assignment, you will train various regression models (regressors) and compare their performances. You will train, test and evaluate individual models as well as ensemble models. You will:

1. Build your DataFrame and define your ML problem:
    * Load the Airbnb "listings" data set
    * Define the label - what are you predicting?
    * Identify the features
2. Create labeled examples from the data set.
3. Split the data into training and test data sets.
4. Train, test and evaluate two individual regressors.
5. Use the stacking ensemble method to train the same regressors.
6. Train, test and evaluate Gradient Boosted Decision Trees.
7. Train, test and evaluate Random Forest.
8. Visualize and compare the performance of all of the models.

<font color='red'><b>Note:</font><br> 
<font color='red'><b>1. Some of the code cells in this notebook may take a while to run.</font><br>
<font color='red'><b>2. Ignore warning messages that pertain to deprecated packages.</font>

## Part 1. Build Your DataFrame and Define Your ML Problem

#### Load a Data Set and Save it as a Pandas DataFrame

We will work with the data set ``airbnbData_train``. This data set already has all the necessary preprocessing steps implemented, including one-hot encoding of the categorical variables, scaling of all numerical variable values, and imputing missing values. It is ready for modeling.

<b>Task</b>: In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`.

You will be working with the file named "airbnbData_train.csv" that is located in a folder named "data_regressors".

In [3]:
# YOUR CODE HERE
filename = os.path.join("data_regressors/", "airbnbData_train.csv")
df = pd.read_csv(filename)
df.columns
df.head(3)
print(df.shape)
print(df.columns)

(28022, 50)
Index(['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified',
       'has_availability', 'instant_bookable', 'host_response_rate',
       'host_acceptance_rate', 'host_listings_count',
       'host_total_listings_count', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_cou

#### Define the Label

Your goal is to train a machine learning model that predicts the price of an Airbnb listing. This is an example of supervised learning and is a regression problem. In our dataset, our label will be the `price` column and the label contains continuous values.

#### Evaluation Metrics for Regressors

So far, we have mostly focused on classification problems. For this assignment, we will focus on a regression problem and predict a continuous outcome. There are different evaluation metrics that are used to determine the performance of a regressor. We will use two metrics to evaluate our regressors: RMSE (root mean square error) and $R^2$ (coefficient of determination).

RMSE:<br>
RMSE finds the average difference between the predicted values and the actual values. We will compute the RMSE on the test set.  To compute the RMSE, we will use the scikit-learn ```mean_squared_error()``` function. Since RMSE finds the difference between the predicted and actual values, lower RMSE values indicate good performance - the model fits the data well and makes more accurate predictions. On the other hand, higher RSME values indicate that the model is not performing well.

$R^2$:<br>
$R^2$ is a measure of the proportion of variability in the prediction that the model was able to make using the test data. An $R^2$ value of 1 is perfect and 0 implies no explanatory value. We can use scikit-learn's ```r2_score()``` function to compute it. Since $R^2$ measures how well the model fits the data, a higher $R^2$ value indicates that good performance and a lower $R^2$ indicates that poor performance.

#### Identify Features

Our features will be all of the remaining columns in the dataset.

## Part 2. Create Labeled Examples from the Data Set 

<b>Task</b>: In the code cell below, create labeled examples from DataFrame `df`.

In [9]:
# YOUR CODE HERE
y = df['price']
X = df.loc[:, df.columns != 'price']

## Part 3. Create Training and Test Data Sets

<b>Task</b>: In the code cell below, create training and test sets out of the labeled examples. Create a test set that is 30 percent of the size of the data set. Save the results to variables `X_train, X_test, y_train, y_test`.

In [13]:
# YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1234)

## Part 4: Train, Test and Evaluate Two Regression Models: Linear Regression and Decision Tree

### a. Train, Test and Evaluate a Linear Regression

You will use the scikit-learn `LinearRegression` class to create a linear regression model. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

First let's import `LinearRegression`:

In [14]:
from sklearn.linear_model import LinearRegression

<b>Task</b>: Initialize a scikit-learn `LinearRegression` model object with no arguments, and fit the model to the training data. The model object should be named `lr_model`.

In [16]:
# YOUR CODE HERE
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

<b>Task:</b> Test your model on the test set (`X_test`). Call the ``predict()`` method  to use the fitted model to generate a vector of predictions on the test set. Save the result to the variable ``y_lr_pred``.

In [17]:
# Call predict() to use the fitted model to make predictions on the test data
# YOUR CODE HERE
y_lr_pred = lr_model.predict(X_test)

To compute the RMSE, we will use the scikit-learn ```mean_squared_error()``` function, which computes the mean squared error between the predicted values and the actual values: ```y_lr_pred``` and```y_test```. In order to obtain the root mean squared error, we will specify the parameter `squared=False`. 

To compute the $R^2$, we will use the scikit-learn ```r2_score()``` function. 

<b>Task</b>: In the code cell below, do the following:

1. Call the `mean_squared_error()` function with arguments `y_test` and `y_lr_pred` and the parameter `squared=False` to find the RMSE. Save your result to the variable `lr_rmse`.

2. Call the `r2_score()` function with the arguments `y_test` and `y_lr_pred`.  Save the result to the variable `lr_r2`.

In [None]:
# 1. Compute the RMSE using mean_squared_error()
# YOUR CODE HERE



# 2. Compute the R2 score using r2_score()
# YOUR CODE HERE

print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

### b. Train, Test and Evaluate a Decision Tree Using GridSearch

You will use the scikit-learn `DecisionTreeRegressor` class to create a decision tree regressor. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

First let's import `DecisionTreeRegressor`:

In [None]:
from sklearn.tree import DecisionTreeRegressor

#### Set Up a Parameter Grid 

<b>Task</b>: Create a dictionary called `param_grid` that contains possible hyperparameter values for `max_depth` and `min_samples_leaf`. The dictionary should contain the following key/value pairs:

* a key called 'max_depth' with a value which is a list consisting of the integers 4 and 8
* a key called 'min_samples_leaf' with a value which is a list consisting of the integers 25 and 50

In [None]:
# YOUR CODE HERE

<b>Task:</b> Use `GridSearchCV` to fit a grid of decision tree regressors and search over the different values of hyperparameters `max_depth` and `min_samples_leaf` to find the ones that results in the best 3-fold cross-validation (CV) score.


You will pass the following arguments to `GridSearchCV()`:

1. A decision tree **regressor** model object.
2. The `param_grid` variable.
3. The number of folds (`cv=3`).
4. The scoring method `scoring='neg_root_mean_squared_error'`. Note that `neg_root_mean_squared_error` returns the negative RMSE.


Complete the code in the cell below.

In [None]:
print('Running Grid Search...')

# 1. Create a DecisionTreeRegressor model object without supplying arguments. 
#    Save the model object to the variable 'dt_regressor'

dt_regressor = # YOUR CODE HERE


# 2. Run a Grid Search with 3-fold cross-validation and assign the output to the object 'dt_grid'.
#    * Pass the model and the parameter grid to GridSearchCV()
#    * Set the number of folds to 3
#    * Specify the scoring method

dt_grid = # YOUR CODE HERE


# 3. Fit the model (use the 'grid' variable) on the training data and assign the fitted model to the 
#    variable 'dt_grid_search'

dt_grid_search = # YOUR CODE HERE

print('Done')


The code cell below prints the RMSE score of the best model using the `best_score_` attribute of the fitted grid search object `dt_grid_search`. Note that specifying a scoring method of `neg_root_mean_squared_error` will result in the negative RMSE, so we will multiply `dt_grid_search.best_score` by -1 to obtain the RMSE.

In [None]:
rmse_DT = -1 * dt_grid_search.best_score_
print("[DT] RMSE for the best model is : {:.2f}".format(rmse_DT) )

<b>Task</b>: In the code cell below, obtain the best model hyperparameters identified by the grid search and save them to the variable `dt_best_params`.

In [None]:
dt_best_params = # YOUR CODE HERE

dt_best_params

<b>Task</b>: In the code cell below, initialize a `DecisionTreeRegressor` model object, supplying the best values of hyperparameters `max_depth` and `min_samples_leaf` as arguments.  Name the model object `dt_model`. Then fit the model `dt_model` to the training data.

In [None]:
# YOUR CODE HERE

<b>Task:</b> Test your model `dt_model` on the test set `X_test`. Call the ``predict()`` method  to use the fitted model to generate a vector of predictions on the test set. Save the result to the variable ``y_dt_pred``. Evaluate the results by computing the RMSE and R2 score in the same manner as you did above. Save the results to the variables `dt_rmse` and `dt_r2`.

Complete the code in the cell below to accomplish this.

In [None]:
# 1. Use the fitted model to make predictions on the test data
# YOUR CODE HERE


# 2. Compute the RMSE using mean_squared_error()
# YOUR CODE HERE


# 3. Compute the R2 score using r2_score()
# YOUR CODE HERE


print('[DT] Root Mean Squared Error: {0}'.format(dt_rmse))
print('[DT] R2: {0}'.format(dt_r2))

## Part 5: Train, Test and Evaluate Ensemble Models: Stacking 

You will use the stacking ensemble method to train two regression models. You will use the scikit-learn `StackingRegressor` class. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html).

First let's import `StackingRegressor`:

In [None]:
from sklearn.ensemble import StackingRegressor

In this part of the assignment, we will use two models jointly. In the code cell below, we creates a list of tuples, each consisting of a scikit-learn model function and the corresponding shorthand name that we choose. We will specify the hyperparameters for the decision tree that we determined through the grid search above.

In [None]:
estimators = [("DT", DecisionTreeRegressor(max_depth=8, min_samples_leaf=25)),
              ("LR", LinearRegression())
             ]

<b>Task</b>: 


1. Create a `StackingRegressor` model object. Call `StackingRegressor()` with the following parameters:
    * Assign the list `estimators` to the parameter `estimators`.
    * Use the parameter 'passthrough=False'. 
Assign the results to the variable `stacking_model`.

2. Fit `stacking_model` to the training data.

As you read up on the definition of the `StackingRegressor` class, you will notice that by default, the results of each model are combined using a ridge regression (a "final regressor").

In [None]:
print('Implement Stacking...')

# YOUR CODE HERE


print('End')

<b>Task:</b> Use the `predict()` method to test your ensemble model `stacking_model` on the test set (`X_test`). Save the result to the variable `stacking_pred`. Evaluate the results by computing the RMSE and R2 score. Save the results to the variables `stack_rmse` and `stack_r2`.

Complete the code in the cell below to accomplish this.

In [None]:
# 1. Use the fitted model to make predictions on the test data
# YOUR CODE HERE


# 2. Compute the RMSE 
# YOUR CODE HERE


# 3. Compute the R2 score
# YOUR CODE HERE

   
print('Root Mean Squared Error: {0}'.format(stack_rmse))
print('R2: {0}'.format(stack_r2))                       

## Part 6: Train, Test and Evaluate  Evaluate Ensemble Models: Gradient Boosted Decision Trees 

You will use the scikit-learn `GradientBoostingRegressor` class to create a gradient boosted decision tree. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html).

First let's import `GradientBoostingRegressor`:

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

Let's assume you already performed a grid search to find the best model hyperparameters for your gradient boosted decision tree. (We are omitting this step to save computation time.) The best values are: `max_depth=2`, and `n_estimators = 300`. 

<b>Task</b>: Initialize a `GradientBoostingRegressor` model object with the above values as arguments. Save the result to the variable `gbdt_model`. Fit the `gbdt_model` model to the training data.

In [None]:
print('Begin GBDT Implementation...')

# YOUR CODE HERE

print('End')

<b>Task:</b> Use the `predict()` method to test your model `gbdt_model` on the test set `X_test`. Save the result to the variable ``y_gbdt_pred``. Evaluate the results by computing the RMSE and R2 score in the same manner as you did above. Save the results to the variables `gbdt_rmse` and `gbdt_r2`.

Complete the code in the cell below to accomplish this.

In [None]:
# 1. Use the fitted model to make predictions on the test data
# YOUR CODE HERE


# 2. Compute the RMSE 
# YOUR CODE HERE


# 3. Compute the R2 score 
# YOUR CODE HERE


print('[GBDT] Root Mean Squared Error: {0}'.format(gbdt_rmse))
print('[GBDT] R2: {0}'.format(gbdt_r2))                 

## Part 7: Train, Test and Evaluate  Ensemble Models: Random Forest

You will use the scikit-learn `RandomForestRegressor` class to create a gradient boosted decision tree. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

First let's import `RandomForestRegressor`:

In [None]:
from sklearn.ensemble import RandomForestRegressor

Let's assume you already performed a grid search to find the best model hyperparameters for your random forest model. (We are omitting this step to save computation time.) The best values are: `max_depth=32`, and `n_estimators = 300`. 

<b>Task</b>: Initialize a `RandomForestRegressor` model object with the above values as arguments. Save the result to the variable `rf_model`. Fit the `rf_model` model to the training data.

In [None]:
print('Begin RF Implementation...')

# YOUR CODE HERE

print('End')

<b>Task:</b> Use the `predict()` method to test your model `rf_model` on the test set `X_test`. Save the result to the variable ``y_rf_pred``. Evaluate the results by computing the RMSE and R2 score in the same manner as you did above. Save the results to the variables `rf_rmse` and `rf_r2`.

Complete the code in the cell below to accomplish this.

In [None]:
# 1. Use the fitted model to make predictions on the test data
# YOUR CODE HERE


# 2. Compute the RMSE 
# YOUR CODE HERE


# 3. Compute the R2 score 
# YOUR CODE HERE


print('[RF] Root Mean Squared Error: {0}'.format(rf_rmse))
print('[RF] R2: {0}'.format(rf_r2))                 

## Part 8: Visualize and Compare Model Performance

The code cell below will plot the RMSE and R2 score for each regressor. 

<b>Task:</b> Complete the code in the cell below.

In [None]:
RMSE_Results = [stack_rmse, lr_rmse, dt_rmse, gbdt_rmse, rf_rmse]
R2_Results = [stack_r2, lr_r2, dt_r2, gbdt_r2, rf_r2]

rg= np.arange(5)
width = 0.35

# 1. Create bar plot with RMSE results
# YOUR CODE HERE


# 2. Create bar plot with R2 results
# YOUR CODE HERE



labels = ['Stacking','LR', 'DT', 'GBDT', 'RF']
plt.xticks(rg + width/2, labels)

plt.xlabel("Models")
plt.ylabel("RMSE/R2")


plt.ylim([0,1])
plt.title('Model Performance')
plt.legend(loc='upper left', ncol=2)
plt.show()


<b>Analysis</b>: Compare and contrast the resulting $R^2$ and RSME scores of the ensemble models and the individual models. Are the ensemble models performing better? Which is the best performing model? Explain.

<Double click this Markdown cell to make it editable, and record your findings here.>