Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# CSE204 - Introduction to Machine Learning - Lab Session 9: Decision Trees and Ensemble Methods

<img src="https://raw.githubusercontent.com/adimajo/CSE204-2021/master/data/logo.jpg" style="float: left; width: 15%" />

[CSE204-2021](https://moodle.polytechnique.fr/course/view.php?id=12838) Lab session #09

J.B. Scoggins - Adrien Ehrhardt

## Introduction

In this lab you will learn how to use decision trees and ensemble methods to build a model which predicts the average price of houses for neighborhoods in the US state of California. We will make heavy use of the following libraries, to which you are already familiar, to play with the dataset and build our models:

- [Pandas](https://pandas.pydata.org/) - python data analysis library
- [Scikit-learn](https://scikit-learn.org/stable/) - python machine learning library

Recall from the lecture that decision trees can be a powerful way to generate "cheap" (because fast to build) and efficient models for classification and regression. Some of the advantages of decision trees over other models include:

- They are easy to use, requiring little data preprocessing,
- Can be easily interpreted,
- Are useful for feature selection,
- Fast to build and evaluate,
- Non-linear.

On the other hand, some of the disadvantages include:

- Greedy tree building algorithms are not necessarily optimal (this is NP-complete),
- The number of samples is logarithmic in tree depth,
- Trees are unstable, meaning they can be easily perturbed (you would get a totally different tree) with small differences in data (e.g. subsamples),
- Decision trees tend to overfit the data,
- Since they only consider a single feature at a time, they have difficulty handling model additivity.

Often, decision trees can be poor classifiers or regressors. However, because of their fast training speed, they can be used to generate ensemble models, such as random forests, with boosting and bagging.
Often, the algorithm that is boosted or bagged is called the **weak learner**, meaning that they would be pretty lame on their own, but ensembling many weak learners can lead to a **strong learner**.
We will play with some of these concepts during this lab in order to compare the resulting model to basic decision tree performance. Before we get started, let's import and load the different packages we will use.

**Crash course on ensemble methods**

Ensemble methods can roughly be divided into two categories: **bagging** and **boosting**.

**Bagging**, or bootstrap aggregation, is a method which yields $B$ bootstrap samples, *i.e.* $B$ new training datasets of $(x_i^{(b)}, y_i^{(b)})_1^n$ with $1 \leq b \leq B$, where the $n$ samples are drawn **with replacement** from the original dataset $(x_i, y_i)_1^n$. A **weak learner** (e.g. a decision tree) is learned on each bootstrap sample, yielding a model $\hat{f}^{(b)}$. For a new point $x$ for which we want to predict $y$, the estimate is given by:
$$\hat{f} = \dfrac{1}{B} \sum_{b=1}^B \hat{f}^{(b)}(x).$$

In the case of decision trees, recall that $\hat{f}^{(b)}(x)$ is the mean (resp. the mode) of the leaf of the tree where the new point $x$ lands for regression (resp. classification). **Random Forests** are a generalization of **bagging** where the set of features $(X^j)_{j \in S_b}$ available for each decision tree is sampled **without replacement** (obviously!) from the original set of features $(X^j)_1^d$ and generally with only a fraction of them (*i.e.* $|S_b| = k << d$). The combination of bootstraping the samples and drawing the features yields a robust prediction (much less variance than a single decision tree) and a better performance.

**Boosting**

Contrary to **bagging** which can be done in parallel (*i.e.* we fit $B$ independent model), in **boosting** we serialize models. A first model (*e.g.* a decision tree) is fit on the original dataset. Individual errors are computed, *i.e.* $E(y_i, \hat{f}^{(0)}(x_i))$ and points in the training dataset are weighted according to this error (*i.e* we progressively concentrate on points for which we make the biggest error). Iterating this process, we get:
$$\hat{F}^{(0)}(x) = \hat{f}^{(0)}(x),$$
$$\hat{F}^{(b)}(x) = \hat{F}^{(b-1)}(x) + \alpha_b \hat{f}^{(b)}(x),$$

for $1 \leq b \leq B$ where $\alpha_b$ is a coefficient to determine and $\hat{F}^{(B)}$ is our final predictor.

In [None]:
# Import required packages
from IPython.display import display
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor, export_graphviz, plot_tree
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor

import graphviz

# Setup pandas options
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

## Data exploration with Pandas

As we will be using the Pandas library, it is important you remember the core functionality from your previous labs.
If you want a quick Pandas refresher, you can follow a simple Pandas tutorial provided by the [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/).

This lab will partially resemble [lab_session_01](https://adimajo.github.io/CSE204-2021/lab_session_01/lab_session_01.html), insofar as we will explore a dataset, and adjust decision trees (as well as ensemble methods by the end of the lab).

## Step 1: Get to know your dataset

In this lab we will make use of the [California housing price dataset](https://developers.google.com/machine-learning/crash-course/california-housing-data-description) (this will be useful later on). This data was compiled from the 1990 census taken in California.  You can download the dataset with the Pandas code below.

In [None]:
# Get the housing data
housing_data = pd.read_csv(
    "https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")

As you are probably well aware at this point, one of the most important aspects of machine learning and data science is to first understand your dataset. Whenever you are introduced to a new dataset, it is crucial to learn everything you can about the data to help your model building strategy. The Pandas library is very useful for this purpose.

**Exercise 1.1:** Using the describe() method to print information about the `housing_data` data frame and answer the following questions:

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

1. How many features are in the dataset?

In [None]:
# number_of_features = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

2. How many samples are there?

In [None]:
# number_of_samples = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()

3. Do mean and standard deviations make sense for what you expect each variable to be?
  - Are the latitude and longitude values consistent with California? (don't be afraid to check with Google maps - use either Markdown or code cell).

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

  - What do you think the units of `total_rooms` and `total_bedrooms` are?

YOUR ANSWER HERE

  - What about the `median_income`?

YOUR ANSWER HERE

4. Which feature(s) are we likely to want to predict (using all the others)? Put its name (as a `str`) in `feature_to_predict`.

In [None]:
# feature_to_predict = ...
# YOUR CODE HERE
raise NotImplementedError()

5. Can you detect if there are any outliers in the data based on the minimum and maximum values?

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 1.2** From the previous question, it seems logical to think that the `total_rooms` and `total_bedrooms` are for an entire block, not a single house.  Let's check this assumption by creating two new variables called `rooms_per_person` and `bedrooms_per_person`.

In [None]:
# housing_data["rooms_per_person"] = ...  # <- TO UNCOMMENT AND COMPLETE
# housing_data["bedrooms_per_person"] = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

**Exercise 1.3:** Get a visual feel for the data by plotting the median house value (maybe with `plt.scatter`?) on a scatter plot versus longitude and latitude (use color for house value).  Compare with a map of california such as [this one](https://www.google.com/search?q=california+maps&client=firefox-b-d&tbm=isch&source=iu&ictx=1&fir=KnETshNcnsi1VM%253A%252CMK2MjhZw7xRERM%252C_&vet=1&usg=AI4_-kSz1S_ut8rli9wcyp0A12LG1aVofg&sa=X&ved=2ahUKEwju59vrxsDhAhXqyIUKHZAkBBgQ9QEwBXoECAgQDg#imgrc=KnETshNcnsi1VM:).  (Hint: normalize the median house value by the maximum.)

- Does this make sense?
- Where are the most expensive homes?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

**Exercise 1.4:** Before building a model, let's build some expectation for what the important features are (just like we did for [lab_session_01](https://adimajo.github.io/CSE204-2021/lab_session_01/lab_session_01.html). This will help us interpret our decision trees later on. Use the `corr()` method from Pandas to build a correlation matrix.  Recall that a correlation matrix tells us how correlated any two features are. For very positive values (close to 1.0), there is a strong correlation betwee the two features. Likewise, for very negative values (close to -1.0), it means that the parameters are negatively correlated. When the values are close to 0, there is no correlation.

Plot a visual representation of the correlation matrix using the `imshow()` method from matplotlib.

List which features (or groups of features) are strongly correlated (positive or negative).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

**Exercise 1.5:** Use scatter plots to visualize each pair of correlated variables.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 1.6:** Notice anything strange about the scatter plot with `median_house_value`? What's going here?  Use a [histogram](https://pandas.pydata.org/docs/reference/api/pandas.Series.hist.html) to get an idea of the distribution of this feature. 

- What does the histogram tell us about the distribution?
- What could this mean about how the data was collected or processed to build the database?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Step 2: Train a simple decision tree to predict median house values

At this point we have a pretty good idea about our dataset. Let's try and create a simple model based on a decision tree.

**Exercise 2.1:** Implement the `custom_train_test_split` below to split the dataset into a training and validation set. Use the [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from Scikit-learn with `shuffle`, passed as an argument from `custom_train_test_split` set to `False`. Check that the reduced datasets make sense with the `describe()` function.

In [None]:
def custom_train_test_split(hs_data: pd.DataFrame = housing_data,
                            feature_to_predict: str = feature_to_predict,
                            test_size: float = 0.3,
                            shuffle: bool = False):
    """
    Wrapper around sklearn's train_test_split

    :param pd.DataFrame hs_data: our dataset (default: housing_data)
    :param str feature_to_predict: the feature to predict (default: feature_to_predict defined earlier)
    :param float test_size: proportion of samples in the test data (default: 0.3)
    :param bool shuffle: passed to sklearn => whether or not to shuffle the data or take the test_size first ones
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
training_features, testing_features, training_target, testing_target = custom_train_test_split(shuffle=False)
training_features.describe()

**Exercise 2.2:** As a second check, redo the scatter plot from Exercise 1.3 for both the training and testing subsets.  

- Is this what you expected?  
- What went wrong?

Copy / paste your scatter plot from 1.3 below and run it on unshuffled data.

In [None]:
# Your plot on unshuffled data here
# YOUR CODE HERE
raise NotImplementedError()

This is an important lesson.  In all of our dataset checking from Exercise 1, we didn't get a feel for the ordering of the data.  Ideally, we want our validation and training data to be identically distributed, however we did not take into account that the dataset has a clear ordering.  Normally, the `train_test_split()` defaults to shuffle our data for us to avoid such issues but we have turned this off expicitly.

Call `custom_train_test_split` again with `shuffle=True` **and redo the scatter plot**. Have we been able to solve this problem?

In [None]:
training_features, testing_features, training_target, testing_target = custom_train_test_split(shuffle=True)

In [None]:
# Your plot here
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 2.3:** Use the `DecisionTreeRegressor` from Scikit-learn to create a new decision tree model with a depth of 3, the standard squared error, and train it on the training dataset using the `fit()` method.

*Hint*: You can find the API for DecisionTreeRegressor [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html).

In [None]:
# model = ...
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 2.4:** Let's get a sense of how well the tree fits the data.

* Use the tree's `predict` method on the testing features to get a prediction for the testing dataset.

In [None]:
# prediction = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

* Compute the mean squared error using the `mean_squared_error()` function from Scikit-learn and assign it to `MSE`.

In [None]:
# MSE = ...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()
print(MSE)

* Plot a scatter plot of the predicted median house values versus the testing target values. 

In [None]:
# plt...  # <- TO UNCOMMENT AND COMPLETE

# YOUR CODE HERE
raise NotImplementedError()

- Is the prediction good?
- What does the scatter plot look like?
- How many unique values are there in our predictions? Why is this?

YOUR ANSWER HERE

**Exercise 2.5** You can visualize the trained decision tree with the following code. As stated in the beginning, one advantage of such trees is that they are easily interpretible with graphs like this one. If you did not manage to install graphviz, the following cell will error. You may leave it as is and use the subsequent cell.

In [None]:
# Note, you may need to change 'model' to your tree's name
dot_data = export_graphviz(
    model, out_file=None, feature_names=list(training_features), filled=True, 
    rounded=True, special_characters=True)
graph = graphviz.Source(dot_data) 
graph

In [None]:
fig, axes = plt.subplots(figsize=(2, 2), dpi=600)
plot_tree(model,
          feature_names = list(training_features),
          class_names=feature_to_predict,
          filled = True);

- What are the main features used in the tree?

YOUR ANSWER HERE

- How do these compare to the features correlated to `median_house_value` you identified in Exercise 1.4?

YOUR ANSWER HERE

**Exercise 2.6** Repeat 2.3 and 2.4 **below* *for different values of the maximum tree-depth*.

- How does the fit improve with increasing depth?

Note that you can try visualizing the tree with larger depth values, but it becomes difficult as the tree grows.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

## Step 3: Ensemble methods

We have seen that a single decision tree is not a very good estimator.  We can improve our predictions using a number of ensemble techniques such as bagging, random forests, and boosting.  Using Scikit-learn, it is easy to try all of these models.  Here are the main classes that we will test:

- [BaggingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor)
- [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- [ExtraTreesRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
- [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)

Follow each link to read about the details of each method. In the following exercises you will train each type of regressor using a max_depth of 10 layers and test their prediction accuracy. Finally, you can perform a hyperparameter study to try and find the optimal model/parameter pair which minimizes the MSE on the testing dataset.

In [None]:
def fit_and_get_test_error(model,
                           training_features: np.array = training_features,
                           training_target: np.array = training_target,
                           testing_features: np.array = testing_features,
                           task: str = "regression"):
    model.fit(training_features, training_target)
    prediction = model.predict(testing_features)
    if task == "regression":
        return mean_squared_error(prediction, testing_target)
    else:
        return accuracy_score(testing_target, prediction)

**Exercise 3.1:** Create and fit a bagging regressor based on decision trees with a max depth of 10 using 100 trees, and maximum samples and features of 50%.  Compute the MSE on the testing dataset and compare the true median house values with the predictions using a scatter plot.

In [None]:
# model = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
bagging_prediction = fit_and_get_test_error(model)
print(bagging_prediction)
plt.scatter(testing_target, model.predict(testing_features));

**Exercise 3.2:** Create and fit a random forest regressor with a max depth of 10.  Compute the MSE on the testing dataset and compare the true median house values with the predictions using a scatter plot.

In [None]:
# model = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
rf_prediction = fit_and_get_test_error(model)
print(rf_prediction)
plt.scatter(testing_target, model.predict(testing_features));

**Exercise 3.3:** Create and fit an extremely randomized trees regressor (ExtraTreesRegressor) with a max depth of 10.  Compute the MSE on the testing dataset and compare the true median house values with the predictions using a scatter plot.

In [None]:
# model = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
xtra_prediction = fit_and_get_test_error(model)
print(xtra_prediction)
plt.scatter(testing_target, model.predict(testing_features));

**Exercise 3.4:** Create and fit a tree ensemble regressor with boosting (AdaBoostRegressor) with a max depth of 10.  Compute the MSE on the testing dataset and compare the true median house values with the predictions using a scatter plot.

In [None]:
# model = ...  # <- TO UNCOMMENT AND COMPLETE
# YOUR CODE HERE
raise NotImplementedError()
ada_prediction = fit_and_get_test_error(model)
print(ada_prediction)
plt.scatter(testing_target, model.predict(testing_features));

**Exercise 3.5:** Plot the MSE for each of the four ensemble methods above versus the maximum depth used (from 2 to 20).

- At what maximum depth do the different methods more or less converge?
- Which is the best method in terms of minimum MSE obtained?

In [None]:
# Generate MSE data
min_depth = 2
max_depth = 20
estimators = 100
mse = np.zeros((max_depth-min_depth+1, 4))

for i, D in enumerate(range(2,max_depth+1)):
    # model = ...  # <- TO UNCOMMENT AND COMPLETE
    # mse[i, ...] = test_error(model)  # <- TO UNCOMMENT AND COMPLETE
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# Plot MSE curves for each model on a single plot
for m in range(4):
    plt.plot(range(min_depth, max_depth + 1), mse[:,m])
plt.legend(['Bagging', 'Random Forest', 'Extremely Random Tree', 'Boosted'])
plt.show()

YOUR ANSWER HERE

## Conclusions

That's it for this lab!  Note there are still several ways you can try and improve the regression.  It might even be worth trying to use an entirely different model such as a neural network to see if you can obtain better performance.  Over time, you will build up an intuition for which models might work best on different datasets.  

If you want more practice with trees and ensemble methods, checkout the Iris dataset below which is a classification problem.  You can use everything you learned here by replacing "Regressor" with "Classifier" as a Bonus below.  Happy coding!

In [None]:
from sklearn import datasets
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
covertype = datasets.fetch_covtype()

In [None]:
print(covertype.DESCR)

In [None]:
covertype_df = pd.DataFrame(covertype.data)
covertype_df['cover'] = covertype.target

In [None]:
training_features, testing_features, training_target, testing_target = custom_train_test_split(
    hs_data=covertype_df,
    feature_to_predict='cover',
    shuffle=True)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()