<a href="https://colab.research.google.com/github/acmucsd-projects/AI-Tutorial-Resources/blob/main/3%20%7C%20Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Trees**

Contributors: Katie, Nolan



>[Trees](#scrollTo=wsJ6RPtF-Cx2)

>>[What is a Decision Tree? 🌲](#scrollTo=lLwFDSMgHTW5)

>>[Classification vs. Regression](#scrollTo=st73yMAujg-c)

>>>[Classification](#scrollTo=eyiXq2gHE86D)

>>>[Regression (and Encoding)](#scrollTo=nxy50DpMlODL)

>>[Hyperparameters](#scrollTo=eumrl3UzjbwP)

>>>[Criterion](#scrollTo=eumrl3UzjbwP)

>>>[Min Samples Split and Max Depth](#scrollTo=qENYOQ52tVE4)

>>[Random Forests 🌲🌲🌲](#scrollTo=Q3JEjIk__eL2)

>>>[Regression](#scrollTo=1_5fuqXRKqUk)

>>>[Classification (and Accuracy)](#scrollTo=_eDmcuJGI-Eq)

>>>[Grid Searching](#scrollTo=Q18ekTUN2XSz)

>>>[XGBoost](#scrollTo=Wd_dEknvibHx)

>>[The Confusion Matrix](#scrollTo=cvYFWC-IcHul)

>>>[More Metrics](#scrollTo=4C68SKBLcOoi)



## **What is a Decision Tree? 🌲**

A decision tree is a **model that splits data into smaller groups** by asking a series of yes/no questions. The model chooses these questions based on the **features** (i.e., columns) of the data.

> The goal of a decision tree is to **make accurate decisions or predictions about new data**, based on its knowledge from current data!

A decision tree is a **non-parametric supervised learning** method.
- **non-parametric**: The parameters of the model are not fixed.
  - It learns these parameters (e.g., which feature to split, threshold value of split, etc.) through training.
  - Linear regression, a parametric method, will always only have 2 parameters: slope and intercept.
- **supervised learning**: The training data must have data about the feature we want to predict.
  - To predict if someone has diabetes based on their glucose levels and BMI, we need data on other patients' glucose levels, BMI, and *whether they actually have diabetes or not*.
  - The model is only as good as the data we give it!

As the name suggests, this tree-like structure looks like this:

<img src='https://waz.smartdraw.com/decision-tree/img/structure-of-a-decision-tree.png?bn=15100111939' width=500>

- Root Node: The starting point of the tree! Asks the first question to split the data.
- Decision Node: Branched from the root node or another decision node. Asks another question to split the data.
- Leaf Node: An end point of the tree. Makes a final decision or prediction!

## **Classification vs. Regression**

A decision tree can be used to predict both categorical and numerical values!

> Scikit-learn (`sklearn`), a popular machine learning library for Python, has both classification and regression models for decision trees.

Classification: predicts category
- Predict diabetes (yes, no)
- Predict letter grade (A, B, C, D, F)
- in `sklearn`: `DecisionTreeClassifier()`
([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

Regression: predicts number(s)
- Predict age
- Predict tomorrow's temperature and humidity
- in `sklearn`: `DecisionTreeRegressor()` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html))

### **Classification**

We'll start with a classification example.

Let's say you want to decide what top to wear today based on the weather. You have some data from 20 other days that include the temperature, chance of rain, and what you wore that day.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Create data (feel free to change any of the values)
weather = pd.DataFrame(columns=['Temperature (ºF)', 'Chance of Rain (%)', 'Top'],
                       data=[[80, 0, 't-shirt'],
                             [70, 10, 'long sleeve'],
                             [80, 10, 't-shirt'],
                             [50, 100, 'raincoat'],
                             [60, 70, 'raincoat'],
                             [90, 0, 't-shirt'],
                             [70, 20, 'jacket'],
                             [80, 0, 't-shirt'],
                             [60, 20, 'long sleeve'],
                             [90, 10, 't-shirt'],
                             [70, 0, 'jacket'],
                             [70, 10, 'long sleeve'],
                             [80, 0, 't-shirt'],
                             [50, 0, 'jacket'],
                             [90, 20, 't-shirt'],
                             [70, 80, 'raincoat'],
                             [80, 0, 't-shirt'],
                             [60, 10, 'long sleeve'],
                             [70, 0, 'long sleeve'],
                             [60, 30, 'jacket']])
weather.head()

Before we create our model, it is good practice to first split the data into two parts: **training** and **testing**.

> The model will only learn from the training data, and we will evaluate its performance on the testing data it has not seen before. Since we have the actual top worn in the test set as well, we can determine how accurate the model truly is!

To visualize this, let's use `train_test_split` from `sklearn`'s `model_selection` module. ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
- Takes in `n` DataFrames and splits them into `n` training and `n` testing sets.
- For our purposes, we'll split `weather` into 2 DataFrames where
  - the first one contains all the features the model will learn from (temperature, chance of rain) and
  - the second one contains the feature the model will predict (top).
- We'll call these X and y.

In [None]:
from sklearn.model_selection import train_test_split
from IPython.display import display # just for visual purposes

In [None]:
X = weather[['Temperature (ºF)', 'Chance of Rain (%)']]
y = weather[['Top']]

# Randomly splits the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Notice that the row indices in each set are the same
display(X_train, y_train)

In [None]:
# Notice that the row indices in each set are the same
display(X_test, y_test)

Now we're ready to create and train our model!

In [None]:
# Creates model with default hyperparameters
dt_class = DecisionTreeClassifier()

 # Trains model on only the training data
dt_class.fit(X_train, y_train)

This doesn't show us much... We can visualize our tree using `sklearn`'s `plot_tree`. ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html))

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt # just for visual purposes

In [None]:
plt.figure(figsize=(8, 8))
plot_tree(dt_class,
          feature_names=X_train.columns, # must be in same order as data
          class_names=['jacket', 'long sleeve', 'raincoat', 't-shirt'], # must be in alphabetical order
          impurity=False
          )
plt.show()

The first line of each decision node is the question that determines how to split the data. The root node splits the data by `Temperature (ºF) <= 75.0`. Left arrows always mean `True` and right arrows always mean `False`.

`samples` is number of rows in that reached that node. In the rightmost node, 6 rows *do not* have a temperature <= 75, and are classified as t-shirts.

`value` shows the distribution of counts at that node. The root node is [2, 5, 2, 6], which equates to 2 jackets, 5 long sleeves, 2 raincoats, and 6 tshirts.

`class` is the majority class at that node.

### **Regression (and Encoding)**

Let's try making a decision tree using the same `weather` DataFrame. This time, we will use the `DecisionTreeRegressor()` to predict `Temperature (ºF)`.



In [None]:
X = weather[['Chance of Rain (%)', 'Top']]
y = weather[['Temperature (ºF)']]

One issue is that the `Top` column is categorical. A decision tree cannot use categorical data because it uses mathematical comparisons (<, >, =) to make these splits. (What is `Top <= 'jacket'`?)

Decision Trees, and most machine learning models, can only use **numerical features** to help them make predictions. To combat this, we can use an **encoding technique**, or a method to convert categorical variables into numerical ones!

- `X`: only numerical variables
- `y`: either numerical or categorical variables

One such technique is **One Hot Encoding**, from `sklearn`'s `preprocessing` module. ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html))

In [None]:
from sklearn.preprocessing import OneHotEncoder

For a categorical column `C`, one-hot encoding creates a separate column for each unique value in `C`. Each row gets a 1 in the column that matches its value, and 0 in all the others. Check out the example below.

In [None]:
# Create example categorical column
top_only = pd.DataFrame(columns=['Top'],
                        data=[['t-shirt'], ['jacket'], ['long sleeve'], ['raincoat']])
top_only

In [None]:
# Create One Hot Encoder
one_hot = OneHotEncoder(sparse_output=False)

# fit: stores mapping of the data ('jacket' = [1 0 0 0], 't-shirt' = [0 0 0 1], etc.)
# transform: applies mapping the data (converts mapping to 2D array)
top_encoded = one_hot.fit_transform(top_only[['Top']]) # DataFrame with only categorical column(s)

# Turn 2D array into DataFrame
pd.DataFrame(data=top_encoded,
             columns=one_hot.get_feature_names_out(['Top'])) # Column name(s), in a list

Now it's your turn! One Hot Encode the `Top` column in the `weather` DataFrame. Make sure you add the other columns back too.

In [None]:
# For your convenience, the weather DataFrame again
weather.head()

In [None]:
# One Hot Encode the `Top` column

# TODO: Create One Hot Encoder and set sparse_output=False
one_hot = ...

# TODO: Use the encoder to fit and transform the `Top` column of the `weather` DataFrame
weather_encoded = ...

# TODO: Turn 2D array into DataFrame
encoded_df = ...

# TODO: Concat encoded_df with `Temperature (ºF)` and `Chance of Rain (%)` columns from `weather`
encoded_df = ...
encoded_df

Use the `DecisionTreeRegressor()` and our newly encoded `encoded_df` DataFrame to predict `Temperature (ºF)`.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
### TODO: Define X and y and complete a train test split

X = ... # which columns?
y = ... # which columns?

X_train, X_test, y_train, y_test = ..., ..., ..., ...

In [None]:
### TODO: Create and train a decision tree regression model

dt_reg = ...
...

In [None]:
# Visualize tree! Uncomment when you're done

# plt.figure(figsize=(10, 8))
# plot_tree(dt_reg,
#           feature_names=X_train.columns, # must be in same order as data
#           )
# plt.show()

In [None]:
#@title **Answer**
# One Hot Encode the `Top` column

# TODO: Create One Hot Encoder and set sparse_output=False
one_hot = OneHotEncoder(sparse_output=False)

# TODO: Use the encoder to fit and transform the `weather` DataFrame
weather_encoded = one_hot.fit_transform(weather[['Top']])

# TODO: Turn 2D array into DataFrame
encoded_df = pd.DataFrame(data=weather_encoded,
                          columns=one_hot.get_feature_names_out(['Top']))

# TODO: Concat encoded_df with `Chance of Rain (%)` and `Temperature (ºF)` columns from `weather`
encoded_df = pd.concat([encoded_df, weather[['Chance of Rain (%)', 'Temperature (ºF)']]], axis=1)

### TODO: Define X and y and complete a train test split
X = encoded_df.drop(columns=['Temperature (ºF)']) # which columns?
y = encoded_df['Temperature (ºF)'] # which columns?
X_train, X_test, y_train, y_test = train_test_split(X, y)

### TODO: Create and train decision tree regression model
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)

# Visualize tree! Uncomment when you're done
plt.figure(figsize=(12, 8))
plot_tree(dt_reg,
          feature_names=X_train.columns, # must be in same order as data
          )
plt.show()

If you know the game, *20 Questions*, you might have noticed that decision trees follow a similar logic. They ask the "best" questions to help narrow down the possible answers until only one is left!

> But how does a decision tree know which feature and split is the "best"???


## **Hyperparameters**

The `DecisionTreeClassifier()` and `DecisionTreeRegressor()` have many **hyperparameters**, such as `criterion`, `min_samples_split`, `max_depth`, etc. These are different than *parameters* because we get to choose them, rather than the model.

### **Criterion**

The `criterion` hyperparameter **controls how the tree decides to split the data**. They utilize mathematical formulas to decide which split is the best!

Classifiers and Regressors have different criterion. Some examples include:
- For classification:
  - [Gini impurity](https://www.learndatasci.com/glossary/gini-impurity/): `criterion = 'gini'` (default)
  - [Entropy](https://www.geeksforgeeks.org/how-to-calculate-entropy-in-decision-tree/): `criterion = 'entropy'`

- For regression:
  - [Mean squared error](https://www.geeksforgeeks.org/retrieving-node-mse-in-decisiontreeregressor/): `criterion = 'squared_error'` (default)
  - [Mean absolute error](https://www.geeksforgeeks.org/how-to-calculate-mean-absolute-error-in-python/): `criterion = 'absolute_error'`

We won't go in detail about them in this notebook, but feel free to read more about them by clicking on each link.

Since our regression model from earlier was created using the default hyperparameters, it's **criterion** is **mean squared error**. Let's check out the predictions! (Make sure to run the cells in the Answer section if you didn't fill out the code!)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
# The .predict() method generates predictions based on the trained model
y_train_pred = dt_reg.predict(X_train)
y_test_pred = dt_reg.predict(X_test)

# Test the MSE of the training set
train_mse = mean_squared_error(y_train, y_train_pred)

# Test the MSE of the testing set
test_mse = mean_squared_error(y_test, y_test_pred)

print('Training Set MSE: ', train_mse)
print('Testing Set MSE: ', test_mse)

What do these mean? MSE is the **average squared difference** between the actual and predicted temperatures. So the training MSE of ~24 doesn't mean that the model was 24º off, but rather $\sqrt{24}$ ~5º! Similarly, the testing MSE is $\sqrt{49}$ ~7º.

Let's visualize the predictions vs. actual temperatures.

In [None]:
predictions = pd.DataFrame({'actual temp': y_train})
predictions['predicted temp'] = y_train_pred
predictions

Pretty good! But remember we trained the model with this data. We will see the model's true performance by looking at the test set's predictions.

In [None]:
predictions = pd.DataFrame({'actual temp': y_test})
predictions['predicted temp'] = y_test_pred
predictions

Still pretty good, but we can see larger differences (60 vs. 70).

### **Min Samples Split and Max Depth**

Ultimately, we want to use hyperparameters to make sure the training and testing MSE (or other criterion) are both low in value and similar to one another. Decision trees tend to **overfit** the training data, which means that the **model might work really well for the data it was trained on, but not as effectively for unseen data**. This is because decision trees, by default, keep splitting the data until every specific detail of the training data is found, rather than finding general patterns.

The `min_samples_split` and `max_depth` hyperparameters can help reduce overfitting by limiting when the tree can split.
- `min_samples_split` (int): sets the minimum number of samples required to split a node (default=`2`)
- `max_depth` (int): sets the maximum number of levels a tree can split from root to leaf (default=`None`)

Let's try visualizing a new `DecisionTreeRegressor()` to have a `min_samples_split` of 5 and `max_depth` of 3.

In [None]:
dt_new = DecisionTreeRegressor(min_samples_split=5, max_depth=3)
dt_new.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(12, 8))
plot_tree(dt_new,
          feature_names=X_train.columns, # must be in same order as data
          )
plt.show()

As you can see, no split has a sample less than 5 and the entire tree has a maximum depth of 3.

Let's see if this changed our MSEs.

In [None]:
# The .predict() method generates predictions based on the trained model
new_train_pred = dt_new.predict(X_train)
new_test_pred = dt_new.predict(X_test)

# Test the MSE of the training set
train_mse_new = mean_squared_error(y_train, new_train_pred)

# Test the MSE of the testing set
test_mse_new = mean_squared_error(y_test, new_test_pred)

print('Training Set MSE: ', train_mse_new)
print('Testing Set MSE: ', test_mse_new)

Though the MSE is worse for the training set, it is actually **better** for the testing set. In addition, the values are closer to one another, making the model more generalizable!

Your turn! Create a new `DecisionTreeRegressor()` with `criterion='absolute_error'`, `min_samples_split=4`, and `max_depth=4`. Print out the MSEs for the training and testing sets.

In [None]:
# TODO: Create new regressor and fit to training data
dt_abs = ...
...

# TODO: Use .predict() to generate predictions
abs_train_pred = ...
abs_test_pred = ...

# TODO: Find the MAE of the training set
train_mae = ...

# TODO: Find the MAE of the testing set
test_mae = ...

print('Training Set MAE: ', train_mae)
print('Testing Set MAE: ', test_mae)

Better or worse than MSE? Depending on the dataset, some criterion are better than others...

In [None]:
#@title **Answer**
# TODO: Create new regressor and fit to training data
dt_abs = DecisionTreeRegressor(criterion='absolute_error', min_samples_split=4, max_depth=4)
dt_abs.fit(X_train, y_train)

# TODO: Use .predict() to generate predictions
abs_train_pred = dt_abs.predict(X_train)
abs_test_pred = dt_abs.predict(X_test)

# TODO: Find the MAE of the training set
train_mse_abs = mean_squared_error(y_train, abs_train_pred)

# TODO: Find the MAE of the testing set
test_mse_abs = mean_squared_error(y_test, abs_test_pred)

print('Training Set MAE: ', train_mse_abs)
print('Testing Set MAE: ', test_mse_abs)

## **Random Forests 🌲🌲🌲**

Now that we know the basic structure of decision tree models, we'll introduce **Random Forests**.
> Just as a forest is an area with many trees, a **random forest** is a **model** with many **decision trees**!

<img src="https://serokell.io/files/vz/vz1f8191.Ensemble-of-decision-trees.png" width=500>

We call it a *random* forest because this machine learning algorithm uses **bootstrapping** and **random feature selection** in each decision tree.
- **Bootstrapping**: For each decision tree in the forest, we randomly select data points from the original dataset, but with replacement. A single data point can be selected more than once, while others might not be selected at all.
- **Random Feature Selection**: Each decision tree randomly chooses a few columns (features) from the data to learn from.

> *These strategies ensure that each tree learns from a slightly different version of the dataset and helps the model avoid overfitting.*

Random Forests can be used for both classification and regression tasks:
- **Classification**: Each tree produces a categorical prediction, and the final result is the **category** chosen by the **majority** of trees (majority voting).
- **Regression**: Each tree produces a numerical prediction, and the final result is the **average** of all the individual predictions.

From `sklearn`'s `ensemble` module, we use **`RandomForestClassifier`** ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)) and **`RandomForestRegressor`** ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)) to create these models. They both have similar hyperparameters to decision trees, such as `criterion`, `max_depth`, and `min_samples_split`.
> One **new** hyperparameter is **`n_estimators`**, which decides the **number of decision trees** the model uses (default is 100).


### **Regression**

We'll explore the regressor first, and you will try a classifier on your own.


In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor

# Turn data into DataFrame
housing = pd.DataFrame(data=fetch_california_housing().data, columns=fetch_california_housing().feature_names)
housing['value'] = fetch_california_housing().target

The `housing` DataFrame contains data about California housing in "block groups," which is an area with about 600 - 3,000 people. The `value` column contains the median price of a block, in hundreds of thousands of dollars (e.g., 4.2 ~ 420,000). See the dataset and description below for further info on the other columns.

In [None]:
housing.head()

In [None]:
print(fetch_california_housing().DESCR)

Let's use the `RandomForestRegressor` to train a model to predict the median house value given the other features. We follow a similar process as before:
1. Encode categorical data (if necessary)
2. Split X and y data into train and test sets
3. Create model and fit training data
4. Generate predictions on the testing data
5. Compare actual and predicted values using metric(s)
6. Adjust hyperparameters to improve model and metric

In [None]:
# Step 2
X = housing.drop(columns='value')
y = housing['value']
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Step 3
forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)

# Step 4
y_pred = forest_reg.predict(X_test)

# Step 5
mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse:.4f}, which implies that the average prediction is about {mse**0.5:.4f} off, or ${(mse**0.5)*100000:.2f} off.')

# Step 6: Adjust hyperparameters in Step 3 and rerun. Feel free to try this on your own!

### **Classification (and Accuracy)**

The `criterion` and evaluation metric of a model does not need to be the same. In fact, it's common (in classification) to use a `criterion` like Gini impurity or entropy during training, while evaluating model performance using a metric such as **accuracy**.

- The **accuracy** of a model is the **proportion** of **correct predictions** out of **all predictions** made.
  - *Easy to interpret* and provides a *general sense of performance*.
  - Best used when the classes within a categorical variable are *balanced and equally represented*.

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

- For instance, given a dataset of 3 cats and 2 dogs, if I predict 2 cats and 2 dogs correctly, my accuracy is $\frac{4}{5} = 0.8$.

> **In `sklearn`, the `metrics` module can calculate accuracy using the `accuracy_score` function**. ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html))



**Your turn**! The `forest` DataFrame (how fitting!), seen below, contains data on 30x30m patches of forest in the US.
- The `type` column is each patch's cover type, which is the dominant species of tree, as an **integer 1-7**.
- See more info on each feature column [here](https://archive.ics.uci.edu/dataset/31/covertype). There are 54 features, so it's a good idea to set some hyperparameters to limit overfitting and save time!

Use the `RandomForestClassifier` to train a model to predict a patch's cover type. Try to achieve an **accuracy above 90% within 2 minutes**!

In [None]:
from sklearn.datasets import fetch_covtype
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Turn data into DataFrame
forest = pd.DataFrame(data=fetch_covtype().data, columns=fetch_covtype().feature_names)
forest['type'] = fetch_covtype().target

In [None]:
forest.head()

In [None]:
print(fetch_covtype().DESCR)

In [None]:
%%time
# TODO: Create a RandomForestClassifier that predicts the cover type of a patch.
# Try out various hyperparameters such that your code runs within 2 minutes
# and your accuracy is above 0.9!

...

In [None]:
#@title **Answer**
%%time
# TODO: Create a RandomForestClassifier that predicts the cover type of a patch.
# Try out various hyperparameters such that your code runs within 2 minutes
# and your accuracy is above 0.9!

X = forest.drop(columns='type')
y = forest['type']
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Note your hyperparameters can be different
forest_cls = RandomForestClassifier(n_estimators=50, max_depth=100, min_samples_split=10)

# Takes a while to run (~1.5 mins)
forest_cls.fit(X_train, y_train)

y_pred = forest_cls.predict(X_test)

accuracy = accuracy_score(y_pred, y_test)
print('Accuracy:', accuracy)

### **Grid Searching**

As of now, we've manually chosen our hyperparameters through trial and error. While this approach does provide valuable intuition about how different parameters affect model performance, it's often inefficient, subjective, and may lead to suboptimal results (especially as the number of hyperparameters increase).

A more systematic and unbiased approach is to use automated hyperparameter tuning methods, one of the most common being **grid search**.

> Grid searching tests every possible combination of values from a list you provide and evaluates each one by training and testing the model multiple times on different splits of the data to find the best-performing setup.

`sklearn`'s `model_selection` module allows us to grid search using the **`GridSearchCV`** function. ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html))

> "CV" stands for Cross-Validation, a method used to make sure a machine learning model works well on new, unseen data. Instead of training and testing the model just once, cross-validation splits the data into several parts (called *folds*). The model is trained on some parts and tested on the rest, then this is repeated several times using different splits. This helps prevent overfitting and gives a more accurate picture of how the model will perform in real-world situations.

That being said, we might make a dictionary of lists like this to create our grid of possible hyperparameter combinations:

<pre>
param_grid = {
  n_estimators: [50, 100, 150],
  max_depth: [None, 10, 20],
  min_samples_split: [3, 5, 10]
}
</pre>

And our new model like this:

<pre>
forest_cls = RandomForestClassifier()

grid_search = GridSearchCV(estimator=forest_cls, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)
</pre>

This would result in our model trying and comparing 3 `n_estimators` x 3 `max_depth` x 3 `min_samples_split` x 5 folds = 135 different random forests for the best accuracy. The order in which the algorithm cycles through the hyperparameters looks like this:
1. `n_estimators=50`, `max_depth=None`, `min_samples_split=3`
2. `n_estimators=50`, `max_depth=None`, `min_samples_split=5`
3. `n_estimators=50`, `max_depth=None`, `min_samples_split=10`
4. `n_estimators=50`, `max_depth=10`, `min_samples_split=3`
5. `n_estimators=50`, `max_depth=10`, `min_samples_split=5`
6. And so on.

This can be time consuming, but very rewarding to achieve a great model. We'll explore how to fully code this in the next section!







### **XGBoost**

XGBoost stands for **extreme gradient boosting**. This algorithm, like random forests, uses many decision trees to make predictions. However, the **key difference** is that it builds these trees **sequentially**, rather than independentally.
> This means that each new decision tree learns from the errors of the previous ones!

The learning process in XGBoost is as follows:
1. Assign **equal weights** to all instances (an instance is one row of the data).
2. Train first model on this equally weighed data. Its performance determines how the instance weights are adjusted.
  - Instances that are **predicted well receive lower weights**, while those **predicted poorly are given higher weights**.
3. Train second model on this updated data, focusing more on the difficult cases the first model struggled with.
4. Train third model, fourth model, etc...
  - Each new model aims to **correct the errors of the previous ones**, gradually improving overall accuracy!
  - However, this causes XGBoost to be more prone to overfitting, so it is important to adjust the hyperparameters of the function.

> How does the algorithm do this? **Gradient Descent Optimization!** Feel free to read more about gradient descent [here](https://www.ibm.com/think/topics/gradient-descent), or check out [this video](https://www.youtube.com/watch?v=TyvYZ26alZs&ab_channel=Econoscent) on XGBoost to visualize it better.

From the `xgboost` module, we can use the `XGBClassifier` and `XGBRegressor` to create these models.
- Check out their documentations: [`xgboost`](https://xgboost.readthedocs.io/en/release_3.0.0/index.html), [`XGBClassifier`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBClassifier), [`XGBRegressor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBRegressor)

In the previous question, we predicted the cover type of a forest patch using `RandomForestClassifier`. Let's try doing the same with the `XGBClassifier` and compare our results!



In [None]:
from xgboost import XGBClassifier, XGBRegressor

# The `forest` DataFrame again, for your convenience
forest.head()

In [None]:
X = forest.drop(columns='type')
y = forest['type']
X_train, X_test, y_train, y_test = train_test_split(X, y)

The XGBClassifier internally expects class labels to start at 0 and be contiguous integers (0, 1, 2...). However, our labels range from 1 to 7. We'll use a `LabelEncoder` to adjust this! ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html))

In [None]:
# Uncomment to see the unadjusted error

# forest_xgb = XGBClassifier()
# forest_xgb.fit(X_train, y_train)

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test) # use .transform, rather than .fit_transform, to avoid possible mismatches

forest_xgb = XGBClassifier(n_estimators=50, eval_metric='error') # limit number of trees to 50, 'error' = accuracy

# Create a list of possible values to try for each hyperparameter
param_grid = {
    'gamma': [0.2, 0.3],
    'max_depth': [10, 12],
    'subsample': [0.6, 0.8]
}

# Create grid search with 3 folds and choose combination with best accuracy
grid_search = GridSearchCV(estimator=forest_xgb, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1)

# Takes a while to run (~12 mins)
grid_search.fit(X_train, y_train_enc)

In [None]:
# Best parameter combination and score
print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation accuracy:", grid_search.best_score_)

# Get best model and evaluate accuracy on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test set accuracy:", accuracy_score(y_test_enc, y_pred))

The best combination of hyperparameters (that we tested) is `gamma=0.3`, `max_depth=12`, and `subsample=0.6`, which achieved a 95% accuracy!

Your turn! Use the `XGBRegressor` to predict the median house value of a block group. You can use grid search, but it's not required (in the interest of time). Compare its MSE to our `RandomForestRegressor`!

In [None]:
# The `housing` DataFrame again, for your convenience
housing.head()

In [None]:
# TODO

...

In [None]:
#@title **Answer**

X = housing.drop(columns='value')
y = housing['value']
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Without grid search
housing_xgb = XGBRegressor(n_estimators=50, eval_metric='rmse') # limit number of trees to 50, 'rmse' = root mean squared error
housing_xgb.fit(X_train, y_train) # ~3 secs

y_pred = housing_xgb.predict(X_test)
rmse = mean_squared_error(y_test, y_pred) ** 0.5

print(f'RMSE: {rmse:.4f}, which implies that the average prediction is about ${rmse*100000:.2f} off.')


# With grid search
housing_xgb = XGBRegressor(n_estimators=100, eval_metric='rmse')

param_grid = {
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 6],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'min_child_weight': [1, 5],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1, 5]
}

grid_search = GridSearchCV(estimator=housing_xgb, param_grid=param_grid, cv=3, scoring='neg_root_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train) # ~3 mins

best_rmse = -grid_search.best_score_
print("Best RMSE:", best_rmse)
print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred) ** 0.5

print(f'RMSE: {rmse:.4f}, which implies that the average prediction is about ${rmse*100000:.2f} off.')

In general, XGBoost classification takes longer than regression due to added complexity:
- Label encoding is required for classification but not for regression.
- Multiclass classification builds one tree per class per round (e.g., 3 classes = 3x more trees).
- And more...

So even with similar data and settings, classification is slower because it does more computationally expensive procedures.



## **The Confusion Matrix**

A **confusion matrix** is a table used to evaluate the performance of a **classification model** by comparing the predicted labels with the actual labels. It is especially useful in binary and multiclass classification problems.

<img src='https://plat.ai/wp-content/uploads/Table1-2.png.webp' width=500>

- **True Positive (TP)**: The model correctly predicted the positive class.

- **True Negative (TN)**: The model correctly predicted the negative class.

- **False Positive (FP)**: The model incorrectly predicted positive when it is actually negative (Type I error).

- **False Negative (FN)**: The model incorrectly predicted negative when it is actually positive (Type II error).

In context of the confusion matrix, the formula of accuracy (overall proportion of correct predictions) can also be written as:

$$\text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$$


### **More Metrics**

Classification models have many other metrics, such as **precision, recall, and F1-score**. Each has their own situational benefits:

**Precision**: Out of all the instances the model predicted as positive, how many were **actually positive?**
- A **high precision** indicates your model's positive predictions are **mostly correct** (few false positives).
- A **low precision** indicates your model's positive predictions are **mostly incorrect** (many false positives).
> This would be a good metric to use when **false positives are costly**. For example, in an email spam filter, you would want to avoid marking real emails as spam.

$$\text{Precision} = \frac{\text{TP}}{\text{TP + FP}}$$

**Recall**: Out of all actual positive instances, how many did the model correctly identify?
- A **high recall** indicates your model **identifies** most of the actual positives (few false negatives).
- A **low recall** indicates your model **misses** most of the actual positives (many false negatives).
> This would be a good metric to use when **missing a positive is costly**. For example, in a disease detection model, you would want to catch as many sick patients as possible.

$$\text{Recall} = \frac{\text{TP}}{\text{TP + FN}}$$

**F1-Score**: The harmonic (lower) mean of precision and recall.
- A **high F1-Score** means there is a **good balance** between precision and recall (few false positives and few false negatives).
- A **low F1-Score** means there is a **poor balance** between precision and recall (many false positives and/or many false negatives).
> This would be a good metric to use when you want to **balance false positives and false negatives** and need a single metric. Also useful when **classes are imbalanced**.



$$\text{F1-Score} = 2 * \frac{\text{Precision * Recall}}{\text{Precision + Recall}}$$

For regression, one other metric (besides RMSE, MSE, MAE) is the **Coefficient of Determination ($R^2$)**, which tells you how well your model fits the data.

It outputs a number between 0 and 1 (sometimes even negative if the model is really bad), which tells you:
- 0: Your model explains **none** of the variation in the data.
- 1: Your model explains **all** of the variation perfectly.
- In between: Your model explains **some**, but not all, of the variation.

Feel free to read more [here](https://www.scribbr.com/statistics/coefficient-of-determination/)!




For this final question, calculate the accuracy, precision, and recall of this confusion matrix:

|                      | Predicted Positive | Predicted Negative |
|----------------------|--------------------|--------------------|
| Actual Positive      |        50 (TP)      |       10 (FN)       |
| Actual Negative      |         5 (FP)      |       35 (TN)       |


In [None]:
# TODO

accuracy = ...
print('Accuracy: ', accuracy)

precision = ...
print('Precision: ', precision)

recall = ...
print('Recall: ', recall)

In [None]:
#@title **Answer**

accuracy = (50 + 35) / (50 + 10 + 5 + 35)
print('Accuracy: ', accuracy)

precision = (50) / (50 + 5)
print('Precision: ', precision)

recall = (50) / (50 + 10)
print('Recall: ', recall)