<img src="https://drive.google.com/uc?id=1-hPP-XPm9_5M3orUgmompcVleQ5xvPST" style="Width:1000px">

# Introducing the `XGBoost` Library

We will continue working on our Earthquake damage predictions, but this time, we will use a different approach: `Boosted Trees` with `XGBoost`.

`XGBoost` (this stands for `eXtreme Gradient BOOSTing`), and it is one of the most popular machine learning library. Many `Kaggle` competitions are won using `XGBoost`. This library is dedicated to `ensemble learning` with decision trees, and it is very complete and has many options. It can handle `RandomForest` classification, but also `BoostedTrees` by performing `AdaBoost` on a `RandomForest` classifier. 

I highly recommend you <a href="https://xgboost.readthedocs.io/en/stable/">use the very complete the documentation</a> of `XGBoost` if you are serious about applying machine learning to tabular data: it is one of the most powerful family of algorithms at the moment.

Part of the success of `XGBoost` is that it leverages principles from `Deep-learning` and applies the approach to `DecisionTrees`: for instance, you can control the `learning rate` of the `boosted trees` in `XGBoost`, something you are familiar with through your use of the `SGDClassifier` and `SGDRegressor` classes, and our lecture on Wednesday of week 1.

Here, we will see that `XGBoost` will outperform all of the classifiers we developped in our previous notebook.


# Opening the data

As in the previous exercise, open the data in `earthquake_nepal.csv`,  separate the target (`damage_grade`) and the features, and do a train_test_split with 80% in the `train_set` (use a `random_state=42` to be consistent with my results). <br>
Convert your `y_train` and `y_test` to categorical data using a `label_encoder`.

**Note**: this time, I am giving you more data. In fact, one order of magnitude more data! But you will see that `XGBoost` can handle this relatively easily, and this will contribute to boosting our performance. We should achieve an **accuracy > 70%**.


In [None]:
from nbta.utils import download_data
download_data(id='1q2id-UaRkjFRm_bCefaAPXONI5hB15mK')

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Importing the dataset
data = pd.read_csv('raw_data/earthquake_nepal.csv')

In [None]:
from sklearn.preprocessing import LabelEncoder

X = data.drop(columns='damage_grade')
y = data.damage_grade

l_enc = LabelEncoder()

y = l_enc.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [None]:
X_train

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('data_preproc',
                         X_shape = X_train.shape,
                         y_values = pd.DataFrame(y_train).value_counts().shape[0],
                         y_type = type(pd.DataFrame(y_train).value_counts().index[0][0])
)

result.write()
print(result.check())

# Open the saved `preproc` pipeline

Now, using `joblib`, load your saved `preproc` pipeline, save it in a variable named `preproc`, and transform your `X_train` and `X_test` with this pipeline.

In [None]:
from joblib import load

preproc = load('preproc.joblib')

In [None]:
X_train = preproc.transform(X_train)

In [None]:
X_test = preproc.transform(X_test)

# Create a validation set

First of all, as we would do when training neural networks, `XGBoost` can use a `validation_set` to track the performance of our algorithm over time. So let's create an `X_val` and `y_val` by further splitting our `X_train` and `y_train` into an 80%/20% `X_train` `y_train` / `X_val` `y_val` (don't forget to save the new `X_train` and `y_train` with the same names, or your `X_val` and `y_val` sets will be also present in your `X_train` `y_train` and thus you will have a data leak):


In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8)

# XGBoost: A gentle Introduction

Let's get acquainted with the basic use of the `XGBClassifier` class. Import it from the `xgboost` library, and run train a plain-vanilla version of the classifier. Save the accuracy score into a variable called `xgb_accuracy`

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

xgb_accuracy = accuracy_score(y_test, xgb.predict(X_test))
xgb_accuracy

<details><summary>🔍 Observations</summary><br>
    Right off the bat, the <code>XGBoost</code> classifier performs as strongly as our strongest previous classifier!</details>

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('base_accuracy',
                         score = xgb_accuracy
)

result.write()
print(result.check())

### `XGBoost` hyperparameters

Now, let's explore some of the basic hyperparameters of `XGBoost`. We will focus here on the ones that are particular to the algorithm, and not the ones similar to `DecisionTrees`. The hyperparameters we will play with include:

* `n_estimators`: the number of boosting round we want our algorithm to go through.
* `learning_rate`: the learning rate for the gradient descent algorithm
* `early_stopping_rounds`: determines after how many boosting rounds training will stop if no improvement is detected

To be able to use the `early_stopping_rounds` we need to pass both our training set (the `X_train` and `y_train`) and our evaluation set (the `X_val` and `y_val`) as a `eval_set` argument of our `fit()` function.
⚠️ This needs to be passed as a **list of (X, y) tupples**. 

We also need to set the `eval_metric` to reflect a valid cost function for classification (`"accuracy"` cannot be a cost function) when we create our `XGBClassifier`.

To get a feel of how `XGBoost` works with a validation set, go ahead and  create a new `XGBClassifier` with `n_estimators=100`, a `eval_metric` of `"mlogloss"` (multinomial log-loss), a `learning_rate=1.8`,  and pass it your `train` and `validation` sets during fitting.

In [None]:
xgb = XGBClassifier(eval_metric='mlogloss', n_estimators=100, learning_rate=1.8)

In [None]:
xgb.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_val, y_val)])

## ☝️ Training performance at each iteration: learning curve

As you can see, `XGBoost` is outputing a log of the `losses` both for your training (`validation_0` if you passed `(X_train, y_train)` first in your `eval_set`) and your validation set (`validation_1` if you passed `(X_train, y_train)` first in your `eval_set`).

If you remember last week's lecture on training curves (and I hope you do!) you will recognise the value of this data: we can plot a learning curve, and see what our algorthm is doing.

**Do the following:**
* See what the `evals_result()` function returns when applied to your trained `XGBClassifier` (*tip*: it returns the training loop losses for both of your validation sets). Save this into a variable called `results`
* Explore `results`: it is a dictionary-like objects, so see what the keys are, and what the keys of the returned object are.
* Once you understand your `results`, write a simple python function that will draw both training curves on a plot. I suggest using a `figsize` of 15, 10, or something similar to show the curve nicely.
* Draw the learning curve for the `XGBClassifier` that you trained above

Then, based on your observations, answer the following question:
* Does the algorithm `'underfit'`, `'overfit'`, or is it `'balanced'`?

Save your answer as a string in a variable named `performance`

In [None]:
results = xgb.evals_result()

In [None]:
results.keys()

In [None]:
results['validation_0'].keys()

In [None]:
import matplotlib.pyplot as plt

def draw_learning_curves(results):
    fig, ax = plt.subplots(1,1, figsize=(15,10))

    ax.plot(results['validation_0']['mlogloss'], c='blue', label='Train set');
    ax.plot(results['validation_1']['mlogloss'], c='orange', label='Validation set');
    ax.legend();
    
    return ax

In [None]:
draw_learning_curves(results);

In [None]:
performance = 'overfit'

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('fitting',
                         performance = performance
)

result.write()
print(result.check())

## Changing Hyperparameters

Now, let's change the hyperparameters of the `XGBClassifier` to:
* `eval_metric='mlogloss'` 
* `n_estimators=150` 
* `learning_rate=0.3`

Retrain your model, and draw the learning curve. What do you think your model is doing now?

In [None]:
xgb = XGBClassifier(n_estimators=150, eval_metric='mlogloss', learning_rate=0.3)

In [None]:
xgb.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_val, y_val)])

In [None]:
draw_learning_curves(xgb.evals_result());

In [None]:
accuracy_score(y_test, xgb.predict(X_test))

<details><summary><strong>💡 Observations and strategy</strong></summary><br>
    You should see now that the training is much smoother, and that your validation set is showing some steady decrease.</details>
    
## Controlling overfitting

Now, let's control overfitting by tweaking two hyperparameters:
* Create a new `XGBClassifier` with the exact same hyperparameters as above
* set the new `min_child_weight` parameter to `6`
* set the new `max_depth` parameter to `7`
* retrain the model, and check the learning curve.
* Calculate an `accuracy_score` and save it under a variable named `final_score`

In [None]:
xgb = XGBClassifier(n_estimators=150, 
                    eval_metric='mlogloss', 
                    learning_rate=0.3,
                   min_child_weight=6,
                   max_depth=7)

xgb.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_val, y_val)])


In [None]:
draw_learning_curves(xgb.evals_result());

In [None]:
final_score = accuracy_score(y_test, xgb.predict(X_test))
final_score

<details><summary><strong>💡 Observations and strategy</strong></summary><br>
    We are improving a little bit on our model, but not very much. Feel free to do a grid search or other hyperparameter search if you want to.</details>
    
# How did I come up with these hyperparameters?

So, how did I come up with these sets of `hyperparameters` to improve on our model? Simply put, I used `hyperparameter search`, in particular the `BayesCV` from the <a href="https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html">scikit-optimize</a> library.

But this takes a lot of time, so rather than have you do it in this notebook, I did it for you. But how do you start optimizing a complex algorithm like `XGBoost`? This is not easy, because there are many hyperparameters to tweak.

A good way to start is by <a href="https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/">following this article that explains how to do hyperparameter tuning</a> for `XGBoost`: there is a clear guidelines on which ones to optimize first, and which ones to do only later.

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('final_score',
                         score = final_score
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.