This Notebook is authored by Rocky Jagtiani for all his learner friends enrolled on govt and private platforms. Connect with me here : https://linkedin.com/in/rocky-jagtiani-3b390649/

Introduction
--
**XGBoost** is an implementation of gradient boosted decision trees designed for speed and performance.  XGBoost stands for <b><font color='darkblue'> extreme gradient boosting </font></b>, which is an implementation of gradient boosting with several additional features focused on performance and speed.

We have made predictions with the **random forest method**, which achieves better performance than a single decision tree simply by averaging the predictions of many decision trees.

We refer to the random forest method as an **"ensemble method"**. By definition, ensemble methods combine the predictions of several models (e.g., several trees, in the case of random forests).

Now in this NB, we'll learn about **another ensemble method** called **gradient boosting**.

Steps in Gradient Boosting
--
Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble.

It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.)

Then, we start the cycle:

**First**, we use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble.

These predictions are used to calculate a loss function (like mean squared error, for instance).

Then, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to the ensemble will reduce the loss. (*note: The "gradient" in "gradient boosting" refers to the fact that we'll use gradient descent on the loss function to determine the parameters in this new model.*)

**Finally**, we add the new model to ensemble, and ...
... **repeat**!

![XGBOOst_Image](https://drive.google.com/uc?id=11kQ-s_oc1G-xHY5qGXubSGeI6YFZq3Sf)

In the coded examples, we'll work with the **XGBoost library**. (<font color='darkblue'> xgboost.XGBRegressor  and  XGBClassifier</font> )

<font color='green'> Note : Scikit-learn has another version of gradient boosting, but XGBoost has some technical advantages. The two classes in Scikit-learn equally popular are <i>sklearn.ensemble.GradientBoostingClassifier</i> and <i>sklearn.ensemble.GradientBoostingRegressor</i></font>

As the NB progresses, you would see that XGBRegressor class has many **tunable parameters** (*as compared to sklearn.ensemble classes*).

In [None]:
# importing required Libraries
import numpy as np
from scipy.stats import uniform, randint

# importing some pre-defn datasets
from sklearn.datasets import load_breast_cancer, load_diabetes, load_wine

# importing different metrics for performance measurement
from sklearn.metrics import accuracy_score, confusion_matrix, mean_squared_error

# importing modules for applying CROSS Validation and GridSearch on the dataset to find best parameters 
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split

# importing XGBoost Library. It contains XGBRegressor and XGBClassifier modules
import xgboost as xgb

In [None]:
# defining function to print scores in a formated way
def display_scores(scores):
    print("Scores: {0}\nMean: {1:.3f}\nStd: {2:.3f}".format(scores, np.mean(scores), np.std(scores)))

In [None]:
# Code Example 1 : Use XGBRegressor (i.e a regression model) to predict diabetes 


# diabetes dataset description like Column names and types here https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset.  
X = diabetes.data
y = diabetes.target # is a numeric value measuring diabetes(i.e sugar_level) one year after baseline
                    # normal range of sugar_level for diabetes in a adult male is 125-180 mg/dL

# see sample i/p features and the o/p target



xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=10) # choose any positive value for random_state
# objective="reg:squarederror" is equivalent to objective="reg:linear" and it tells the Regressor Model to minimise mse
# mse --> mean squared error

# split data into training and test set, for both features and target




# predicting on test data






[[ 0.03807591  0.05068012  0.06169621  0.02187235 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990842 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632783 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06832974 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567061 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286377 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665645  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02269202 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187235  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03199144 -0.04664087]]
[151.  75. 141. 206. 135.]
-----------------------------
RMSE :  58.80307813639512


In [None]:
# optional
# uncomment below line to print the xgb_model. You would see a very large no. of parameters.
# xgb_model

In [None]:
# Code Example 2 : Use XGBClassifier (i.e a classifier model) to predict whether the female patient 
            # who have had breast_cancer in the past or some major symptoms would it "recur" in her or not ?  
# so its a Binary classification problem. i.e either the outcome could be Malignant (means infectious or dangerous) 
# or Benign (not infectious or not dangerous )  
# about the dataset here https://scikit-learn.org/dev/datasets/index.html#breast-cancer-dataset
cancer = load_breast_cancer()

X = cancer.data
y = cancer.target

# split data into training and test set, for both features and target


xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
# objective="binary:logistic" tell the Classifier to use logistic Regression for classification



# predicting on test data



# your accuracy could be between 93 to 99%, which is very good

[[51  2]
 [ 1 89]]
0.9790209790209791


In [None]:
# Code Example 3 : Use XGBClassifier (i.e a classifier model) to classify in which class does Wine sample belongs
# So its a Multiclass classification problem.
# This Wine_Dataset as just 3 Classes with [59,71,48] samples per class. 
# description of this dataset here : https://scikit-learn.org/stable/datasets/index.html#wine-dataset
wine = load_wine()

X = wine.data
y = wine.target

# split data into training and test set, for both features and target



# objective="multi:softprob" tells the algorithm to calculate probabilities for every possible outcome 
# (in this case, a probability for each of the three wine classes) and accordingly put the Wine sample into that category
# for which we get the highest probability. Now, you may ask this Genius Qn : "What if we get same probability ?" well in
# such case the sample is put in the first available class. for e.g. if p=0.8 for class_1 and class_2 then the sample
# is put into class_1

xgb_model.fit(train_X, train_y)

y_pred = xgb_model.predict(test_X)

print(confusion_matrix(test_y, y_pred))
print(accuracy_score(test_y, y_pred))  # your accuracy could be between 93 to 99%, which is very good

[[16  0  0]
 [ 1 19  1]
 [ 0  0  8]]
0.9555555555555556


<font color='red'> Q> Would you get more accurate performance metric by testing the XGRegressor by using CROSS Validation ?
</font>  

In [None]:
# Code Example 4 : Use XGRegressor over the same diabetes dataset using KFold Cross validation
# In case if you want to revise CROSS Validation , refer my NB here : https://colab.research.google.com/drive/19FP6eA0EK6SpbGcneuzMyHQfxVpBy-j4 

diabetes = load_diabetes()

X = diabetes.data
y = diabetes.target

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = []

for train_index, test_index in kfold.split(X):   
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    xgb_model = xgb.XGBRegressor(objective="reg:squarederror")
    xgb_model.fit(X_train, y_train)
    
    y_pred = xgb_model.predict(X_test)
    
    scores.append(mean_squared_error(y_test, y_pred))
    
display_scores(np.sqrt(scores))


Scores: [55.30444573 55.59151472 63.44642064 57.82986083 58.71808276]
Mean: 58.178
Std: 2.937


<font color='green'> I am sure you would have got a more accurate performance metric this time via CROSS Validating our data. Remember due to CROSS Validation the problem of Overfitting is resolved. </font>

In [None]:
# Redoing coding example 4 using cross_val_score() 
# Cross-validation using cross_val_score()

xgb_model = xgb.XGBRegressor(objective="reg:squarederror", random_state=42)



display_scores(np.sqrt(-scores)) # we are negating the scores becoz the scores returned by cross_val_score() are negative
                                 # so -scores makes it positive
                                 # scoring="neg_mean_squared_error" parameter was also discuused in CROSS Validation NB 

Scores: [56.04057166 56.14039793 60.3213523  59.67532995 60.7722925 ]
Mean: 58.590
Std: 2.071


Parameter Tuning
--
XGBoost has a few parameters that can dramatically affect accuracy and training speed. The <u>most important parameters</u> you should understand are:

**n_estimators** <br>
n_estimators specifies how many times to go through the modeling cycle described above. It is equal to the number of models that we include in the ensemble.

> Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.

> Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on test data (which is what we care about).

**Typical values range from 100-1000**, though this depends a lot on the ***learning_rate*** parameter.
<hr>

**learning_rate** <br>
Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number (known as the learning rate) before adding them in.

This means each tree we add to the ensemble helps us less. So, we can ***set a higher value for n_estimators without overfitting.*** If we use <font color='darkblue'><b>early stopping</b>, the appropriate number of trees will be determined automatically. </font> ( <small>Extra Reading : do Refer : https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/ </small>.   <small><b>Remember :</b> It is generally a good idea to select the early_stopping_rounds as a reasonable function of the total number of training epochs (10% in this case).</small> )

<font color='darkgreen'><b>
In general, a small learning rate and large number of estimators will yield more accurate XGBoost model, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets learning_rate=0.1.</b></font>

<hr>

**early_stopping_rounds**<br>
`early_stopping_rounds` offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score stops improving, even if we are much below the value for `n_estimators`. It's smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number for how many rounds of straight deterioration to allow before stopping. Setting early_stopping_rounds=5 or 10 is a reasonable choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.

When using early_stopping_rounds, you also need to set aside some data for calculating the validation scores - this is done by setting the `eval_set` parameter.  **for example**

<font color='blue'><i>
my_model = XGBRegressor(n_estimators=500) <br>
my_model.fit(X_train, y_train, <br>
             early_stopping_rounds=5, <br> 
             eval_set=[(X_valid, y_valid)], <br>
             verbose=False)
</i></font>

<hr>

**n_jobs**

On larger datasets where runtime is a consideration, you can use `parallelism` to build your models faster. It's common to set the parameter `n_jobs` equal to the `number of cores` on your machine. On smaller datasets, this won't help.

The resulting model won't be any better.  But, it's useful in large datasets where you would otherwise `spend a long time` waiting during the fit command.

Here's the modified example: <br>
<font color='blue'><i>
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4) <br>
my_model.fit(X_train, y_train,  <br>
             early_stopping_rounds=5,  <br>
             eval_set=[(X_valid, y_valid)], <br>
             verbose=False)
             </i></font>

In [None]:
# Code Example 5 : Use XGBClassifier with Parameters. This could save time for training and give better performance too. 
            
# n_estimators => the number of boosted trees to train 
# early_stopping_rounds => training continues until validation has not improved in n rounds. 
# eval_set=[(X_test, y_test)] => indicates what parameters to evaluate
# eval_metric="error"  => indicates on what metric to take decision
#              error => Binary classification error rate. It is calculated as #(wrong cases)/#(all cases)

# so if we keep n_estimators = 150 (default is 100) and early_stopping_rounds = 5 then 
# if for 5 runs if our metric say accuracy_score stops improving then the XGBoost model stops running.

cancer = load_breast_cancer()

X = cancer.data
y = cancer.target

# if more than one evaluation metric are given the last one is used for early stopping











[0]	validation_0-error:0.076923
Will train until validation_0-error hasn't improved in 5 rounds.
[1]	validation_0-error:0.076923
[2]	validation_0-error:0.076923
[3]	validation_0-error:0.06993
[4]	validation_0-error:0.06993
[5]	validation_0-error:0.055944
[6]	validation_0-error:0.062937
[7]	validation_0-error:0.055944
[8]	validation_0-error:0.048951
[9]	validation_0-error:0.041958
[10]	validation_0-error:0.041958
[11]	validation_0-error:0.034965
[12]	validation_0-error:0.034965
[13]	validation_0-error:0.041958
[14]	validation_0-error:0.034965
[15]	validation_0-error:0.041958
[16]	validation_0-error:0.041958
Stopping. Best iteration:
[11]	validation_0-error:0.034965

-------------- accuracy_score is ----------------


0.965034965034965

<font color='darkblue'><b> Important </b> </font> <br>
**xgb_model.fit()** will produce a model from the last iteration, ***not the best one***, so to get the optimum model consider `retraining` over **xgb_model.best_iteration** rounds.

In [None]:
print("best score: {0}, best iteration: {1}, best ntree limit {2}".format(xgb_model.best_score, xgb_model.best_iteration, xgb_model.best_ntree_limit))

best score: 0.034965, best iteration: 11, best ntree limit 12


In [None]:
# best ntree limit 12 => means you should fix the n_estimators parameter to 12 in XGBClassifier
xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=10, eval_metric="error", n_estimators=12) 

xgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

y_pred = xgb_model.predict(X_test)

print("-------------- accuracy_score is ----------------")
accuracy_score(y_test, y_pred)

[0]	validation_0-error:0.076923
[1]	validation_0-error:0.076923
[2]	validation_0-error:0.076923
[3]	validation_0-error:0.06993
[4]	validation_0-error:0.06993
[5]	validation_0-error:0.055944
[6]	validation_0-error:0.062937
[7]	validation_0-error:0.055944
[8]	validation_0-error:0.048951
[9]	validation_0-error:0.041958
[10]	validation_0-error:0.041958
[11]	validation_0-error:0.034965
-------------- accuracy_score is ----------------


0.965034965034965

Wow !!
--

You have done a lot by now. Let me recall :
<font color = 'green'> <br>
1. You understood the steps in Gradient Boosting. i.e add a new model to ensemble in each round till we get good performance metric.

2. Then you used XGBoost library i.e used two classes of it <b>XGBRegressor</b> and <b>XGBClassifier</b>.

3. You run through some codes where : <br>
<font color = 'green'>Example 1 :</font> Used XGBRegressor (i.e a regression model) to predict diabetes.
<br><font color = 'green'>Example 2 :</font> Used XGBClassifier (i.e a classifier model) to predict the cancer is dangerous or not ? This was a <b>Binary Classifier Problem</b>
<br><font color = 'green'>Example 3 :</font> Again we used XGBClassifier over the Wine_dataset to classify it into one of the three classes. This was a <b>Multi-Classification Problem</b>
<br><font color = 'green'>Example 4 :</font> You smartly applied CROSS VALIDATION <br>
<b> You learned the 4 most used parameters, which when <u>tuned</u> improve the performance of XGBoost </b>
<br><font color = 'green'>Example 5 :</font>  You smartly applied the parameters and <b><u>tuned</u></b> the Model.

Resources & References
--
1. <a href='https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e'>Good Blog article_1</a>

2. <a href='https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/'>Good Blog article_2</a>

3. <a href='https://xgboost.readthedocs.io/en/latest/parameter.html'>Official Docs of XGBoost parameters</a>

4. <a href='https://xgboost.readthedocs.io/en/latest/tutorials/index.html'>Official Docs on XGBoost tutorials and Examples</a>
