## Overview  
The below overview is an implementation of **XGBoost** on the Iris dataset and includes hyperparameter tuning with **GridSearchCV**        
## Problem Statement
The intent of this notebook is to serve as code-along/self-study reference material, and is a combination of original work and a [towards data science](https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7) article.

## Table of Contents  

* [Gradient Boosting Theory](#theory)
* [Import Libraries](#import_libraries)
* [Import Data](#import_data)
* [Splitting the data into training and testing sets](#split_data)
* [Restructure data into DMatrix](#DMatrix)
* [Define Parameters](#param)
* [Create and Train the Model](#train_model)
* [Evaluation Metrics](#eval_metrics)
* [Gridsearch](#grid_search)
* [Predictions from Gridsearch Model](#gs_predict)

<a class="anchor" id="theory"></a>
## Gradient Boosting Theory

**`Boosting`** is an ensemble technique that combines many models together, but rather than training all of the models in isolation of one another, **boosting trains models in succession**, with **each new model being trained to correct the errors made by the previous ones**. Models are added sequentially until no further improvements can be made.

**`Gradient Boosting`** is an approach where new models are trained to predict the errors of prior models. 


<a class="anchor" id="import_libraries"></a>
## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
%matplotlib inline

<a class="anchor" id="import_data"></a>
## Import Data

For this project we are using a seaborn built in dataset:

In [10]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:

    Iris-setosa (n=50)
    Iris-versicolor (n=50)
    Iris-virginica (n=50)

The four features of the Iris dataset:

    sepal length in cm
    sepal width in cm
    petal length in cm
    petal width in cm

<a class="anchor" id="split_data"></a>
## Splitting the data into training and testing sets

In [12]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

<a class="anchor" id="DMatrix"></a>
## Restructure data into DMatrix

In [16]:
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)

<a class="anchor" id="param"></a>
## Define Parameters

Reference the offical documentation here [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html)

In [17]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20  # The number of training iterations

   **eta**: (aka 'learning rate') gives us a chance to prevent overfitting by reducing the weight of the predictions of the new 
    trees; eta will be multiplied by the residuals being adding to reduce their weight; this effectively reduces the complexity 
    of the overall model.  
    **max_depth**: maximum depth of the decision trees being trained  
    **objective**: the loss function being used  
    **num_class**: the number of classes in the dataset 

<a class="anchor" id="train_model"></a>
## Create and Train the Model

In [18]:
model = xgb.train(param, D_train, steps)



<a class="anchor" id="eval_metrics"></a>
## Evaluation Metrics

In [20]:
from sklearn.metrics import classification_report,confusion_matrix, precision_score, recall_score, accuracy_score

Fit the model to our Test data, then compare the predictions to the actual values

In [22]:
predictions = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in predictions])

In [24]:
print("Precision = {}".format(precision_score(y_test, best_preds, average='macro')))
print("Recall = {}".format(recall_score(y_test, best_preds, average='macro')))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Precision = 0.9333333333333332
Recall = 0.9722222222222222
Accuracy = 0.9666666666666667


In [25]:
print(confusion_matrix(y_test,best_preds))
print('\n')
print(classification_report(y_test,best_preds))

[[14  0  0]
 [ 0  4  0]
 [ 0  1 11]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       0.80      1.00      0.89         4
           2       1.00      0.92      0.96        12

    accuracy                           0.97        30
   macro avg       0.93      0.97      0.95        30
weighted avg       0.97      0.97      0.97        30



<a class="anchor" id="grid_search"></a>
## GridSearchCV

In [27]:
from sklearn.model_selection import GridSearchCV

#create instance of model to use
clf = xgb.XGBClassifier()

#define parameters to test
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

#instantiate GridSearchCV with the model and parameters
grid = GridSearchCV(clf,
                    parameters, 
                    n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

#fit the estimator to the training data
grid.fit(X_train, y_train)





GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs...
                                     num_parallel_tree=None, random_state=None,
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None, va

Inspect the best parameters; best_params_  attribute:

In [28]:
grid.best_params_

{'colsample_bytree': 0.5,
 'eta': 0.3,
 'gamma': 0.3,
 'max_depth': 3,
 'min_child_weight': 1}

Inspect the best estimator in the best\_estimator_ attribute:

In [30]:
grid.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, eta=0.3, gamma=0.3,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

<a class="anchor" id="gs_predict"></a>
## Predictions from Gridsearch Model

In [31]:
grid_predictions = grid.predict(X_test)

In [35]:
print('Predictions from model with GridSearchCV parameters:')
print('\n')
print(confusion_matrix(y_test,grid_predictions))
print('\n')
print(classification_report(y_test,grid_predictions))
print('\n')
print("Precision = {}".format(precision_score(y_test, grid_predictions, average='macro')))
print("Recall = {}".format(recall_score(y_test, grid_predictions, average='macro')))
print("Accuracy = {}".format(accuracy_score(y_test, grid_predictions)))

Predictions from model with GridSearchCV parameters:


[[14  0  0]
 [ 0  4  0]
 [ 0  2 10]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       0.67      1.00      0.80         4
           2       1.00      0.83      0.91        12

    accuracy                           0.93        30
   macro avg       0.89      0.94      0.90        30
weighted avg       0.96      0.93      0.94        30



Precision = 0.8888888888888888
Recall = 0.9444444444444445
Accuracy = 0.9333333333333333


In [36]:
print('Predictions from model with manually defined parameters (above):')
print('\n')
print(confusion_matrix(y_test,best_preds))
print('\n')
print(classification_report(y_test,best_preds))
print('\n')
print("Precision = {}".format(precision_score(y_test, best_preds, average='macro')))
print("Recall = {}".format(recall_score(y_test, best_preds, average='macro')))
print("Accuracy = {}".format(accuracy_score(y_test, best_preds)))

Predictions from model with manually defined parameters (above):


[[14  0  0]
 [ 0  4  0]
 [ 0  1 11]]


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        14
           1       0.80      1.00      0.89         4
           2       1.00      0.92      0.96        12

    accuracy                           0.97        30
   macro avg       0.93      0.97      0.95        30
weighted avg       0.97      0.97      0.97        30



Precision = 0.9333333333333332
Recall = 0.9722222222222222
Accuracy = 0.9666666666666667
