----
# XGBoost practice
---


### Dataset
We will again use the [pima indian diabetes dataset](pima-indians-diabetes.csv), whose description is [here](http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names), which records measurements about several hundred patients and an indication of whether or not they tested positive for diabetes (the class label).  The classification is therefore to predict whether a patient will test positive for diabetes, based on the various measurements.

The dataset has been downloaded as a csv file in the current directory. 

## Exercise 0: prelude
Install and import necessary modules and functions including: 
 * pandas for loading and parsing data. 
 * `cross_val_score` from sklearn to do cross validation automatically. 
 * `xgboost` because we are using this model today. 

Load the pima data into a pandas dataframe. Do some exploration to gain some understanding of the dataset. 

## Exercise 1: Define `xgboost` model
Define an `xgboost` model for our data. Since we are doing (binary) classification, we will use `XGBClassifier`. 
Do: 
 * Read the docs:
   * [Get started](https://xgboost.readthedocs.io/en/latest/get_started.html)
   * [XGBClassifier]
 * Take note on the various parameters.
 * Instantiate the classifier. 

### Performance evaluation
a) Report the AUC of an XGBoost classifier on the dataset using 10 fold cross validation.

b) Manually vary the following parameters and observe the effect on AUC of varying the following parameters for XGBoost? 

    • max depth
    • learning rate 
    • n estimators 
    • gamma 
    • min child weight 
    • reg lambda

## Exercise 2: Effect of hyperparameters
The following example illustrates how one can explore the resulting performance of a classifier a we vary its   hyperparameter (`max_depth` in this case) by plotting `parameter vs performance` on a graph. 


In [None]:
# Define a range of the hyperparameter. 
max_depths = range(1,15,2)

# a list to accumulate the performance measure. We use accuracy for example. Change it as you wish.
performance_scores=[]

# Define the model. 
xgboostmodel = XGBClassifier(
    n_estimators=100,
    eta=0.3,
    min_child_weight=1,
    max_depth=3, 
    # add the following to remove two depreciation warning
    eval_metric='mlogloss', 
    use_label_encoder=False
)

# In a loop, assign different hyperparameter to the model and record performance measure. 
for m_depth in max_depths:
    # reassign the hyperparameter
    xgboostmodel.max_depth = m_depth
    
    # Compute performance score (we use the mean of 10-fold cross validation with ROCAUC as score each time)
    score = cross_val_score(xgboostmodel, data,classlabel,cv=10,scoring='roc_auc').mean()
    
    # print it out if you wish
    #print("10-fold cross validation AUC= ",cross_val_score(xgboostmodel, data,classlabel,cv=10,scoring='roc_auc').mean())
    performance_scores.append(score)

plt.plot(max_depths, performance_scores)
plt.xlabel('Max-Depth', fontsize=16)
plt.ylabel('AUC', fontsize=16)

## Your turn
Repeat the above for different hyperparameters and perhaps different performance measures (AUC, accuracy, f1 etc...)

Do it at least for the following hyperparameters:  

`eta [default=0.3]`

    - Analogous to learning rate in GBM
    - Makes the model more robust by shrinking the weights on each step
    - Typical final values to be used: 0.01-0.2
    
    
`n_estimators` 
(covered before)



`min_child_weight [default=1]`
    - Defines the minimum sum of weights of all observations required in a child.
    - This refers to min “sum of weights” of observations.
    - Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
    - Too high values can lead to under-fitting hence, it should be tuned using CV.
    
** stop trying to split once your sample size in a node goes below a given threshold.



`gamma [default=0]`
    - A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    - Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.

** The complexity cost by introducing additional leaf


`lambda [default=1]`
    - L2 regularization term on weights (analogous to Ridge regression)
    - This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
    
    
`max_leaf_nodes`
    - The maximum number of terminal nodes or leaves in a tree.
    - Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    - If this is defined, GBM will ignore max_depth.

# Exercise 3: Grid Search 
Sklearn provides `GridSearchCV` to systematically search for a good **combination** of hyperparameters.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
xgboostmodel = XGBClassifier(
    n_estimators=50,
    eta=0.3,
    min_child_weight=1,
    max_depth=3, 
    # add the following to remove two depreciation warning
    eval_metric='mlogloss', 
    use_label_encoder=False
)

n_estimators = [50,100,200,300]
max_depth = [2,4,6,8,10]
tuned_parameters = dict(max_depth=max_depth, n_estimators=n_estimators)

kfold = StratifiedKFold(n_splits=10, random_state=7, shuffle=True)

clf = GridSearchCV(xgboostmodel, tuned_parameters, cv=kfold, scoring='roc_auc')

clf.fit(data,classlabel)

print("Best parameters set found on development set:")
print(clf.best_params_)
