<a href="https://colab.research.google.com/github/aminmohebbi11/mcmedhacks_2022/blob/main/Case_Study_XGBOOST_(June_15).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment Week 1-2: Case Study XGBOOST (June 15)

Created by: **Hosein Beheshtifard & Aref Motamedi**

**Overview**:

XGBoost (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

![](https://upload.wikimedia.org/wikipedia/commons/6/69/XGBoost_logo.png)

![](https://miro.medium.com/max/777/1*l4PN8hyAO4fMLxUbIxcETA.png)



In order to get a better understanding of this library, let's test it on an actual -small- project.

<h2> Dataset </h2>

We will use a clean dataset of 70,692 survey responses to the CDC's BRFSS2015. It has an equal 50-50 split of respondents with no diabetes and with either prediabetes or diabetes. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is balanced.



Link to download the dataset: https://drive.google.com/file/d/1UXvhGWwUApkDEX9Tt4enQ5l1D6jNBPb4/view?usp=sharing

<h2> Importing libraries and packages </h2>

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb

from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, cross_val_score, train_test_split, StratifiedKFold, GridSearchCV

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, log_loss
from sklearn.metrics import roc_curve, roc_auc_score, auc
from xgboost import XGBClassifier

plt.style.use("Solarize_Light2")

import warnings
warnings.filterwarnings("ignore")

<h2> Loading the dataset </h2>


In [2]:
# load the dataset
from google.colab import files
uploaded = files.upload()

Saving Copy of diabetes_binary_5050split_health_indicators_BRFSS2015.csv to Copy of diabetes_binary_5050split_health_indicators_BRFSS2015.csv


In [3]:
data = pd.read_csv("Copy of diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
data

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70687,1.0,0.0,1.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,0.0,0.0,6.0,4.0,1.0
70688,1.0,0.0,1.0,1.0,29.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,1.0,1.0,10.0,3.0,6.0
70689,1.0,1.0,1.0,1.0,25.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,5.0,15.0,0.0,1.0,0.0,13.0,6.0,4.0
70690,1.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0


<h2> Pre-processing </h2>

In this section, we need to prepare our data before it is used in order to ensure or enhance performances.
We usually do a variety of things in this step, such as normalization, data cleaning, handling missing values, feature reduction, and etc.



In [4]:
data.isna().sum()

Diabetes_binary         0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

*   Fortunately, our dataset does not contain any missing values.

Let's see how many unique values are in each attribute:



In [5]:
data.nunique()

Diabetes_binary          2
HighBP                   2
HighChol                 2
CholCheck                2
BMI                     80
Smoker                   2
Stroke                   2
HeartDiseaseorAttack     2
PhysActivity             2
Fruits                   2
Veggies                  2
HvyAlcoholConsump        2
AnyHealthcare            2
NoDocbcCost              2
GenHlth                  5
MentHlth                31
PhysHlth                31
DiffWalk                 2
Sex                      2
Age                     13
Education                6
Income                   8
dtype: int64

* Non-binary categorical features - GenHlth, Education, Income.
* Continuous features - BMI, MentHlth, PhysHlth, Age



In [6]:
cat_col = ['GenHlth', 'Education', 'Income']

In [7]:
num = ['BMI', 'MentHlth', 'PhysHlth', 'Age']

Let's break the categorical features into separate variables:


In [8]:
data = pd.get_dummies(data, columns=cat_col)

We can also deivide the values of BMI (body mass index) into groups:

In [9]:
data['bmi_group'] = pd.cut(data['BMI'], (0, 16, 18.5, 25, 30, 35, 40, np.inf), labels=[1, 2, 3, 4, 5, 6, 7])
data.BMI = data['bmi_group']
data.drop('bmi_group', axis=1, inplace=True)
data.BMI = data.BMI.astype('float')

Many other preprocessing operations such as normalization and feature reduction can be performed at this step, but for the purposes of this exercise, we will not go any further.

Finally we need to seperate the labels

In [10]:
X = data.drop(['Diabetes_binary'], axis=1)
y = data['Diabetes_binary']

<h2> XGBOOST </h2>


At first, we would like to split out dataset into train and validation sets. For each, we are needed to indicate data (**X**) and labels (**y**)


In [11]:
seed = 123 # a value for random_state
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=seed, shuffle=True, stratify=y)
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((56553, 37), (14139, 37), (56553,), (14139,))

You can see the parameters and default values of XGBoost Classifier:

In [12]:
xgb.XGBClassifier().get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}

[Learn about XGBoost Parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)

<h1> EXERCISE 1. </h1>

1. Create a XGBOOST classifier model with parameters defined as below:

*   Learning rate = 0.3 (`learning_rate`)
*   Maximum depth of tree = 4 (`max_depth`)
*   `n_estimator` = 100
*   `subsample` = 0.5
*   `colsample_bytree` = 1
*   `random_state` = seed (defined earlier)
*   evaluation metric = 'auc' (`eval_metric`)

2. After creating the model, You need to fit model with our data. (Set `early_stopping_rounds = 10` and `verbose=True` )

3. Check the accuracy for the trained model.

In [13]:

# 1. Defining the model 
model_xgboost = XGBClassifier(max_depth = 4, learning_rate = 0.3, n_estimators=100 ,colsample_bytree=1,subsample=0.5, eval_metric = 'auc' , random_state= seed) # Your code goes here!

eval_set = [ (X_train, y_train) , (X_valid, y_valid)]

# 2. Fit the model 
#model_xgboost.fit(X_train, y_train, early_stopping_rounds=10 , verbose=True) # Your code goes here!
model_xgboost.fit(X_train, y_train, verbose=True) # Your code goes here!

# 3. Check the result
from sklearn.metrics import classification_report
# Your code goes here!
pred_train = model_xgboost.predict(X_train)
pred_valid = model_xgboost.predict(X_valid)
print('Train Accuracy: ', accuracy_score(y_valid, pred_valid))
print('Test Accuraccy: ', accuracy_score(y_train, pred_train))
print('Classification Report:')
print(classification_report(y_train,pred_train))

Train Accuracy:  0.7512553928849283
Test Accuraccy:  0.7613566035400421
Classification Report:
              precision    recall  f1-score   support

         0.0       0.78      0.72      0.75     28276
         1.0       0.74      0.80      0.77     28277

    accuracy                           0.76     56553
   macro avg       0.76      0.76      0.76     56553
weighted avg       0.76      0.76      0.76     56553



<h2> Tuning hyperparameters </h2>

As you saw in the last section, there are many hyperparameters involved in creating a model. In order to achieve better results, we always need to tune them first.

<h2> EXERCISE 2. </h2>

Try different values for the most important hyperparameters to find the best combination. 

For tuning hyperparameters use `GridSearchCV` to try different combinations, and you can also use `my_roc_auc_score` as the evaluating function.

**Note:** In order to find the best set, it is important to monitor and analyze the results of changing each hyperparameter.

In [14]:
# you can use different values for the below hyperparameters.
learning_rate_list = [0.02, 0.1, 0.3]
max_depth_list = [2, 3, 4]
n_estimators_list = [100, 200, 300]

params_dict = {"learning_rate": learning_rate_list,
               "max_depth": max_depth_list,
               "n_estimators": n_estimators_list}

num_combinations = 1
for v in params_dict.values(): num_combinations *= len(v) 

print(num_combinations)
params_dict

27


{'learning_rate': [0.02, 0.1, 0.3],
 'max_depth': [2, 3, 4],
 'n_estimators': [100, 200, 300]}

In [35]:
from xgboost.training import cv
from sklearn.metrics import make_scorer
def my_roc_auc_score(model, X, y): return roc_auc_score(y, model.predict_proba(X)[:,1])

model = XGBClassifier(subsample=0.5,
                      colsample_bytree=1,
                      random_state = seed,
                      eval_metric='auc')

# Your code goes here!
scoring = {
    'AUC': 'roc_auc', 
    'Accuracy': make_scorer(roc_auc_score)
}
num_folds = 10
kfold = StratifiedKFold(n_splits=num_folds, random_state=seed, shuffle=True)
n_iter=50
grid = GridSearchCV(estimator = model,param_grid = params_dict , scoring = scoring , n_jobs=-1 , refit = "AUC" , cv=kfold)
    #estimator=model, 
    #param_distributions=params_dict,
    #cv=kfold,
    #scoring=scoring,
    #n_jobs=-1,
    #n_iter=n_iter,
    #refit="AUC",
# fit grid search
%time best_model = grid.fit(X_valid , y_valid)


CPU times: user 13 s, sys: 664 ms, total: 13.6 s
Wall time: 11min 55s


In [43]:
best_model

GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=123, shuffle=True),
             estimator=XGBClassifier(eval_metric='auc', random_state=123,
                                     subsample=0.5),
             n_jobs=-1,
             param_grid={'learning_rate': [0.02, 0.1, 0.3],
                         'max_depth': [2, 3, 4],
                         'n_estimators': [100, 200, 300]},
             refit='AUC',
             scoring={'AUC': 'roc_auc', 'Accuracy': make_scorer(roc_auc_score)})

With `(model_name).best_params_` you can find the best combination in your model.







With `pd.DataFrame(model_name.cv_results_)` you can get the results in a dataframe.

Sort the dataframe based on the evaluation metric.

In [44]:
print(f'Best score: {best_model.best_score_}')
print(f'Best model: {best_model.best_params_}')

df_cv_results = pd.DataFrame(best_model.cv_results_)

# Your code goes here! 


Best score: 0.8308384389876112
Best model: {'learning_rate': 0.02, 'max_depth': 4, 'n_estimators': 300}


<h2> Exercise 3.</h2>

1. Train and test the model using the best hyperparameteres you found.
2. Calculate the AUC metric for both train and test sets.
3. Present the classification results in the form of a confusion matrix.
4. Determine recall and precision using the confusiom matrix.

In [45]:
# Your code goes here! 
pred_train = best_model.predict(X_train)
pred_valid = best_model.predict(X_valid)
print('Train Accuracy: ', accuracy_score(y_valid, pred_valid))
print('Test Accuraccy: ', accuracy_score(y_train, pred_train))
print('\nConfusion Matrix:')
print(confusion_matrix(y_train,pred_train))
print('\nClassification Report:')
print(classification_report(y_train,pred_train))

Train Accuracy:  0.7636325058349247
Test Accuraccy:  0.749473944795148

Confusion Matrix:
[[19786  8490]
 [ 5678 22599]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.78      0.70      0.74     28276
         1.0       0.73      0.80      0.76     28277

    accuracy                           0.75     56553
   macro avg       0.75      0.75      0.75     56553
weighted avg       0.75      0.75      0.75     56553



## References
1. [XGBoost Documentation](https://xgboost.readthedocs.io/en/stable/)
