## 02-Lab - XGBoost Classification

In this lab, you will work with the Heart dataset to predict if a person has AHD or not. More over, you will compare xgboost and all the other tree based algorithms that we have learned so far. In the second part of the lab, you will generate the label map for xgboost.

Data is availabe in : https://raw.githubusercontent.com/colaberry/DSin100days/master/data/Heart.csv

"Some of the data in this lab are taken from "An Introduction to Statistical Learning, with applications in R"  (Springer, 2013) from the authors: G. James, D. Witten,  T. Hastie and R. Tibshirani " 

In [None]:
# Importing pandas
import pandas as pd

heart = pd.read_csv('https://raw.githubusercontent.com/colaberry/DSin100days/master/data/Heart.csv', na_values='?').dropna()
heart.info()
heart.head()


In [None]:
# get dataset  
data_set = heart[["Age","MaxHR","AHD"]]

In [None]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap 
from sklearn.preprocessing import LabelEncoder
import numpy as np 

labels = LabelEncoder().fit_transform(data_set["AHD"].values) 
colors = ['yellow','black']
cmap= ListedColormap(colors)
plt.figure(figsize=(10,10))
plt.xlabel('Age', fontsize=15)
plt.ylabel('MaxHR', fontsize=15)
plt.scatter(data_set['Age'].values, data_set['MaxHR'].values, c=labels, cmap=cmap )


In [None]:
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split 

X = data_set[['Age','MaxHR']].values
y = labels.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1)
print("y value min and max are : {},{}".format(min(y),max(y)))

## Part 1: Comparing all the classifiers 

In [None]:
import xgboost as xgb
xgb_clf = # from xgb get the xgb classifier to with parameters random state=12, max_depth=2 and n_estimators=100
xgb_pred= xgb_clf.fit(X_train,y_train).predict(X_test)
xgb_acc = metrics.accuracy_score(xgb_pred, y_test)
print("Accuracy of the xgboost classifier on the test set {}".format(xgb_acc))


Accuracy of the xgboost classifier on the test set 0.7

XGBoost also has a random forest classifier. A good exercise is to compare all tree based classifiers- xgboost classifier, decision tree, boosted trees and random forest. We will be doing this below 

In [None]:
from xgboost import XGBRFClassifier
# get the xgboost randomforest algorithm with same parameters as the xgb classifier. 
# you will have to fit and train and predict as well. Approximately 3 to 4 lines of code
print("Accuracy of the xgboost Random forest classifier on the test set {}".format(round(xrf_acc,2)))

In [None]:
dt_clf = tree.DecisionTreeClassifier()
# Train a decision tree classifier. Make sure you name your variables apporpriately.
# parameters are random_state=12 and max_depth=2. 3 to 4 lines of code 
print("Accuracy of the decision tree classifier on the test set {}".format(round(dt_acc,2)))

Accuracy of the decision tree classifier on the test set 0.73

In [None]:
from sklearn.ensemble import RandomForestClassifier
# Train a Random Forest classifier. Make sure you name your variables apporpriately.
# parameters are random_state=12, n_estimators= 100 and max_depth=2. 3 to 4 lines of code 
print("Accuracy of the random forest classifier on the test set {}".format(round(rf_acc,2)))

Accuracy of the random forest classifier on the test set 0.73

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
# Train a Random Forest classifier. Make sure you name your variables apporpriately.
# parameters are random_state=12, n_estimators= 100, learning_rate=0.01 and max_depth=2. 3 to 4 lines of code 
print("Accuracy of the gradient boosting classifier on the test set {}".format(round(gb_acc,2)))

Accuracy of the gradient boosting classifier on the test set 0.73


Ultimately you should be able to get-


| Tree method            | Accuracy |
|------------------------|----------|
| XGBoost(normal)        | 70%      |
| Decision tree          | 73%      |
| XGBoost(Random forest) | 77%      |
| Random Forest          | 77%      |
| Boosted trees          | 73%      |


From the above table it is evident that the Random Forest classifier does the best in either the regular or boosted form. A caveat is that we did not not carry out parameter tuning. We fixed the parameters such as max depth and number of estimators. In a real use case, one would have do gridsearchCV or parameter search and identify the best parameters to train the models. Hence the above table should be taken with a grain of salt since this is an poorly optimized comparison. What we can see is that the the biggest differences lie between Random Forest methods and non random forest methods. Random Forests are highly effective predictors in most scenarios. 

Next lets look generating the label map for XGboost.

## Part 2: Label map


In [None]:
def to_3d(x,y,plot_step=0.01): 
   

    x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
    y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    return xx, yy 

def plot_contour(xx,yy,Z): 
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)
    return cs

In [None]:

xx, yy = to_3d(X,y)

# we are going to generate the label map for xgb_classifier 
Z = xgb_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)


plt.figure(figsize=(10,10))
cmap= ListedColormap(colors)

_ = plot_contour('####') # what are going to be the inputs for this function
plt.scatter(data_set["Age"].values, data_set["MaxHR"].values, c=labels,cmap=cmap )
plt.show()

<img src="../../../images/xgb_label_map.png">