#Extreme Gradient Boosting with XGBoost

## Classification

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

df = datasets.load_breast_cancer()
cancer = pd.DataFrame(data= df.data, columns=df.feature_names)
cancer['target'] = df.target
cancer.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


### XGBoost: Fit/Predict

In [2]:
import xgboost as xgb

# Create arrays for the features and the target: X, y
X, y = cancer.iloc[:,:-1], cancer.iloc[:,-1]

# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifer: xg_cl
xg_cl = xgb.XGBClassifier(objective="binary:logistic",
                         n_estimators=10, seed=123)

# Fit the classification to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.956140


### Decision Trees

Instantiate a `DecisionTreeClassifier` called `dt_clf_4` with a max_depth of 4. This parameter specifies the maximum number of successive split points you can have before reaching a leaf node.

In [3]:
# Import the necessary modules
from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier
dt_clf_4 = DecisionTreeClassifier(max_depth=4)

# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)

# Predict the labels of the test set: y_pred_4
y_pred_4 = dt_clf_4.predict(X_test)

# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

accuracy: 0.9736842105263158


### Measuring accuracy
- Utilize XGBoost's learning API through its baked in cross-validation capabilities.
- When you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix
- nfold is the number of cross-validation folds (3), num_boost_round is the number of trees we want to build (5), 
- metrics is the metric you want to compute (this will be "error", which we will convert to an accuracy).

**The example in the cell below comes from the XGBoost library, not the SKLearn library**

In [4]:
# # Create arrays for the features and the target: X, y
# X, y = cancer.iloc[:,:-1], cancer.iloc[:,-1] ~ done above

# Create the DMatrix for X and y: cancer_dmatrix
cancer_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=cancer_dmatrix, params=params,
                   nfold=3, num_boost_round=5,
                   metrics="error", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the accuracy
print(((1-cv_results["test-error-mean"]).iloc[-1]))

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.025480         0.002452         0.066824        0.019564
1          0.021969         0.001257         0.061524        0.013876
2          0.014945         0.006589         0.056252        0.010004
3          0.012306         0.003300         0.052734        0.011418
4          0.010549         0.004314         0.054497        0.012485
0.9455026455026455


### Measuring AUC

Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

- Perform 3-fold cross-validation with 5 boosting rounds and "auc" as your metric.
- Print the "test-auc-mean" column of cv_results

In [5]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=cancer_dmatrix, params=params, nfold=3, num_boost_round=5, metrics="auc",
                   as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.987225       0.001302       0.961473      0.024760
1        0.993245       0.004294       0.969078      0.022616
2        0.995224       0.003751       0.972491      0.024377
3        0.997125       0.002042       0.971355      0.025405
4        0.997610       0.001870       0.974002      0.026528
0.9740019437090325
