# Hands on - Comparing GradientBoostingClassifier vs. XGBOOST

Please install xgboost - it is not part of the anaconda framework. Of course, it has to be performed only once :)

In [None]:
!pip install xgboost

# Import and Prepare Data

In [2]:
import pandas as pd
churn = pd.read_csv("https://raw.githubusercontent.com/casbdai/datasets/main/churn.csv")

## Separate Features and Labels

In [3]:
X = churn.drop("churn",axis=1) # Features
y = churn["churn"]# Target variable

## Dummy coding

In [4]:
X = pd.get_dummies(X, drop_first = True)

 # Compare GradientBoostingClassifier and XGBoost

The GradientBoosting Algorithm is one of the most powerful algorithms that we have today. It is very frequently used in research and practice and many winning submissions to kaggle competitions are based on GradientBoosting. In short, Data Scientists love it, because it is fast, accurate, and can be well adapted to many prediction problems.

Sklearn offers a "standard" implementation of Gradient Boosting Algorithm. However, the same algorithm can be implemented in different software libraries. The software library "XGBoost" offers an even more powerful implementation of GradientBoosting - usually Data Scientists use this algorithm when speaking of "boosting." 

In the following, we want to shortly compare the standard GradientBoostingClassifier() with a XGBoostClassier() on the customer churn data.

## 1) XGBOOST

In [5]:
%%time

# Import functions
from xgboost import XGBClassifier #we need to import XGBoost instead of GradientBoostingClassifier()
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Instantiate the Model
clf_xgb = XGBClassifier(learning_rate=0.1,
                        n_estimators=500, 
                        max_depth=10,
                        use_label_encoder=False)

# Create Test and Training Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Fit Model to Data
clf_xgb.fit(X_train, y_train)

# Make Predictions on Test DAta
y_pred = clf_xgb.predict(X_test)

# Evaluate Performance
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.82      0.86       925
           1       0.73      0.84      0.78       534

    accuracy                           0.83      1459
   macro avg       0.81      0.83      0.82      1459
weighted avg       0.84      0.83      0.83      1459

CPU times: total: 22.9 s
Wall time: 6.32 s


# 2) Sklearn's GradientBooostingClassifier

In [6]:
%%time 

# Import functions
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Instantiate the Model
clf_skl = GradientBoostingClassifier(learning_rate=0.1,
                                     n_estimators=500, 
                                     max_depth=10)

# Create Test and Training Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Fit Model to Data
clf_skl.fit(X_train, y_train)

# Make Predictions on Test DAta
y_pred = clf_skl.predict(X_test)

# Evaluate Performance
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87       925
           1       0.77      0.80      0.78       534

    accuracy                           0.84      1459
   macro avg       0.82      0.83      0.82      1459
weighted avg       0.84      0.84      0.84      1459

CPU times: total: 17.8 s
Wall time: 17.5 s


# 3) Comparison

We overall performance measures in F1-Score are identical (0.78). But there are small differences in Precision and Recall. 

However, XGBoost is up to two or three times faster (Compare "Wall time" values; results depend a bit on your machine and setup)