## Extreme Gradient Boosting for Class Imbalance

Kaggle CreditCard Fraud Detection Data can be downloaded here:
https://github.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/blob/master/creditcard.csv?raw=true

In [3]:
# In your command line issue this command:
#conda install -c conda-forge xgboost

In [16]:
%%time
import pandas as pd
data = pd.read_csv('creditcards.csv')

CPU times: user 1.41 s, sys: 163 ms, total: 1.57 s
Wall time: 1.6 s


In [17]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [18]:
data.shape

(284807, 31)

In [19]:
data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [20]:
%%time
# Split data into train and test splits

from sklearn.model_selection import train_test_split

# retrieve numpy array
data = data.values
# split into input and output elements
X, y = data[:, 1:-1], data[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y)

CPU times: user 220 ms, sys: 95.7 ms, total: 316 ms
Wall time: 322 ms


In [21]:
# Count how many unique values of each class
import numpy as np
unique, counts = np.unique(y, return_counts=True)
print (np.asarray((unique, counts)).T)

unique, counts = np.unique(y_test, return_counts=True)
print (np.asarray((unique, counts)).T)

[[0.00000e+00 2.84315e+05]
 [1.00000e+00 4.92000e+02]]
[[0.0000e+00 9.3825e+04]
 [1.0000e+00 1.6200e+02]]


In [22]:
counts

array([93825,   162])

In [23]:
# calculate heuristic class weighting
from sklearn.utils.class_weight import compute_class_weight

# calculate class weighting according to training data
weighting = compute_class_weight('balanced', [0,1], y_train)
print(weighting)

[  0.50086619 289.12121212]




### For XGBoost:

XGBoost is trained to minimize a loss function and the “gradient” in gradient boosting refers to the steepness of this loss function, e.g. the amount of error. A small gradient means a small error and, in turn, a small change to the model to correct the error. A large error gradient during training in turn results in a large correction.

* *Small Gradient:* Small error or correction to the model
* *Large Gradient:* Large error or correction to the model


The **scale_pos_weight** value is used to scale the gradient for the positive class.


By default, the scale_pos_weight hyperparameter is set to the value of 1.0 and has the effect of weighing the balance of positive examples, relative to negative examples when boosting decision trees. 


Scaling the gradient for the positive class has the effect of scaling errors made by the model during training on the positive class and encourages the model to over-correct them. In turn, this can help the model achieve better performance when making predictions on the positive class. 




#### Observe how the estimate value of scale_pos_weight is calculated

In [27]:
%%time
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

unique, counts = np.unique(y_train, return_counts=True)

# estimate scale_pos_weight value
estimate = counts[0] / counts[1]
print('Estimate: %.3f' % estimate)

# define model
# try with and without scale_pos_weight
model = XGBClassifier(scale_pos_weight=estimate, use_label_encoder=False)
#model = XGBClassifier()
# fit model
model.fit(X_train, y_train)

# evaluate model
y_probs = model.predict(X_test)

auc = roc_auc_score(y_test, y_probs)

# summarize performance
print('ROC AUC = %.3f' % auc)

Estimate: 577.242
ROC AUC = 0.895
CPU times: user 3min 14s, sys: 614 ms, total: 3min 15s
Wall time: 58 s


In [None]:
0.88
0.933