# XGBoost Compare
What parameters do I need to set for Python API to match the validation accuracy of Scikit-learn API? The dataset `data.npy` can be downloaded from Google Drive if needed here:

https://drive.google.com/open?id=1gUQ7z2lNNsSCYCxInFqNMyUaQGB63zB9

In [10]:
import numpy, cudf, cupy
X = numpy.load('data.npy')
X_train = cudf.DataFrame.from_gpu_matrix( cupy.asarray(X) )
split = 3*len(X_train)//4
X_train.shape

(590540, 217)

In [9]:
import xgboost as xgb
print("XGBoost version:", xgb.__version__)

xgb_parms = { 
    'n_estimators':2000,
    'max_depth':12, 
    'learning_rate':0.02, 
    'subsample':0.8,
    'colsample_bytree':0.4, 
    'missing':-1, 
    'eval_metric':'auc',
    'tree_method':'gpu_hist' 
}
train = xgb.DMatrix(data=X_train.iloc[:split,:-1],label=X_train.iloc[:split,-1])
valid = xgb.DMatrix(data=X_train.iloc[split:,:-1],label=X_train.iloc[split:,-1])
clf = xgb.train(xgb_parms, dtrain=train,
    num_boost_round=2000,evals=[(train,'train'),(valid,'valid')],
    early_stopping_rounds=100,maximize=True,
    verbose_eval=50)

XGBoost version: 1.0.0-SNAPSHOT
[0]	train-auc:0.86773	valid-auc:0.81684
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 100 rounds.
[50]	train-auc:0.91358	valid-auc:0.87507
[100]	train-auc:0.93154	valid-auc:0.88947
[150]	train-auc:0.94443	valid-auc:0.89634
[200]	train-auc:0.95914	valid-auc:0.90377
[250]	train-auc:0.97378	valid-auc:0.91124
[300]	train-auc:0.98300	valid-auc:0.91501
[350]	train-auc:0.98819	valid-auc:0.91609
[400]	train-auc:0.99130	valid-auc:0.91739
[450]	train-auc:0.99360	valid-auc:0.91818
[500]	train-auc:0.99505	valid-auc:0.91844
[550]	train-auc:0.99621	valid-auc:0.91851
[600]	train-auc:0.99713	valid-auc:0.91912
[650]	train-auc:0.99780	valid-auc:0.91831
Stopping. Best iteration:
[585]	train-auc:0.99688	valid-auc:0.91917



In [12]:
import xgboost as xgb
print("XGBoost version:", xgb.__version__)

clf = xgb.XGBClassifier( 
    n_estimators=2000,
    max_depth=12, 
    learning_rate=0.02, 
    subsample=0.8,
    colsample_bytree=0.4, 
    missing=-1, 
    eval_metric='auc',
    tree_method='gpu_hist' 
)
h = clf.fit(X_train.iloc[:split,:-1].to_pandas(), X_train.iloc[:split,-1].to_pandas(), 
    eval_set=[(X_train.iloc[:split,:-1].to_pandas(), X_train.iloc[:split,-1].to_pandas()),\
                (X_train.iloc[split:,:-1].to_pandas(), X_train.iloc[split:,-1].to_pandas())],
    verbose=50, early_stopping_rounds=100)

XGBoost version: 1.0.0-SNAPSHOT




[0]	validation_0-auc:0.86434	validation_1-auc:0.82828
Multiple eval metrics have been passed: 'validation_1-auc' will be used for early stopping.

Will train until validation_1-auc hasn't improved in 100 rounds.
[50]	validation_0-auc:0.91345	validation_1-auc:0.87605
[100]	validation_0-auc:0.94021	validation_1-auc:0.89421
[150]	validation_0-auc:0.95937	validation_1-auc:0.90838
[200]	validation_0-auc:0.97598	validation_1-auc:0.91897
[250]	validation_0-auc:0.98571	validation_1-auc:0.92563
[300]	validation_0-auc:0.99142	validation_1-auc:0.93043
[350]	validation_0-auc:0.99471	validation_1-auc:0.93317
[400]	validation_0-auc:0.99663	validation_1-auc:0.93468
[450]	validation_0-auc:0.99769	validation_1-auc:0.93553
[500]	validation_0-auc:0.99832	validation_1-auc:0.93601
[550]	validation_0-auc:0.99879	validation_1-auc:0.93628
[600]	validation_0-auc:0.99910	validation_1-auc:0.93637
Stopping. Best iteration:
[537]	validation_0-auc:0.99868	validation_1-auc:0.93642

