# About this kernel

Before I get started, I just wanted to say: huge props to Inversion! The official starter kernel is **AWESOME**; it's so simple, clean, straightforward, and pragmatic. It certainly saved me a lot of time wrangling with data, so that I can directly start tuning my models (real data scientists will call me lazy, but hey I'm an engineer I just want my stuff to work).

I noticed two tiny problems with it:
* It takes a lot of RAM to run, which means that if you are using a GPU, it might crash as you try to fill missing values.
* It takes a while to run (roughly 3500 seconds, which is more than an hour; again, I'm a lazy guy and I don't like waiting).

With this kernel, I bring some small changes:
* Decrease RAM usage, so that it won't crash when you change it to GPU. I simply changed when we are deleting unused variables.
* Decrease **running time from ~3500s to ~40s** (yes, that's almost 90x faster), at the cost of a slight decrease in score. This is done by adding a single argument.

Again, my changes are super minimal (cause Inversion's kernel was already so awesome), but I hope it will save you some time and trouble (so that you can start working on cool stuff).


### Changelog

**V4**
* Change some wording
* Prints XGBoost version
* Add random state to XGB for reproducibility

In [5]:
import os

import numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as xgb

In [6]:
print("XGBoost version:", xgb.__version__)

XGBoost version: 0.90


# Efficient Preprocessing

This preprocessing method is more careful with RAM usage, which avoids crashing the kernel when you switch from CPU to GPU. Otherwise, it is exactly the same procedure as the official starter.

In [7]:
%%time
import pandas as pd

train_transaction = pd.read_csv('ieee-fraud-detection/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('ieee-fraud-detection/test_transaction.csv', index_col='TransactionID')

train_identity = pd.read_csv('ieee-fraud-detection/train_identity.csv', index_col='TransactionID')
test_identity  = pd.read_csv('ieee-fraud-detection/test_identity.csv',  index_col='TransactionID')

sample_submission = pd.read_csv('ieee-fraud-detection/sample_submission.csv', index_col='TransactionID')

train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)

print(train.shape)
print(test.shape)

# (590540, 433)
# (506691, 432)
# CPU times: user 5min 17s, sys: 18.3 s, total: 5min 36s
# Wall time: 5min 35s


(590540, 433)
(506691, 432)
CPU times: user 30.6 s, sys: 6.73 s, total: 37.3 s
Wall time: 37.5 s


In [8]:
%who

clf	 np	 os	 pd	 preprocessing	 sample_submission	 test	 test_identity	 test_transaction	 
train	 train_identity	 train_transaction	 xgb	 


In [9]:
%%time

del train_transaction, train_identity, test_transaction, test_identity
# y_train = train['isFraud'].copy()
# Drop target, fill in NaNs
# X_train = train.drop('isFraud', axis=1)
X_test = test.copy()

X_train=train
del train, test


CPU times: user 721 ms, sys: 516 ms, total: 1.24 s
Wall time: 1.24 s


In [10]:
%who

X_test	 X_train	 clf	 np	 os	 pd	 preprocessing	 sample_submission	 xgb	 



In [11]:
testset_isFraud=pd.read_csv("2019-9-22-yuchi3.csv")

In [12]:
%%time
testset_isFraud['isFraud']#506690
# len(X_test['V144'])
X_test['isFraud']=list(testset_isFraud['isFraud'])

CPU times: user 58.5 ms, sys: 7.64 ms, total: 66.2 ms
Wall time: 65.1 ms


In [13]:
%who

X_test	 X_train	 clf	 np	 os	 pd	 preprocessing	 sample_submission	 testset_isFraud	 
xgb	 


In [None]:
%%time
X_train=pd.concat([X_train,X_test],sort=True).reset_index(drop=True)

In [None]:
%%time
y_train = X_train['isFraud'].copy()
X_train.drop('isFraud',axis=1)

In [None]:
len(y_train)

In [None]:
X_train = X_train.fillna(-999)
X_test  = X_test.fillna(-999)

from sklearn import preprocessing
# Label Encoding
for f in X_train.columns:
    if X_train[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[f].values) + list(X_test[f].values))
        X_train[f] = lbl.transform(list(X_train[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values))   

# Training

To activate GPU usage, simply use `tree_method='gpu_hist'` (took me an hour to figure out, I wish XGBoost documentation was clearer about that).

In [3]:
%who

X_test	 X_train	 f	 lbl	 np	 os	 pd	 preprocessing	 sample_submission	 
testset_isFraud	 xgb	 y_train	 


In [None]:
%%time
import xgboost as xgb
clf = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=9,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    missing=-999,
    random_state=2019,
    tree_method='gpu_hist'  # THE MAGICAL PARAMETER
)

In [None]:
# len(feature_need)？

In [None]:
# %time clf.fit(X_train[feature_need], y_train)
%time clf.fit(X_train, y_train)


Some of you must be wondering how we were able to decrease the fitting time by that much. The reason for that is not only we are running on gpu, but we are also computing an approximation of the real underlying algorithm (which is a greedy algorithm). This hurts your score slightly, but as a result is much faster.

So why am I not using CPU with `tree_method='hist'`? If you try it out yourself, you'll realize it'll take ~ 7 min, which is still far from the GPU fitting time. Similarly, `tree_method='gpu_exact'` will take ~ 4 min, but likely yields better accuracy than `gpu_hist` or `hist`.

The [docs on parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) has a section on `tree_method`, and it goes over the details of each option.

In [None]:
len(remain_list)

In [None]:
%%time
sample_submission['isFraud'] = clf.predict_proba(X_test[remain_list])[:,1]
sample_submission.to_csv('33_xgboost.csv')
from IPython.display import FileLink
FileLink('33_xgboost.csv')