# About this kernel

Before I get started, I just wanted to say: huge props to Inversion! The official starter kernel is **AWESOME**; it's so simple, clean, straightforward, and pragmatic. It certainly saved me a lot of time wrangling with data, so that I can directly start tuning my models (real data scientists will call me lazy, but hey I'm an engineer I just want my stuff to work).

I noticed two tiny problems with it:
* It takes a lot of RAM to run, which means that if you are using a GPU, it might crash as you try to fill missing values.
* It takes a while to run (roughly 3500 seconds, which is more than an hour; again, I'm a lazy guy and I don't like waiting).

With this kernel, I bring some small changes:
* Decrease RAM usage, so that it won't crash when you change it to GPU. I simply changed when we are deleting unused variables.
* Decrease **running time from ~3500s to ~40s** (yes, that's almost 90x faster), at the cost of a slight decrease in score. This is done by adding a single argument.

Again, my changes are super minimal (cause Inversion's kernel was already so awesome), but I hope it will save you some time and trouble (so that you can start working on cool stuff).


### Changelog

**V4**
* Change some wording
* Prints XGBoost version
* Add random state to XGB for reproducibility

In [None]:
import os

import numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as xgb

In [None]:
print("XGBoost version:", xgb.__version__)

# Efficient Preprocessing

This preprocessing method is more careful with RAM usage, which avoids crashing the kernel when you switch from CPU to GPU. Otherwise, it is exactly the same procedure as the official starter.

In [None]:
%%time
train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')

train_identity = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')

sample_submission = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)

print(train.shape)
print(test.shape)

y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity

# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()

del train, test

X_train = X_train.fillna(-999)
X_test = X_test.fillna(-999)

# Label Encoding
for f in X_train.columns:
    if X_train[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[f].values) + list(X_test[f].values))
        X_train[f] = lbl.transform(list(X_train[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values))   

# Training

To activate GPU usage, simply use `tree_method='gpu_hist'` (took me an hour to figure out, I wish XGBoost documentation was clearer about that).

In [None]:
clf = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=9,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    missing=-999,
    random_state=2019,
    tree_method='gpu_exact'  # THE MAGICAL PARAMETER
)

In [None]:
%time clf.fit(X_train, y_train)

Some of you must be wondering how we were able to decrease the fitting time by that much. The reason for that is not only we are running on gpu, but we are also computing an approximation of the real underlying algorithm (which is a greedy algorithm). This hurts your score slightly, but as a result is much faster.

So why am I not using CPU with `tree_method='hist'`? If you try it out yourself, you'll realize it'll take ~ 7 min, which is still far from the GPU fitting time. Similarly, `tree_method='gpu_exact'` will take ~ 4 min, but likely yields better accuracy than `gpu_hist` or `hist`.

The [docs on parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) has a section on `tree_method`, and it goes over the details of each option.

In [None]:
sample_submission['isFraud'] = clf.predict_proba(X_test)[:,1]
sample_submission.to_csv('simple_xgboost.csv')

## Stupid XGBoost
-with no hyper parameter tuning at all XGBoost produces a success rate of .938%. To get to 1st place I need to improve 1.1%.

### Examining the results

In [None]:
import sklearn
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size = .2)

In [None]:
clf = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=9,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    missing=-999,
    random_state=2019,
    tree_method='gpu_hist'  # THE MAGICAL PARAMETER
)

In [None]:
%time clf.fit(X_train, y_train)

In [None]:
y_val_pred = clf.predict_proba(X_val)[:,1]

In [None]:
import sklearn
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
import seaborn as sns
import numpy as np

import matplotlib.pyplot as plt
"""
IMPORTANT must upgrade Seaborn to use in google Colab.
Classification_report is just the sklearn classification report
Classification_report will show up in the shell and notebooks
Results from confusion_viz will appear in notebooks only
"""
def classification_visualization(y_true,y_pred):
    """
    Prints the results of the functions. That's it
    """
    print(classification_report(y_true,y_pred))
    print(confusion_viz(y_true,y_pred))
def confusion_viz(y_true, y_pred):
    """
    Uses labels as given
    Pass y_true,y_pred, same as any sklearn classification problem
    Inspired from code from a Ryan Herr Lambda School Lecture
    """
    y_true = np.array(y_true).ravel()
    labels = unique_labels(y_true,y_pred)
    matrix = confusion_matrix(y_true, y_pred)
    graph = sns.heatmap(matrix, annot=True,
                       fmt=',', linewidths=1,linecolor='grey',
                       square=True,
                       xticklabels=["Predicted\n" + str(i) for i in labels],
                       yticklabels=["Actual\n" + str(i) for i in labels],
                       robust=True,
                       cmap=sns.color_palette("coolwarm"))
    plt.yticks(rotation=0)
    plt.xticks(rotation=0)
    return graph

In [None]:
classification_visualization(y_val,y_val_pred.round() )

In [None]:
a,b,c = precision_recall_curve(y_val, y_val_pred)
plt.plot(a,b)

In [None]:
from sklearn.metrics import precision_recall_curve


def plt_prc(y_true, y_pred):
    a,b,c  = precision_recall_curve(y_true,y_pred)
    plt.figure()
    lw = 2
    plt.plot(a, b, color='darkorange',
             lw=lw, label='Precision Recall curve')#(area = %0.2f)' % c)
    plt.plot([0, 1], [1, 0], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()
plt_prc(y_val,y_val_pred)

So this XGBoost model with the the parameter from notebook i forked did a good job of classifying the 0's, but less than fifty % fo a

In [None]:
from sklearn.metrics import average_precision_score
average_precision_score(y_val,y_val_pred)

## Stupid hyper parameter tuning.
Lets just try some hyper parameter tuning with no data analysis

In [None]:
!pip install scikit-optimize

In [None]:
1e3

In [None]:
import skopt
from skopt import gbrt_minimize, gp_minimize
from skopt.utils import use_named_args
from skopt.space import Real, Categorical, Integer  

dim_n_estimators = Integer(low=1e2, high=1e3, name='n_estimators')
dim_max_depth = Integer(low=1e0, high =2e1,name='max_depth')
dim_learning_rate = Real(1e-3,1e-1, name="learning_rate")

dimensions = [dim_n_estimators, dim_max_depth, dim_learning_rate]

default_parameters = [500, 9, 0.05]

def create_model(n_estimators, max_depth, learning_rate):
    
    clf = xgb.XGBClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        subsample=0.9,
        colsample_bytree=0.9,
        missing=-999,
        random_state=2019,
        tree_method='gpu_hist'
    )
    return clf


In [None]:
@use_named_args(dimensions=dimensions)
def fitness(n_estimators, max_depth, learning_rate):
    model = create_model(n_estimators = n_estimators, max_depth=max_depth, learning_rate=learning_rate)
    model.fit(X_train,y_train)
    y_val_pred = clf.predict_proba(X_val)[:,1]
    score = average_precision_score(y_val,y_val_pred)
    print("Average_Precision_Score = %f" %score)
    
    del model
    
    return -score
    
    

In [None]:
gp_result = gp_minimize(func=fitness,
                            dimensions=dimensions,
                            n_calls=12,
                            noise= 0.01,
                            n_jobs=-1,
                            kappa = 5,
                            x0=default_parameters)