# Motivation
Tree-based models like Random Forest and XGBoost has become very poplular to address tabular(structured) data problems and gained a lot of tractions in Kaggle competitions. It has its very deserving reasons. A lot of the notebooks for this competition is inspired by fast.ai ML course. This notebook will also try to use fast.ai, but another approach: **Deep Learning**. 
This is a bit against industry consensous that Deep Learning is more for unstructured data like image, audio or NLP, and usually won't be very good at handling tabular data. Yet, the introduction of embedding for the categorical data changed this perspective and we'll try to use fast.ai's tabular model to tackle this competition and see how well a Deep Learning approach can do. 

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai import *
from fastai.tabular import *
from fastai.data.transforms import Normalize
import pandas as pd
import numpy as np

# Load Data
After imported the necessary fast.ai modules, mainly 'fastai.tabular'. Let's load the data in. 

In [3]:
# read in the dataset. Since the Test.csv and Valid.csv doesn't have label, it will be used to create our own validation set. 
train_df = pd.read_csv('train_split.csv', low_memory=False).drop(columns=['Id','BUTTER'])
valid_df = pd.read_csv('validation.csv', low_memory=False).drop(columns=['Id','BUTTER'])

In [4]:
train_df.head()

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,28.878847,2.662533,2924.690991,0.999997,19085.568945,0.569198,-0.575502,0.581565,66850.893711,0.637969,14298.486178,7940.694301,2.628526,2.680116,1.0
1,34.233566,0.092746,346.948714,0.999997,6631.244546,0.248707,-0.615941,0.277898,39274.475071,0.148815,11553.163934,3904.681337,3.292504,3.085754,1.0
2,36.113632,2.442423,238.553023,0.999986,7740.918989,0.222347,0.249383,0.216576,27757.153899,0.24984,24081.196003,4738.891687,3.433676,3.121906,1.0
3,14.286133,6.337556,227.375132,0.999806,6740.281614,0.347316,0.591884,0.306927,10593.207077,0.400748,11343.521945,3308.94375,2.291867,2.200712,0.0
4,60.474274,7.632751,106.73065,0.999905,5556.388794,0.204273,0.65585,0.1966,11801.249543,0.223101,25940.693317,4026.326871,3.290073,3.281829,0.0


In [5]:
len(train_df),len(valid_df)

(169935, 42727)

# Data Pre-processing
The competition's evaluation methods uses RMSLE (root mean squared log error). So if we take the log of our prediction, we can just use the good old RMSE as our loss function. It's just easier this way.

In [6]:
# Defining pre-processing we want for our fast.ai DataBunch
procs=[Normalize]

Namely, we'll fix the missing values, categorify all categorical columns, then normalize. Plain and simple. 

# Building the Model


In [7]:
train_df.dtypes
g = train_df.columns.to_series().groupby(train_df.dtypes).groups
g

{float64: ['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV', 'B_DIRA_OWNPV', 'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH', 'Kplus_IP_OWNPV', 'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P', 'gamma_PT', 'piminus_ETA', 'Kplus_ETA', 'signal']}

Have a look at all the column types and see which are categorical and continuous. We'll use it to build the fast'ai DataBunch for training our learner. 

In [8]:
train_df.columns

Index(['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV', 'B_DIRA_OWNPV',
       'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH', 'Kplus_IP_OWNPV',
       'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P', 'gamma_PT', 'piminus_ETA',
       'Kplus_ETA', 'signal'],
      dtype='object')

In [9]:
# prepare categorical and continous data columns for building Tabular DataBunch.
cat_vars = []

cont_vars = ['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV',
       'B_DIRA_OWNPV', 'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH',
       'Kplus_IP_OWNPV', 'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P',
       'gamma_PT', 'piminus_ETA', 'Kplus_ETA']

In [10]:
train_df

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,28.878847,2.662533,2924.690991,0.999997,19085.568945,0.569198,-0.575502,0.581565,66850.893711,0.637969,14298.486178,7940.694301,2.628526,2.680116,1.0
1,34.233566,0.092746,346.948714,0.999997,6631.244546,0.248707,-0.615941,0.277898,39274.475071,0.148815,11553.163934,3904.681337,3.292504,3.085754,1.0
2,36.113632,2.442423,238.553023,0.999986,7740.918989,0.222347,0.249383,0.216576,27757.153899,0.249840,24081.196003,4738.891687,3.433676,3.121906,1.0
3,14.286133,6.337556,227.375132,0.999806,6740.281614,0.347316,0.591884,0.306927,10593.207077,0.400748,11343.521945,3308.943750,2.291867,2.200712,0.0
4,60.474274,7.632751,106.730650,0.999905,5556.388794,0.204273,0.655850,0.196600,11801.249543,0.223101,25940.693317,4026.326871,3.290073,3.281829,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169930,12.893907,0.151000,984.507893,1.000000,11121.462006,0.491973,-0.548488,0.541038,72639.491690,0.398007,22999.329625,3426.012947,3.220792,3.165776,0.0
169931,33.269855,0.422886,528.518223,0.999999,6547.421388,0.324063,-0.397384,0.286336,50331.007785,0.539486,8898.424053,3657.986285,3.077444,3.243626,1.0
169932,8.028983,2.256855,377.706643,0.999878,5160.700365,0.626228,-0.176330,0.598550,16262.726940,0.705622,5497.144994,4276.250211,2.379503,2.772514,1.0
169933,59.286201,3.385165,675.095332,0.999828,9506.037618,0.278407,0.948246,0.231088,11309.374549,0.308349,20879.660445,3915.660229,2.323603,2.348653,0.0


In [11]:
# rearrange training set before feed into the databunch
dep_var = 'signal'
df = train_df[cat_vars + cont_vars + [dep_var]].copy()

Time to create our validation set. The most important step. Since this dataset is somewhat time series, we need to make sure validation set entries happens AFTER all entries in the training set, otherwise the model will be cheating and won't generalize well. 

In [12]:
from fastai.tabular.data import TabularDataLoaders

In [13]:
# Use fast.ai datablock api to put our training data into the DataBunch, getting ready for training
data = TabularDataLoaders.from_df(df,  cat_names=cat_vars, cont_names=cont_vars, procs=procs,y_names='signal')


In [14]:
data.show_batch()

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,49.068467,1.471852,1668.45617,0.999998,12160.437408,0.546918,-0.199049,0.643068,31573.789036,0.40007,21161.294931,7545.394002,2.887407,2.923944,1.0
1,33.170727,2.205191,237.649367,0.999838,4611.243265,0.269019,-0.204553,0.334923,10187.02379,0.184128,8503.592415,3647.61422,2.782956,2.380438,1.0
2,20.954386,4.815278,318.533618,0.999995,8162.881784,0.239195,-0.620043,0.259206,67188.006805,0.226728,16899.85915,3513.940735,3.598323,3.394847,0.0
3,47.126931,1.010613,141.36317,0.999985,10409.026333,0.207534,0.845583,0.183372,9423.700688,0.235512,20424.845633,8933.82221,3.269624,3.455156,0.0
4,51.543175,6.694719,1358.092348,0.999755,5369.266572,0.368295,-0.22923,0.463762,10638.671141,0.252619,8251.985794,3681.267843,2.84702,2.538678,0.0
5,7.046338,0.1302,575.715647,0.999999,10972.552768,0.381012,0.083197,0.445311,18344.432063,0.338042,18515.90833,5858.030274,2.600619,2.381447,0.0
6,17.545283,3.161073,5675.258246,0.999994,9714.529301,2.128631,-0.364013,2.092698,18613.690025,0.797955,9480.605847,6362.613294,2.500552,2.53193,1.0
7,62.182065,4.493889,297.55596,0.999846,7516.475617,0.20509,-0.219213,0.27778,6659.065614,0.139449,7538.023205,5542.593749,2.482442,2.369295,0.0
8,10.599662,3.35516,656.072323,0.999982,13040.566425,0.329322,-0.315135,0.283032,66975.312773,0.502113,18291.413885,3647.159631,2.851285,2.821613,1.0
9,29.712267,5.72716,3668.974616,0.99998,5840.80705,0.881368,0.137922,0.895917,20967.991946,0.934745,13767.69699,3793.189221,3.02752,3.145737,1.0


# Model

Finally, it's time for some training. We will fire up a fast.ai 'tabular.learner' from the DataBunch we just created.

In [15]:
from fastai.tabular.learner import tabular_learner
from fastai.tabular.data import *

In [16]:
dls = TabularDataLoaders.from_df(df, '.', procs=procs, cat_names=cat_vars, cont_names=cont_vars, 
                                 y_names="signal", bs=64)
learn = tabular_learner(dls)

In [17]:
learn.model

TabularModel(
  (embeds): ModuleList()
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=14, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=1, bias=True)
    )
  )
)

As can be seen from the above, we have embedding layers for categorical columns, then followed by a drop out layer. We have batch norm layer for the continuous columns, then we put all of them into two fully connected layers with 1000 and 500 nodes, with Relu, BatchNorm, and Dropout in between. Quite standard.

In [18]:
row, clas, probs = learn.predict(df.iloc[0])

In [19]:
clas, probs

(tensor([0.1176]), tensor([0.1176]))

User fast.ai's *lr_find* function to find the proper learning rate, then do a 'fit one cycle' training. 

In [20]:
learn.model.fit_one_cycle(2, 1e-2, wd=0.2)

ModuleAttributeError: 'TabularModel' object has no attribute 'fit_one_cycle'

In [21]:
learn.fit_one_cycle(5, 3e-4, wd=0.2)

AttributeError: 'TabularLearner' object has no attribute 'fit_one_cycle'

In [22]:
# learn.fit_one_cycle(5, 3e-4, wd=0.2)

Best result reaches 0.227 RMSLE, I think it beats the #1 in Kaggle leaderboard. 

# Conlusion
I think overall people still prefer XGBoost or Random Forest for tabular Kaggle competitions since it usually will yield the best scores. However, Deep Learning is also a viable approach, though lacking a bit on the explainability side. At least it could be used for ensamble purpose so it's worth exploring. 

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import RandomForestRegressor,GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import scipy

In [2]:
class GradientBoostingFeatureGenerator(BaseEstimator, TransformerMixin):
    """
    Feature generator from a gradient boosting

    References:
        - Practical Lessons from Predicting Clicks on Ads at Facebook
    """

    def __init__(
        self,
        stack_to_X=True,
        sparse_feat=True,
        add_probs=True,
        criterion="friedman_mse",
        init=None,
        learning_rate=0.1,
        loss="deviance",
        max_depth=3,
        max_features=None,
        max_leaf_nodes=None,
        min_impurity_decrease=0.0,
        min_impurity_split=None,
        min_samples_leaf=1,
        min_samples_split=2,
        min_weight_fraction_leaf=0.0,
        n_estimators=50,
        n_iter_no_change=None,
        presort="auto",
        random_state=None,
        subsample=1.0,
        tol=0.0001,
        validation_fraction=0.1,
        verbose=0,
        warm_start=False,
    ):

        # Deciding wheather to append features or simply return generated features
        self.stack_to_X = stack_to_X
        self.sparse_feat = sparse_feat
        self.add_probs = add_probs

        # GBM hyperparameters
        self.criterion = criterion
        self.init = init
        self.learning_rate = learning_rate
        self.loss = loss
        self.max_depth = max_depth
        self.max_features = max_features
        self.max_leaf_nodes = max_leaf_nodes
        self.min_impurity_decrease = min_impurity_decrease
        self.min_impurity_split = min_impurity_split
        self.min_samples_leaf = min_samples_leaf
        self.min_samples_split = min_samples_split
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.n_estimators = n_estimators
        self.n_iter_no_change = n_iter_no_change
        self.presort = presort
        self.random_state = random_state
        self.subsample = subsample
        self.tol = tol
        self.validation_fraction = validation_fraction
        self.verbose = verbose
        self.warm_start = warm_start

    def _get_leaves(self, X):
        X_leaves = self.gbm.apply(X)
        n_rows, n_cols, _ = X_leaves.shape
        X_leaves = X_leaves.reshape(n_rows, n_cols)

        return X_leaves

    def _decode_leaves(self, X):

        if self.sparse_feat:
            # float_eltype = np.float32
            # return scipy.sparse.csr.csr_matrix(self.encoder.transform(X), dtype=float_eltype)
            return scipy.sparse.csr.csr_matrix(self.encoder.transform(X))
        else:
            return self.encoder.transform(X).todense()

    def fit(self, X, y):

        self.gbm = GradientBoostingClassifier(
            criterion=self.criterion,
            init=self.init,
            learning_rate=self.learning_rate,
            loss=self.loss,
            max_depth=self.max_depth,
            max_features=self.max_features,
            max_leaf_nodes=self.max_leaf_nodes,
            min_impurity_decrease=self.min_impurity_decrease,
            min_impurity_split=self.min_impurity_split,
            min_samples_leaf=self.min_samples_leaf,
            min_samples_split=self.min_samples_split,
            min_weight_fraction_leaf=self.min_weight_fraction_leaf,
            n_estimators=self.n_estimators,
            n_iter_no_change=self.n_iter_no_change,
            presort=self.presort,
            random_state=self.random_state,
            subsample=self.subsample,
            tol=self.tol,
            validation_fraction=self.validation_fraction,
            verbose=self.verbose,
            warm_start=self.warm_start,
        )

        self.gbm.fit(X, y)
        self.encoder = OneHotEncoder(categories="auto")
        X_leaves = self._get_leaves(X)
        self.encoder.fit(X_leaves)
        return self

    def transform(self, X):
        """
        Generates leaves features using the fitted self.gbm and saves them in R.
        If 'self.stack_to_X==True' then '.transform' returns the original features with 'R' appended as columns.
        If 'self.stack_to_X==False' then  '.transform' returns only the leaves features from 'R'
        Ìf 'self.sparse_feat==True' then the input matrix from 'X' is cast as a sparse matrix as well as the 'R' matrix.
        """
        R = self._decode_leaves(self._get_leaves(X))

        if self.sparse_feat:
            if self.add_probs:
                P = self.gbm.predict_proba(X)
                X_new = (
                    scipy.sparse.hstack(
                        (
                            scipy.sparse.csr.csr_matrix(X),
                            R,
                            scipy.sparse.csr.csr_matrix(P),
                        )
                    )
                    if self.stack_to_X == True
                    else R
                )
            else:
                X_new = (
                    scipy.sparse.hstack((scipy.sparse.csr.csr_matrix(X), R))
                    if self.stack_to_X == True
                    else R
                )

        else:

            if self.add_probs:
                P = self.gbm.predict_proba(X)
                X_new = (
                    scipy.sparse.hstack(
                        (
                            scipy.sparse.csr.csr_matrix(X),
                            R,
                            scipy.sparse.csr.csr_matrix(P),
                        )
                    )
                    if self.stack_to_X == True
                    else R
                )
            else:
                X_new = np.hstack((X, R)) if self.stack_to_X == True else R

        return X_new.toarray()


In [4]:
#from sktools import GradientBoostingFeatureGenerator
from sklearn.datasets import load_boston
boston = load_boston()['data']
y = load_boston()['target']
y = np.where(y>y.mean(),1,0)
mf = GradientBoostingFeatureGenerator(sparse_feat=False)
mf.fit(boston, y)
mf.transform(boston.head(2))



AttributeError: 'numpy.ndarray' object has no attribute 'head'