# Motivation
Tree-based models like Random Forest and XGBoost has become very poplular to address tabular(structured) data problems and gained a lot of tractions in Kaggle competitions. It has its very deserving reasons. A lot of the notebooks for this competition is inspired by fast.ai ML course. This notebook will also try to use fast.ai, but another approach: **Deep Learning**. 
This is a bit against industry consensous that Deep Learning is more for unstructured data like image, audio or NLP, and usually won't be very good at handling tabular data. Yet, the introduction of embedding for the categorical data changed this perspective and we'll try to use fast.ai's tabular model to tackle this competition and see how well a Deep Learning approach can do. 

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from fastai import *
from fastai.tabular import *
from fastai.data.transforms import Normalize
import pandas as pd
import numpy as np

# Load Data
After imported the necessary fast.ai modules, mainly 'fastai.tabular'. Let's load the data in. 

In [3]:
# read in the dataset. Since the Test.csv and Valid.csv doesn't have label, it will be used to create our own validation set. 
train_df = pd.read_csv('train_fai.csv', low_memory=False).drop(columns=['Id','BUTTER'])
valid_df = pd.read_csv('validation_fai.csv', low_memory=False).drop(columns=['Id','BUTTER'])

In [4]:
train_df.head()

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,28.878847,2.662533,2924.690991,0.999997,19085.568945,0.569198,-0.575502,0.581565,66850.893711,0.637969,14298.486178,7940.694301,2.628526,2.680116,1.0
1,34.233566,0.092746,346.948714,0.999997,6631.244546,0.248707,-0.615941,0.277898,39274.475071,0.148815,11553.163934,3904.681337,3.292504,3.085754,1.0
2,36.113632,2.442423,238.553023,0.999986,7740.918989,0.222347,0.249383,0.216576,27757.153899,0.24984,24081.196003,4738.891687,3.433676,3.121906,1.0
3,14.286133,6.337556,227.375132,0.999806,6740.281614,0.347316,0.591884,0.306927,10593.207077,0.400748,11343.521945,3308.94375,2.291867,2.200712,0.0
4,60.474274,7.632751,106.73065,0.999905,5556.388794,0.204273,0.65585,0.1966,11801.249543,0.223101,25940.693317,4026.326871,3.290073,3.281829,0.0


In [5]:
len(train_df),len(valid_df)

(170352, 42310)

# Data Pre-processing
The competition's evaluation methods uses RMSLE (root mean squared log error). So if we take the log of our prediction, we can just use the good old RMSE as our loss function. It's just easier this way.

In [6]:
# Defining pre-processing we want for our fast.ai DataBunch
procs=[Normalize]

Namely, we'll fix the missing values, categorify all categorical columns, then normalize. Plain and simple. 

# Building the Model


In [7]:
train_df.dtypes
g = train_df.columns.to_series().groupby(train_df.dtypes).groups
g

{float64: ['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV', 'B_DIRA_OWNPV', 'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH', 'Kplus_IP_OWNPV', 'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P', 'gamma_PT', 'piminus_ETA', 'Kplus_ETA', 'signal']}

Have a look at all the column types and see which are categorical and continuous. We'll use it to build the fast'ai DataBunch for training our learner. 

In [8]:
train_df.columns

Index(['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV', 'B_DIRA_OWNPV',
       'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH', 'Kplus_IP_OWNPV',
       'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P', 'gamma_PT', 'piminus_ETA',
       'Kplus_ETA', 'signal'],
      dtype='object')

In [9]:
# prepare categorical and continous data columns for building Tabular DataBunch.
cat_vars = []

cont_vars = ['B_OWNPV_CHI2', 'B_IPCHI2_OWNPV', 'B_FDCHI2_OWNPV',
       'B_DIRA_OWNPV', 'B_PT', 'Kst_892_0_IP_OWNPV', 'Kst_892_0_cosThetaH',
       'Kplus_IP_OWNPV', 'Kplus_P', 'piminus_IP_OWNPV', 'piminus_P',
       'gamma_PT', 'piminus_ETA', 'Kplus_ETA']

In [10]:
train_df

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,28.878847,2.662533,2924.690991,0.999997,19085.568945,0.569198,-0.575502,0.581565,66850.893711,0.637969,14298.486178,7940.694301,2.628526,2.680116,1.0
1,34.233566,0.092746,346.948714,0.999997,6631.244546,0.248707,-0.615941,0.277898,39274.475071,0.148815,11553.163934,3904.681337,3.292504,3.085754,1.0
2,36.113632,2.442423,238.553023,0.999986,7740.918989,0.222347,0.249383,0.216576,27757.153899,0.249840,24081.196003,4738.891687,3.433676,3.121906,1.0
3,14.286133,6.337556,227.375132,0.999806,6740.281614,0.347316,0.591884,0.306927,10593.207077,0.400748,11343.521945,3308.943750,2.291867,2.200712,0.0
4,60.474274,7.632751,106.730650,0.999905,5556.388794,0.204273,0.655850,0.196600,11801.249543,0.223101,25940.693317,4026.326871,3.290073,3.281829,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
170347,12.893907,0.151000,984.507893,1.000000,11121.462006,0.491973,-0.548488,0.541038,72639.491690,0.398007,22999.329625,3426.012947,3.220792,3.165776,0.0
170348,33.269855,0.422886,528.518223,0.999999,6547.421388,0.324063,-0.397384,0.286336,50331.007785,0.539486,8898.424053,3657.986285,3.077444,3.243626,1.0
170349,30.405409,6.838541,245.848798,0.999909,6634.286140,0.286608,0.522046,0.194381,32017.565715,0.446696,22491.555816,4975.574892,3.510440,3.603324,0.0
170350,8.028983,2.256855,377.706643,0.999878,5160.700365,0.626228,-0.176330,0.598550,16262.726940,0.705622,5497.144994,4276.250211,2.379503,2.772514,1.0


In [11]:
# rearrange training set before feed into the databunch
dep_var = 'signal'
df = train_df[cat_vars + cont_vars + [dep_var]].copy()

Time to create our validation set. The most important step. Since this dataset is somewhat time series, we need to make sure validation set entries happens AFTER all entries in the training set, otherwise the model will be cheating and won't generalize well. 

In [12]:
from fastai.tabular.data import TabularDataLoaders

In [13]:
# Use fast.ai datablock api to put our training data into the DataBunch, getting ready for training
data = TabularDataLoaders.from_df(df,  cat_names=cat_vars, cont_names=cont_vars, procs=procs,y_names='signal')


In [14]:
data.show_batch()

Unnamed: 0,B_OWNPV_CHI2,B_IPCHI2_OWNPV,B_FDCHI2_OWNPV,B_DIRA_OWNPV,B_PT,Kst_892_0_IP_OWNPV,Kst_892_0_cosThetaH,Kplus_IP_OWNPV,Kplus_P,piminus_IP_OWNPV,piminus_P,gamma_PT,piminus_ETA,Kplus_ETA,signal
0,51.995236,4.536544,227.362384,0.999955,7470.622076,0.26175,-0.135363,0.271361,64960.340843,0.30077,28086.437589,4510.429184,3.933022,4.168918,0.0
1,14.716708,4.041287,305.552841,0.99964,8040.760226,0.284416,0.963321,0.234621,12560.041526,0.309272,21362.095821,4475.803225,2.674244,2.735106,0.0
2,13.067174,2.175954,2202.285649,0.999994,5000.619747,1.020902,0.168225,1.284032,22554.572532,0.794177,25058.322271,3084.997013,3.234347,3.143624,0.0
3,34.836101,5.82578,269.906071,0.999966,4345.999782,0.243963,-0.335012,0.248214,46810.636893,0.378233,7568.501019,3785.708201,3.30618,3.944712,0.0
4,23.053198,0.220533,254.570546,0.999977,7222.524932,0.231138,0.595629,0.258591,14921.83099,0.227353,22410.328154,3357.786811,2.649248,2.516724,1.0
5,56.961617,1.071579,7396.173316,0.999991,10932.127912,0.956862,0.302077,1.125037,15972.121027,0.829219,20041.38865,7273.423291,2.716556,2.765273,1.0
6,52.504078,2.8095,424.516591,0.999986,11318.878908,0.240555,0.821447,0.178708,22659.734635,0.290096,29210.216846,4920.521947,2.626247,2.72329,0.0
7,49.335835,0.204402,312.98463,0.999998,6794.391658,0.289229,-0.020246,0.256601,22694.666279,0.463536,5017.078215,5633.497556,2.89816,3.138744,1.0
8,27.568237,5.246316,8109.486695,1.0,23112.449267,0.963872,-0.207135,0.990731,75402.694862,1.063759,32684.861144,11296.381975,2.930873,2.882131,0.0
9,34.638153,3.022676,293.552399,0.999869,6891.697772,0.211486,-0.036645,0.198235,17151.726074,0.256143,7748.160506,3692.788503,2.208954,2.469869,1.0


# Model

Finally, it's time for some training. We will fire up a fast.ai 'tabular.learner' from the DataBunch we just created.

In [17]:
from fastai.tabular.learner import tabular_learner
from fastai.tabular.data import *

In [18]:
dls = TabularDataLoaders.from_df(df, '.', procs=procs, cat_names=cat_vars, cont_names=cont_vars, 
                                 y_names="signal", bs=64)
learn = tabular_learner(dls)

In [19]:
learn.model

TabularModel(
  (embeds): ModuleList()
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): BatchNorm1d(14, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=14, out_features=200, bias=False)
      (2): ReLU(inplace=True)
    )
    (1): LinBnDrop(
      (0): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Linear(in_features=200, out_features=100, bias=False)
      (2): ReLU(inplace=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=1, bias=True)
    )
  )
)

As can be seen from the above, we have embedding layers for categorical columns, then followed by a drop out layer. We have batch norm layer for the continuous columns, then we put all of them into two fully connected layers with 1000 and 500 nodes, with Relu, BatchNorm, and Dropout in between. Quite standard.

In [None]:
row, clas, probs = learn.predict(df.iloc[0])

In [None]:
clas, probs

User fast.ai's *lr_find* function to find the proper learning rate, then do a 'fit one cycle' training. 

In [None]:
learn.model.fit_one_cycle(2, 1e-2, wd=0.2)

In [None]:
learn.fit_one_cycle(5, 3e-4, wd=0.2)

In [None]:
# learn.fit_one_cycle(5, 3e-4, wd=0.2)

Best result reaches 0.227 RMSLE, I think it beats the #1 in Kaggle leaderboard. 

# Conlusion
I think overall people still prefer XGBoost or Random Forest for tabular Kaggle competitions since it usually will yield the best scores. However, Deep Learning is also a viable approach, though lacking a bit on the explainability side. At least it could be used for ensamble purpose so it's worth exploring. 