## Simple Binary Classification with defaults

In this notebook we will use the Adult Census dataset. Download the data from [here](https://www.kaggle.com/wenruliu/adult-income-dataset/downloads/adult.csv/2).

In [1]:
import sys
sys.path.insert(0, '/root/gdrive/MyDrive/pytorch-widedeep/')

In [2]:
import numpy as np
import pandas as pd
import torch

from pytorch_widedeep.preprocessing import WidePreprocessor, TabPreprocessor
from pytorch_widedeep.training import Trainer
from pytorch_widedeep.models import Wide, TabMlp, TabResnet, TabTransformer, WideDeep
from pytorch_widedeep.metrics import Accuracy, Precision

In [3]:
df = pd.read_csv('data/adult/adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
# For convenience, we'll replace '-' with '_'
df.columns = [c.replace("-", "_") for c in df.columns]
# binary target
df['income_label'] = (df["income"].apply(lambda x: ">50K" in x)).astype(int)
df.drop('income', axis=1, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_label
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


### Preparing the data

Have a look to notebooks one and two if you want to get a good understanding of the next few lines of code (although there is no need to use the package)

In [5]:
wide_cols = ['education', 'relationship','workclass','occupation','native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]
cat_embed_cols = [('education',16), ('relationship',8), ('workclass',16), ('occupation',16),('native_country',16)]
continuous_cols = ["age","hours_per_week"]
target_col = 'income_label'

In [6]:
# TARGET
target = df[target_col].values

# wide
wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(df)

# deeptabular
tab_preprocessor = TabPreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_tab = tab_preprocessor.fit_transform(df)

In [7]:
print(X_wide)
print(X_wide.shape)

[[  1  17  23 ...  89  91 316]
 [  2  18  23 ...  89  92 317]
 [  3  18  24 ...  89  93 318]
 ...
 [  2  20  23 ...  90 103 323]
 [  2  17  23 ...  89 103 323]
 [  2  21  29 ...  90 115 324]]
(48842, 8)


In [8]:
print(X_tab)
print(X_tab.shape)

[[ 1.          1.          1.         ...  1.         -0.99512893
  -0.03408696]
 [ 2.          2.          1.         ...  1.         -0.04694151
   0.77292975]
 [ 3.          2.          2.         ...  1.         -0.77631645
  -0.03408696]
 ...
 [ 2.          4.          1.         ...  1.          1.41180837
  -0.03408696]
 [ 2.          1.          1.         ...  1.         -1.21394141
  -1.64812038]
 [ 2.          5.          7.         ...  1.          0.97418341
  -0.03408696]]
(48842, 7)


### Defining the model

Using `TabTransformer` as the `deeptabular` component

In [9]:
# for TabTransformer we only need the names of the columns
cat_embed_cols_for_transformer = [el[0] for el in cat_embed_cols]

In [10]:
cat_embed_cols_for_transformer

['education', 'relationship', 'workclass', 'occupation', 'native_country']

In [11]:
# deeptabular
tab_preprocessor = TabPreprocessor(embed_cols=cat_embed_cols_for_transformer, 
                                   continuous_cols=continuous_cols, 
                                   for_tabtransformer=True)
X_tab = tab_preprocessor.fit_transform(df)

In [12]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular = TabTransformer(column_idx=tab_preprocessor.column_idx,
                             embed_input=tab_preprocessor.embeddings_input,
                             continuous_cols=continuous_cols)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [13]:
trainer = Trainer(model, objective='binary', metrics=[Accuracy, Precision])

In [14]:
trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=100, batch_size=128, val_split=0.2)

epoch 1: 100%|██████████| 306/306 [00:07<00:00, 41.80it/s, loss=0.382, metrics={'acc': 0.8205, 'prec': 0.6516}]
valid: 100%|██████████| 77/77 [00:00<00:00, 92.01it/s, loss=0.356, metrics={'acc': 0.8368, 'prec': 0.6783}]
epoch 2: 100%|██████████| 306/306 [00:07<00:00, 40.01it/s, loss=0.359, metrics={'acc': 0.8323, 'prec': 0.6818}]
valid: 100%|██████████| 77/77 [00:00<00:00, 90.39it/s, loss=0.352, metrics={'acc': 0.8371, 'prec': 0.6773}]
epoch 3: 100%|██████████| 306/306 [00:07<00:00, 39.47it/s, loss=0.354, metrics={'acc': 0.8363, 'prec': 0.6936}]
valid: 100%|██████████| 77/77 [00:00<00:00, 87.19it/s, loss=0.353, metrics={'acc': 0.8335, 'prec': 0.6456}]
epoch 4: 100%|██████████| 306/306 [00:07<00:00, 40.06it/s, loss=0.349, metrics={'acc': 0.8374, 'prec': 0.6957}]
valid: 100%|██████████| 77/77 [00:00<00:00, 84.28it/s, loss=0.355, metrics={'acc': 0.8318, 'prec': 0.638}]
epoch 5: 100%|██████████| 306/306 [00:07<00:00, 40.20it/s, loss=0.347, metrics={'acc': 0.8378, 'prec': 0.6962}]
valid: 10

Also mentioning that one could build a model with the individual components independently. For example, a model comprised only by the `wide` component would be simply a linear model. This could be attained by just:

In [15]:
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
deeptabular = TabTransformer(column_idx=tab_preprocessor.column_idx,
                             embed_input=tab_preprocessor.embeddings_input,
                             continuous_cols=continuous_cols,
                             embed_continuous = False)
model = WideDeep(wide=wide, deeptabular=deeptabular)

In [16]:
trainer = Trainer(model, objective='binary', metrics=[Accuracy, Precision])

In [17]:
trainer.fit(X_wide=X_wide, X_tab=X_tab, target=target, n_epochs=5, batch_size=64, val_split=0.2)

epoch 1: 100%|██████████| 611/611 [00:15<00:00, 40.18it/s, loss=0.385, metrics={'acc': 0.819, 'prec': 0.6495}]
valid: 100%|██████████| 153/153 [00:01<00:00, 97.30it/s, loss=0.372, metrics={'acc': 0.8322, 'prec': 0.6831}]
epoch 2: 100%|██████████| 611/611 [00:15<00:00, 40.39it/s, loss=0.364, metrics={'acc': 0.8282, 'prec': 0.6697}]
valid: 100%|██████████| 153/153 [00:01<00:00, 99.11it/s, loss=0.363, metrics={'acc': 0.8312, 'prec': 0.6605}]
epoch 3: 100%|██████████| 611/611 [00:14<00:00, 40.75it/s, loss=0.357, metrics={'acc': 0.8323, 'prec': 0.6784}]
valid: 100%|██████████| 153/153 [00:01<00:00, 98.44it/s, loss=0.361, metrics={'acc': 0.8319, 'prec': 0.6642}]
epoch 4: 100%|██████████| 611/611 [00:14<00:00, 40.97it/s, loss=0.353, metrics={'acc': 0.8346, 'prec': 0.6838}]
valid: 100%|██████████| 153/153 [00:01<00:00, 93.52it/s, loss=0.356, metrics={'acc': 0.8342, 'prec': 0.6679}]
epoch 5: 100%|██████████| 611/611 [00:15<00:00, 40.24it/s, loss=0.348, metrics={'acc': 0.8371, 'prec': 0.6894}]
v

In [36]:
model = WideDeep(wide=wide)

In [37]:
trainer = Trainer(model, objective='binary', metrics=[Accuracy, Precision])

In [15]:
trainer.fit(X_wide=X_wide, target=target, n_epochs=10, batch_size=64, val_split=0.2)

epoch 1:   0%|          | 0/611 [00:00<?, ?it/s]


KeyError: 'deeptabular'

The only requisite is that the model component must be passed to `WideDeep` before "fed" to the `Trainer`. This is because the `Trainer` is coded so that it trains a model that has a parent called `model` and then children that correspond to the model components: `wide`,  `deeptabular`, `deeptext` and `deepimage`. Also, `WideDeep` builds the last connection between the output of those components and the final, output neuron(s).