### Model Parmeters

* n_d : int (default=8) 

Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.

* n_a: int (default=8)

Width of the attention embedding for each mask. According to the paper n_d=n_a is usually a good choice. (default=8)

* n_steps : int (default=3)

Number of steps in the architecture (usually between 3 and 10)

* gamma : float (default=1.3)

This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1.0 to 2.0.

* cat_idxs : list of int (default=[] - Mandatory for embeddings)

List of categorical features indices.

* cat_dims : list of int (default=[] - Mandatory for embeddings)

List of categorical features number of modalities (number of unique values for a categorical feature) /!\ no new modalities can be predicted

* cat_emb_dim : list of int (optional)

List of embeddings size for each categorical features. (default =1)

* n_independent : int (default=2)

Number of independent Gated Linear Units layers at each step. Usual values range from 1 to 5.

* n_shared : int (default=2)

Number of shared Gated Linear Units at each step Usual values range from 1 to 5

* epsilon : float (default 1e-15)

Should be left untouched.

* seed : int (default=0)

Random seed for reproducibility

* momentum : float

Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02)

* clip_value : float (default None)

If a float is given this will clip the gradient at clip_value.

* lambda_sparse : float (default = 1e-3)

This is the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.

* optimizer_fn : torch.optim (default=torch.optim.Adam)

Pytorch optimizer function

* optimizer_params: dict (default=dict(lr=2e-2))

Parameters compatible with optimizer_fn used initialize the optimizer. Since we have Adam as our default optimizer, we use this to define the initial learning rate used for training. As mentionned in the original paper, a large initial learning rate of 0.02 with decay is a good option.

* scheduler_fn : torch.optim.lr_scheduler (default=None)

Pytorch Scheduler to change learning rates during training.

* scheduler_params : dict

Dictionnary of parameters to apply to the scheduler_fn. Ex : {"gamma": 0.95, "step_size": 10}

* model_name : str (default = 'DreamQuarkTabNet')

Name of the model used for saving in disk, you can customize this to easily retrieve and reuse your trained models.

* verbose : int (default=1)

Verbosity for notebooks plots, set to 1 to see every epoch, 0 to get None.

* device_name : str (default='auto') 'cpu' for cpu training, 'gpu' for gpu training, 'auto' to automatically detect gpu.  


* mask_type: str (default='sparsemax') Either "sparsemax" or "entmax" : this is the masking function to use for selecting features.  


* n_shared_decoder : int (default=1)  

Number of shared GLU block in decoder, this is only useful for TabNetPretrainer.

* n_indep_decoder : int (default=1)

Number of independent GLU block in decoder, this is only useful for TabNetPretrainer.

### Fit parameters

* X_train : np.array

Training features

* y_train : np.array

Training targets

* eval_set: list of tuple

List of eval tuple set (X, y).
The last one is used for early stopping

* eval_name: list of str
List of eval set names.

* eval_metric : list of str
List of evaluation metrics.
The last metric is used for early stopping.

* max_epochs : int (default = 200)

Maximum number of epochs for trainng.

* patience : int (default = 10)

Number of consecutive epochs without improvement before performing early stopping.

If patience is set to 0, then no early stopping will be performed.

Note that if patience is enabled, then best weights from best epoch will automatically be loaded at the end of fit.

* weights : int or dict (default=0)

/!\ Only for TabNetClassifier Sampling parameter 0 : no sampling 1 : automated sampling with inverse class occurrences dict : keys are classes, values are weights for each class

* loss_fn : torch.loss or list of torch.loss

Loss function for training (default to mse for regression and cross entropy for classification) When using TabNetMultiTaskClassifier you can set a list of same length as number of tasks, each task will be assigned its own loss function

* batch_size : int (default=1024)

Number of examples per batch. Large batch sizes are recommended.

* virtual_batch_size : int (default=128)

Size of the mini batches used for "Ghost Batch Normalization". /!\ virtual_batch_size should divide batch_size

* num_workers : int (default=0)

Number or workers used in torch.utils.data.Dataloader

* drop_last : bool (default=False)

Whether to drop last batch if not complete during training

* callbacks : list of callback function
List of custom callbacks

pretraining_ratio : float


*  warm_start : bool (default=False) In order to match scikit-learn API, this is set to False. It allows to fit twice the same model and start from a warm start.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings(action='ignore')

from pytorch_tabnet.tab_model import TabNetClassifier
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.model_selection import train_test_split
np.random.seed(0)


import os
import wget
from pathlib import Path
import shutil
import gzip


#### Download ForestCoverType dataset

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz"

dataset_name = 'forest-cover-type'
tmp_out = Path('./data/'+dataset_name+'.gz')
out = Path(os.getcwd()+'/data/'+dataset_name+'.csv')

In [3]:
out.parent.mkdir(parents = True , exist_ok = True)
if out.exists() :
    print('File already exists')
else : 
    print("Downloading file...")
    wget.download(url, tmp_out.as_posix())
    with gzip.open(tmp_out, 'rb') as f_in:
        with open(out, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    

File already exists


#### Load data and split 

In [4]:
target = "Covertype"

bool_columns = [
    "Wilderness_Area1", "Wilderness_Area2", "Wilderness_Area3",
    "Wilderness_Area4", "Soil_Type1", "Soil_Type2", "Soil_Type3", "Soil_Type4",
    "Soil_Type5", "Soil_Type6", "Soil_Type7", "Soil_Type8", "Soil_Type9",
    "Soil_Type10", "Soil_Type11", "Soil_Type12", "Soil_Type13", "Soil_Type14",
    "Soil_Type15", "Soil_Type16", "Soil_Type17", "Soil_Type18", "Soil_Type19",
    "Soil_Type20", "Soil_Type21", "Soil_Type22", "Soil_Type23", "Soil_Type24",
    "Soil_Type25", "Soil_Type26", "Soil_Type27", "Soil_Type28", "Soil_Type29",
    "Soil_Type30", "Soil_Type31", "Soil_Type32", "Soil_Type33", "Soil_Type34",
    "Soil_Type35", "Soil_Type36", "Soil_Type37", "Soil_Type38", "Soil_Type39",
    "Soil_Type40"
]

int_columns = [
    "Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology",
    "Vertical_Distance_To_Hydrology", "Horizontal_Distance_To_Roadways",
    "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
    "Horizontal_Distance_To_Fire_Points"
]

feature_columns = (
    int_columns + bool_columns + [target])

In [5]:
train = pd.read_csv(out,  header = None, names = feature_columns)

n_total = len(train)

# Train, val and test split follows
# Rory Mitchell, Andrey Adinets, Thejaswi Rao, and Eibe Frank.
# Xgboost: Scalable GPU accelerated learning. arXiv:1806.11248, 2018.

train_val_indices , test_indices = train_test_split ( range(n_total), test_size = 0.2, random_state= 0)
train_indices, valid_indices = train_test_split( train_val_indices, test_size= 0.2 / 0.6, random_state= 0)

#### simple preprocessing 

In [6]:
categorical_columns = [ ]
categorical_dims = {}

for col in train.columns[train.dtypes == object]:
    print(col, train[col].nunique())
    l_enc = LabelEncoder()
    train[col] = train[col].fillna("VV_likely")
    train[col] = l_enc.fit_transform(train[col].values)
    categorical_columns.append(col)
    categorical_dims[col] = len(l_enc.classes_)

for col in train.columns[train.dtypes == 'float64']:
    train.fillna(train.loc[train_indices, col].mean(), inplace=True)

#### Define categorical features for categorical embeddings

In [7]:
unused_feat = []

features = [ col for col in train.columns if col not in unused_feat+[target]] 

cat_idxs = [ i for i, f in enumerate(features) if f in categorical_columns]

cat_dims = [ categorical_dims[f] for i, f in enumerate(features) if f in categorical_columns]

#### Network parameters 

In [8]:
clf = TabNetMultiTaskClassifier(n_steps=1,
                                cat_idxs=cat_idxs,
                                cat_dims=cat_dims,
                                cat_emb_dim=1,
                                optimizer_fn=torch.optim.Adam,
                                optimizer_params=dict(lr=2e-2),
                                scheduler_params={"step_size":50, # how to use learning rate scheduler
                                                  "gamma":0.9},
                                scheduler_fn=torch.optim.lr_scheduler.StepLR,
                                mask_type='entmax', # "sparsemax",
                                lambda_sparse=0, # don't penalize for sparser attention
                       
                      )

Device used : cpu


#### Traing 

In [9]:
if os.getenv("CI", False):
# Take only a subsample to run CI
    X_train = train[features].values[train_indices][:1000,:]
    y_train = train[target].values[train_indices][:1000]
else:
    X_train = train[features].values[train_indices]
    y_train = train[target].values[train_indices]

X_valid = train[features].values[valid_indices]
y_valid = train[target].values[valid_indices]

X_test = train[features].values[test_indices]
y_test = train[target].values[test_indices]

In [10]:
max_epochs = 50 if not os.getenv("CI", False) else 2

In [None]:
clf.fit(
   X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    max_epochs=max_epochs ,
    patience=50, # please be patient ^^
    batch_size=1024,
    virtual_batch_size=128,
    num_workers=1,
    drop_last=False
)

In [None]:
# plot losses
plt.plot(clf.history['loss'])

In [None]:
# plot accuracy
plt.plot(clf.history['train_accuracy'])
plt.plot(clf.history['valid_accuracy'])

In [None]:
# To get final results you may need to use a mapping for classes 
# as you are allowed to use targets like ["yes", "no", "maybe", "I don't know"]

preds_mapper = { idx : class_name for idx, class_name in enumerate(clf.classes_)}

preds = clf.predict_proba(X_test)

y_pred = np.vectorize(preds_mapper.get)(np.argmax(preds, axis=1))

test_acc = accuracy_score(y_pred=y_pred, y_true=y_test)

print(f"BEST VALID SCORE FOR {dataset_name} : {clf.best_cost}")
print(f"FINAL TEST SCORE FOR {dataset_name} : {test_acc}")

In [None]:
# or you can simply use the predict method

y_pred = clf.predict(X_test)
test_acc = accuracy_score(y_pred=y_pred, y_true=y_test)
print(f"FINAL TEST SCORE FOR {dataset_name} : {test_acc}")