Skip to content
fastai tabular and text combination model
Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
scratch initial commit Apr 7, 2019
.gitignore initial commit Apr 7, 2019
README.md factorize version 2. Add predict_one_item Jun 30, 2019
fastai_tab_text.py
fastai_tabtext2.py
mercari-language-model.ipynb initial commit Apr 7, 2019
mercari-tabular-text-training-complete.ipynb simplify show_batch() text, add testing session for mercari databunch Apr 8, 2019
mercari-tabular-text-version-2-complete.ipynb factorize version 2. Add predict_one_item Jun 30, 2019
mercari-tabular-text-version-2-experimenting.ipynb
pet-finder-fastai-api-experiment.ipynb initial commit Apr 7, 2019

README.md

Demo for fastai tabular + text databunch with end-to-end classification/regression training

UPDATED: There is a FASTER way to create/train an end-to-end tabular + text regression WITH BETTER LOSS using an entirely different approach (should work with both classification and regression task)

This approach uses existing tabular databunch and text databunch along with tabular model and text model (AWD_LSTM), thus shorter to implement. And all of these are already in Fastai library; this implementation will be less prone to error, and will be up-to-date. You can follow the discussion on fastai forum

Main notebook for this approach is in fastai_tabtext2.py and this notebook. Dataset is from Mercari Price Kaggle competition

If you still want to learn about Fastai Datablock API (it can be helpful if you want to create your own data API that is different from tabular or text), you can take a look at all the materials below.


Inspired by Wayde Gilliam for his detailed blog post on Fastai Datablock API.

  • Including several new TabularText classes (inherited from ItemLists, LabelLists and DataBunch) to handle both tabular data (continuous and categorical) and textual data (to be converted to numerical ids). All the preprocesses from tabular processor (FillMissing, Categorify, Normalize) and text processor (Tokenizer and Numericalize) are included.

  • Combine RNN model (e.g. AWD LSTM) with multi layer perceptron (MLP) head to train both text and tabular data. All good training optimization from fastai learner can be used: fit_one_cycle, learning rate schedule, callbacks, train different groups using freeze_to (differential learning rate)...

The code to build TabularText databunch has been tested using data from Kaggle PetFinder competition by comparing the output from Tabular Databunch and Text Databunch. Notebook for that can be found here

The code to build TabularText learner has been used to train on data from Mercari Price Kaggle competition. The model did train successfully with loss decreaseed after epochs, but the results are just average compared to the leaderboard of that competition (not sure why -> need more testing). Training notebook

The entire source code is in fastai_tab_text.py. You can create TabularText databuch by following the same Databunch API from fastai doc. You can also look at my training notebook for both databunch creation and training the learner.

  • Example of creating TabularText databunch from pandas Dataframe (using Mercari dataset)
cat_names=['category1','category2','category3','brand_name','shipping'] # categorical
cont_names= list(set(train_df.columns) - set(cat_names) - {'price','text'}) # continuous
dep_var = 'price' # label
procs = [FillMissing,Categorify, Normalize]
txt_cols=['text'] # text

def get_tabulartext_databunch(bs=100,val_idxs=val_idxs,path=mercari_path):
    data_lm = load_data(path, 'data_lm.pkl', bs=bs) # data_lm.pkl from mercari-language-model notebook
    collate_fn = partial(mixed_tabular_pad_collate, pad_idx=1, pad_first=True)
    reset_seed()
    return (TabularTextList.from_df(train_df, cat_names, cont_names, txt_cols, vocab=data_lm.vocab, procs=procs, path=path)
                            .split_by_idx(val_idxs)
                            .label_from_df(cols=dep_var)
                            .add_test(TabularTextList.from_df(test_df, cat_names, cont_names, txt_cols,path=path))
                            .databunch(bs=bs,collate_fn=collate_fn, no_check=False))

data = get_tabulartext_databunch(bs=100)
data.show_batch()
  • Example of creating TabularText learner and start one-cycle training (note: this is a regression problem)
encoder_name = 'bs60-awdlstm-enc-stage2' # encoder from mercari-language-model notebook
def get_tabulartext_learner(data,params):
    learn= tabtext_learner(data,AWD_LSTM,metrics=[root_mean_squared_error],
                               callback_fns=[partial(SaveModelCallback, monitor='root_mean_squared_error',mode='min',every='improvement',name='best_nn')],
                               **params)
    learn.load_encoder(encoder_name)
    return learn

params={
    'layers':[500,400,200], # neural network at model's head
    'ps': [0.001,0.,0.], # dropout for NN at model's head
    'bptt':70,
    'max_len':20*70,
    'drop_mult': 1., # drop_mult: multiply to different dropouts in AWD LSTM
    'lin_ftrs': [300], # linear layer to AWD_LSTM output, before combining to embeddings
    'ps_lin_ftrs': [0.], # dropout for this linear layer at AWD_LSTM output
    # set 'lin_ftrs': None if you want AWD LSTM output (1200) to be combined straight to embeddings
    'emb_drop': 0., # embeddings dropout
    'y_range': [0,6], # restrict y range for regression problem
    'use_bn': True,    
}


learn = get_tabulartext_learner(data,params,seed=42)
print(learn.model)

learn.fit_one_cycle(3,max_lr=1e-02,pct_start=0.3,moms=(0.8,0.7))

Requirement: fastai version 1.0.51 (including pytorch 1.0). Visit https://github.com/fastai/fastai#installation for more.

Any feedback is welcome! Follow the discussion on fastai forums: https://forums.fast.ai/t/build-mixed-databunch-and-train-end-to-end-model-for-tabular-categorical-continuous-data-and-text-data/

You can’t perform that action at this time.