[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/georgianpartners/Multimodal-Toolkit/blob/master/notebooks/text_w_tabular_classification.ipynb)

# Training a BertWithTabular Model for Clothing Review Recommendation Prediction

This guide closely follows the [example](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb#scrollTo=bwl3I_VGAZXb) from HuggingFace for text classification on the GLUE dataset.

Install `multimodal-transformers` and `datasets` (for getting a dataset).

In [1]:
!pip install multimodal-transformers
!pip install datasets
!pip install tensorboard

Collecting multimodal-transformers
  Downloading https://files.pythonhosted.org/packages/99/5f/3c1509a15b1c41daeb6e812deefbf52fb52012c3b7cd4b5721143fa7a345/multimodal_transformers-0.11a0.tar.gz
Collecting transformers>=3.0
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 7.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 27.4MB/s 
[?25hCollecting tokenizers==0.8.1.rc2
[?25l  Downloading https://files.pythonhosted.org/packages/80/83/8b9fccb9e48eeb575ee19179e2bdde0ee9a1904f97de5f02d19016b8804f/tokenizers-0.8.1rc2-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 23.3MB/s 
Collecting sentencepi

## All other imports are here:

In [71]:
from dataclasses import dataclass, field
import json
import logging
import os
from typing import Optional

import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoConfig, Trainer, EvalPrediction, set_seed
from transformers.training_args import TrainingArguments

import sys

from multimodal_transformers.data import load_data_from_folder
from multimodal_transformers.model import TabularConfig
from multimodal_transformers.model import AutoModelWithTabular

logging.basicConfig(level=logging.INFO)
os.environ["COMET_MODE"] = "DISABLED"

**Important Note:** If you run this notebook on Google Colab and face any issue with numpy, please restart the kernel once and try again. Refer to this [issue](https://github.com/georgian-io/Multimodal-Toolkit/issues/71) for more information.

## Dataset

Our dataset is the [Womens Clothing E-Commerce Reviews](https://huggingface.co/datasets/Censius-AI/ECommerce-Women-Clothing-Reviews) dataset. It contains reviews written by customers about clothing items as well as whether they recommend the data or not. We download the dataset here.

In [72]:
from datasets import load_dataset

dataset = load_dataset("Censius-AI/ECommerce-Women-Clothing-Reviews")

#### Let us take a look at what the dataset looks like

In [73]:
data_df = dataset['train'].to_pandas()
data_df.head(5)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


We see that the data contains both text in the `Review Text` and `Title` column as well as tabular features in the `Division Name`, `Department Name`, and `Class Name` columns. 

In [74]:
data_df.describe(include=object)

Unnamed: 0,Title,Review Text,Division Name,Department Name,Class Name
count,19676,22641,23472,23472,23472
unique,13993,22634,3,6,20
top,Love it!,Perfect fit and i've gotten so many compliment...,General,Tops,Dresses
freq,136,3,13850,10468,6319


In [75]:
data_df.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23486.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.535936
std,6779.968547,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


We split our data into 8:1:1 training splits. We also save our splits to `train.csv`, `val.csv`, and `test.csv` as this is the format our dataloader requires.

In [76]:
train_df, val_df, test_df = np.split(data_df.sample(frac=1), [int(.8*len(data_df)), int(.9 * len(data_df))])
print('Num examples train-val-test')
print(len(train_df), len(val_df), len(test_df))
train_df.to_csv('train.csv')
val_df.to_csv('val.csv')
test_df.to_csv('test.csv')

Num examples train-val-test
18788 2349 2349


## We then our Experiment Parameters
We use Data Classes to hold each of our arguments for the model, data, and training. 

In [77]:
@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={
            "help": "Path to pretrained model or model identifier from huggingface.co/models"
        }
    )
    config_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "Pretrained config name or path if not the same as model_name"
        },
    )
    tokenizer_name: Optional[str] = field(
        default=None,
        metadata={
            "help": "Pretrained tokenizer name or path if not the same as model_name"
        },
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={
            "help": "Where do you want to store the pretrained models downloaded from s3"
        },
    )


@dataclass
class MultimodalDataTrainingArguments:
    """
    Arguments pertaining to how we combine tabular features
    Using `HfArgumentParser` we can turn this class
    into argparse arguments to be able to specify them on
    the command line.
    """

    data_path: str = field(
        metadata={"help": "the path to the csv file containing the dataset"}
    )
    column_info_path: str = field(
        default=None,
        metadata={
            "help": "the path to the json file detailing which columns are text, categorical, numerical, and the label"
        },
    )

    column_info: dict = field(
        default=None,
        metadata={
            "help": "a dict referencing the text, categorical, numerical, and label columns"
            "its keys are text_cols, num_cols, cat_cols, and label_col"
        },
    )

    categorical_encode_type: str = field(
        default="ohe",
        metadata={
            "help": "sklearn encoder to use for categorical data",
            "choices": ["ohe", "binary", "label", "none"],
        },
    )
    numerical_transformer_method: str = field(
        default="yeo_johnson",
        metadata={
            "help": "sklearn numerical transformer to preprocess numerical data",
            "choices": ["yeo_johnson", "box_cox", "quantile_normal", "none"],
        },
    )
    task: str = field(
        default="classification",
        metadata={
            "help": "The downstream training task",
            "choices": ["classification", "regression"],
        },
    )

    mlp_division: int = field(
        default=4,
        metadata={
            "help": "the ratio of the number of "
            "hidden dims in a current layer to the next MLP layer"
        },
    )
    combine_feat_method: str = field(
        default="individual_mlps_on_cat_and_numerical_feats_then_concat",
        metadata={
            "help": "method to combine categorical and numerical features, "
            "see README for all the method"
        },
    )
    mlp_dropout: float = field(
        default=0.1, metadata={"help": "dropout ratio used for MLP layers"}
    )
    numerical_bn: bool = field(
        default=True,
        metadata={"help": "whether to use batchnorm on numerical features"},
    )
    categorical_bn: bool = field(
        default=True,
        metadata={"help": "whether to use batchnorm on categorical features"},
    )
    use_simple_classifier: str = field(
        default=True,
        metadata={"help": "whether to use single layer or MLP as final classifier"},
    )
    mlp_act: str = field(
        default="relu",
        metadata={
            "help": "the activation function to use for finetuning layers",
            "choices": ["relu", "prelu", "sigmoid", "tanh", "linear"],
        },
    )
    gating_beta: float = field(
        default=0.2,
        metadata={
            "help": "the beta hyperparameters used for gating tabular data "
            "see https://www.aclweb.org/anthology/2020.acl-main.214.pdf"
        },
    )

    def __post_init__(self):
        assert self.column_info != self.column_info_path
        if self.column_info is None and self.column_info_path:
            with open(self.column_info_path, "r") as f:
                self.column_info = json.load(f)

### Here are the data and training parameters we will use.
For model we can specify any supported HuggingFace model classes (see README for more details) as well as any AutoModel that are from the supported model classes. For the data specifications, we need to specify a dictionary that specifies which columns are the `text` columns, `numerical feature` columns, `categorical feature` column, and the `label` column. If we are doing classification, we can also specify what each of the labels means in the label column through the `label list`. We can also specifiy these columns using a path to a json file with the argument `column_info_path` to `MultimodalDataTrainingArguments`.

In [78]:
text_cols = ["Title", "Review Text"]
cat_cols = ["Clothing ID", "Division Name", "Department Name", "Class Name"]
numerical_cols = ["Rating", "Age", "Positive Feedback Count"]

column_info_dict = {
    "text_cols": text_cols,
    "num_cols": numerical_cols,
    "cat_cols": cat_cols,
    "label_col": "Recommended IND",
    "label_list": ["Not Recommended", "Recommended"],
}


model_args = ModelArguments(model_name_or_path="bert-base-uncased")

data_args = MultimodalDataTrainingArguments(
    data_path=".",
    combine_feat_method="individual_mlps_on_cat_and_numerical_feats_then_concat",
    column_info=column_info_dict,
    task="classification",
)

training_args = TrainingArguments(
    output_dir="./logs/model_name",
    logging_dir="./logs/runs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=32,
    num_train_epochs=1,
    logging_steps=25,
    eval_steps=250,
)

set_seed(training_args.seed)

## Now we can load our model and data. 
### We first instantiate our HuggingFace tokenizer
This is needed to prepare our custom torch dataset. See `torch_dataset.py` for details.

In [79]:
tokenizer_path_or_name = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
print('Specified tokenizer: ', tokenizer_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name,
    cache_dir=model_args.cache_dir,
)

Specified tokenizer:  bert-base-uncased


### Load dataset csvs to torch datasets
The function `load_data_from_folder` expects a path to a folder that contains `train.csv`, `test.csv`, and/or `val.csv` containing the respective split datasets. 

In [80]:
# Get Datasets
train_dataset, val_dataset, test_dataset = load_data_from_folder(
    data_args.data_path,
    data_args.column_info["text_cols"],
    tokenizer,
    label_col=data_args.column_info["label_col"],
    label_list=data_args.column_info["label_list"],
    categorical_cols=data_args.column_info["cat_cols"],
    numerical_cols=data_args.column_info["num_cols"],
    sep_text_token_str=tokenizer.sep_token,
)

INFO:multimodal_transformers.data.data_utils:3 numerical columns
INFO:multimodal_transformers.data.data_utils:1238 categorical columns
INFO:multimodal_transformers.data.data_utils:3 numerical columns
INFO:multimodal_transformers.data.load_data:Text columns: ['Title', 'Review Text']
INFO:multimodal_transformers.data.load_data:Raw text example: Absolutely wonderful - silky and sexy and comfortable
INFO:multimodal_transformers.data.data_utils:1238 categorical columns
INFO:multimodal_transformers.data.data_utils:3 numerical columns
INFO:multimodal_transformers.data.load_data:Text columns: ['Title', 'Review Text']
INFO:multimodal_transformers.data.load_data:Raw text example: Great fitting pants! [SEP] Love these pants i bought them in two colors! flattering fitting!! im 5'4 120lbs 36" hip. i bought these in size 25. sometimes i don't like petit sizing but these petite at 30" long inseam still gave me enough length to wear with heals. there is the usual good and bad with linen - after you we

In [81]:
num_labels = len(np.unique(train_dataset.labels))
num_labels

2

In [82]:
config = AutoConfig.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
)
tabular_config = TabularConfig(
    num_labels=num_labels,
    cat_feat_dim=train_dataset.cat_feats.shape[1],
    numerical_feat_dim=train_dataset.numerical_feats.shape[1],
    **vars(data_args)
)
config.tabular_config = tabular_config

In [83]:
model = AutoModelWithTabular.from_pretrained(
    model_args.config_name if model_args.config_name else model_args.model_name_or_path,
    config=config,
    cache_dir=model_args.cache_dir,
)

Some weights of BertWithTabular were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['tabular_combiner.cat_mlp.layers.0.weight', 'tabular_combiner.num_mlp.bn.0.weight', 'classifier.bias', 'tabular_combiner.num_mlp.bn.0.running_var', 'classifier.weight', 'tabular_classifier.weight', 'tabular_classifier.bias', 'tabular_combiner.num_mlp.layers.0.weight', 'tabular_combiner.num_mlp.layers.1.weight', 'tabular_combiner.num_mlp.bn.0.bias', 'tabular_combiner.num_mlp.bn.0.running_mean', 'tabular_combiner.num_mlp.layers.0.bias', 'tabular_combiner.cat_mlp.layers.0.bias', 'tabular_combiner.num_mlp.layers.1.bias', 'tabular_combiner.num_mlp.bn.0.num_batches_tracked']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### We need to define a task-specific way of computing relevant metrics:

In [84]:
import numpy as np
from scipy.special import softmax
from sklearn.metrics import (
    auc,
    precision_recall_curve,
    roc_auc_score,
    f1_score,
    confusion_matrix,
    matthews_corrcoef,
)


def calc_classification_metrics(p: EvalPrediction):
    predictions = p.predictions[0]
    pred_labels = np.argmax(predictions, axis=1)
    pred_scores = softmax(predictions, axis=1)[:, 1]
    labels = p.label_ids
    if len(np.unique(labels)) == 2:  # binary classification
        roc_auc_pred_score = roc_auc_score(labels, pred_scores)
        precisions, recalls, thresholds = precision_recall_curve(labels, pred_scores)
        fscore = (2 * precisions * recalls) / (precisions + recalls)
        fscore[np.isnan(fscore)] = 0
        ix = np.argmax(fscore)
        threshold = thresholds[ix].item()
        pr_auc = auc(recalls, precisions)
        tn, fp, fn, tp = confusion_matrix(labels, pred_labels, labels=[0, 1]).ravel()
        result = {
            "roc_auc": roc_auc_pred_score,
            "threshold": threshold,
            "pr_auc": pr_auc,
            "recall": recalls[ix].item(),
            "precision": precisions[ix].item(),
            "f1": fscore[ix].item(),
            "tn": tn.item(),
            "fp": fp.item(),
            "fn": fn.item(),
            "tp": tp.item(),
        }
    else:
        acc = (pred_labels == labels).mean()
        f1 = f1_score(y_true=labels, y_pred=pred_labels)
        result = {
            "acc": acc,
            "f1": f1,
            "acc_and_f1": (acc + f1) / 2,
            "mcc": matthews_corrcoef(labels, pred_labels),
        }

    return result

In [86]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=calc_classification_metrics,
)

## Launching the training is as simple is doing trainer.train() 🤗

Note: We set `training_args.max_steps` to 10 just for the sake of a faster demo. This isn't included in this starter code to prevent folks from getting confused.

In [87]:
%%time
trainer.train()

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [02:54<00:00, 17.41s/it]

{'train_runtime': 174.0874, 'train_samples_per_second': 1.838, 'train_steps_per_second': 0.057, 'train_loss': 0.512493371963501, 'epoch': 0.02}
CPU times: user 9min 32s, sys: 1min 12s, total: 10min 44s
Wall time: 2min 54s





TrainOutput(global_step=10, training_loss=0.512493371963501, metrics={'train_runtime': 174.0874, 'train_samples_per_second': 1.838, 'train_steps_per_second': 0.057, 'train_loss': 0.512493371963501, 'epoch': 0.02})

### Evaluating on the validation data

In [99]:
%%time
trainer.evaluate(eval_dataset=torch.utils.data.random_split(val_dataset, [0.01, 0.99])[0])

100%|██████████| 3/3 [00:03<00:00,  1.16s/it]

CPU times: user 15 s, sys: 546 ms, total: 15.5 s
Wall time: 5.97 s





{'eval_loss': 0.4994272291660309,
 'eval_roc_auc': 0.8210526315789474,
 'eval_threshold': 0.8211019039154053,
 'eval_pr_auc': 0.9517217548568722,
 'eval_recall': 1.0,
 'eval_precision': 0.8260869565217391,
 'eval_f1': 0.9047619047619047,
 'eval_tn': 0,
 'eval_fp': 5,
 'eval_fn': 0,
 'eval_tp': 19,
 'eval_runtime': 5.9443,
 'eval_samples_per_second': 4.037,
 'eval_steps_per_second': 0.505,
 'epoch': 0.02}

# Inference

Below is an example of running inference after training the model. We use a test batch of size 32. If you want to test a single example, ensure you still pass it as a batch to the model.

In [100]:
test_batch = test_dataset[:32]
test_batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels', 'cat_feats', 'numerical_feats'])

In [101]:
model.eval()
with torch.no_grad():
    _, logits, classifier_outputs = model(
        test_batch["input_ids"],
        attention_mask=test_batch["attention_mask"],
        token_type_ids=test_batch["token_type_ids"],
        cat_feats=test_batch["cat_feats"],
        numerical_feats=test_batch["numerical_feats"],
    )

In [102]:
acc = torch.sum(logits.argmax(axis=1) == test_batch["labels"]) / logits.shape[0]
print(f"Accuracy: {acc}")

Accuracy: 0.875
