[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/georgianpartners/Multimodal-Toolkit/blob/master/notebooks/text_w_tabular_classification.ipynb)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Training a BertWithTabular Model for Clothing Review Recommendation Prediction

This guide follows closely with the [example](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/trainer/01_text_classification.ipynb#scrollTo=bwl3I_VGAZXb) from HuggingFace for text classificaion on the GLUE dataset.

Install `multimodal-transformers`, `kaggle`  so we can get the dataset.

In [2]:
!pip install multimodal-transformers
!pip install -q kaggle
!pip install pandas --upgrade
!pip install -U accelerate
!pip install -U transformers

Collecting multimodal-transformers
  Downloading multimodal_transformers-0.2a0-py3-none-any.whl (22 kB)
Collecting transformers>=4.26.1 (from multimodal-transformers)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses~=0.0.53 (from multimodal-transformers)
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting networkx~=2.6.3 (from multimodal-transformers)
  Downloading networkx-2.6.3-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scikit-learn~=1.0.2 (from multimodal-transformers)
  Downloading scikit_learn-1.0.2-cp310-cp310-manylinux_2_17_x86_6

Collecting pandas
  Downloading pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/12.3 MB[0m [31m1.7 MB/s[0m eta [36m0:00:08[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/12.3 MB[0m [31m35.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m7.4/12.3 MB[0m [31m70.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.3/12.3 MB[0m [31m166.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.3/12.3 MB[0m [31m166.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m90.1 MB/s[0m eta [36m0:00:00

## Setting up Kaggle
To get the dataset from kaggle we must upload our kaggle.json file containing our kaggle api token. See https://www.kaggle.com/docs/api for details.

In [3]:
# from google.colab import files
# files.upload()

In [4]:
# ! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json
# ! kaggle datasets list

## All other imports are here:

In [5]:
from dataclasses import dataclass, field
import json
import logging
import os
from typing import Optional
import accelerate

import numpy as np
import pandas as pd
from transformers import (
    AutoTokenizer,
    AutoConfig,
    Trainer,
    EvalPrediction,
    set_seed
)
from transformers.training_args import TrainingArguments

from multimodal_transformers.data import load_data_from_folder
from multimodal_transformers.model import TabularConfig
from multimodal_transformers.model import AutoModelWithTabular

logging.basicConfig(level=logging.INFO)
os.environ['COMET_MODE'] = 'DISABLED'

cur_dir = os.getcwd()
print(f"current directory: {cur_dir}")
if cur_dir != '/content/drive/MyDrive/Research/r4.2':
  os.chdir('drive/MyDrive/Research/r4.2/')




current directory: /content


## Dataset

Our dataset is the [Womens Clothing E-Commerce Reviews](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews) dataset from kaggle. It contains reviews written by customers about clothing items as well as whether they recommend the data or not. We download the dataset here.

In [6]:
# !kaggle datasets download -d nicapotato/womens-ecommerce-clothing-reviews
# !unzip womens-ecommerce-clothing-reviews.zip
# !ls

#### Let us take a look at what the dataset looks like

In [7]:
# data_df = pd.read_csv('ExtractedData/dayr4.2.csv')
# data_df.head(5)

We see that the data contains both text in the `Review Text` and `Title` column as well as tabular features in the `Division Name`, `Department Name`, and `Class Name` columns.

In [8]:
# data_df.describe()

In this demonstration, we split our data into 8:1:1 training splits. We also save our splits to `train.csv`, `val.csv`, and `test.csv` as this is the format our dataloader requires.


In [9]:
# train_df, val_df, test_df = np.split(data_df.sample(frac=1), [int(.8*len(data_df)), int(.9 * len(data_df))])
# print('Num examples train-val-test')
# print(len(train_df), len(val_df), len(test_df))
# train_df.to_csv('train.csv')
# val_df.to_csv('val.csv')
# test_df.to_csv('test.csv')

## We then our Experiment Parameters
We use Data Classes to hold each of our arguments for the model, data, and training.

In [10]:
@dataclass
class ModelArguments:
  """
  Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
  """

  model_name_or_path: str = field(
      metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
  )
  config_name: Optional[str] = field(
      default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
  )
  tokenizer_name: Optional[str] = field(
      default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
  )
  cache_dir: Optional[str] = field(
      default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
  )


@dataclass
class MultimodalDataTrainingArguments:
  """
  Arguments pertaining to how we combine tabular features
  Using `HfArgumentParser` we can turn this class
  into argparse arguments to be able to specify them on
  the command line.
  """

  data_path: str = field(metadata={
                            'help': 'the path to the csv file containing the dataset'
                        })
  column_info_path: str = field(
      default=None,
      metadata={
          'help': 'the path to the json file detailing which columns are text, categorical, numerical, and the label'
  })

  column_info: dict = field(
      default=None,
      metadata={
          'help': 'a dict referencing the text, categorical, numerical, and label columns'
                  'its keys are text_cols, num_cols, cat_cols, and label_col'
  })

  categorical_encode_type: str = field(default='ohe',
                                        metadata={
                                            'help': 'sklearn encoder to use for categorical data',
                                            'choices': ['ohe', 'binary', 'label', 'none']
                                        })
  numerical_transformer_method: str = field(default='yeo_johnson',
                                            metadata={
                                                'help': 'sklearn numerical transformer to preprocess numerical data',
                                                'choices': ['yeo_johnson', 'box_cox', 'quantile_normal', 'none']
                                            })
  task: str = field(default="classification",
                    metadata={
                        "help": "The downstream training task",
                        "choices": ["classification", "regression"]
                    })

  mlp_division: int = field(default=4,
                            metadata={
                                'help': 'the ratio of the number of '
                                        'hidden dims in a current layer to the next MLP layer'
                            })
  combine_feat_method: str = field(default='individual_mlps_on_cat_and_numerical_feats_then_concat',
                                    metadata={
                                        'help': 'method to combine categorical and numerical features, '
                                                'see README for all the method'
                                    })
  mlp_dropout: float = field(default=0.1,
                              metadata={
                                'help': 'dropout ratio used for MLP layers'
                              })
  numerical_bn: bool = field(default=True,
                              metadata={
                                  'help': 'whether to use batchnorm on numerical features'
                              })
  use_simple_classifier: str = field(default=True,
                                      metadata={
                                          'help': 'whether to use single layer or MLP as final classifier'
                                      })
  mlp_act: str = field(default='relu',
                        metadata={
                            'help': 'the activation function to use for finetuning layers',
                            'choices': ['relu', 'prelu', 'sigmoid', 'tanh', 'linear']
                        })
  gating_beta: float = field(default=0.2,
                              metadata={
                                  'help': "the beta hyperparameters used for gating tabular data "
                                          "see https://www.aclweb.org/anthology/2020.acl-main.214.pdf"
                              })

  def __post_init__(self):
      assert self.column_info != self.column_info_path
      if self.column_info is None and self.column_info_path:
          with open(self.column_info_path, 'r') as f:
              self.column_info = json.load(f)

### Here are the data and training parameters we will use.
For model we can specify any supported HuggingFace model classes (see README for more details) as well as any AutoModel that are from the supported model classes. For the data specifications, we need to specify a dictionary that specifies which columns are the `text` columns, `numerical feature` columns, `categorical feature` column, and the `label` column. If we are doing classification, we can also specify what each of the labels means in the label column through the `label list`. We can also specifiy these columns using a path to a json file with the argument `column_info_path` to `MultimodalDataTrainingArguments`.

In [30]:
if cur_dir != '/content/drive/MyDrive/Research/r4.2/MockDataset':
  os.chdir('/content/drive/MyDrive/Research/r4.2/MockDataset')

# text_cols = ['Title', 'Review Text']
# cat_cols = ['Clothing ID', 'Division Name', 'Department Name', 'Class Name']
# numerical_cols = ['Rating', 'Age', 'Positive Feedback Count']
# cat_cols = ['user', 'day', 'week', 'role', 'dept', 'team']
cat_cols = []
text_cols = ["Unnamed: 0", "starttime", "endtime", "isweekday", "isweekend", "b_unit", "f_unit", "ITAdmin", "O", "C", "E", "A", "N", "n_allact", "allact_n-pc0", "allact_n-pc1", "allact_n-pc2", "allact_n-pc3", "n_workhourallact", "workhourallact_n-pc0", "workhourallact_n-pc1", "workhourallact_n-pc2", "workhourallact_n-pc3", "n_afterhourallact", "afterhourallact_n-pc0", "afterhourallact_n-pc1", "afterhourallact_n-pc2", "afterhourallact_n-pc3", "n_logon", "logon_n-pc0", "logon_n-pc1", "logon_n-pc2", "logon_n-pc3", "n_workhourlogon", "workhourlogon_n-pc0", "workhourlogon_n-pc1", "workhourlogon_n-pc2", "workhourlogon_n-pc3", "n_afterhourlogon", "afterhourlogon_n-pc0", "afterhourlogon_n-pc1", "afterhourlogon_n-pc2", "afterhourlogon_n-pc3", "n_usb", "usb_mean_usb_dur", "usb_n-pc0", "usb_n-pc1", "usb_n-pc2", "usb_n-pc3", "n_workhourusb", "workhourusb_mean_usb_dur", "workhourusb_n-pc0", "workhourusb_n-pc1", "workhourusb_n-pc2", "workhourusb_n-pc3", "n_afterhourusb", "afterhourusb_mean_usb_dur", "afterhourusb_n-pc0", "afterhourusb_n-pc1", "afterhourusb_n-pc2", "afterhourusb_n-pc3", "n_file", "file_mean_file_len", "file_mean_file_depth", "file_mean_file_nwords", "file_n-disk0", "file_n-disk1", "file_n-pc0", "file_n-pc1", "file_n-pc2", "file_n-pc3", "file_n_otherf", "file_otherf_mean_file_len", "file_otherf_mean_file_depth", "file_otherf_mean_file_nwords", "file_otherf_n-disk0", "file_otherf_n-disk1", "file_otherf_n-pc0", "file_otherf_n-pc1", "file_otherf_n-pc2", "file_otherf_n-pc3", "file_n_compf", "file_compf_mean_file_len", "file_compf_mean_file_depth", "file_compf_mean_file_nwords", "file_compf_n-disk0", "file_compf_n-disk1", "file_compf_n-pc0", "file_compf_n-pc1", "file_compf_n-pc2", "file_compf_n-pc3", "file_n_phof", "file_phof_mean_file_len", "file_phof_mean_file_depth", "file_phof_mean_file_nwords", "file_phof_n-disk0", "file_phof_n-disk1", "file_phof_n-pc0", "file_phof_n-pc1", "file_phof_n-pc2", "file_phof_n-pc3", "file_n_docf", "file_docf_mean_file_len", "file_docf_mean_file_depth", "file_docf_mean_file_nwords", "file_docf_n-disk0", "file_docf_n-disk1", "file_docf_n-pc0", "file_docf_n-pc1", "file_docf_n-pc2", "file_docf_n-pc3", "file_n_txtf", "file_txtf_mean_file_len", "file_txtf_mean_file_depth", "file_txtf_mean_file_nwords", "file_txtf_n-disk0", "file_txtf_n-disk1", "file_txtf_n-pc0", "file_txtf_n-pc1", "file_txtf_n-pc2", "file_txtf_n-pc3", "file_n_exef", "file_exef_mean_file_len", "file_exef_mean_file_depth", "file_exef_mean_file_nwords", "file_exef_n-disk0", "file_exef_n-disk1", "file_exef_n-pc0", "file_exef_n-pc1", "file_exef_n-pc2", "file_exef_n-pc3", "n_workhourfile", "workhourfile_mean_file_len", "workhourfile_mean_file_depth", "workhourfile_mean_file_nwords", "workhourfile_n-disk0", "workhourfile_n-disk1", "workhourfile_n-pc0", "workhourfile_n-pc1", "workhourfile_n-pc2", "workhourfile_n-pc3", "workhourfile_n_otherf", "workhourfile_otherf_mean_file_len", "workhourfile_otherf_mean_file_depth", "workhourfile_otherf_mean_file_nwords", "workhourfile_otherf_n-disk0", "workhourfile_otherf_n-disk1", "workhourfile_otherf_n-pc0", "workhourfile_otherf_n-pc1", "workhourfile_otherf_n-pc2", "workhourfile_otherf_n-pc3", "workhourfile_n_compf", "workhourfile_compf_mean_file_len", "workhourfile_compf_mean_file_depth", "workhourfile_compf_mean_file_nwords", "workhourfile_compf_n-disk0", "workhourfile_compf_n-disk1", "workhourfile_compf_n-pc0", "workhourfile_compf_n-pc1", "workhourfile_compf_n-pc2", "workhourfile_compf_n-pc3", "workhourfile_n_phof", "workhourfile_phof_mean_file_len", "workhourfile_phof_mean_file_depth", "workhourfile_phof_mean_file_nwords", "workhourfile_phof_n-disk0", "workhourfile_phof_n-disk1", "workhourfile_phof_n-pc0", "workhourfile_phof_n-pc1", "workhourfile_phof_n-pc2", "workhourfile_phof_n-pc3", "workhourfile_n_docf", "workhourfile_docf_mean_file_len", "workhourfile_docf_mean_file_depth", "workhourfile_docf_mean_file_nwords", "workhourfile_docf_n-disk0", "workhourfile_docf_n-disk1", "workhourfile_docf_n-pc0", "workhourfile_docf_n-pc1", "workhourfile_docf_n-pc2", "workhourfile_docf_n-pc3", "workhourfile_n_txtf", "workhourfile_txtf_mean_file_len", "workhourfile_txtf_mean_file_depth", "workhourfile_txtf_mean_file_nwords", "workhourfile_txtf_n-disk0", "workhourfile_txtf_n-disk1", "workhourfile_txtf_n-pc0", "workhourfile_txtf_n-pc1", "workhourfile_txtf_n-pc2", "workhourfile_txtf_n-pc3", "workhourfile_n_exef", "workhourfile_exef_mean_file_len", "workhourfile_exef_mean_file_depth", "workhourfile_exef_mean_file_nwords", "workhourfile_exef_n-disk0", "workhourfile_exef_n-disk1", "workhourfile_exef_n-pc0", "workhourfile_exef_n-pc1", "workhourfile_exef_n-pc2", "workhourfile_exef_n-pc3", "n_afterhourfile", "afterhourfile_mean_file_len", "afterhourfile_mean_file_depth", "afterhourfile_mean_file_nwords", "afterhourfile_n-disk0", "afterhourfile_n-disk1", "afterhourfile_n-pc0", "afterhourfile_n-pc1", "afterhourfile_n-pc2", "afterhourfile_n-pc3", "afterhourfile_n_otherf", "afterhourfile_otherf_mean_file_len", "afterhourfile_otherf_mean_file_depth", "afterhourfile_otherf_mean_file_nwords", "afterhourfile_otherf_n-disk0", "afterhourfile_otherf_n-disk1", "afterhourfile_otherf_n-pc0", "afterhourfile_otherf_n-pc1", "afterhourfile_otherf_n-pc2", "afterhourfile_otherf_n-pc3", "afterhourfile_n_compf", "afterhourfile_compf_mean_file_len", "afterhourfile_compf_mean_file_depth", "afterhourfile_compf_mean_file_nwords", "afterhourfile_compf_n-disk0", "afterhourfile_compf_n-disk1", "afterhourfile_compf_n-pc0", "afterhourfile_compf_n-pc1", "afterhourfile_compf_n-pc2", "afterhourfile_compf_n-pc3", "afterhourfile_n_phof", "afterhourfile_phof_mean_file_len", "afterhourfile_phof_mean_file_depth", "afterhourfile_phof_mean_file_nwords", "afterhourfile_phof_n-disk0", "afterhourfile_phof_n-disk1", "afterhourfile_phof_n-pc0", "afterhourfile_phof_n-pc1", "afterhourfile_phof_n-pc2", "afterhourfile_phof_n-pc3", "afterhourfile_n_docf", "afterhourfile_docf_mean_file_len", "afterhourfile_docf_mean_file_depth", "afterhourfile_docf_mean_file_nwords", "afterhourfile_docf_n-disk0", "afterhourfile_docf_n-disk1", "afterhourfile_docf_n-pc0", "afterhourfile_docf_n-pc1", "afterhourfile_docf_n-pc2", "afterhourfile_docf_n-pc3", "afterhourfile_n_txtf", "afterhourfile_txtf_mean_file_len", "afterhourfile_txtf_mean_file_depth", "afterhourfile_txtf_mean_file_nwords", "afterhourfile_txtf_n-disk0", "afterhourfile_txtf_n-disk1", "afterhourfile_txtf_n-pc0", "afterhourfile_txtf_n-pc1", "afterhourfile_txtf_n-pc2", "afterhourfile_txtf_n-pc3", "afterhourfile_n_exef", "afterhourfile_exef_mean_file_len", "afterhourfile_exef_mean_file_depth", "afterhourfile_exef_mean_file_nwords", "afterhourfile_exef_n-disk0", "afterhourfile_exef_n-disk1", "afterhourfile_exef_n-pc0", "afterhourfile_exef_n-pc1", "afterhourfile_exef_n-pc2", "afterhourfile_exef_n-pc3", "n_email", "email_mean_n_des", "email_mean_n_atts", "email_mean_n_exdes", "email_mean_n_bccdes", "email_mean_email_size", "email_mean_email_text_slen", "email_mean_email_text_nwords", "email_n-Xemail1", "email_n-exbccmail1", "email_n-pc0", "email_n-pc1", "email_n-pc2", "email_n-pc3", "n_workhouremail", "workhouremail_mean_n_des", "workhouremail_mean_n_atts", "workhouremail_mean_n_exdes", "workhouremail_mean_n_bccdes", "workhouremail_mean_email_size", "workhouremail_mean_email_text_slen", "workhouremail_mean_email_text_nwords", "workhouremail_n-Xemail1", "workhouremail_n-exbccmail1", "workhouremail_n-pc0", "workhouremail_n-pc1", "workhouremail_n-pc2", "workhouremail_n-pc3", "n_afterhouremail", "afterhouremail_mean_n_des", "afterhouremail_mean_n_atts", "afterhouremail_mean_n_exdes", "afterhouremail_mean_n_bccdes", "afterhouremail_mean_email_size", "afterhouremail_mean_email_text_slen", "afterhouremail_mean_email_text_nwords", "afterhouremail_n-Xemail1", "afterhouremail_n-exbccmail1", "afterhouremail_n-pc0", "afterhouremail_n-pc1", "afterhouremail_n-pc2", "afterhouremail_n-pc3", "n_http", "http_mean_url_len", "http_mean_url_depth", "http_mean_http_c_len", "http_mean_http_c_nwords", "http_n-pc0", "http_n-pc1", "http_n-pc2", "http_n-pc3", "http_n_otherf", "http_otherf_mean_url_len", "http_otherf_mean_url_depth", "http_otherf_mean_http_c_len", "http_otherf_mean_http_c_nwords", "http_otherf_n-pc0", "http_otherf_n-pc1", "http_otherf_n-pc2", "http_otherf_n-pc3", "http_n_socnetf", "http_socnetf_mean_url_len", "http_socnetf_mean_url_depth", "http_socnetf_mean_http_c_len", "http_socnetf_mean_http_c_nwords", "http_socnetf_n-pc0", "http_socnetf_n-pc1", "http_socnetf_n-pc2", "http_socnetf_n-pc3", "http_n_cloudf", "http_cloudf_mean_url_len", "http_cloudf_mean_url_depth", "http_cloudf_mean_http_c_len", "http_cloudf_mean_http_c_nwords", "http_cloudf_n-pc0", "http_cloudf_n-pc1", "http_cloudf_n-pc2", "http_cloudf_n-pc3", "http_n_jobf", "http_jobf_mean_url_len", "http_jobf_mean_url_depth", "http_jobf_mean_http_c_len", "http_jobf_mean_http_c_nwords", "http_jobf_n-pc0", "http_jobf_n-pc1", "http_jobf_n-pc2", "http_jobf_n-pc3", "http_n_leakf", "http_leakf_mean_url_len", "http_leakf_mean_url_depth", "http_leakf_mean_http_c_len", "http_leakf_mean_http_c_nwords", "http_leakf_n-pc0", "http_leakf_n-pc1", "http_leakf_n-pc2", "http_leakf_n-pc3", "http_n_hackf", "http_hackf_mean_url_len", "http_hackf_mean_url_depth", "http_hackf_mean_http_c_len", "http_hackf_mean_http_c_nwords", "http_hackf_n-pc0", "http_hackf_n-pc1", "http_hackf_n-pc2", "http_hackf_n-pc3", "n_workhourhttp", "workhourhttp_mean_url_len", "workhourhttp_mean_url_depth", "workhourhttp_mean_http_c_len", "workhourhttp_mean_http_c_nwords", "workhourhttp_n-pc0", "workhourhttp_n-pc1", "workhourhttp_n-pc2", "workhourhttp_n-pc3", "workhourhttp_n_otherf", "workhourhttp_otherf_mean_url_len", "workhourhttp_otherf_mean_url_depth", "workhourhttp_otherf_mean_http_c_len", "workhourhttp_otherf_mean_http_c_nwords", "workhourhttp_otherf_n-pc0", "workhourhttp_otherf_n-pc1", "workhourhttp_otherf_n-pc2", "workhourhttp_otherf_n-pc3", "workhourhttp_n_socnetf", "workhourhttp_socnetf_mean_url_len", "workhourhttp_socnetf_mean_url_depth", "workhourhttp_socnetf_mean_http_c_len", "workhourhttp_socnetf_mean_http_c_nwords", "workhourhttp_socnetf_n-pc0", "workhourhttp_socnetf_n-pc1", "workhourhttp_socnetf_n-pc2", "workhourhttp_socnetf_n-pc3", "workhourhttp_n_cloudf", "workhourhttp_cloudf_mean_url_len", "workhourhttp_cloudf_mean_url_depth", "workhourhttp_cloudf_mean_http_c_len", "workhourhttp_cloudf_mean_http_c_nwords", "workhourhttp_cloudf_n-pc0", "workhourhttp_cloudf_n-pc1", "workhourhttp_cloudf_n-pc2", "workhourhttp_cloudf_n-pc3", "workhourhttp_n_jobf", "workhourhttp_jobf_mean_url_len", "workhourhttp_jobf_mean_url_depth", "workhourhttp_jobf_mean_http_c_len", "workhourhttp_jobf_mean_http_c_nwords", "workhourhttp_jobf_n-pc0", "workhourhttp_jobf_n-pc1", "workhourhttp_jobf_n-pc2", "workhourhttp_jobf_n-pc3", "workhourhttp_n_leakf", "workhourhttp_leakf_mean_url_len", "workhourhttp_leakf_mean_url_depth", "workhourhttp_leakf_mean_http_c_len", "workhourhttp_leakf_mean_http_c_nwords", "workhourhttp_leakf_n-pc0", "workhourhttp_leakf_n-pc1", "workhourhttp_leakf_n-pc2", "workhourhttp_leakf_n-pc3", "workhourhttp_n_hackf", "workhourhttp_hackf_mean_url_len", "workhourhttp_hackf_mean_url_depth", "workhourhttp_hackf_mean_http_c_len", "workhourhttp_hackf_mean_http_c_nwords", "workhourhttp_hackf_n-pc0", "workhourhttp_hackf_n-pc1", "workhourhttp_hackf_n-pc2", "workhourhttp_hackf_n-pc3", "n_afterhourhttp", "afterhourhttp_mean_url_len", "afterhourhttp_mean_url_depth", "afterhourhttp_mean_http_c_len", "afterhourhttp_mean_http_c_nwords", "afterhourhttp_n-pc0", "afterhourhttp_n-pc1", "afterhourhttp_n-pc2", "afterhourhttp_n-pc3", "afterhourhttp_n_otherf", "afterhourhttp_otherf_mean_url_len", "afterhourhttp_otherf_mean_url_depth", "afterhourhttp_otherf_mean_http_c_len", "afterhourhttp_otherf_mean_http_c_nwords", "afterhourhttp_otherf_n-pc0", "afterhourhttp_otherf_n-pc1", "afterhourhttp_otherf_n-pc2", "afterhourhttp_otherf_n-pc3", "afterhourhttp_n_socnetf", "afterhourhttp_socnetf_mean_url_len", "afterhourhttp_socnetf_mean_url_depth", "afterhourhttp_socnetf_mean_http_c_len", "afterhourhttp_socnetf_mean_http_c_nwords", "afterhourhttp_socnetf_n-pc0", "afterhourhttp_socnetf_n-pc1", "afterhourhttp_socnetf_n-pc2", "afterhourhttp_socnetf_n-pc3", "afterhourhttp_n_cloudf", "afterhourhttp_cloudf_mean_url_len", "afterhourhttp_cloudf_mean_url_depth", "afterhourhttp_cloudf_mean_http_c_len", "afterhourhttp_cloudf_mean_http_c_nwords", "afterhourhttp_cloudf_n-pc0", "afterhourhttp_cloudf_n-pc1", "afterhourhttp_cloudf_n-pc2", "afterhourhttp_cloudf_n-pc3", "afterhourhttp_n_jobf", "afterhourhttp_jobf_mean_url_len", "afterhourhttp_jobf_mean_url_depth", "afterhourhttp_jobf_mean_http_c_len", "afterhourhttp_jobf_mean_http_c_nwords", "afterhourhttp_jobf_n-pc0", "afterhourhttp_jobf_n-pc1", "afterhourhttp_jobf_n-pc2", "afterhourhttp_jobf_n-pc3", "afterhourhttp_n_leakf", "afterhourhttp_leakf_mean_url_len", "afterhourhttp_leakf_mean_url_depth", "afterhourhttp_leakf_mean_http_c_len", "afterhourhttp_leakf_mean_http_c_nwords", "afterhourhttp_leakf_n-pc0", "afterhourhttp_leakf_n-pc1", "afterhourhttp_leakf_n-pc2", "afterhourhttp_leakf_n-pc3", "afterhourhttp_n_hackf", "afterhourhttp_hackf_mean_url_len", "afterhourhttp_hackf_mean_url_depth", "afterhourhttp_hackf_mean_http_c_len", "afterhourhttp_hackf_mean_http_c_nwords", "afterhourhttp_hackf_n-pc0", "afterhourhttp_hackf_n-pc1", "afterhourhttp_hackf_n-pc2", "afterhourhttp_hackf_n-pc3"]
numerical_cols = []
unique_labels = [0, 1, 2, 3]
column_info_dict = {
    'text_cols': text_cols,
    'num_cols': numerical_cols,
    'cat_cols': cat_cols,
    'label_col': 'insider',
    'label_list': unique_labels
}


model_args = ModelArguments(
    model_name_or_path='bert-base-uncased'
)

data_args = MultimodalDataTrainingArguments(
    data_path='.',
    combine_feat_method='text_only',
    column_info=column_info_dict,
    task='classification',
    categorical_encode_type=None,
    numerical_transformer_method="none"
)

training_args = TrainingArguments(
    output_dir="./logs/model_name",
    logging_dir="./logs/runs",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy='steps',
    logging_steps=25,
    eval_steps=250,
)

set_seed(training_args.seed)

In [12]:
print(len(text_cols))

502


## Now we can load our model and data.
### We first instantiate our HuggingFace tokenizer
This is needed to prepare our custom torch dataset. See `torch_dataset.py` for details.

In [13]:
tokenizer_path_or_name = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path
print('Specified tokenizer: ', tokenizer_path_or_name)
tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_path_or_name,
    cache_dir=model_args.cache_dir,
)

Specified tokenizer:  bert-base-uncased


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Load dataset csvs to torch datasets
The function `load_data_from_folder` expects a path to a folder that contains `train.csv`, `test.csv`, and/or `val.csv` containing the respective split datasets.

In [14]:
# Get Datasets
train_dataset, val_dataset, test_dataset = load_data_from_folder(
    data_args.data_path,
    data_args.column_info['text_cols'],
    tokenizer,
    label_col=data_args.column_info['label_col'],
    label_list=data_args.column_info['label_list'],
    categorical_cols=data_args.column_info['cat_cols'],
    numerical_cols=data_args.column_info['num_cols'],
    sep_text_token_str=tokenizer.sep_token,
    categorical_encode_type=data_args.categorical_encode_type,
    numerical_transformer_method=data_args.numerical_transformer_method,
)

In [15]:
num_labels = len(np.unique(train_dataset.labels))
num_labels

3

In [16]:
config = AutoConfig.from_pretrained(
            model_args.config_name
            if model_args.config_name
            else model_args.model_name_or_path,
            cache_dir=model_args.cache_dir,
        )
tabular_config = TabularConfig(
    num_labels=num_labels,
    cat_feat_dim=train_dataset.cat_feats.shape[1]
    if train_dataset.cat_feats is not None
    else 0,
    numerical_feat_dim=train_dataset.numerical_feats.shape[1]
    if train_dataset.numerical_feats is not None
    else 0,
    **vars(data_args),
)
config.tabular_config = tabular_config

In [17]:
model = AutoModelWithTabular.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir
    )

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertWithTabular were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'tabular_classifier.bias', 'classifier.bias', 'tabular_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### We need to define a task-specific way of computing relevant metrics:

In [40]:
import numpy as np
from scipy.special import softmax
from sklearn.metrics import (
    auc,
    precision_recall_curve,
    roc_auc_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    matthews_corrcoef,
)

def calc_classification_metrics(p: EvalPrediction):
  predictions = p.predictions[0]
  pred_labels = np.argmax(predictions, axis=1)
  pred_scores = softmax(predictions, axis=1)[:, 1]
  labels = p.label_ids
  print(np.unique(labels))
  # if len(np.unique(labels)) == 2:  # binary classification
  if len(unique_labels) == 2:  # binary classification
      roc_auc_pred_score = roc_auc_score(labels, pred_scores)
      precisions, recalls, thresholds = precision_recall_curve(labels, pred_scores)
      fscore = (2 * precisions * recalls) / (precisions + recalls)
      fscore[np.isnan(fscore)] = 0
      ix = np.argmax(fscore)
      threshold = thresholds[ix].item()
      pr_auc = auc(recalls, precisions)
      tn, fp, fn, tp = confusion_matrix(labels, pred_labels, labels=[0, 1]).ravel()
      result = {
              'roc_auc': roc_auc_pred_score,
              'threshold': threshold,
              'pr_auc': pr_auc,
              'recall': recalls[ix].item(),
              'precision': precisions[ix].item(), 'f1': fscore[ix].item(),
              'tn': tn.item(), 'fp': fp.item(), 'fn': fn.item(), 'tp': tp.item()
            }
  else:
      acc = (pred_labels == labels).mean()
      roc_auc_pred_score = roc_auc_score(labels, pred_scores, average='weighted')
      prec_score = precision_score(labels, pred_labels, average='weighted')
      rec_score = recall_score(labels, pred_labels, average='weighted')

      f1 = f1_score(y_true=labels, y_pred=pred_labels, average='weighted')
      result = {
          "acc": acc,
          'roc_auc': roc_auc_pred_score,
          'recall': rec_score,
          'precision': prec_score,
          "f1": f1,
          "acc_and_f1": (acc + f1) / 2,
          "mcc": matthews_corrcoef(labels, pred_labels)
      }

  return result

In [41]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=calc_classification_metrics,
)

## Launching the training is as simple is doing trainer.train() 🤗

In [20]:
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize

import psutil
import humanize
import os
import GPUtil as GPU
import gc
gc.collect()

GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
    process = psutil.Process(os.getpid())
    print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " |     Proc size: " + humanize.naturalsize(process.memory_info().rss))
    print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total     {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Collecting gputil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Created wheel for gputil: filename=GPUtil-1.4.0-py3-none-any.whl size=7394 sha256=42ced780eb16389d93b8d7881dcbe0746488cd3813cfb0a18f1db9252fa0079a
  Stored in directory: /root/.cache/pip/wheels/a9/8a/bd/81082387151853ab8b6b3ef33426e98f5cbfebc3c397a9d4d0
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0
Gen RAM Free: 47.4 GB  |     Proc size: 6.0 GB
GPU RAM Free: 13748MB | Used: 1353MB | Util   9% | Total     15360MB


In [42]:
%%time
trainer.train()
trainer.save_model()



Step,Training Loss,Validation Loss,Acc,Roc Auc,Recall,Precision,F1,Acc And F1,Mcc
250,0.1234,0.043735,0.996498,0.429145,0.996498,0.993009,0.99475,0.995624,0.0
500,0.0,0.04373,0.996498,0.518503,0.996498,0.993009,0.99475,0.995624,0.0


[0 2]


  _warn_prf(average, modifier, msg_start, len(result))


[0 2]


  _warn_prf(average, modifier, msg_start, len(result))


CPU times: user 5min 41s, sys: 5.64 s, total: 5min 47s
Wall time: 7min 29s


TrainOutput(global_step=500, training_loss=0.014554292136339193, metrics={'train_runtime': 449.2191, 'train_samples_per_second': 4.45, 'train_steps_per_second': 1.113, 'total_flos': 525973166785536.0, 'train_loss': 0.014554292136339193, 'epoch': 1.0})

In [50]:
from pprint import pformat

# Evaluation
eval_results = {}
if training_args.do_eval:
    eval_result = trainer.evaluate(eval_dataset=val_dataset)
    print(pformat(eval_result, indent=4))

    output_eval_file = os.path.join(
        training_args.output_dir, f"eval_metric_results_classification_fold_0.txt"
    )
    if trainer.is_world_process_zero():
        with open(output_eval_file, "w") as writer:
            print("***** Eval results classification *****")
            for key, value in eval_result.items():
                print("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

    eval_results.update(eval_result)

[0 2]
{   'epoch': 1.0,
    'eval_acc': 0.9964982491245623,
    'eval_acc_and_f1': 0.9956243468765202,
    'eval_f1': 0.9947504446284782,
    'eval_loss': 0.04372965916991234,
    'eval_mcc': 0.0,
    'eval_precision': 0.9930087605083183,
    'eval_recall': 0.9964982491245623,
    'eval_roc_auc': 0.5185025817555938,
    'eval_runtime': 99.5575,
    'eval_samples_per_second': 20.079,
    'eval_steps_per_second': 2.511}
***** Eval results classification *****
  %s = %s eval_loss 0.04372965916991234
  %s = %s eval_acc 0.9964982491245623
  %s = %s eval_roc_auc 0.5185025817555938
  %s = %s eval_recall 0.9964982491245623
  %s = %s eval_precision 0.9930087605083183
  %s = %s eval_f1 0.9947504446284782
  %s = %s eval_acc_and_f1 0.9956243468765202
  %s = %s eval_mcc 0.0
  %s = %s eval_runtime 99.5575
  %s = %s eval_samples_per_second 20.079
  %s = %s eval_steps_per_second 2.511
  %s = %s epoch 1.0


  _warn_prf(average, modifier, msg_start, len(result))


In [53]:
from pprint import pformat

logging.info("*** Test ***")

predictions = trainer.predict(test_dataset=test_dataset).predictions[0]
output_test_file = os.path.join(
    training_args.output_dir, f"test_results_classification_fold_0.txt"
)
eval_result = trainer.evaluate(eval_dataset=test_dataset)
print(pformat(eval_result, indent=4))
if trainer.is_world_process_zero():
    with open(output_test_file, "w") as writer:
        print("***** Test results classification *****")
        writer.write("index\tprediction\n")
        predictions = np.argmax(predictions, axis=1)
        for index, item in enumerate(predictions):
          item = test_dataset.get_labels()[item]
          writer.write("%d\t%s\n" % (index, item))
    output_test_file = os.path.join(
        training_args.output_dir,
        f"test_metric_results_classification_fold_0.txt",
    )
    with open(output_test_file, "w") as writer:
        print("***** Test results classification *****")
        for key, value in eval_result.items():
            print("  %s = %s", key, value)
            writer.write("%s = %s\n" % (key, value))
    eval_results.update(eval_result)


[0 2]


  _warn_prf(average, modifier, msg_start, len(result))


[0 2]
{   'epoch': 1.0,
    'eval_acc': 0.9979989994997499,
    'eval_acc_and_f1': 0.9974992503763148,
    'eval_f1': 0.9969995012528797,
    'eval_loss': 0.02499542199075222,
    'eval_mcc': 0.0,
    'eval_precision': 0.9960020030025019,
    'eval_recall': 0.9979989994997499,
    'eval_roc_auc': 0.04855889724310776,
    'eval_runtime': 89.988,
    'eval_samples_per_second': 22.214,
    'eval_steps_per_second': 2.778}
***** Test results classification *****
***** Test results classification *****
  %s = %s eval_loss 0.02499542199075222
  %s = %s eval_acc 0.9979989994997499
  %s = %s eval_roc_auc 0.04855889724310776
  %s = %s eval_recall 0.9979989994997499
  %s = %s eval_precision 0.9960020030025019
  %s = %s eval_f1 0.9969995012528797
  %s = %s eval_acc_and_f1 0.9974992503763148
  %s = %s eval_mcc 0.0
  %s = %s eval_runtime 89.988
  %s = %s eval_samples_per_second 22.214
  %s = %s eval_steps_per_second 2.778
  %s = %s epoch 1.0


  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
del model
del config
del tabular_config
del trainer
torch.cuda.empty_cache()


### Check that our training was successful using TensorBoard

In [43]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [None]:
%tensorboard --logdir ./logs/runs --port=6006