# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `05_multimodal.ipynb`                                   |
| Purpose  | Combining tabular data with a BERT transformer.         |

(todo: description)

*Assumptions*  
 - The dataset being used has binary class labels following convention: 0 = authentic tweet; 1 = troll tweet
 - The execution environment has internet access (to download models from huggingface.co)
 - The execution environment has a CUDA-capable GPU available

*General Notes*
 - Notebook kernel must be completely restarted between runs to release reserved VRAM from the GPU.
 - Notebook contains our usual code to load dataset file from GCP bucket, but model files are always saved locally regardless of `local_or_cloud` setting.
 - Notebook is based on a [notebook by georgian-io (github.com)](https://github.com/georgian-io/Multimodal-Toolkit/blob/master/notebooks/text_w_tabular_classification.ipynb) from their [Multimodal-Toolkit repository (github.com)](https://github.com/georgian-io/Multimodal-Toolkit) and [accompanying blog post (medium.com)](https://medium.com/georgian-impact-blog/how-to-incorporate-tabular-data-with-huggingface-transformers-b70ac45fcfb4).

# 1 - Setup

In [1]:
# confirm installed version of `transformers` matches the version required by `multimodal_transformers`
supported_transformers_versions = {'4.25.1', '4.26.1'}

import transformers
assert (transformers.__version__ in supported_transformers_versions), \
    "Unsupported version of 'transformers' installed."

In [2]:
# imports from Python standard library
import os
import json
import logging
import shutil
import zipfile
from dataclasses import dataclass, field
from pathlib import Path
from pprint import pprint
from typing import Optional

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from pynvml import *    # for debugging
from sklearn.metrics import (
    accuracy_score, auc, brier_score_loss, confusion_matrix, f1_score, precision_recall_curve, 
    precision_score, recall_score, roc_auc_score
)
from sklearn.model_selection import train_test_split
from scipy.special import softmax

# 🤗 (Hugging Face) packages
import evaluate
from transformers import (
    AutoTokenizer, AutoConfig, AutoModel, 
    BertTokenizerFast, DistilBertTokenizerFast, RobertaTokenizerFast, BertweetTokenizer, XLMRobertaTokenizerFast,
    TrainingArguments, Trainer, EvalPrediction, 
    set_seed
)

# Georgian packages
from multimodal_transformers.data.load_data import load_train_val_test_helper   # shhh, didn't have a leading underscore
from multimodal_transformers.model import (
    TabularConfig, AutoModelWithTabular, BertWithTabular, DistilBertWithTabular, RobertaWithTabular
)

In [3]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [4]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [5]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [6]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

In [7]:
# debug
# from huggingface tutorial: https://huggingface.co/docs/transformers/perf_train_gpu_one#efficient-training-on-a-single-gpu
def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

# 2 - Load Dataset

Starting with the full dataset (with engineered features) from notebook **`06_feature_engineering.ipynb`**.

In [8]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "06data_full_final_en2.parquet.gz"
parq_path = Path(snapshot_paths['parq_snapshot'], parq_filename)

if (local_or_cloud == "local"):
    df_full = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    df_full = tur.get_gcp_object_from_parq_as_df(bucket=gcp_bucket, object_name=parq_path)

In [9]:
df_full.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3596578 entries, 0 to 3596577
Data columns (total 54 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   external_author_id             string 
 1   author                         string 
 2   following                      int64  
 3   followers                      int64  
 4   updates                        int64  
 5   is_retweet                     int64  
 6   tweet_id                       string 
 7   has_url                        int64  
 8   emoji_count                    int64  
 9   following_ratio                float64
 10  class_numeric                  int8   
 11  RUS_lett_count                 int64  
 12  emoji_flagUS                   int64  
 13  emoji_police                   int64  
 14  emoji_check                    int64  
 15  emoji_exclamation              int64  
 16  emoji_fist                     int64  
 17  emoji_collision                int64  
 18  em

In [10]:
# # make a smaller slice of full dataset for testing
# n_tweets = 100000

# df_small = df_full.groupby(by='class_numeric').sample(n=(n_tweets//2), random_state=42).reset_index()

# # confirm stratefied
# df_small['class_numeric'].value_counts()

# 3 - Setup Experimental Parameters

## 3.1 - Define dataclasses

In [10]:
# define dataclasses to store the experiment parameters
@dataclass
class ModelArguments:
    model_name_or_path: str = field()
    config_name:  Optional[str] = field(default=None)
    tokenizer_name: Optional[str] = field(default=None)
    cache_dir: Optional[str] = field(default=None)
    num_labels: int = field(default=2)

@dataclass
class MultimodalDataTrainingArguments:
    #data_path: str = field()
    data_df: pd.DataFrame = field()     # modified to use dataframe

    column_info_path: str = field(default=None)
    column_info: dict = field(default=None)

    categorical_encode_type: str = field(default='ohe')
    numerical_transform_method: str = field(default='yeo_johnson')
    
    task: str = field(default='classification')
    mlp_division: int = field(default=4)
    combine_feat_method: str = field(default='individual_mlps_on_cat_and_numerical_feats_then_concat')
    mlp_dropout: float = field(default=0.1)
    numerical_bn: bool = field(default=True)
    use_simple_classifier: str = field(default=True)
    mlp_act: str = field(default='relu')
    gating_beta: float = field(default=0.2)

    def __post_init__(self):
        assert self.column_info != self.column_info_path
        if (self.column_info is None and self.column_info_path):
            with open(self.column_info_path, mode='r') as f:
                self.column_info = json.load(f)

## 3.2 - Select Data / Model / Training args

In [11]:
# model choices available
multimodal_model_classes = {
    # key = (huggingface) model name
    # val = multimodal-transformers ModelWithTabular class (each a subclass of their respective ModelForSequenceClassification)
    'bert-base-uncased': BertWithTabular,
    'distilbert-base-uncased': DistilBertWithTabular,
    'roberta-base': RobertaWithTabular,
    'vinai/bertweet-base': RobertaWithTabular,
    'Twitter/twhin-bert-base': BertWithTabular, 
}

In [12]:
# Step 1 - Select the fine_tuned model
fine_tuned_models_folder = '../data/models/'
selected_model_folder_name = 'roberta-base-200k'  # no error checking implemented so type carefully
base_hf_model_name = 'roberta-base'  # select from keys of `multimodal_model_classes` above

model_path = Path(fine_tuned_models_folder, selected_model_folder_name)
model_class = multimodal_model_classes[base_hf_model_name]

# Step 2 - (mostly automatic) Choose a descriptive run name for this training/model.
#   Suggested format: multimodal__[fine-tuned-model-name]__[YYYY-MM-DD]
stub = selected_model_folder_name # 'distilbert-base-uncased-50k'
date = pd.Timestamp.now().strftime(format=r"%Y-%m-%d")
run_name = f"multi-modal__{stub}__{date}"

# Step 3 - (mostly automatic) Choose a folder name for where to store the output of the model
#   Will be created as a subfolder of `../data/models/multimodal`
output_dir_name = f"multi-modal__{stub}"

# Step 4 - Choose which columns from dataframe will be used.
label_col = 'class_numeric'
text_cols = ['cleaned_tweet']
categorical_cols = [
    'is_retweet',
    'has_url',
    'emoji_flagUS',
    'emoji_police',
    'emoji_check',
    'emoji_exclamation',
    'emoji_fist',
    'emoji_collision',
    'emoji_prohibited',
    'emoji_loudcryface',
    'emoji_smilinghearteye',
    'emoji_fire',
    'emoji_redheart',
    'emoji_tearsjoy',
    'emoji_thumbsup',
    'emoji_claphands',
    'emoji_blowingkiss',
    'emoji_partypop',
    'emoji_raisehands',
    'region_United_States',
    'region_Unknown',
    'region_New_York_NY',
    'region_United_Kingdom',
    'region_Los_Angeles_CA',
    'region_Boston_MA',
    'region_London',
    'region_New_York_and_the_World',
    'region_New_York_City',
    'region_Pale_Blue_Dot',
    'region_Atlanta_GA',
    'region_Australia',
    'region_Global',
    'region_Washington_DC',
    'region_All_Other',
    'multi_authors',
]
numericical_cols = [
    'following', 
    'followers',
    'updates',
    'emoji_count',
    'following_ratio',
    'RUS_lett_count',
    'num_dashes',
    'num_commas',
    'num_hashs',
    'num_URLs',
    'median_word_length',
    'sentiment',
    'emoji_sentiment',
    'tweet_length',
]

# don't edit this
column_info_dict = {
    'text_cols': text_cols,
    'num_cols': numericical_cols,
    'cat_cols': categorical_cols,
    'label_col': label_col,
    'label_list': ['Authentic', 'Troll']
}

# Step 5 - specify dataset's dataframe and how many rows from dataset to feed
#   also stratify by `label_col`
n_rows = 1000000
dataset_dataframe = df_full.groupby(by=label_col).sample(n=(n_rows//2), random_state=42).reset_index()
#dataset_dataframe = df_full

# Step 6 - Set model arguments
model_args = ModelArguments(
    model_name_or_path=model_path,
    tokenizer_name=base_hf_model_name,
    config_name=base_hf_model_name,
    num_labels=2,
)

# Step 7 - Set data arguments
data_args = MultimodalDataTrainingArguments(
    data_df=dataset_dataframe,
    combine_feat_method='weighted_feature_sum_on_transformer_cat_and_numerical_feats',
    column_info=column_info_dict,
    task='classification',
    categorical_encode_type=None,
    numerical_transform_method="yeo_johnson",
    #use_simple_classifier=False,
)

# Step 8 - Set training arguments
multimodal_parent_output_dir = '../data/models/multimodal/'

training_args = TrainingArguments(
    ## file args
    run_name=run_name,
    output_dir=Path(multimodal_parent_output_dir, output_dir_name),
    logging_dir=Path(multimodal_parent_output_dir, output_dir_name, "runs"),
    overwrite_output_dir=True,
    save_strategy='epoch',
    save_total_limit=1,
    ## training hyperparams
    num_train_epochs=3,
    per_device_train_batch_size=160,
    per_device_eval_batch_size=160,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,   # comment out this line for DistilBERT
    weight_decay=0.01,
    seed=42,
    ## eval/log strategies
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
    disable_tqdm=False,
)

set_seed(training_args.seed)

## 3.2 - Setup Tokenizer

In [13]:
# use tokenizer associated with our chosen model type
#   - let AutoTokenizer pick the class for now but confirm the expected one gets picked
#       - BERTbase      -> BertTokenizerFast
#       - DistilBERT    -> DistilBertTokenizerFast
#       - RoBERTA       -> RobertaTokenizerFast
#       - BERTweet      -> BertweetTokenizer
#       - TwHIN-BERT    -> XLMRobertaTokenizerFast
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name)
print(f"Model class specified:\t'{model_args.tokenizer_name}'")
print(f"Tokenizer class chosen:\t'{tokenizer.__class__.__name__}'")

Model class specified:	'roberta-base'
Tokenizer class chosen:	'RobertaTokenizerFast'


## 3.3 - Training / Validation / Test Split

### 3.3.1 - Filter out any tweets that were used for fine tuning

Especially important for testing and validation datasets to filter these out.

In [14]:
# check for a file `fine_tune_tweet_ids.json` within fine-tuned model directory
fine_tuned_tweets_filename = 'fine_tune_tweet_ids.json'
fine_tuned_tweets_file = Path(model_args.model_name_or_path, fine_tuned_tweets_filename)

if (not fine_tuned_tweets_file.exists()):
    print("Heads up: no fine-tuned tweet file found at", fine_tuned_tweets_file)
else:
    # load the list of tweets already used for fine-tuning
    tweet_id_list = json.load(fine_tuned_tweets_file.open(encoding='utf-8'))

    # filter the dataframe
    filtered_df = data_args.data_df.loc[~data_args.data_df['tweet_id'].isin(tweet_id_list)]

    # calculate delta
    delta_tweets = data_args.data_df.shape[0] - filtered_df.shape[0]
    print(f"Number of tweets filtered out:{delta_tweets:>9,}")

    # re-assign the filtered datafram
    data_args.data_df = filtered_df

Number of tweets filtered out:   68,177


### 3.3.2 - Split (and Tokenize)

The tokenizer defined above is invoked within `load_train_val_test_helper`.

In [15]:
# aiming for a 70% / 15% / 15% split
train_df, test_and_val_df = train_test_split(data_args.data_df, train_size=0.7, random_state=42, shuffle=True)
test_df, val_df = train_test_split(test_and_val_df, train_size=0.5, shuffle=False)

# using this helper function (potentially not intended for public interface) because it allows
#   us to bring in our dataframe directly (rather than needing to import from CSV)
train_dataset, val_dataset, test_dataset = load_train_val_test_helper(
    train_df=train_df,
    val_df=val_df,
    test_df=test_df,
    text_cols=data_args.column_info['text_cols'],
    tokenizer=tokenizer,
    label_col=data_args.column_info['label_col'],
    label_list=data_args.column_info['label_list'],
    categorical_cols=data_args.column_info['cat_cols'],
    numerical_cols=data_args.column_info['num_cols'],
    sep_text_token_str=tokenizer.sep_token,
    categorical_encode_type=data_args.categorical_encode_type,
    numerical_transformer_method=data_args.numerical_transform_method
)

print(f"Training dataset:  {len(train_dataset):>12,} samples")
print(f"Validation dataset:{len(val_dataset):>12,} samples")
print(f"Testing dataset:   {len(test_dataset):>12,} samples")

Training dataset:       652,276 samples
Validation dataset:     139,774 samples
Testing dataset:        139,773 samples


In [16]:
# check output
print(">> Dataset:\n", train_dataset[0], end="\n\n")
print(">> Origin DF:\n", train_df.iloc[0][text_cols + categorical_cols + numericical_cols + ["class_numeric"]], end="\n\n")

print(f">> Unique labels: {np.unique(train_dataset.labels)}")

>> Dataset:
 {'input_ids': tensor([    0, 11329,  2923,  7782, 40702,   192,   114,    52,    64,  1955,
           66,    99,   189,    33,  1102, 10683,   201,    10, 18695,   217,
            5,  1421,  2733,   341,     2,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,  

## 3.4 - Setup Model

In [17]:
# help out with setup
pprint(model_args)

ModelArguments(model_name_or_path=WindowsPath('../data/models/roberta-base-200k'),
               config_name='roberta-base',
               tokenizer_name='roberta-base',
               cache_dir=None,
               num_labels=2)


In [18]:
# make a 🤗 transformer config
config = AutoConfig.from_pretrained(model_args.config_name)

# setup the multimodal-transformers-specific TabularConfig
tabular_config = TabularConfig(
    num_labels=model_args.num_labels,                           # default is 2
    cat_feat_dim=train_dataset.cat_feats.shape[1],              # number of cat feature columns
    numerical_feat_dim=train_dataset.numerical_feats.shape[1],  # number of num feature columns
    **vars(data_args)                                           # dump in everything else as kwarg
)

# add TabularConfig to 🤗 transformer config
config.tabular_config = tabular_config

# define which class we're going to use for model
#model = DistilBertWithTabular.from_pretrained(
model = model_class.from_pretrained(
    pretrained_model_name_or_path=model_args.model_name_or_path,
    config=config,
)

Some weights of RobertaWithTabular were not initialized from the model checkpoint at ..\data\models\roberta-base-200k and are newly initialized: ['tabular_combiner.layer_norm.weight', 'tabular_combiner.num_bn.num_batches_tracked', 'tabular_combiner.num_bn.running_var', 'tabular_combiner.num_layer.bias', 'tabular_combiner.cat_layer.weight', 'tabular_combiner.weight_num', 'tabular_classifier.bias', 'tabular_combiner.weight_cat', 'tabular_combiner.num_bn.running_mean', 'tabular_combiner.num_layer.weight', 'tabular_combiner.num_bn.bias', 'tabular_classifier.weight', 'tabular_combiner.num_bn.weight', 'tabular_combiner.layer_norm.bias', 'tabular_combiner.cat_layer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 3.5 - Setup Evaluation Metrics

In [19]:
def compute_binary_classification_metrics(eval_pred: EvalPrediction) -> dict[str, float]:
    logits_obj, labels = eval_pred

    if (isinstance(logits_obj, tuple)):
        # not quite sure what is causing a tuple to be returned here but this line of code cost 3+ hours of sadness
        logits=logits_obj[0]
    else:
        logits=logits_obj

    # for each row of eval_pred, pick the larger column and return its column index (0 or 1)
    #   - result looks like: [0, 1, 0, 0, 1, ...]
    predictions_as_labels = np.argmax(logits, axis=1)   # axis=1 gives max column value in each row

    # for each row of eval_pred, apply softmax to ~normalize as scores summing to 1, 
    #   and grab the probability of class=1 (i.e. second column of each row)
    #   - result looks like: [0.993, 0.003, 0.234, ...]
    predictions_as_probs = softmax(logits, axis=1)[:,1]

    # binary classification metrics
    #   label-based
    metric_accuracy = accuracy_score(y_true=labels, y_pred=predictions_as_labels)
    metric_f1score = f1_score(y_true=labels, y_pred=predictions_as_labels, zero_division=0)
    metric_precision = precision_score(y_true=labels, y_pred=predictions_as_labels, zero_division=0)
    metric_recall = recall_score(y_true=labels, y_pred=predictions_as_labels, zero_division=0)

    #   probability-based
    metric_roc_auc = roc_auc_score(y_true=labels, y_score=predictions_as_probs)
    metric_brier = brier_score_loss(y_true=labels, y_prob=predictions_as_probs)

    #   confusion matrix
    tn, fp, fn, tp = confusion_matrix(labels, predictions_as_labels, labels=[0, 1]).ravel()

    # package up the results
    result = {
        'accuracy': metric_accuracy,
        'f1score': metric_f1score,
        'precision': metric_precision,
        'recall': metric_recall,
        'roc_auc': metric_roc_auc,
        'brier_score': metric_brier,
        'tn': tn.item(),
        'fp': fp.item(),
        'fn': fn.item(),
        'tp': tp.item()
    }

    return result

## 3.6 - Setup / Run Trainer

In [20]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_binary_classification_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start
print("\nTraining duration:", str(time_training), end="\n\n")

# debug
print_summary(result)



  0%|          | 0/3057 [00:00<?, ?it/s]

{'loss': 0.1132, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


  0%|          | 0/874 [00:00<?, ?it/s]

{'eval_loss': 0.01712989993393421, 'eval_accuracy': 0.9983258689026572, 'eval_f1score': 0.9983473175692855, 'eval_precision': 0.9979667047909518, 'eval_recall': 0.9987282207808724, 'eval_roc_auc': 0.9999635338431926, 'eval_brier_score': 0.0023339306778908868, 'eval_tn': 68863, 'eval_fp': 144, 'eval_fn': 90, 'eval_tp': 70677, 'eval_runtime': 285.4048, 'eval_samples_per_second': 489.739, 'eval_steps_per_second': 3.062, 'epoch': 1.0}
{'loss': 0.0096, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}


  0%|          | 0/874 [00:00<?, ?it/s]

{'eval_loss': 0.006480858661234379, 'eval_accuracy': 0.9991557800449297, 'eval_f1score': 0.9991663134096368, 'eval_precision': 0.9991239597021463, 'eval_recall': 0.9992086707080984, 'eval_roc_auc': 0.999994177132113, 'eval_brier_score': 0.0009177940917554313, 'eval_tn': 68945, 'eval_fp': 62, 'eval_fn': 56, 'eval_tp': 70711, 'eval_runtime': 285.1189, 'eval_samples_per_second': 490.231, 'eval_steps_per_second': 3.065, 'epoch': 2.0}
{'loss': 0.0044, 'learning_rate': 0.0, 'epoch': 3.0}


  0%|          | 0/874 [00:00<?, ?it/s]

{'eval_loss': 0.005546794272959232, 'eval_accuracy': 0.9991843976705252, 'eval_f1score': 0.9991946308724833, 'eval_precision': 0.9990817004082901, 'eval_recall': 0.9993075868695861, 'eval_roc_auc': 0.999979264320124, 'eval_brier_score': 0.0008191003349945584, 'eval_tn': 68942, 'eval_fp': 65, 'eval_fn': 49, 'eval_tp': 70718, 'eval_runtime': 285.5344, 'eval_samples_per_second': 489.517, 'eval_steps_per_second': 3.061, 'epoch': 3.0}
{'train_runtime': 34809.6031, 'train_samples_per_second': 56.215, 'train_steps_per_second': 0.088, 'train_loss': 0.04240188717179178, 'epoch': 3.0}

Training duration: 0 days 09:40:10.093864

Time: 34809.60
Samples/second: 56.22
GPU memory occupied: 7861 MB.


## 3.7 - Save trained model

In [21]:
trainer.save_model()    # defaults to self.args.output_dir

In [22]:
# optional - delete checkpoint directories
checkpoint_dirs = [
    f"{trainer.args.output_dir}/{directory}"
    for directory in os.listdir(trainer.args.output_dir)
        if (
            os.path.isdir(os.path.join(trainer.args.output_dir, directory))
            and
            directory.startswith('checkpoint')
        )
]

for checkpoint_dir in checkpoint_dirs:
    print(f"Attempting to delete '{checkpoint_dir}' ...", end='')
    shutil.rmtree(checkpoint_dir)
    print(f" success")

Attempting to delete '..\data\models\multimodal\multi-modal__roberta-base-200k/checkpoint-3057' ... success


## 3.8 - Generate Model Predictions and Evaluate

The `Trainer.predict()` method also calculates performance metrics so we'll cover both tasks with one loop through our test dataset.

### 3.8.1 - Calculate Predictions with Test Dataset

And save the predictions as JSON file.

In [23]:
# make predictions
predictions_output = trainer.predict(test_dataset, metric_key_prefix='test')

# predictions output is a subclass of NamedTuple, so to save to JSON we convert to a dict first
#   Note: according to python docs, the leading underscore is to avoid name conflicts, not as the usual "discouraged from use" meaning
#   Source: https://docs.python.org/3.10/library/collections.html#collections.somenamedtuple._asdict
predictions_dict = predictions_output._asdict()

# process the incoming object depending on its type
if (isinstance(predictions_dict['predictions'], tuple)):
    predictions = predictions_dict['predictions'][0]
else:
    predictions = predictions_dict['predictions']

# overwrite the dict values with vanilla Python types
predictions_dict['run_name'] = trainer.args.run_name
predictions_dict['predictions'] = predictions.tolist()
predictions_dict['label_ids'] = predictions_dict['label_ids'].tolist()

# sort the keys
dict_order = ['run_name', 'metrics', 'predictions', 'label_ids']
predictions_dict = {key: predictions_dict[key] for key in dict_order}

# save predictions to `output_dir`
predictions_file = Path(trainer.args.output_dir, 'predictions.json')
with predictions_file.open(mode='w', encoding='utf-8') as fp:
    json.dump(predictions_dict, fp, indent=4)

  0%|          | 0/874 [00:00<?, ?it/s]

### 3.8.2 - Display and Save Metrics

In [24]:
# display
pprint(predictions_dict['metrics'], sort_dicts=False)

# consolidate
metrics_dict = {
    'run_name': trainer.args.run_name,
    'final_test_metrics': predictions_dict['metrics'],
    'trainer_log': trainer.state.log_history
}

# save
metrics_file = Path(trainer.args.output_dir, 'model_metrics.json')
with metrics_file.open(mode='w', encoding='utf-8') as fp:
    json.dump(metrics_dict, fp, indent=4)

{'test_loss': 0.005593133624643087,
 'test_accuracy': 0.9991343106322395,
 'test_f1score': 0.9991441565698361,
 'test_precision': 0.9989957709226178,
 'test_recall': 0.9992925863044708,
 'test_roc_auc': 0.9999797125755825,
 'test_brier_score': 0.0008616273300967795,
 'test_tn': 69022,
 'test_fp': 71,
 'test_fn': 50,
 'test_tp': 70630,
 'test_runtime': 553.8012,
 'test_samples_per_second': 252.388,
 'test_steps_per_second': 1.578}


### 3.8.3 - All The Args

In [25]:
# consolidate
all_args = {
    'run_name': trainer.args.run_name,
    'column_info_dict': column_info_dict,
    'data_args': {k: v for (k, v) in data_args.__dict__.items() if (k != 'data_df')},
    'model_args': model_args.__dict__,
    'training_args': training_args.__dict__
}

# fix args that doesn't json serialize
all_args['model_args']['model_name_or_path'] = all_args['model_args']['model_name_or_path'].as_posix()
all_args['training_args']["__cached__setup_devices"] = str(all_args['training_args']["__cached__setup_devices"])

# save
all_args_file = Path(trainer.args.output_dir, 'all_args.json')
with all_args_file.open(mode='w', encoding='utf-8') as fp:
    json.dump(all_args, fp, indent=4)

### 3.8.4 - Tweet IDs used

In [26]:
tweet_ids = {
    'train': train_df['tweet_id'].to_list(),
    'val': val_df['tweet_id'].to_list(),
    'test': test_df['tweet_id'].to_list()
}

tweet_ids_filename = Path(trainer.args.output_dir, 'tweet_ids.json')
with tweet_ids_filename.open(mode='w', encoding='utf-8') as fp:
    json.dump(tweet_ids, fp)

----