## Description

In this notebook, we demonstrate how to train and evaluate CTC models on a specific fold, as detailed in Section 3 and 4 of the paper.

In [2]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
sys.path.insert(0, sys.path[0] + '/..')
import os

In [3]:
import torch

torch.cuda.is_available()  # make sure torch in installed correctly

True

In [4]:
val_fold = 'img_class'  # this fold of papers is used as the valdation set
test_fold = 'misc'  # this fold of papers is used as the test set
seed = 42  # random seed
data_dir = '../data/intermediate_data'  # the directory to put all downloaded training related files
output_dir = '../_tmp'  # the directory to store model outputs and predictions
os.makedirs(output_dir, exist_ok=True)

### Train a CTC model for a specific fold

The Config sets up the training and test configurations.

In [5]:
from common_utils.common_ML_utils import Config

config = Config(
        seed = seed,
        BS = 32,  # training batch size
        lr = 2e-5,
        num_MLP_layers = 1,  # number of MLP layers after the transformer encoder
        num_epochs = 2,  # the number of training epochs
        num_labels = 5,  # number of label classes, in our problem is 5
        input_cols = ['region_type', 'row_id', 'reverse_row_id', 
                    'col_id', 'reverse_col_id', 'has_reference', 
                    'cell_content', 'row_context', 'col_context', 
                    'text_sentence_no_mask'],  
                    # features used in the cell representation in paper Sec 4
        test_fold = test_fold,
        valid_fold = val_fold,
        pretrained = "allenai/scibert_scivocab_uncased",  # the underlying pretrianed model
        augment = True,  # whether or not to augment the minoriy classes
        use_labels = True,  # supervised training
        drop_duplicates = True,  # whether or not to drop duplicates in the training set
        eval_steps = 300,  
        input_file = f'{data_dir}/CTC.pkl',
        eval_BS = 512,  # evaluation batch size
        name = f'CTC_{val_fold}_{test_fold}_{seed}',  # file name of the saved model
        save_dir = output_dir  # the dir to save the best model in training
    )

In [6]:
from CTC.trainers import train_CTC_notebook

# trains a model and save it to the save_dir/name
train_CTC_notebook(config)

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<<< Augmenting data for CTC!
0    35960
2    23536
3    21571
1    16298
4     6181
Name: labels, dtype: int64
Dropping duplicates!
103546
86649


Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

cuda


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5416/5416 [43:31<00:00,  2.07it/s]


### Evaluate a CTC model for a specific fold

In [7]:
from CTC.experiments import CTCEvalExpNB

exp = CTCEvalExpNB(model_path=f'{output_dir}/CTC_{val_fold}_{test_fold}_{seed}')

compute the validation f1 and test f1

In [9]:
val_f1, val_cr = exp.compute_cr('valid')
test_f1, test_cr = exp.compute_cr('test')

exp.generate_preds(f'{output_dir}/{val_fold}_{test_fold}_{seed}_CTC_preds')

print(f"valid f1: {val_f1}, test f1: {test_f1}")

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:13<00:00,  2.23s/it]


Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.98s/it]


Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:21<00:00,  2.36s/it]

valid f1: 0.9638194441795349, test f1: 0.9779179692268372



