## Description

In this notebook, we demonstrate how to train and evaluate CTC models on a specific fold, as detailed in Section 3 and 4 of the paper.

In [2]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
sys.path.insert(0, sys.path[0] + '/..')
import os
import pandas as pd

In [2]:
import torch

torch.cuda.is_available()  # make sure torch in installed correctly

True

In [3]:
val_fold = 'img_class'  # this fold of papers is used as the valdation set
test_fold = 'misc'  # this fold of papers is used as the test set
seed = 42  # random seed
data_dir = '../data'  # the directory to put all downloaded training related files
output_dir = '../_tmp'  # the directory to store model outputs and predictions
os.makedirs(output_dir, exist_ok=True)

### Data Preparation

Data files we need to run this notebook:
- `cell_type.jsonl` (download from https://github.com/allenai/S2abEL/blob/main/data/release_data.tar.gz)
- `cell_type_additional_data.jsonl` (contains context spans and region for each cell, download [here](https://github.com/allenai/S2abEL/blob/main/data/train_data.tar.gz))
- `papers_with_text.pkl` (run [text extraction notebook](text_extraction.ipynb) to produce this)

In [5]:
cell_types = pd.read_json(os.path.join(data_dir, 'cell_type.jsonl'), lines=True)
cell_types_additional_data = pd.read_json(os.path.join(data_dir, 'cell_type_additional_data.jsonl'), lines=True)
cell_data = cell_types.merge(cell_types_additional_data, on='cell_id')

papers_with_text = pd.read_pickle(os.path.join(data_dir, 'papers_with_text.pkl'))

In [6]:
from common_utils.common_data_processing_utils import assemble_ctc_data

ctc = assemble_ctc_data(cell_data, papers_with_text)
display(ctc.head())

ctc.to_pickle(os.path.join(data_dir, 'CTC.pkl'))

Unnamed: 0,cell_id,cell_type,region_type,row_pos,col_pos,reverse_row_pos,reverse_col_pos,fold,cell_content,has_reference,row_context,col_context,cell_reference,context_sentences,labels
0,1606.02270v2/table_02.csv/0/1,dataset,1,0,1,9,3,qa,CBT-NE,0,[unused11] [unused10] CBT-NE [unused10] CBT-NE...,CBT-NE [unused10] valid [unused10] - [unused10...,,[unused7] Figure 2: An abridged example from C...,1
1,1606.02270v2/table_02.csv/0/2,dataset,1,0,2,9,2,qa,CBT-NE,0,[unused11] [unused10] CBT-NE [unused10] CBT-NE...,CBT-NE [unused10] test [unused10] 81.6 [unused...,,[unused7] Figure 2: An abridged example from C...,1
2,1606.02270v2/table_02.csv/0/3,dataset,1,0,3,9,1,qa,CBT-CN,0,[unused11] [unused10] CBT-NE [unused10] CBT-NE...,CBT-CN [unused10] valid [unused10] - [unused10...,,"As has been done previously, we train separate...",1
3,1606.02270v2/table_02.csv/0/4,dataset,1,0,4,9,0,qa,CBT-CN,0,[unused11] [unused10] CBT-NE [unused10] CBT-NE...,CBT-CN [unused10] test [unused10] 81.6 [unused...,,"As has been done previously, we train separate...",1
4,1606.02270v2/table_02.csv/1/0,other,0,1,0,8,4,qa,Model,0,Model [unused10] valid [unused10] test [unused...,[unused11] [unused10] Model [unused10] Humans ...,,Our model achieves state-of-the-art results on...,0


### Train a CTC model for a specific fold

The Config sets up the training and test configurations.

In [7]:
from common_utils.common_ML_utils import Config
from common_utils.common_data_processing_utils import cell_rep_features

config = Config(
        seed = seed,
        BS = 32,  # training batch size
        lr = 2e-5,
        num_MLP_layers = 1,  # number of MLP layers after the transformer encoder
        num_epochs = 2,  # the number of training epochs
        num_labels = 5,  # number of label classes, in our problem is 5
        input_cols = cell_rep_features,  
                    # features used in the cell representation in paper Sec 4
        test_fold = test_fold,
        valid_fold = val_fold,
        pretrained = "allenai/scibert_scivocab_uncased",  # the underlying pretrianed model
        augment = True,  # whether or not to augment the minoriy classes
        use_labels = True,  # supervised training
        drop_duplicates = True,  # whether or not to drop duplicates in the training set
        eval_steps = 300,  
        input_file = f'{data_dir}/CTC.pkl',
        eval_BS = 512,  # evaluation batch size
        name = f'CTC_{val_fold}_{test_fold}_{seed}',  # file name of the saved model
        save_dir = output_dir  # the dir to save the best model in training
    )

In [8]:
from CTC.trainers import train_CTC_notebook

# trains a model and save it to the save_dir/name
train_CTC_notebook(config)

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<<< Augmenting data for CTC!
0    35960
2    23536
3    21571
1    16298
4     6181
Name: labels, dtype: int64
Dropping duplicates!
103546
86649


Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/86649 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

cuda


100%|█████████████████████████████████████████████████████████████████████████| 5416/5416 [43:55<00:00,  2.05it/s]


### Evaluate a CTC model for a specific fold

In [5]:
from CTC.experiments import CTCEvalExpNB

exp = CTCEvalExpNB(model_path=f'{output_dir}/CTC_{val_fold}_{test_fold}_{seed}')

compute the validation f1 and test f1

In [9]:
val_f1, val_cr = exp.compute_cr('valid')
test_f1, test_cr = exp.compute_cr('test')

exp.generate_preds(save_path=f'{output_dir}/{val_fold}_{test_fold}_{seed}_CTC_preds')

print(f"valid f1: {val_f1}, test f1: {test_f1}")

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

Map:   0%|          | 0/2681 [00:00<?, ? examples/s]

100%|███████████████████████████████████████████████████████████████████████████████| 6/6 [00:13<00:00,  2.25s/it]


Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

Map:   0%|          | 0/1585 [00:00<?, ? examples/s]

100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.99s/it]


Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

Map:   0%|          | 0/4266 [00:00<?, ? examples/s]

100%|███████████████████████████████████████████████████████████████████████████████| 9/9 [00:21<00:00,  2.39s/it]

valid f1: 0.9709063768386841, test f1: 0.9772870540618896



