## Description

In this notebook, we demonstrate how to train and evaluate ED related models on a specific fold, as detailed in Section 3 and 4 of the paper.

In [2]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
sys.path.insert(0, sys.path[0] + '/..')

import json
import os
import pandas as pd

In [3]:
val_fold = 'img_class'  # this fold of papers is used as the valdation set
test_fold = 'misc'  # this fold of papers is used as the test set
seed = 42  # random seed
data_dir = '../data/intermediate_data'  # the directory to put all downloaded training related files
output_dir = '../_tmp'  # the directory to store model outputs and predictions
os.makedirs(output_dir, exist_ok=True)

In [None]:
import torch

torch.cuda.is_available()  # make sure torch in installed correctly

### Generate end-to-end EL data

We did not upload end-to-end training and test data for EL, instead, we generate them here, because they are dependent on previous ASR and DR outputs.

#### build PwC entity vocab

In [4]:
with open(f'../data/intermediate_data/methods.json') as f:
    methods = json.load(f)

with open(f'../data/intermediate_data/datasets.json') as f:
    datasets = json.load(f)


ents = methods + datasets
ent_map = {}

## For each PwC entity, we get its name, full name, description, and pwc url.
for m in ents:
    name = '' if m['name'] is None else m['name']
    full_name = '' if m['full_name'] is None else m.get('full_name', '')
    description = '' if m['description'] is None else m.get('description', '')
    ent_map[m['url']] = (name, full_name, description, m['url'])

#### Generate data for EL model

In [6]:
from ED.utils import convert_EL_cans_to_ML_data, replace_EL_non_train_folds


top_n_to_rank = 50
CTC = pd.read_pickle(f'{data_dir}/CTC.pkl')
CER_output = pd.read_pickle(f'{output_dir}/{val_fold}_{test_fold}_{seed}_CER_preds')

# generate test data, where candidate entites are from DR + ASR preds
test_data = CER_output[CER_output.fold == test_fold].merge(CTC[['ext_id', 'row_context', 'col_context', 'row_id', 'col_id', 'reverse_row_id', 'reverse_col_id', 'region_type', 'text_sentence_no_mask']], on='ext_id', how='inner')
test_data = convert_EL_cans_to_ML_data(test_data, ent_map, 'candidates_100', 'candidates_100', add_GT=False, top_n=top_n_to_rank)
test_data.to_pickle(f'{output_dir}/test_{val_fold}_{test_fold}_{seed}_ED_e2e')


## generate training where candiate entities are from BM25F outputs. Also generate validation data, where candiate entities are from DR + ASR preds + ground truth.
EL_mixed_cans = pd.read_pickle('../data/intermediate_data/EL_bm25f_ent_can.pkl')
cans = EL_mixed_cans.merge(CTC[['ext_id', 'fold', 'cell_type', 'cell_content', 'row_context', 'col_context', 'row_id', 'col_id', 'reverse_row_id', 'reverse_col_id', 'region_type', 'text_sentence_no_mask']], on='ext_id', how='inner')
train_val_data = replace_EL_non_train_folds(cans, CER_output[CER_output.fold==val_fold], test_fold=test_fold)
train_val_data = convert_EL_cans_to_ML_data(train_val_data, ent_map, 'candidates_100', 'candidates_100', add_GT=True, top_n=50)
train_val_data.to_pickle(f'{output_dir}/train_val_{val_fold}_{test_fold}_{seed}_ED_e2e')

### Train an end-to-end ED model

In [4]:
from ED.experiments import EDExp

## sets up the configurations for the e2e EL model, a cross-encoder arch.
exp = EDExp(seed=seed,
        BS=32,
        epoch=2,
        lr=2e-5,
        eval_BS=512,
        test_fold=test_fold,
        val_fold=val_fold,
        train_file_path=f'{output_dir}/train_val_{val_fold}_{test_fold}_{seed}_ED_e2e',  # Training and validation data generated in the above cell
        test_file_path=f'{output_dir}/test_{val_fold}_{test_fold}_{seed}_ED_e2e',  # Test data generated in the above cell
        ctc_pred_file_path=f'{output_dir}/{val_fold}_{test_fold}_{seed}_CTC_preds',  # predictions generated by the CTC model
        save_dir=output_dir,
        eval_steps=300,
        name=f"ED_e2e_{val_fold}_{test_fold}_{seed}",
        mode='all')  # consider both inKB and outKB mentions

exp.train()

Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Dropping duplicates!
328913
328913


Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/328913 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Map:   0%|          | 0/29852 [00:00<?, ? examples/s]

Training length 328913
Validation length 29852


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20558/20558 [5:32:19<00:00,  1.03it/s]


### Evaluate an end-to-end ED model

In [5]:
# ed_preds = exp.compute_preds_on_GT_CTC()

## test the e2e EL model with inKB threshold being 0.5, and mesure the accuracy at top 1
perf = exp.test(threshold=0.5, inKB_acc_at_topks=[1])
display(perf)

<<< fold = misc


Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

Map:   0%|          | 0/15350 [00:00<?, ? examples/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 307/307 [00:00<00:00, 606.70it/s]


Unnamed: 0,fold,e2e outKB prec,e2e outKB recall,e2e outKB f1,outKB support,inKB support,e2e gloabl acc,threshold,e2e inKB acc@1
0,misc,0.776824,0.804444,0.790393,225,82,0.641694,0.5,0.195122
