#### The Data
The test data is in 'paper_dev.jsonl'

- Generate random sentences for NEI class
- Generate predicted pages for other classes

In [1]:
import importlib
from mda.src.dataset.DatasetGenerator import DatasetGenerator

In [3]:
ds_generators = DatasetGenerator(dataset_root='/local/fever-common/', 
                                 out_dir='working/data/out/dev/', 
                                 database_path='/local/fever-common/data/fever/fever.db')

#### Generate a dataset with predicted pages / documents per claim.

This single routine handles the generation of the closest docs for the NEI class as well as the generation of the closest docs for the other classes.

If there are 'n' evidences available in the dataset for a sample, we generate 'n' predictions for the sample.

If there are fewer than 5 evidences available in the dataset, i.e. when n < 5, we still generate 5 predictions for the evidences.

We remove the prediction lines ids from the samples and replace them with tags like -1 and -2 indicating if the class belonged to NEI or otherwise.

In [3]:
ds_generators.generate_page_predictions(split='paper_dev', k=5)

Saving prepared dataset to paper_dev.ns.pages.p5.jsonl


100%|██████████| 9999/9999 [08:40<00:00, 19.22it/s]


In [6]:
!head -1 working/data/out/dev/paper_dev_pipeline.ns.pages.p5.jsonl

{"id": 91198, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Colin Kaepernick became a starting quarterback during the 49ers 63rd season in the National Football League.", "evidence": [[[108548, null, "Colin_Kaepernick", -1], [-1, null, "Colin_Kaepernick", -2], [-1, null, "Pistol_offense", -2], [-1, null, "2016_San_Francisco_49ers_season", -2], [-1, null, "2014_San_Francisco_49ers_season", -2]]]}


#### Generate a dataset with the predicted sentences

Re-initialize the generator, save memory, do not init ranker

The input to this step should be the file from the page predictions step, ensure this file is generated and available in the *out_dir*

- paper_dev_pipeline.ns.pages.p5.jsonl

In [2]:
ds_generators = DatasetGenerator(dataset_root='/local/fever-common/', 
                                 out_dir='working/data/out/dev/', 
                                 database_path='/local/fever-common/data/fever/fever.db', init_ranker=False)

In [3]:
ds_generators.generate_sentence_predictions(split='paper_dev', k=5)

  0%|          | 0/9999 [00:00<?, ?it/s]

Saving prepared dataset to paper_dev_pipeline.ps.pages.p5.jsonl


  0%|          | 24/9999 [01:59<13:48:12,  4.98s/it]


KeyboardInterrupt: 

In [5]:
!head -1 working/data/out/dev/paper_dev_pipeline.ps.pages.p5.jsonl

{"id": 91198, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Colin Kaepernick became a starting quarterback during the 49ers 63rd season in the National Football League.", "evidence": [[[108548, null, "Colin_Kaepernick", -1, [["During the 2013 season , his first full season as a starter , Kaepernick helped the 49ers reach the NFC Championship , losing to the Seattle Seahawks ."], []]], [-1, null, "Colin_Kaepernick", -2, [["He remained the team 's starting quarterback for the rest of the season and went on to lead the 49ers to their first Super Bowl appearance since 1994 , losing to the Baltimore Ravens .", "Colin Rand Kaepernick -LRB- -LSB- ` k\u00e6p\u0259rn\u026ak -RSB- ; born November 3 , 1987 -RRB- is an American football quarterback who is currently a free agent .", "In 2016 , Kaepernick gained national attention when he began protesting by not standing while the United States national anthem was being performed before the start of games , motivated by what 