In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from datasets import load_dataset
from tqdm import tqdm

from qasper.dataset_reader import QasperReader
from qasper.models import qasper, gpt35
from qasper.utils import print_wrap

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dataset = load_dataset("allenai/qasper")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100%|██████████| 5.95k/5.95k [00:00<00:00, 10.0MB/s]
Downloading metadata: 100%|██████████| 8.14k/8.14k [00:00<00:00, 35.1MB/s]
Downloading readme: 100%|██████████| 9.64k/9.64k [00:00<00:00, 24.8MB/s]
Downloading data: 100%|██████████| 10.8M/10.8M [00:04<00:00, 2.65MB/s]
Downloading data: 100%|██████████| 3.87M/3.87M [00:00<00:00, 6.59MB/s]
Generating train split: 100%|██████████| 888/888 [00:00<00:00, 2034.19 examples/s]
Generating validation split: 100%|██████████| 281/281 [00:00<00:00, 1568.52 examples/s]
Generating test split: 100%|██████████| 416/416 [00:00<00:00, 1751.79 examples/s]


In [4]:
dataset['validation']

Dataset({
    features: ['id', 'title', 'abstract', 'full_text', 'qas', 'figures_and_tables'],
    num_rows: 281
})

In [5]:
dataset['validation'][0]

{'id': '1912.01214',
 'title': 'Cross-lingual Pre-training Based Transfer for Zero-shot Neural Machine Translation',
 'abstract': 'Transfer learning between different language pairs has shown its effectiveness for Neural Machine Translation (NMT) in low-resource scenario. However, existing transfer methods involving a common target language are far from success in the extreme scenario of zero-shot translation, due to the language space mismatch problem between transferor (the parent model) and transferee (the child model) on the source side. To address this challenge, we propose an effective transfer learning approach based on cross-lingual pre-training. Our key idea is to make all source languages share the same feature space and thus enable a smooth transition for zero-shot translation. To this end, we introduce one monolingual pre-training method and two bilingual pre-training methods to obtain a universal encoder for different languages. Once the universal encoder is constructed, t

In [6]:
reader = QasperReader()

In [7]:
def instance_generator(split):
    for article in split:
        for instance in reader._article_to_instances(article):
            yield instance

In [8]:
# randomly sample 100 instances
import random
random.seed(42)
instances = list(instance_generator(dataset['validation']))
instances = random.sample(instances, 100)

In [9]:
reader._stats

defaultdict(int,
            {'number of documents': 281,
             'number of questions': 1005,
             'number of answers': 3015,
             'questions with multiple answers': 1005,
             'extractive questions': 962,
             'extractive questions with multiple spans': 406,
             'multiple_evidence_spans_count': 536,
             'answers with table or figure as evidence': 212,
             'freeform answers': 431,
             'yes/no questions': 208,
             'answers with no evidence': 212,
             'unanswerable questions': 163,
             'number of truncated contexts': 15})

In [10]:
len(instances)

100

In [11]:
instance = instances[0]
print(instance.keys())
print('QUESTION WITH CONTEXT:')
# print_wrap(instance['s_question_with_context'])
print(instance['s_question_with_context'])

dict_keys(['question_with_context', 's_question_with_context', 'paragraph_indices', 'global_attention_mask', 'evidence', 'answer', 'metadata'])
QUESTION WITH CONTEXT:
Did they experiment with this new dataset?
Introduction
How humans process language has become increasingly relevant in natural language processing since physiological data during language understanding is more accessible and recorded with less effort. In this work, we focus on eye-tracking and electroencephalography (EEG) recordings to capture the reading process. On one hand, eye movement data provides millisecond-accurate records about where humans look when they are reading, and is highly correlated with the cognitive load associated with different stages of text processing. On the other hand, EEG records electrical brain activity across the scalp and is a direct measure of physiological processes, including language processing. The combination of both measurement methods enables us to study the language understanding

In [12]:
print_wrap(' '.join(instance['question_with_context']))

<s> Did Ġthey Ġexperiment Ġwith Ġthis Ġnew Ġdataset ? </s> Introduction </s> How
Ġhumans Ġprocess Ġlanguage Ġhas Ġbecome Ġincreasingly Ġrelevant Ġin Ġnatural
Ġlanguage Ġprocessing Ġsince Ġphysiological Ġdata Ġduring Ġlanguage
Ġunderstanding Ġis Ġmore Ġaccessible Ġand Ġrecorded Ġwith Ġless Ġeffort . ĠIn
Ġthis Ġwork , Ġwe Ġfocus Ġon Ġeye - tracking Ġand Ġelectro ence phal ography Ġ(
EE G ) Ġrecordings Ġto Ġcapture Ġthe Ġreading Ġprocess . ĠOn Ġone Ġhand , Ġeye
Ġmovement Ġdata Ġprovides Ġmillisec ond - acc urate Ġrecords Ġabout Ġwhere
Ġhumans Ġlook Ġwhen Ġthey Ġare Ġreading , Ġand Ġis Ġhighly Ġcorrelated Ġwith
Ġthe Ġcognitive Ġload Ġassociated Ġwith Ġdifferent Ġstages Ġof Ġtext Ġprocessing
. ĠOn Ġthe Ġother Ġhand , ĠEEG Ġrecords Ġelectrical Ġbrain Ġactivity Ġacross
Ġthe Ġscalp Ġand Ġis Ġa Ġdirect Ġmeasure Ġof Ġphysiological Ġprocesses ,
Ġincluding Ġlanguage Ġprocessing . ĠThe Ġcombination Ġof Ġboth Ġmeasurement
Ġmethods Ġenables Ġus Ġto Ġstudy Ġthe Ġlanguage Ġunderstanding Ġprocess Ġin Ġa

In [13]:
print('QUESTION:', instance['metadata']['question'])
print('ANSWER:', instance['answer'])

QUESTION: Did they experiment with this new dataset?
ANSWER: No


In [14]:
instance['metadata']['article_id']

'1912.00903'

## Evaluate F1 score

In [15]:
from qasper.evaluator import token_f1_score, get_answers_and_evidence, evaluate

In [16]:
qasper_answer = qasper.predict(instance)[0]
print(qasper_answer)

Input ids are automatically padded from 4697 to 5120 to be a multiple of `config.attention_window`: 1024


, namely one word, i.e. we are examining the effects of attention delay on


In [17]:
token_f1_score(qasper_answer, instance['answer'])

0

In [18]:
gpt35_answer = gpt35.predict(instance)
print_wrap(gpt35_answer)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Yes, the researchers experimented with the new dataset, Zurich Cognitive
Language Processing Corpus (ZuCo) 2.0. This corpus includes simultaneous eye
movement and brain activity recordings during natural reading and task-specific
reading during annotation. The dataset consists of physiological data from 18
subjects reading 739 English sentences from Wikipedia. The researchers conducted
a detailed technical validation of the data to ensure the quality of the
recordings.


The corpus construction involved recording data from 19 participants, with data
from one participant discarded due to technical issues. The participants read
sentences selected from the Wikipedia corpus, and the dataset included both
normal reading and task-specific reading paradigms. The experimental design
included participants reading sentences at their own speed, using a control pad
to move to the next sentence and answer control questions.


Data acquisition took place in a controlled environment, with eye-trackin

In [19]:
token_f1_score(gpt35_answer, instance['answer'])

0

In [20]:
def evaluate(gold, predicted):
    max_answer_f1s = []
    max_evidence_f1s = []
    max_answer_f1s_by_type = {
        "extractive": [],
        "abstractive": [],
        "boolean": [],
        "none": [],
    }
    num_missing_predictions = 0
    for question_id in gold:
        if question_id not in predicted:
            num_missing_predictions += 1
            max_answer_f1s.append(0.0)
            max_evidence_f1s.append(0.0)
            continue
        answer_f1s_and_types = [
            (token_f1_score(predicted[question_id]["answer"], reference["answer"]),
             reference["type"])
            for reference in gold[question_id]
        ]
        max_answer_f1, answer_type = sorted(answer_f1s_and_types, key=lambda x: x[0], reverse=True)[0]
        max_answer_f1s.append(max_answer_f1)
        max_answer_f1s_by_type[answer_type].append(max_answer_f1)
        # evidence_f1s = [
        #     paragraph_f1_score(predicted[question_id]["evidence"], reference["evidence"])
        #     for reference in gold[question_id]
        # ]
        # max_evidence_f1s.append(max(evidence_f1s))

    mean = lambda x: sum(x) / len(x) if x else 0.0
    return {
        "Answer F1": mean(max_answer_f1s),
        "Answer F1 by type": {key: mean(value) for key, value in max_answer_f1s_by_type.items()},
        # "Evidence F1": mean(max_evidence_f1s),
        "Missing predictions": num_missing_predictions
    }

In [21]:
# gold_data = json.load(open(args.gold))
gold_answers_and_evidence = get_answers_and_evidence(dataset['validation'])

In [22]:
gold_answers_and_evidence.keys()

dict_keys(['b6f15fb6279b82e34a5bf4828b7b5ddabfdf1d54', 'f5e6f43454332e0521a778db0b769481e23e7682', '9a05a5f4351db75da371f7ac12eb0b03607c4b87', '5eda469a8a77f028d0c5f1acd296111085614537', '18c5d366b1da8447b5404eab71f4cc658ba12e6f', 'b5e4866f0685299f1d7af267bbcc4afe2aab806f', '1f085b9bb7bfd0d6c8cba1a9d73f08fcf2da7590', 'b6ae8e10c6a0d34c834f18f66ab730b670fb528c', 'a87a009c242d57c51fc94fe312af5e02070f898b', 'ef4dba073d24042f24886580ae77add5326f2130', '2df4a045a9cd7b44874340b6fdf9308d3c55327a', 'a313e98994fc039a82aa2447c411dda92c65a470', '37861be6aecd9242c4fdccdfcd06e48f3f1f8f81', '7e62a53823aba08bc26b2812db016f5ce6159565', '9eabb54c2408dac24f00f92cf1061258c7ea2e1a', '3d013f15796ae7fed5272183a166c45f16e24e39', '9ee07edc371e014df686ced4fb0c3a7b9ce3d5dc', 'd3aa0449708cc861a51551b128d73e11d62207d2', 'cfbec1ef032ac968560a7c76dec70faf1269b27c', 'c0e341c4d2253eb42c8840381b082aae274eddad', '1ec152119cf756b16191b236c85522afeed11f59', '891c2001d6baaaf0da4e65b647402acac621a7d2', '66c96c297c2cffdf5013

In [23]:
gold_answers_and_evidence['b6f15fb6279b82e34a5bf4828b7b5ddabfdf1d54']

[{'answer': 'BIBREF19, BIBREF20',
  'evidence': ['Table TABREF19 and TABREF26 report zero-shot results on Europarl and Multi-UN evaluation sets, respectively. We compare our approaches with related approaches of pivoting, multilingual NMT (MNMT) BIBREF19, and cross-lingual transfer without pretraining BIBREF16. The results show that our approaches consistently outperform other approaches across languages and datasets, especially surpass pivoting, which is a strong baseline in the zero-shot scenario that multilingual NMT systems often fail to beat BIBREF19, BIBREF20, BIBREF23. Pivoting translates source to pivot then to target in two steps, causing inefficient translation process. Our approaches use one encoder-decoder model to translate between any zero-shot directions, which is more efficient than pivoting. Regarding the comparison between transfer approaches, our cross-lingual pretraining based transfer outperforms transfer method that does not use pretraining by a large margin.'],
 

In [None]:
predicted_answers_and_evidence = {}

In [25]:
for instance in tqdm(instances):
    question_id = instance["metadata"]["question_id"]

    if question_id in predicted_answers_and_evidence: # keep this to conserve API requests
        continue

    # prediction_data = json.loads(line)
    # pred_answer = qasper.predict(instance)[0]
    pred_answer = gpt35.predict(instance)

    predicted_answers_and_evidence[question_id] = {
        "answer": pred_answer,
        # "evidence": prediction_data["predicted_evidence"]
    }


100%|██████████| 100/100 [07:07<00:00,  4.27s/it]


In [None]:
# save predictions
import json
with open('output/gpt35-predictions.json', 'w') as f:
    json.dump(predicted_answers_and_evidence, f)

In [28]:
evaluation_output = evaluate(
    {k:v for k, v in gold_answers_and_evidence.items() \
        if k in predicted_answers_and_evidence}, 
    predicted_answers_and_evidence)

In [29]:
len(gold_answers_and_evidence)

1005

In [30]:
evaluation_output

{'Answer F1': 0.06489175253739486,
 'Answer F1 by type': {'extractive': 0.08545044887734807,
  'abstractive': 0.08790371215230747,
  'boolean': 0.0013445793337097684,
  'none': 0.0},
 'Missing predictions': 0}

In [None]:
# print(json.dumps(evaluation_output, indent=2))