# A Notebook to Explore Multi-Answer QA Datasets
... and to test tools for interacting with them

Contains a few sections:
1. Explore Natural Questions
2. Explore AmbigQA
3. Explore QAMPARI
4. Load DPR retreiver checkpoint
5. Use BM25 Search

Note that AmbigQA section:
- uses utils for seeing how many questions have no positive contexts
- investigates how many annotations are conflicting (with examples)

In [38]:
import sh
import json

from multiqa_utils import retrieval_utils as ru

%load_ext autoreload
%autoreload 2

In [2]:
base_path = "/scratch/ddr8143/repos/DPR/downloads/data"
ambigqa_path = f"{base_path}/ambigqa"
ambigqa_light_path = f"{base_path}/ambigqa_light"
nq_path = f"{base_path}/retriever"
qp_path = f"{base_path}/qampari"

## Explore Natural Questions

In [12]:
sh.ls(nq_path)

LICENSE  README  nq-adv-hn-train.json  nq-dev.json  nq-train.json

In [3]:
nq_data = json.load(open(f"{nq_path}/nq-dev.json"))

In [7]:
print("NQ Keys:", nq_data[0].keys())
print()
for k in ["question", "answers"]:
    print(k + ": ", nq_data[0][k])

NQ Keys: dict_keys(['dataset', 'question', 'answers', 'positive_ctxs', 'negative_ctxs', 'hard_negative_ctxs'])

question:  who sings does he love me with reba
answers:  ['Linda Davis']


In [8]:
# Then look at a hard negative context structure
print(nq_data[0]["hard_negative_ctxs"][0])

{'title': "Why Don't You Love Me (Beyoncé song)", 'text': 'song. According to the lyrics of "Why Don\'t You Love Me", Knowles impersonates a woman who questions her love interest about the reason for which he does not value her fabulousness, convincing him she\'s the best thing for him as she sings: "Why don\'t you love me... when I make me so damn easy to love?... I got beauty... I got class... I got style and I got ass...". The singer further tells her love interest that the decision not to choose her is "entirely foolish". Originally released as a pre-order bonus track on the deluxe edition of "I Am...', 'score': 14.678405, 'title_score': 0, 'passage_id': '14525568'}


In [9]:
# And lets look at the structure of a positive context
nq_data[1]["positive_ctxs"][0]

{'title': 'Great Lakes',
 'text': 'Great Lakes The Great Lakes (), also called the Laurentian Great Lakes and the Great Lakes of North America, are a series of interconnected freshwater lakes located primarily in the upper mid-east region of North America, on the Canada–United States border, which connect to the Atlantic Ocean through the Saint Lawrence River. They consist of Lakes Superior, Michigan, Huron, Erie, and Ontario, although hydrologically, there are four lakes, Superior, Erie, Ontario, and Michigan-Huron. The lakes are interconnected by the Great Lakes Waterway. The Great Lakes are the largest group of freshwater lakes on Earth by total area, and second largest',
 'score': 1000,
 'title_score': 1,
 'passage_id': '151960'}

## Explore AmbigQA

**First there's the light data**

In [13]:
sh.ls(ambigqa_light_path)

dev.json  train.json

In [11]:
abl_data = json.load(open(f"{ambigqa_light_path}/dev.json"))

In [14]:
print("ABL Keys:", abl_data[0].keys())

ABL Keys: dict_keys(['annotations', 'id', 'question'])


In [15]:
for i in range(2):
    print(f"============================= Example {i} ==================================")
    print(json.dumps(abl_data[i], indent=4))
    print()

{
    "annotations": [
        {
            "type": "singleAnswer",
            "answer": [
                "Tony Goldwyn",
                "Goldwyn"
            ]
        }
    ],
    "id": "-807825952267713091",
    "question": "Who plays the doctor in dexter season 1?"
}

{
    "annotations": [
        {
            "type": "singleAnswer",
            "answer": [
                "usually continues uninterrupted until death"
            ]
        },
        {
            "type": "singleAnswer",
            "answer": [
                "constant",
                "usually continues uninterrupted until death"
            ]
        }
    ],
    "id": "8266116451988110240",
    "question": "How often does spermatogeneis\u2014the production of sperm\u2014occur?"
}



**Then there's all the data**

In [16]:
ab_data = json.load(open(f"{ambigqa_path}/dev.json"))

In [17]:
print("AB Keys:", ab_data[0].keys())

AB Keys: dict_keys(['viewed_doc_titles', 'used_queries', 'annotations', 'nq_answer', 'id', 'nq_doc_title', 'question'])


In [23]:
for i in range(4):
    print(i)
    print("     q:  ", ab_data[i]["question"])
    for a in ab_data[i]["annotations"]:
        if "answer" in a:
            print("         SA| t:", a["type"], "a:", a["answer"])
        else:
            print("         MA| t:", [dd["answer"] for dd in a["qaPairs"]])
    
    print("     nqa:", ab_data[i]["nq_answer"])

0
     q:   Who plays the doctor in dexter season 1?
         SA| t: singleAnswer a: ['Tony Goldwyn', 'Goldwyn']
     nqa: ['Tony Goldwyn']
1
     q:   How often does spermatogeneis—the production of sperm—occur?
         SA| t: singleAnswer a: ['usually continues uninterrupted until death']
         SA| t: singleAnswer a: ['constant', 'usually continues uninterrupted until death']
     nqa: ['74 days']
2
     q:   When was the first remote control tv invented?
         SA| t: singleAnswer a: ['1950']
         SA| t: singleAnswer a: ['1950']
     nqa: ['1950']
3
     q:   Why did the st louis cardinals move to arizona?
         SA| t: singleAnswer a: ['mediocrity of the Cardinals,a then-21-year-old stadium,game attendance to dwindle']
         MA| t: [['overall mediocrity of the Cardinals'], ['old stadium'], ['game attendance to dwindle']]
     nqa: ['1988']


**Then, lets look at a result from BM25 to see how many questions have no postive contexts**

In [44]:
ambigqa_bm25_outdir = "/scratch/ddr8143/repos/pyserini/runs"
ambigqa_bm25_pathname_fxn = lambda hits: ru.bm25_out_name(ambigqa_bm25_outdir, "ambigqa_light", "dev", hits)
ru.display_no_positive([100, 400, 1000], ambigqa_bm25_pathname_fxn)

Hits | No Positive Contexts Retrieved
---- | ------------------------------
 100 | 319/2002 (15.93%)
 400 | 221/2002 (11.04%)
1000 | 170/2002 (8.49%)


**And lets look at the annotation quality**

In [45]:
train_dataset_path = '/'.join([ambigqa_light_path, "train.json"])
train_dataset = json.load(open(train_dataset_path))

In [49]:
strange_ds = []
for d in train_dataset:
    if len(d["annotations"]) > 1:
        strange_ds.append(d)
print(f"{len(strange_ds)} questions in the training set have more than one annotation")
print()
print("For example:")
strange_ds[2]

149 questions in the training set have more than one annotation

For example:


{'annotations': [{'type': 'singleAnswer', 'answer': ['New Delhi']},
  {'type': 'singleAnswer',
   'answer': ['New Delhi', 'New Delhi, India', 'Delhi']}],
 'id': '3978528412752837293',
 'question': "India's first ever all india institute of ayurveda has come up in which city?"}

In [50]:
# Some more examples that have conflicting annotations:
for d in strange_ds[:10]:
    answers = []
    for anns in d['annotations']:
        if anns['type'] == 'multipleQAs':
            for qap in anns['qaPairs']:
                answers.extend(qap['answer'])
        else:
            answers.extend(anns['answer'])
    print(d["question"])
    print("       A:", answers)
    print("       A:", set(answers))

Dogri language is spoken in which state of india?
       A: ['Jammu and Kashmir, Himachal Pradesh, Punjab', 'Jammu and Kashmir, Himachal Pradesh, Punjab']
       A: {'Jammu and Kashmir, Himachal Pradesh, Punjab'}
Who plays granny on once upon a time?
       A: ['Beverley Elliott', 'Elliott', 'Beverley Elliott']
       A: {'Beverley Elliott', 'Elliott'}
India's first ever all india institute of ayurveda has come up in which city?
       A: ['New Delhi', 'New Delhi', 'New Delhi, India', 'Delhi']
       A: {'Delhi', 'New Delhi', 'New Delhi, India'}
Type of epithelial tissue containing cells that can change shapes as the tissue stretches?
       A: ['Transitional epithelium', 'a type of stratified epithelium', 'Transitional epithelium']
       A: {'Transitional epithelium', 'a type of stratified epithelium'}
When did the ranch season 2 come out?
       A: ['June 16, 2017', 'December 15, 2017', 'June 16, 2017', '2017', 'December 15, 2017', '2017']
       A: {'December 15, 2017', '2017', 'Ju

## Explore QAMPARI

In [24]:
sh.ls(qp_path)

dev_data.jsonl	test_data.jsonl  train_data.jsonl

In [25]:
qp_data = []
for l in open(f"{qp_path}/train_data.jsonl").readlines():
    qp_data.append(json.loads(l))

In [32]:
print(qp_data[0].keys())
print()
print(f"{qp_data[0]['qid']}: {qp_data[0]['question_text']}")
print()
print(qp_data[0]['entities'][0])
print()
print(qp_data[0]['answer_list'][0])

dict_keys(['entities', 'question_text', 'answer_list', 'qid'])

0__wikidata_simple__train: Which movie, clip, TV show etc. had Chezhiyan as director of photography?

{'entity_url': 'https://en.wikipedia.org/wiki/Chezhiyan', 'entity_text': 'Chezhiyan', 'aliases': ['Chezhiyan']}

{'answer_text': 'To Let', 'aid': '0__wikidata_simple__train__0', 'aliases': ['To Let'], 'answer_url': 'https://en.wikipedia.org/wiki/To_Let_(film)', 'proof': [{'proof_text': 'To let is a 2017 indian tamil-language drama film written, directed and filmed by chezhiyan.', 'found_in_url': 'https://en.wikipedia.org/wiki/To_Let_(film)', 'pid': '0__wikidata_simple__train__0__0'}]}


## Load a DPR Checkpoint

In [34]:
import transformers
import torch
import sh

In [35]:
the_model = torch.load(
    "/scratch/ddr8143/repos/DPR/downloads/checkpoint/retriever/single/nq/bert-base-encoder.cp",
    map_location=torch.device('cpu'),
)
print(the_model.keys())
model_dict = the_model["model_dict"]

odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])


In [36]:
# Get the model structure:
top_levels = set([k.split(".")[0] for k in model_dict.keys()])
print(top_levels)
q_model_ks = [k for k in model_dict.keys() if 'question_model' in k and 'embeddings' in k]
c_model_ks = [k for k in model_dict.keys() if 'ctx_model' in k and 'embeddings' in k]
print()
print("Question model")
for k in q_model_ks:
    print(">>", k)
print()
print("Context model")
for k in c_model_ks:
    print(">>", k)

{'ctx_model', 'question_model'}

Question model
>> question_model.embeddings.word_embeddings.weight
>> question_model.embeddings.position_embeddings.weight
>> question_model.embeddings.token_type_embeddings.weight
>> question_model.embeddings.LayerNorm.weight
>> question_model.embeddings.LayerNorm.bias

Context model
>> ctx_model.embeddings.word_embeddings.weight
>> ctx_model.embeddings.position_embeddings.weight
>> ctx_model.embeddings.token_type_embeddings.weight
>> ctx_model.embeddings.LayerNorm.weight
>> ctx_model.embeddings.LayerNorm.bias


In [37]:
# Then we can also load the base model
from transformers import BertModel, BertConfig
cfg = BertConfig.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
hf_bert = model.state_dict()
model_embed_ks = [k for k in hf_bert.keys() if "embeddings" in k]
model_embed_ks

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['embeddings.position_ids',
 'embeddings.word_embeddings.weight',
 'embeddings.position_embeddings.weight',
 'embeddings.token_type_embeddings.weight',
 'embeddings.LayerNorm.weight',
 'embeddings.LayerNorm.bias']

## Test using BM25 search

In [11]:
from pyserini.search.lucene import LuceneSearcher
from pyserini.index.lucene import IndexReader

In [12]:
#LuceneSearcher.list_prebuilt_indexes()

In [13]:
searcher = LuceneSearcher.from_prebuilt_index('wikipedia-dpr')

Attempting to initialize pre-built index wikipedia-dpr.
/home/ddr8143/.cache/pyserini/indexes/index-wikipedia-dpr-20210120-d1b9e6.c28f3a56b2dfcef25bf3bf755c264d04 already exists, skipping download.
Initializing wikipedia-dpr...


In [4]:
hits = searcher.search('hubble space telescope')

In [5]:
for i in range(0, 10):
    doc = searcher.doc(hits[i].docid)
    docstr = doc.raw()[:100].replace("\n", "")
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}, {docstr}')

 1 500264          17.42493, {  "id" : "500264",  "contents" : "\"Hubble Space Telescope\"\nHubble Space Telescope The Hubble S
 2 500350          17.02356, {  "id" : "500350",  "contents" : "\"Hubble Space Telescope\"\nThese are often European in origin,
 3 500368          16.58024, {  "id" : "500368",  "contents" : "\"Hubble Space Telescope\"\nvery narrow field—Lucky Cam, for ex
 4 500266          16.45677, {  "id" : "500266",  "contents" : "\"Hubble Space Telescope\"\ndata, while the Goddard Space Fligh
 5 500367          16.37399, {  "id" : "500367",  "contents" : "\"Hubble Space Telescope\"\nthe visible, ultraviolet, and infra
 6 500265          16.27738, {  "id" : "500265",  "contents" : "\"Hubble Space Telescope\"\nultraviolet, visible, and near infr
 7 500362          16.26323, {  "id" : "500362",  "contents" : "\"Hubble Space Telescope\"\n(SCRS). , the Trump Administration 
 8 14244283        16.20809, {  "id" : "14244283",  "contents" : "\"Hubble (film)\"\nHubble (film) Hubbl

In [6]:
print(hits[0].raw)

{
  "id" : "500264",
  "contents" : "\"Hubble Space Telescope\"\nHubble Space Telescope The Hubble Space Telescope (HST) is a space telescope that was launched into low Earth orbit in 1990 and remains in operation. Although not the first space telescope, Hubble is one of the largest and most versatile and is well known as both a vital research tool and a public relations boon for astronomy. The HST is named after the astronomer Edwin Hubble and is one of NASA's Great Observatories, along with the Compton Gamma Ray Observatory, the Chandra X-ray Observatory and the Spitzer Space Telescope. With a mirror, Hubble's four main instruments observe in the near"
}


In [7]:
# python -m pyserini.search.lucene \
#   --index wikipedia-dpr \
#   --topics dpr-nq-test \
#   --output runs/run.dpr.nq-test.bm25.trec