# Prep, Run and Eval QAMPARI Baselines

This includes
- Use of new utils to view the retrieval results (positive)
- Use of new utils to calculate some metrics (Recall, Precision, F1)
- Evaluated the qmp_bm25 and qmp_dpr retreival performance (full set and split by question type).

TODO:
- It would be good to color the answer list based on whether they were found in the positive contexts.
- It would be good to visualize the top k contexts too (not just the postive ones)
- Also, do matching based on how QAMPARI determined positive contexts, because for element 2 they definitely aren't by exact match.  Figure this out.

In [16]:
import json
import jsonlines
import sh

import multiqa_utils.general_utils as gu
import multiqa_utils.qampari_utils as qu
import multiqa_utils.eval_utils as eu

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Inspect QAMPARI provided retrieval results

They provide BM25/DPR predictions [here](https://samsam3232.github.io/qampari).

In [17]:
qmp_dwn_path = "/scratch/ddr8143/multiqa/qampari_data/qampari_downloads/"

In [18]:
print(sh.ls(qmp_dwn_path))
print(">> fid_bm25_results")
print(sh.ls(f"{qmp_dwn_path}/fid_bm25_results"))
print()
print(">> fid_dpr_results")
print(sh.ls(f"{qmp_dwn_path}/fid_dpr_results"))

fid_bm25_results  fid_dpr_results  rag_results

>> fid_bm25_results
full_dev_data.jsonl	  full_test_data_gold.jsonl.gz
full_dev_data_gold.jsonl  full_train_data.jsonl.gz
full_test_data.jsonl.gz


>> fid_dpr_results
full_dev_data.jsonl  full_test_data.jsonl.gz  full_train_data.jsonl.gz



In [19]:
fid_bm25_dev = gu.loadjsonl(f"{qmp_dwn_path}/fid_bm25_results/full_dev_data.jsonl")
fid_dpr_dev = gu.loadjsonl(f"{qmp_dwn_path}/fid_dpr_results/full_dev_data.jsonl")

In [32]:
qu.print_retrieval_data(fid_dpr_dev[2])

['New Pittsburgh Courier', 'The Michigan FrontPage', 'Michigan Chronicle', 'Chicago Defender', 'Atlanta Daily World']
Type:                564__wikidata_simple__dev
Question:            Which entity does Real Times have control over?
Question Keywords:   [31mWhich[0m, [31mentity[0m, [31mdoes[0m, [31mReal[0m, [31mTimes[0m, [31mcontrol[0m, [31mover[0m
Answers:             [32mNew Pittsburgh Courier[0m, [32mThe Michigan FrontPage[0m, [32mMichigan Chronicle[0m, [32mChicago Defender[0m, [32mAtlanta Daily World[0m
Len pos contexts:    5
Len ctxs:            200
----------------------------------
136.7562 | [32mNew Pittsburgh Courier[0m
    >> The [32mNew Pittsburgh Courier[0m is a weekly African-American newspaper based in Pittsburgh, Pennsylvania, United States. It is owned by
       [31mReal[0m [31mTimes[0m. The newspaper is named after the original "Pittsburgh Courier" (1907–65), [31mWhich[0m in the 1930s and 1940s was one
       of the largest and most i

In [18]:
qu.print_data(fid_bm25_dev[2])

Type:                958__wikidata_comp__dev
Question:            Where did the White House Deputy Chief of Staff receive their education?
Question Keywords:   [31mWhere[0m, [31mdid[0m, [31mWhite[0m, [31mHouse[0m, [31mDeputy[0m, [31mChief[0m, [31mStaff[0m, [31mreceive[0m, [31meducation[0m
Answers:             [32mHillcrest High School[0m, [32mUniversity of Auckland[0m, [32mUC Berkeley School of Law[0m, [32mKenyon College[0m, [32mStanford University[0m
Len pos contexts:    10
Len ctxs:            200
----------------------------------
182.5096 | Harriet Miers
    >> Harriet Ellan Miers (born August 10, 1945) is an American lawyer who served as [31mWhite[0m [31mHouse[0m Counsel to President George W. Bush from
       2005 to 2007. A member of the Republican Party since 1988, she previously served as [31mWhite[0m [31mHouse[0m [31mStaff[0m Secretary from 2001
       to 2003 and [31mWhite[0m [31mHouse[0m [31mDeputy[0m [31mChief[0m of [31mStaff

## Calculate Metrics on Data

**Calculating metrics on the fid_\<retrieval\>_dev dataset.**

In [19]:
for kv in [20, 100, 200]:
    print(f"Performance @{kv}")
    dataset_results = eu.evaluate_dataset(fid_bm25_dev, k=kv)
    for k, v in dataset_results.items():
        if "avg" in k:
            print(f"{k + ':':25} {v*100.0:0.2f}%")
    print()

Performance @20
avg_recall:               36.18%
avg_precision:            23.56%
avg_f1:                   25.47%

Performance @100
avg_recall:               56.17%
avg_precision:            15.35%
avg_f1:                   21.06%

Performance @200
avg_recall:               62.40%
avg_precision:            12.07%
avg_f1:                   17.38%



In [17]:
eu.viz_correct_answers_context_list(fid_bm25_dev[1])

Question: Which Judge of the United States Court of Appeals for the Second Circuit works for Yale Law School?
Answers: [['Charles Edward Clark'], ['Henry Wade Rogers'], ['John M. Walker, Jr.'], ['Ralph K. Winter, Jr.'], ['Thomas Walter Swan'], ['Guido Calabresi']]
-------
Returned Answers: ['Charles Edward Clark', 'Ralph K. Winter, Jr.', 'Thomas Walter Swan', 'Guido Calabresi']
[Recall: 66.67%] 4 out of 6 in context list
[Precision: 9.00%] 18 out of 200 contexts contained an answer


In [20]:
for kv in [20, 100, 200]:
    print(f"Performance @{kv}")
    dataset_results = eu.evaluate_dataset(fid_dpr_dev, k=kv)
    for k, v in dataset_results.items():
        if "avg" in k:
            print(f"{k + ':':25} {v*100.0:0.2f}%")
    print()

Performance @20
avg_recall:               21.29%
avg_precision:            14.64%
avg_f1:                   14.52%

Performance @100
avg_recall:               34.14%
avg_precision:            9.52%
avg_f1:                   12.29%

Performance @200
avg_recall:               40.48%
avg_precision:            7.90%
avg_f1:                   10.92%



In [21]:
eu.viz_correct_answers_context_list(fid_dpr_dev[1])

Question: Where was a Bishop of Bradford taught?
Answers: [['Nottingham High School'], ['Marlborough College'], ["King's College"], ["King's College London"], ['University of Birmingham']]
-------
Returned Answers: ['Nottingham High School', "King's College", "King's College London", 'University of Birmingham']
[Recall: 80.00%] 4 out of 5 in context list
[Precision: 3.00%] 6 out of 200 contexts contained an answer


**Splitting out results by question type**

In [39]:
print(50.72-31.02,
54.46-36.17,
72.16-38.84,
56.89-36.03,
72.16-42.87,
76.92-47.89)

19.7 18.29 33.31999999999999 20.86 29.29 29.03


In [33]:
print("BM25 Results")
for qt in ['simple', 'comp', 'intersection']:
    print(">> Question type:", qt)
    for kv in [20, 100, 200]:
        print(f"Performance @{kv}")
        dataset_results = eu.evaluate_dataset([q for q in fid_bm25_dev if qt in q['id']], k=kv)
        for k, v in dataset_results.items():
            if "avg" in k:
                print(f"{k + ':':25} {v*100.0:0.2f}%")
        print()

BM25 Results
>> Question type: simple
Performance @20
avg_recall:               30.47%
avg_precision:            19.85%
avg_f1:                   21.12%

Performance @100
avg_recall:               50.72%
avg_precision:            13.44%
avg_f1:                   18.80%

Performance @200
avg_recall:               56.89%
avg_precision:            10.32%
avg_f1:                   15.35%

>> Question type: comp
Performance @20
avg_recall:               35.52%
avg_precision:            26.08%
avg_f1:                   26.48%

Performance @100
avg_recall:               54.46%
avg_precision:            16.89%
avg_f1:                   21.58%

Performance @200
avg_recall:               61.77%
avg_precision:            13.70%
avg_f1:                   18.47%

>> Question type: intersection
Performance @20
avg_recall:               51.27%
avg_precision:            28.98%
avg_f1:                   34.67%

Performance @100
avg_recall:               72.16%
avg_precision:            17.80%
avg_f1:  

In [34]:
print("DPR Results")
for qt in ['simple', 'comp', 'intersection']:
    print(">> Question type:", qt)
    for kv in [20, 100, 200]:
        print(f"Performance @{kv}")
        dataset_results = eu.evaluate_dataset([q for q in fid_dpr_dev if qt in q['id']], k=kv)
        for k, v in dataset_results.items():
            if "avg" in k:
                print(f"{k + ':':25} {v*100.0:0.2f}%")
        print()

DPR Results
>> Question type: simple
Performance @20
avg_recall:               20.21%
avg_precision:            14.08%
avg_f1:                   13.34%

Performance @100
avg_recall:               31.02%
avg_precision:            8.47%
avg_f1:                   10.98%

Performance @200
avg_recall:               36.03%
avg_precision:            6.63%
avg_f1:                   9.36%

>> Question type: comp
Performance @20
avg_recall:               21.43%
avg_precision:            16.65%
avg_f1:                   16.54%

Performance @100
avg_recall:               36.17%
avg_precision:            12.39%
avg_f1:                   15.86%

Performance @200
avg_recall:               42.87%
avg_precision:            10.95%
avg_f1:                   14.97%

>> Question type: intersection
Performance @20
avg_recall:               23.73%
avg_precision:            13.03%
avg_f1:                   14.41%

Performance @100
avg_recall:               38.84%
avg_precision:            7.82%
avg_f1:       