# Test Out Heuristic Approach on Simple Qampari Questions
My hypothesis was that a really simple heuristic could solve simple QAMPARI questions: do a BM25 search on full pages with the question & count recall based on exact match in the retrieved pages.

Steps:

1. Extract the simple qampari questions
2. Postprocess the wikipedia dump into a page index for pyserini
3. Get the results for 100 and 500 hits
Additionally, as a sanity check I would expect that doing a BM25 search with the answer would bring up pages that contain the answer, so test this too.

## Current Results
49.7% miss in top 100 41.5% miss in top 500 26.5% miss in answer query top 100

These results are pretty unexpected suggesting that I'm doing something wrong.

In [4]:
from collections import defaultdict

import json
import jsonlines
import os

from pyserini.search.lucene import LuceneSearcher

  return torch._C._cuda_getDeviceCount() > 0


In [3]:
qp_simple_data = json.load(open('/scratch/ddr8143/multiqa/qampari_data/qp_simple_train.json'))

In [5]:
index_path = "/scratch/ddr8143/multiqa/indexes/full_page_qampari_wikidata_index"
searcher = LuceneSearcher(index_path)

In [9]:
# Use the question as the query
def query_question(searcher, qdata, k):
    q = qdata['question_text']
    alist = [a['answer_text'] for a in qdata['answer_list']]
    hits = searcher.search(q, k=k)
    ans_contained = defaultdict(list)
    for a in alist:
        for i, hit in enumerate(hits):
            if a.lower() in hit.raw.lower():
                ans_contained[a].append(i)
        ans_contained[a]
    return ans_contained

In [10]:
# Use the answers at the query
def query_answers(searcher, qdata, k):
    q = qdata['question_text']
    alist = [a['answer_text'] for a in qdata['answer_list']]
    ans_hits = defaultdict(list)
    for a in alist:
        hits = searcher.search(a, k=k)
        for i, hit in enumerate(hits):
            if a.lower() in hit.raw.lower():
                ans_hits[a].append(i)
        ans_hits[a]
    return ans_hits

**Run hit queries and/or load output**

In [11]:
# Run Queries on the full simple dataset
hit_data_out_path = "/scratch/ddr8143/multiqa/qampari_data/hit_data.jsonl"

In [12]:
if not os.path.exists(hit_data_out_path):
    fp = open(hit_data_out_path, 'w+')
    with jsonlines.Writer(fp) as writer:
        for qd in qp_simple_data:
            qq100 = query_question(searcher, qd, 100)
            qq500 = query_question(searcher, qd, 500)
            qa100 = query_answers(searcher, qd, 100)
            qdata_out = {
                "question_data": qd,
                "query_question_contains_answer_k100": qq100,
                "query_question_contains_answer_k500": qq500,
                "query_answers_contains_answer_k100": qa100,
            }
            writer.write(qdata_out)
    fp.close()
else:
    print(f"Path already exists: {hit_data_out_path}")

Path already exists: /scratch/ddr8143/multiqa/qampari_data/hit_data.jsonl


In [15]:
# Load the output
all_hits_data = []
with jsonlines.Reader(open(hit_data_out_path)) as reader:
    for obj in reader:
        all_hits_data.append(obj)
len(all_hits_data)

28574

**Look at some examples**

In [16]:
# Look at some examples
all_hits_data[0]['query_question_contains_answer_k100']

{'To Let': [16, 25, 38],
 'Kalloori': [],
 'Thenmerku Paruvakaatru': [],
 'A Little Dream': [],
 'Paradesi': [],
 'Sawaari': [55]}

In [17]:
all_hits_data[0]['query_question_contains_answer_k500']

{'To Let': [16, 25, 38, 155, 206, 210, 257, 321, 408],
 'Kalloori': [],
 'Thenmerku Paruvakaatru': [],
 'A Little Dream': [],
 'Paradesi': [],
 'Sawaari': [55]}

In [22]:
print("The first 10 answer hits:")
{k: v[:10] for k, v in all_hits_data[0]['query_answers_contains_answer_k100'].items()}

The first 10 answer hits:


{'To Let': [1, 2, 3, 34, 42, 49, 63],
 'Kalloori': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'Thenmerku Paruvakaatru': [0, 1, 2, 3, 4, 5, 6, 8],
 'A Little Dream': [4, 5, 6, 23, 28, 84],
 'Paradesi': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'Sawaari': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

## Now Analyze Results

In [23]:
all_ans = []
missing_top_100 = []
missing_top_500 = []
missing_in_own = []
for i in range(len(all_hits_data)):
    top_100 = all_hits_data[i]['query_question_contains_answer_k100']
    top_500 = all_hits_data[i]['query_question_contains_answer_k500']
    top_100_own = all_hits_data[i]['query_answers_contains_answer_k100'];
    all_ans.append([a for a, _ in top_100.items()])
    missing_top_100.append([a for a, v in top_100.items() if len(v) == 0])
    missing_top_500.append([a for a, v in top_500.items() if len(v) == 0])
    missing_in_own.append([a for a, v in top_100_own.items() if len(v) == 0])

In [29]:
k = 100
print(all_hits_data[k]['question_data']['question_text'])
print(f'             all answers: {len(all_ans[k])}',  all_ans[k])
print()
print(f'    missing from top 100: {len(missing_top_100[k])}', missing_top_100[k])
print()
print(f'    missing from top 500: {len(missing_top_500[k])}', missing_top_500[k])
print()
print(f' missing from own search: {len(missing_in_own[k])}', missing_in_own[k])

Which software, art, etc. has Charles McPherson as performer?
             all answers: 16 ['Come Play with Me (album)', 'Con Alma!', 'Live in Tokyo', 'Bebop Revisited!', "McPherson's Mood", 'The Quintet/Live!', 'Free Bop!', 'From This Moment On!', "Today's Man", 'Beautiful!', 'Charles McPherson', 'Horizons', 'First Flight Out', 'Manhattan Nocturne (album)', 'Siku Ya Bibi', 'New Horizons']

    missing from top 100: 11 ['Con Alma!', 'Live in Tokyo', "McPherson's Mood", 'The Quintet/Live!', 'Free Bop!', 'From This Moment On!', "Today's Man", 'Beautiful!', 'First Flight Out', 'Siku Ya Bibi', 'New Horizons']

    missing from top 500: 4 ['Con Alma!', "Today's Man", 'Beautiful!', 'Siku Ya Bibi']

 missing from own search: 5 ['Free Bop!', 'From This Moment On!', 'Beautiful!', 'First Flight Out', 'Siku Ya Bibi']


In [33]:
percent_results = {
    "perc_miss_top_100": [len(missing_top_100[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))],
    "perc_miss_top_500": [len(missing_top_500[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))],
    "perc_miss_own_100": [len(missing_in_own[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))],
}

In [34]:
for pname, p in percent_results.items():
    print(f"{pname:15} | min {min(p):0.2f} avg {sum(p)/len(p):0.2f} max {max(p):0.2f}")

perc_miss_top_100 | min 0.00 avg 49.67 max 100.00
perc_miss_top_500 | min 0.00 avg 41.48 max 100.00
perc_miss_own_100 | min 0.00 avg 26.53 max 100.00
