# Test Out Heuristic Approach on Simple Qampari Questions

My hypothesis was that a really simple heuristic could solve simple QAMPARI questions: do a BM25 search on full pages with the question & count recall based on exact match in the retrieved pages.

Steps:
1) Extract the simple qampari questions
2) Postprocess the wikipedia dump into a page index for pyserini
3) Get the results for 100 and 500 hits

Additionally, as a sanity check I would expect that doing a BM25 search with the answer would bring up pages that contain the answer, so test this too.

### Current Results
49.7% miss in top 100
41.5% miss in top 500
26.5% miss in answer query top 100

These results are pretty unexpected suggesting that I'm doing something wrong.

### TODOs
- Pull out the utils into a separate util file.
- Clean up all the testing code.
- Verfiy file locations and make sure that they're still correct here.

In [1]:
import sh
import json
import jsonlines
import os

from collections import defaultdict

import html
import re

from multiqa_utils.wikipedia_utils import load_postprocess_dump

%load_ext autoreload
%autoreload 2

In [2]:
base_path = "/scratch/ddr8143/repos/DPR/downloads/data"
ambigqa_path = f"{base_path}/ambigqa"
ambigqa_light_path = f"{base_path}/ambigqa_light"
nq_path = f"{base_path}/retriever"
qp_path = f"{base_path}/qampari"

wikipath = "/scratch/ddr8143/wikipedia/parsed_dumps"
wikipath_out = "/scratch/ddr8143/wikipedia/postprocessed_qampari"

# Find QAMPARI Subset

In [3]:
sh.ls(qp_path)

dev_data.jsonl	test_data.jsonl  train_data.jsonl

In [3]:
qp_simple_data = json.load(open('/scratch/ddr8143/multiqa/qampari_data/qp_simple_train.json'))

In [5]:
qp_simple_data[0]['question_text']

'Which movie, clip, TV show etc. had Chezhiyan as director of photography?'

In [6]:
qp_simple_data[100]['question_text']

'Which software, art, etc. has Charles McPherson as performer?'

In [8]:
qp_simple_data[2000]['question_text']

'Which software, art, etc. has The Dodos as performer?'

In [9]:
i = 0
with jsonlines.open(f"{qp_path}/train_data.jsonl") as reader:
    for obj in reader:
        i += 1

In [10]:
i

61911

In [11]:
len(qp_simple_data) / i

0.4615334916250747

In [27]:
example = {k: v for k, v in qp_simple_data[0]['answer_list'][2].items() if "aid" not in k}

In [28]:
example['proof']

[{'proof_text': 'It stars saranya ponvannan in her 100th film, along with vijay sethupathi and vasundhra chiyertra. The music was composed by n. R. Raghunanthan with cinematography by chezhiyan and editing by mu.',
  'found_in_url': 'https://en.wikipedia.org/wiki/Thenmerku_Paruvakaatru',
  'pid': '0__wikidata_simple__train__2__0'}]

In [29]:
example['proof'] = [{k: v for k, v in ep.items() if "pid" not in k} for ep in example['proof']]

In [30]:
print(qp_simple_data[0]["question_text"])
example

Which movie, clip, TV show etc. had Chezhiyan as director of photography?


{'answer_text': 'Thenmerku Paruvakaatru',
 'aliases': ['Thenmerku Paruvakaatru'],
 'answer_url': 'https://en.wikipedia.org/wiki/Thenmerku_Paruvakaatru',
 'proof': [{'proof_text': 'It stars saranya ponvannan in her 100th film, along with vijay sethupathi and vasundhra chiyertra. The music was composed by n. R. Raghunanthan with cinematography by chezhiyan and editing by mu.',
   'found_in_url': 'https://en.wikipedia.org/wiki/Thenmerku_Paruvakaatru'}]}

# Insect Wikidata

In [18]:
index_path = "/scratch/ddr8143/multiqa/indexes/full_page_qampari_wikidata_index"

In [19]:
from pyserini.search.lucene import LuceneSearcher
from pyserini.index.lucene import IndexReader

In [20]:
searcher = LuceneSearcher(index_path)

In [61]:
qdata = qp_simple_data[0]
q = qdata['question_text']
alist = [a['answer_text'] for a in qdata['answer_list']]
print(q)
print(alist)

Which movie, clip, TV show etc. had Chezhiyan as director of photography?
['To Let', 'Kalloori', 'Thenmerku Paruvakaatru', 'A Little Dream', 'Paradesi', 'Sawaari']


In [66]:
hits = searcher.search(q, k=1000)

In [67]:
ans_cont = defaultdict(list)
for a in alist:
    for i, hit in enumerate(hits):
        if a.lower() in hit.raw.lower():
            ans_cont[a].append(i)
    print(ans_cont[a])

[16, 25, 38, 155, 206, 210, 257, 321, 408, 587, 617, 636, 641, 665, 674, 727, 735, 749, 774, 776, 788, 808, 898, 899, 922, 963, 985]
[]
[]
[]
[521]
[55]


In [69]:
ans_hits = defaultdict(list)
for a in alist:
    hits = searcher.search(a, k=100)
    for i, hit in enumerate(hits):
        if a.lower() in hit.raw.lower():
            ans_hits[a].append(i)

In [70]:
ans_hits

defaultdict(list,
            {'To Let': [1, 2, 3, 34, 42, 49, 63],
             'Kalloori': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
             'Thenmerku Paruvakaatru': [0, 1, 2, 3, 4, 5, 6, 8],
             'A Little Dream': [4, 5, 6, 23, 28, 84],
             'Paradesi': [0,
              1,
              2,
              3,
              4,
              5,
              6,
              7,
              8,
              9,
              10,
              11,
              12,
              13,
              14,
              15,
              16,
              17,
              18,
              19,
              20,
              21,
              22,
              23,
              24,
              25,
              26,
              27,
              28,
              29,
              30,
              31,
              32,
              33,
              34,
              35,
              36,
              37,
              38,
              39,
              40,


In [4]:
index_path = "/scratch/ddr8143/multiqa/indexes/full_page_qampari_wikidata_index"

In [5]:
from pyserini.search.lucene import LuceneSearcher
from pyserini.index.lucene import IndexReader

  return torch._C._cuda_getDeviceCount() > 0


In [6]:
searcher = LuceneSearcher(index_path)

In [7]:
def query_question(searcher, qdata, k):
    q = qdata['question_text']
    alist = [a['answer_text'] for a in qdata['answer_list']]
    hits = searcher.search(q, k=k)
    ans_contained = defaultdict(list)
    for a in alist:
        for i, hit in enumerate(hits):
            if a.lower() in hit.raw.lower():
                ans_contained[a].append(i)
        ans_contained[a]
    return ans_contained

In [8]:
def query_answers(searcher, qdata, k):
    q = qdata['question_text']
    alist = [a['answer_text'] for a in qdata['answer_list']]
    ans_hits = defaultdict(list)
    for a in alist:
        hits = searcher.search(a, k=k)
        for i, hit in enumerate(hits):
            if a.lower() in hit.raw.lower():
                ans_hits[a].append(i)
        ans_hits[a]
    return ans_hits

In [None]:
fp = open("/scratch/ddr8143/multiqa/qampari_data/hit_data.jsonl", 'w+')
with jsonlines.Writer(fp) as writer:
    for qd in qp_simple_data:
        qq100 = query_question(searcher, qd, 100)
        qq500 = query_question(searcher, qd, 500)
        qa100 = query_answers(searcher, qd, 100)
        qdata_out = {
            "question_data": qd,
            "query_question_contains_answer_k100": qq100,
            "query_question_contains_answer_k500": qq500,
            "query_answers_contains_answer_k100": qa100,
        }
        writer.write(qdata_out)
fp.close()

In [13]:
#!rm /scratch/ddr8143/multiqa/qampari_data/hit_data.jsonl

In [31]:
all_hits_data = []
with jsonlines.Reader(open("/scratch/ddr8143/multiqa/qampari_data/hit_data.jsonl")) as reader:
    for obj in reader:
        all_hits_data.append(obj)

In [32]:
len(all_hits_data)

28574

In [34]:
all_hits_data[0]['question_data']['question_text']

'Which movie, clip, TV show etc. had Chezhiyan as director of photography?'

In [35]:
all_hits_data[0].keys()

dict_keys(['question_data', 'query_question_contains_answer_k100', 'query_question_contains_answer_k500', 'query_answers_contains_answer_k100'])

In [37]:
all_hits_data[0]['query_question_contains_answer_k100']

{'To Let': [16, 25, 38],
 'Kalloori': [],
 'Thenmerku Paruvakaatru': [],
 'A Little Dream': [],
 'Paradesi': [],
 'Sawaari': [55]}

In [38]:
all_hits_data[0]['query_question_contains_answer_k500']

{'To Let': [16, 25, 38, 155, 206, 210, 257, 321, 408],
 'Kalloori': [],
 'Thenmerku Paruvakaatru': [],
 'A Little Dream': [],
 'Paradesi': [],
 'Sawaari': [55]}

In [41]:
all_hits_data[0]['query_answers_contains_answer_k100'];

In [None]:
# Num missing in top 100
# Num missing in top 500
# Num that don't show up in their own bm25 search

In [43]:
all_ans = []
missing_top_100 = []
missing_top_500 = []
missing_in_own = []
for i in range(len(all_hits_data)):
    top_100 = all_hits_data[i]['query_question_contains_answer_k100']
    top_500 = all_hits_data[i]['query_question_contains_answer_k500']
    top_100_own = all_hits_data[i]['query_answers_contains_answer_k100'];
    all_ans.append([a for a, _ in top_100.items()])
    missing_top_100.append([a for a, v in top_100.items() if len(v) == 0])
    missing_top_500.append([a for a, v in top_500.items() if len(v) == 0])
    missing_in_own.append([a for a, v in top_100_own.items() if len(v) == 0])

In [63]:
k = 100
print(all_hits_data[k]['question_data']['question_text'])
print('             all answers:', len(all_ans[k]), all_ans[k])
print('    missing from top 100:', len(missing_top_100[k]), missing_top_100[k])
print('    missing from top 500:', len(missing_top_500[k]), missing_top_500[k])
print(' missing from own search:', len(missing_in_own[k]), missing_in_own[k])

Which software, art, etc. has Charles McPherson as performer?
             all answers: 16 ['Come Play with Me (album)', 'Con Alma!', 'Live in Tokyo', 'Bebop Revisited!', "McPherson's Mood", 'The Quintet/Live!', 'Free Bop!', 'From This Moment On!', "Today's Man", 'Beautiful!', 'Charles McPherson', 'Horizons', 'First Flight Out', 'Manhattan Nocturne (album)', 'Siku Ya Bibi', 'New Horizons']
    missing from top 100: 11 ['Con Alma!', 'Live in Tokyo', "McPherson's Mood", 'The Quintet/Live!', 'Free Bop!', 'From This Moment On!', "Today's Man", 'Beautiful!', 'First Flight Out', 'Siku Ya Bibi', 'New Horizons']
    missing from top 500: 4 ['Con Alma!', "Today's Man", 'Beautiful!', 'Siku Ya Bibi']
 missing from own search: 5 ['Free Bop!', 'From This Moment On!', 'Beautiful!', 'First Flight Out', 'Siku Ya Bibi']


In [69]:
perc_miss_top_100 = [len(missing_top_100[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))]
perc_miss_top_500 = [len(missing_top_500[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))]
perc_miss_own_100 = [len(missing_in_own[k]) * 100.0 / len(all_ans[k]) for k in range(len(all_ans))]

In [70]:
print(max(perc_miss_top_100), min(perc_miss_top_100))
print(max(perc_miss_top_500), min(perc_miss_top_500))
print(max(perc_miss_own_100), min(perc_miss_own_100))

100.0 0.0
100.0 0.0
100.0 0.0


In [72]:
print(sum(perc_miss_top_100)/len(perc_miss_top_100))

49.667740914731


In [73]:
print(sum(perc_miss_top_500)/len(perc_miss_top_500))

41.48405195593409


In [75]:
print(sum(perc_miss_own_100)/len(perc_miss_own_100))

26.530668766019904
