# error stats

## vanilla SBERT
### Recall improvement
- add title for dense embeddings: 46.86% -> 50.75%
    - increase correct question = 64
    - relative boost 8.3%
    - increase 3.89 pts

### Answer performance boost
- unified: 35.86 -> 36.69
    - relative boost 2.3%
    - increase 8.3 pts
- unifiedlarge: 31.99 -> 32.05
    - no obvious improvement
- llama3: 36.22 -> 37.20
    - relative boost 2.7%
    - increase 9.8 pts

multisection error = 因爲無法完整 retrieve 到 來自不同 section 的片段導致 low recall (multisection 題目佔 所有 error cases的比例)
notitle sbert 的 multisection error 數量：138
full title sbert 的 multisection error 數量：134
兩者沒有重疊。

如果可以解決multisection 的問題，答對的題數可預期增加130+題。預期 recall 可以提升2倍。

In [None]:
from collections import Counter
import argparse
import string
import re
import json
import jsonlines
from pathlib import Path
from typing import List, Dict, Set, Union
import numpy as np

import my_evaluator
import utils

In [3]:
# load required test data files

# gold data
gold_path: Path = Path("qasper/test_gold.json")

gold_data = json.load(open(gold_path))
gold_answers_and_evidence = evaluator.get_answers_and_evidence(
    gold_data, True
)

## processed test data
processed_papar_path: Path = Path("qasper/test_papers.json")
test_papers: Dict[str, Dict] = utils.load_json(processed_papar_path)
processed_questions_path: Path = Path("qasper/test_questions.json")
test_questions: Dict[str, Dict] = utils.load_json(processed_questions_path)

print(len(test_papers))
print(len(gold_answers_and_evidence))
print(len(test_questions))

416
1451
1451


In [4]:
for i, data in enumerate(test_papers.values()):
    if i == 3:
        break
    print(data)

{'1': {'section_name': 'Introduction', 'section_description': 'The introduction of the paper "End-to-End Trainable Non-Collaborative Dialog System" sets the stage for the remainder of the paper by highlighting the difference between collaborative and non-collaborative dialog systems. It discusses how collaborative dialog systems have achieved notable successes in tasks where users work together with the system to achieve a common goal. In contrast, non-collaborative dialog systems, such as those found in negotiation and persuasion, do not share the same goal. The introduction notes that users in non-collaborative settings often employ social content to build rapport and trust, which is not explicitly addressed in current research. The authors aim to address this gap by introducing a hierarchical intent annotation scheme and a neural network model that generates diverse and coherent responses, which can be applied to various non-collaborative dialog tasks. The introduction provides a cl

In [8]:
# Average number of answer references per question

# all_gold_sections: Dict[str, List[List[str]]] = {}
max_gold_paras_count = 0
for question_id, references in gold_answers_and_evidence.items():
    for i, possible_ans in enumerate(references):
        gold_paras_count = len(possible_ans['evidence'])
        if gold_paras_count > max_gold_paras_count:
            max_gold_paras_count = gold_paras_count
        print(f"[{question_id}] # of gold paras for annotator {i}: {gold_paras_count}")

print(max_gold_paras_count)

[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 0: 1
[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 1: 1
[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 2: 1
[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 3: 1
[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 4: 1
[397a1e851aab41c455c2b284f5e4947500d797f0] # of gold paras for annotator 5: 1
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 0: 4
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 1: 2
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 2: 2
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 3: 1
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 4: 3
[cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444] # of gold paras for annotator 5: 1
[f7662b11e87c1e051e13799413f3db459ac3e19c] # of gold paras for a

In [None]:
# Average number of answer references per question

all_gold_sections: Dict[str, List[List[str]]] = {}
for question_id, references in gold_answers_and_evidence.items():
    paper_id: str = test_questions[question_id]["from_paper"]
    all_gold_sections[question_id] = []
    for reference in references:
        gold_sections: List[str] = my_evaluator.get_sections(reference["evidence"], test_papers, paper_id)
        all_gold_sections[question_id].append(gold_sections)
print(f"Total questions: {len(all_gold_sections)}")
print(f"Max number of answer references per question: {max([len(answers) for answers in all_gold_sections.values()])}")
print(f"Average number of answer references per question: {np.mean([len(answers) for answers in all_gold_sections.values()]):.2f}")
print(f"Median number of answer references per question: {np.median([len(answers) for answers in all_gold_sections.values()]):.2f}")

KeyError: 'section_name'

In [5]:
# Total questions that require answer from multiple sections

all_section_occurences: Dict[str, List[int]] = {}
for question_id, references in all_gold_sections.items():
    all_section_occurences[question_id] = [len(sections) for sections in references]

multisection_occurences: Dict[str, List[int]] = {}
for qid, occurs in all_section_occurences.items():
    # print(f"Question `{qid}`: {np.mean(occurs):.2f}")
    for occur_count in occurs:
        if occur_count > 1:
            multisection_occurences[qid] = occurs
            # total_multisection_questions += 1
            break
print(f"Total questions that require answer from multiple sections: {len(multisection_occurences)}")
print(f"Percentage: {len(multisection_occurences)/len(all_section_occurences)*100:.2f}%")

Total questions that require answer from multiple sections: 276
Percentage: 19.02%


In [6]:
all_gold_sections["55bafa0f7394163f4afd1d73340aac94c2d9f36c"]

[['Training and Evaluation Data', 'Experiments'],
 ['Introduction', 'Training and Evaluation Data']]

In [7]:
for qid, occurs in multisection_occurences.items():
    multi_hit = 0
    for occ in occurs:
        if occ > 1:
            multi_hit += 1
    if multi_hit >= 2:
        print(f"Question `{qid}`: {occurs}")

Question `cc8b4ed3985f9bfbe1b5d7761b31d9bd6a965444`: [3, 1, 2, 1, 1, 1]
Question `f7662b11e87c1e051e13799413f3db459ac3e19c`: [2, 2, 1, 1, 1, 1]
Question `b584739622d0c53830e60430b13fd3ae6ff43669`: [2, 2, 1, 2, 2]
Question `fe52b093735bb456d7e699aa9a2b806d2b498ba0`: [2, 1, 2, 1]
Question `8b4bd0a962241ea548752212ebac145e2ced7452`: [2, 1, 2, 3]
Question `371433bd3fb5042bacec4dfad3cfff66147c14f0`: [2, 1, 1, 2]
Question `c19e9fd2f1c969e023fb99b74e78eb1f3db8e162`: [2, 1, 2, 2]
Question `6c50871294562e4886ede804574e6acfa8d1a5f9`: [2, 0, 4, 0]
Question `5ed02ae6c534cd49d405489990f0e4ba0330ff1b`: [2, 0, 2]
Question `935873b97872820b7b6100d6a785fba286b94900`: [0, 2, 2, 1]
Question `108f99fcaf620fab53077812e8901870896acf36`: [3, 0, 3, 2]
Question `ebe1084a06abdabefffc66f029eeb0b69f114fd9`: [2, 2, 2]
Question `cfdd583d01abaca923f5c466bb20e1d4b8c749ff`: [2, 3, 3]
Question `c0355afc7871bf2e12260592873ffdb5c0c4c919`: [2, 2, 1]
Question `67cb001f8ca122ea859724804b41529fea5faeef`: [2, 1, 2]
Question `

In [None]:
RETRIEVER = "sbert"
READER = "unified"
TOPK = 3
modes = ["notitle", "full"]

error_cases: Dict[str, Dict[str, Dict]] = {}
for mode in modes:
    error_case_path: Path = Path(f"results/{RETRIEVER}-{READER}-{mode}-top{TOPK}/low_recall_cases.jsonl")
    with jsonlines.open(error_case_path) as reader:
        multisection_errors: Dict[str, Dict] = {}
        all_error_cases: List[Dict] = []
        for question in reader:
            all_error_cases.append(question)
            question_id = question["qid"]
            if question_id in multisection_occurences.keys():
                multisection_errors[question_id] = {
                                                    "question_text": question["question"],
                                                    "from_paper": question["from_paper"],
                                                    "from_paper": question["from_paper"],
                                                    "gold": question["gold"],
                                                    "gold_section": question["gold_section"],
                                                    "predicted": question["predicted"],
                                                    "predicted_section": question["predicted_section"],
                                                    }
        error_cases[mode] = multisection_errors

    print(f"\n=================  {RETRIEVER} - {mode} ========================")
    print(f"Total error cases that require answer from multiple sections: {len(multisection_errors)}")
    print(f"Percentage (compared to all multisection questions): {len(multisection_errors)/len(multisection_occurences)*100:.2f}%")
    print(f"Percentage (compared to all error cases): {len(multisection_errors)/len(all_error_cases)*100:.2f}%")
    print(f"Percentage (compared to all questions): {len(multisection_errors)/len(all_section_occurences)*100:.2f}%")
    
# export error cases
output_file: Path = Path(f"demo/multisection_error_cases_{READER}.json")
with open(output_file, "w+") as f:
    json.dump(error_cases, f, indent=4)


Total error cases that require answer from multiple sections: 138
Percentage (compared to all multisection questions): 50.00%
Percentage (compared to all error cases): 18.45%
Percentage (compared to all questions): 9.51%

Total error cases that require answer from multiple sections: 134
Percentage (compared to all multisection questions): 48.55%
Percentage (compared to all error cases): 19.59%
Percentage (compared to all questions): 9.24%


In [9]:
notitle_qids: Set[str] = set(list(error_cases["notitle"].keys()))
full_qids: Set[str] = set(list(error_cases["full"].keys()))

# only_in_full: Set[str] = full_qids - notitle_qids
# for qid in only_in_full:
#     print(qid)
# print("====================")
# only_in_notitle: Set[str] = notitle_qids - full_qids
# for qid in only_in_notitle:
#     print(qid)

# fail in notitle , but not in fulltitle
qids_better_in_full: List[str] = []

for qid in notitle_qids:
    if qid not in full_qids:
        qids_better_in_full.append(qid)
        print(qid)

print("====================")
print(len(qids_better_in_full))

bd817a520a62ddd77e65e74e5a7e9006cdfb19b3
8126c6b8a0cab3e22661d3d71d96aa57360da65c
477da8d997ff87400c6aad19dcc74f8998bc89c3
1a419468d255d40ae82ed7777618072a48f0091b
de4e949c6917ff6933f5fa2a3062ba703aba014c
ab37ae82e38f64d3fa95782f2c791488f26cd43f
c58ef13abe5fa91a761362ca962d7290312c74e4
91e361e85c6d3884694f3c747d61bfcef171bab0
aa287673534fc05d8126c8e3486ca28821827034
230f127e83ac62dd65fccf6b1a4960cf0f7316c7
8c89f1d1b3c2a45c0254c4c8d6e700ab9a4b4ffb
3fd8eab282569b1c18b82f20d579b335ae70e79f
de0154affd86c608c457bf83d888bbd1f879df93
1ec0be667a6594eb2e07c50258b120e693e040a8
330fe3815f74037a9be93a4c16610c736a2a27b3
d0dc6729b689561370b6700b892c9de8871bb44d
e8fa4303b36a47a5c87f862458442941bbdff7d9
4367617c0b8c9f33051016e8d4fbb44831c54d0f
14fdc8087f2a62baea9d50c4aa3a3f8310b38d17
58a3cfbbf209174fcffe44ce99840c758b448364
1f8044487af39244d723582b8a68f94750eed2cc
f8264609a44f059b74168995ffee150182a0c14f
2ca3ca39d59f448e30be6798514709be7e3c62d8
47d54a6dd50cab8dab64bfa1f9a1947a8190080c
f94cea545f745994

In [10]:
# fail in fulltitle , but not in notitle

qids_better_in_notitle: List[str] = []

for qid in full_qids:
    if qid not in notitle_qids:
        qids_better_in_notitle.append(qid)
        print(qid)

print("====================")
print(len(qids_better_in_notitle))

2bd702174e915d97884d1571539fb1b5b0b7123a
c176eb1ccaa0e50fb7512153f0716e60bf74aa53
3d662fb442d5fc332194770aac835f401c2148d9
a98ae529b47362f917a398015c8525af3646abf0
87c00edc497274ae6a972c3097818de85b1b384f
7239c02a0dcc0c3c9d9cddb5e895bcf9cfcefee6
45e6532ac06a59cb6a90624513242b06d7391501
7380e62edcb11f728f6d617ee332dc8b5752b185
887d7f3edf37ccc6bf2e755dae418b04d2309686
35c01dc0b50b73ee5ca7491d7d373f6e853933d2
1e11e74481ead4b7635922bbe0de041dc2dde28d
1f053f338df6d238cb163af1a0b1b073e749ed8a
5d03a82a70f7b1ab9829891403ec31607828cbd5
b6e97d1b1565732b1b3f1d74e6d2800dd21be37a
7ece07a84635269bb19796497847e4517d1e3e61
5ae005917efc17a505ba1ba5e996c4266d6c74b6
78c7318b2218b906a67d8854f3e511034075f79a
a8f189fad8b72f8b2b4d2da4ed8475d31642d9e7
275b2c22b6a733d2840324d61b5b101f2bbc5653
9da181ac8f2600eb19364c1b1e3cdeb569811a11
d38b3e0896b105d171e69ce34c689e4a7e934522
21


In [11]:
qids_better_in_both: List[str] = list(set(qids_better_in_full).intersection(set(qids_better_in_notitle)))

for qid in qids_better_in_both:
    print(qid)

print("====================")
print(len(qids_better_in_both))

0


In [12]:
qids_fail_in_both: List[str] = list(notitle_qids.intersection(full_qids))

for qid in qids_fail_in_both:
    print(qid)

print("====================")
print(len(qids_fail_in_both))

c2ce25878a17760c79031a426b6f38931cd854b2
75c221920bee14a6153bd5f4c1179591b2f48d59
5ed02ae6c534cd49d405489990f0e4ba0330ff1b
344238de7208902f7b3a46819cc6d83cc37448a0
6cad6f074b0486210ffa4982c8d1632f5aa91d91
649e77ac2ecce42ab2efa821882675b5a0c993cb
f1f7a040545c9501215d3391e267c7874f9a6004
98b97d24f31e9c535997e9b6cb126eb99fc72a90
31e6062ba45d8956791e1b86bad7efcb6d1b191a
058b6e3fdbb607fa7dbfc688628b3e13e130c35a
101d7a355e8bf6d1860917876ee0b9971eae7a2f
4e2b12cfc530a4682b06f8f5243bc9f64bd41135
f7b91b99279833f9f489635eb8f77c6d13136098
4288621e960ffbfce59ef1c740d30baac1588b9b
71413505d7d6579e2a453a1f09f4efd20197ab4b
ce8d8de78a21a3ba280b658ac898f73d0b52bf1b
b0e894536857cb249bd75188c3ca5a04e49ff0b6
415014a5bcd83df52c9307ad16fab1f03d80f705
42279c3a202a93cfb4aef49212ccaf401a3f8761
90eeb1b27f84c83ffcc8a88bc914a947c01a0c8b
23cbf6ab365c1eb760b565d8ba51fb3f06257d62
654306d26ca1d9e77f4cdbeb92b3802aa9961da1
ecaa10a2d9927fa6ab6a954488f12aa6b42ddc1a
c035a011b737b0a10deeafc3abe6a282b389d48b
a3a867f7b3557c16

# error case similarities

In [None]:
import json
from pathlib import Path
from typing import List, Dict, Tuple, Optional, Union
from sentence_transformers import SentenceTransformer, util
import jsonlines
import torch
import re
# from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import time
import os
import argparse

import utils

os.environ['HF_HOME'] = '/workspace/P76125041/.cache/'

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
embedding_model = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1", cache_folder="/workspace/P76125041/.cache/")

## NOTITLE CASE 1

In [15]:
MODE = "notitle"
TOPK = 10
paper_id = "2004.04124"
paper_para_embeddings: Dict[str, List[List[float]]] = utils.load_json(Path(f"qasper/embeddings/test_embeddings_sbert_{MODE}.json"))
test_papers: Dict[str, Dict] = utils.load_json(Path("qasper/test_papers.json"))
raw_paras: List[str] = [para["text"] for para in test_papers[paper_id].values()]

In [16]:
question = "Does LadaBERT ever outperform its knowledge distillation teacher in terms of accuracy on some problems?"

# from section: 
    #  Lightweight Adaptation of BERT ::: Overview
    #  Experiments ::: Performance Comparison
gold = [
    "The overall pipeline of LadaBERT (Lightweight Adaptation of BERT) is illustrated in Figure FIGREF8. As shown in the figure, the pre-trained BERT model (e.g., BERT-Base) is served as the teacher as well as the initial status of the student model. Then, the student model is compressed towards smaller parameter size through a hybrid model compression framework in an iterative manner until the target compression ratio is reached. Concretely, in each iteration, the parameter size of student model is first reduced by $1-\\Delta $ based on weight pruning and matrix factorization, and then the parameters are fine-tuned by the loss function of knowledge distillation. The motivation behind is that matrix factorization and weight pruning are complementary with each other. Matrix factorization calculates the optimal approximation under a certain rank, while weight pruning introduces additional sparsity to the decomposed matrices. Moreover, weight pruning and matrix factorization generates better initial and intermediate status of the student model, which improve the efficiency and effectiveness of knowledge distillation. In the following subsections, we will introduce the algorithms in detail.",
    
    "The evaluation results of LadaBERT and state-of-the-art approaches are listed in Table TABREF40, where the models are ranked by parameter sizes for feasible comparison. As shown in the table, LadaBERT consistently outperforms the strongest baselines under similar model sizes. In addition, the performance of LadaBERT demonstrates the superiority of hybrid combination of SVD-based matrix factorization, weight pruning and knowledge distillation."
]

# from section:
    # Experiments ::: Performance Comparison
    # Introduction
    # Experiments ::: Learning curve comparison
predicted = [
    "With model size of $2.5\\times $ reduction, LadaBERT-1 performs significantly better than BERT-PKD, boosting the performance by relative 8.9, 8.1, 6.1, 3.8 and 5.8 percentages on MNLI-m, MNLI-mm, SST-2, QQP and QNLI datasets respectively. Recall that BERT-PKD initializes the student model by selecting 3 of 12 layers in the pre-trained BERT-Base model. It turns out that the discarded layers have huge impact on the model performance, which is hard to be recovered by knowledge distillation. On the other hand, LadaBERT generates the student model by iterative pruning on the pre-trained teacher. In this way, the original knowledge in the teacher model can be preserved to the largest extent, and the benefit of which is complementary to knowledge distillation.",
    
    "To further demonstrate the efficiency of LadaBERT, we visualize the learning curves on MNLI-m and QQP datasets in Figure FIGREF42 and FIGREF42, where LadaBERT-3 is compared to the strongest baseline, TinyBERT, under $7.5 \\times $ compression ratio. As shown in the figures, LadaBERT-3 achieves good performances much faster and results in a better convergence point. After training $2 \\times 10^4$ steps (batches) on MNLI-m dataset, the performance of LadaBERT-3 is already comparable to TinyBERT after convergence (approximately $2 \\times 10^5$ steps), achieving nearly $10 \\times $ acceleration. And on QQP dataset, both performance improvement and training speed acceleration is very significant. This clearly shows the superiority of combining matrix factorization, weight pruning and knowledge distillation in a reinforce manner. Instead, TinyBERT is based on pure knowledge distillation, so the learning speed is much slower.",
    
    "We conduct extensive experiments on five public datasets of natural language understanding. As an example, the performance comparison of LadaBERT and state-of-the-art models on MNLI-m dataset is illustrated in Figure FIGREF1. We can see that LadaBERT outperforms other BERT-oriented model compression baselines at various model compression ratios. Especially, LadaBERT-1 outperforms BERT-PKD significantly under $2.5\\times $ compression ratio, and LadaBERT-3 outperforms TinyBERT under $7.5\\times $ compression ratio while the training speed is accelerated by an order of magnitude."
    ]



question_embedding: List[float] = embedding_model.encode([question])
gold_embeddings: List[List[float]] = embedding_model.encode(gold)
predicted_embeddings: List[List[float]] = embedding_model.encode(predicted)

gold_similarity : List[float] = util.dot_score(question_embedding, gold_embeddings)[0].cpu().tolist()
predicted_similarity : List[float] = util.dot_score(question_embedding, predicted_embeddings)[0].cpu().tolist()

print(f"Question: {question}")
print(f"Similarity scores of golden evidences:{gold_similarity}")
print(f"Similarity scores of predicted evidences:{predicted_similarity}")

print("+"*100)
all_stored_embeddings = paper_para_embeddings[paper_id]
all_stored_similarity: List[float] = util.dot_score(question_embedding, all_stored_embeddings)[0].cpu().tolist()
para_score_pairs = list(zip(raw_paras, all_stored_similarity))
topk_para_score_pairs: List[Tuple[str, float]] = sorted(para_score_pairs, key=lambda x: x[1], reverse=True)[:TOPK]
topk_paras: List[str] = [para for para, _ in topk_para_score_pairs]

for rank, (para, score) in enumerate(topk_para_score_pairs, 1):
    for gold_id, gold_para in enumerate(gold):
        if gold_para in para:
            print(f"Gold paragraph {gold_id} found in top{TOPK}! (score={score}), ranked {rank}")

Question: Does LadaBERT ever outperform its knowledge distillation teacher in terms of accuracy on some problems?
Similarity scores of golden evidences:[0.49882906675338745, 0.5327474474906921]
Similarity scores of predicted evidences:[0.6653413772583008, 0.6210382580757141, 0.5310972929000854]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Gold paragraph 1 found in top10! (score=0.5327473878860474), ranked 3
Gold paragraph 0 found in top10! (score=0.4988291561603546), ranked 7


In [17]:
factoid_question = "Which tasks are explored in this paper?"
paragraph = "We have conducted experiments to compare SLRTM with several strong topic model baselines on two tasks: generative model evaluation (i.e. test set perplexity) and document classification. The results on several benchmark datasets quantitatively demonstrate SLRTM's advantages in modeling documents. We further provide some qualitative results on topic2sentence, the generated sentences for different topics clearly demonstrate the power of SLRTM in topic-sensitive short text conversations."
sentence = "We have conducted experiments to compare SLRTM with several strong topic model baselines on two tasks: generative model evaluation (i.e. test set perplexity) and document classification. "


question_embedding: List[List[float]] = embedding_model.encode([factoid_question])
para_embeddings: List[List[float]] = embedding_model.encode([paragraph])
sent_embeddings: List[List[float]] = embedding_model.encode([sentence])

para_sim : List[float] = util.dot_score(question_embedding, para_embeddings)[0].cpu().tolist()
sent_sim : List[float] = util.dot_score(question_embedding, sent_embeddings)[0].cpu().tolist()

In [18]:
print(para_sim)

[0.17536188662052155]


In [19]:
print(sent_sim)

[0.19177640974521637]


In [20]:
multi_question = "Which neural network architecture do they use for the dialog agent and user simulator?"
summary = "The text describes the designs of a dialog agent and a user simulator. The dialog agent tracks dialog state, performs API calls to knowledge bases (KB), and generates system actions and responses based on query results, utilizing an LSTM to maintain and update its state after each dialog turn. Similarly, the user simulator, initialized with a randomly sampled goal at the start of the conversation, also uses an LSTM to manage its state throughout the interaction."
para1 = "Figure 1 illustrates the design of the dialog agent. The dialog agent is capable of tracking dialog state, issuing API calls to knowledge bases (KB), and producing corresponding system actions and responses by incorporating the query results, which are key skill sets BIBREF26 in conducting task-oriented dialogs. State of the dialog agent is maintained in the LSTM BIBREF35 state and being updated after the processing of each turn."
para2 = "Figure 2 shows the design of the user simulator. User simulator is given a randomly sampled goal at the beginning of the conversation. Similar to the design of the dialog agent, state of the user simulator is maintained in the state of an LSTM. "
sbert1 = "Our contribution in this work is two-fold. Firstly, we propose an iterative dialog policy learning method that jointly optimizes the dialog agent and the user simulator in end-to-end trainable neural dialog systems. Secondly, we design a novel neural network based user simulator for task-oriented dialogs that can be trained in a data-driven manner without requiring the design of complex rules."
sbert2 = "In supervised pre-training, the dialog agent and the user simulator are trained separately against dialog corpus. We use the same set of neural network model configurations for both agents. Hidden layer sizes of the dialog-level LSTM for dialog modeling and utterance-level LSTM for utterance encoding are both set as 150. We perform mini-batch training using Adam optimization method BIBREF41 . Initial learning rate is set as 1e-3. Dropout BIBREF42 ( INLINEFORM0 ) is applied during model training to prevent to model from over-fitting."

question_embedding: List[List[float]] = embedding_model.encode([multi_question])
sum_embeddings: List[List[float]] = embedding_model.encode([summary])
p1_embeddings: List[List[float]] = embedding_model.encode([para1])
p2_embeddings: List[List[float]] = embedding_model.encode([para2])
sr1_embeddings: List[List[float]] = embedding_model.encode([sbert1])
sr2_embeddings: List[List[float]] = embedding_model.encode([sbert2])

summary_sim : List[float] = util.dot_score(question_embedding, sum_embeddings)[0].cpu().tolist()
p1_sim : List[float] = util.dot_score(question_embedding, p1_embeddings)[0].cpu().tolist()
p2_sim : List[float] = util.dot_score(question_embedding, p2_embeddings)[0].cpu().tolist()
sr1_sim : List[float] = util.dot_score(question_embedding, sr1_embeddings)[0].cpu().tolist()
sr2_sim : List[float] = util.dot_score(question_embedding, sr2_embeddings)[0].cpu().tolist()

print(summary_sim)
print(p1_sim)
print(p2_sim)
print(sr1_sim)
print(sr2_sim)

[0.6100060343742371]
[0.45549678802490234]
[0.5478793382644653]
[0.6787906885147095]
[0.5924627184867859]


In [22]:
multi_question = "Do they use external financial knowledge in their approach?"
# summary = "The text describes the designs of a dialog agent and a user simulator. The dialog agent tracks dialog state, performs API calls to knowledge bases (KB), and generates system actions and responses based on query results, utilizing an LSTM to maintain and update its state after each dialog turn. Similarly, the user simulator, initialized with a randomly sampled goal at the start of the conversation, also uses an LSTM to manage its state throughout the interaction."
gold1 = "The BLSTM models take as input a headline sentence of size L tokens where L is the length of the longest sentence in the training texts. Each word is converted into a 300 dimension vector using the word2vec model trained over the financial text. Any text that is not recognised by the word2vec model is represented as a vector of zeros; this is also used to pad out the sentence if it is shorter than L."
gold2 = "We additionally trained a word2vec BIBREF10 word embedding model on a set of 189,206 financial articles containing 161,877,425 tokens, that were manually downloaded from Factiva. The articles stem from a range of sources including the Financial Times and relate to companies from the United States only. We trained the model on domain specific data as it has been shown many times that the financial domain can contain very different language."
sbert1 = "Domain specific terminology is expected to play a key part in this task, as reporters, investors and analysts in the financial domain will use a specific set of terminology when discussing financial performance. Potentially, this may also vary across different financial domains and industry sectors. Therefore, we took an exploratory approach and investigated how various features and learning algorithms perform differently, specifically SVR and BLSTMs. We found that BLSTMs outperform an SVR without having any knowledge of the company that the sentiment is with respect to. For replicability purposes, with this paper we are releasing our source code and the finance specific BLSTM word embedding model."
sbert2 = "We are grateful to Nikolaos Tsileponis (University of Manchester) and Mahmoud El-Haj (Lancaster University) for access to headlines in the corpus of financial news articles collected from Factiva. This research was supported at Lancaster University by an EPSRC PhD studentship."

question_embedding: List[List[float]] = embedding_model.encode([multi_question])
# sum_embeddings: List[List[float]] = embedding_model.encode([summary])
p1_embeddings: List[List[float]] = embedding_model.encode([gold1])
p2_embeddings: List[List[float]] = embedding_model.encode([gold2])
sr1_embeddings: List[List[float]] = embedding_model.encode([sbert1])
sr2_embeddings: List[List[float]] = embedding_model.encode([sbert2])

summary_sim : List[float] = util.dot_score(question_embedding, sum_embeddings)[0].cpu().tolist()
p1_sim : List[float] = util.dot_score(question_embedding, p1_embeddings)[0].cpu().tolist()
p2_sim : List[float] = util.dot_score(question_embedding, p2_embeddings)[0].cpu().tolist()
sr1_sim : List[float] = util.dot_score(question_embedding, sr1_embeddings)[0].cpu().tolist()
sr2_sim : List[float] = util.dot_score(question_embedding, sr2_embeddings)[0].cpu().tolist()

# print(summary_sim)
print(p1_sim)
print(p2_sim)
print(sr1_sim)
print(sr2_sim)

[0.11605284363031387]
[0.2582840919494629]
[0.38228026032447815]
[0.3351742625236511]


## FULL TITLE CASE 1

In [132]:
MODE = "full"
TOPK = 10
paper_id = "1805.11937"
paper_para_embeddings: Dict[str, List[List[float]]] = utils.load_json(Path(f"qasper/embeddings/test_embeddings_sbert_{MODE}.json"))
test_papers: Dict[str, Dict] = utils.load_json(Path("qasper/test_papers.json"))
raw_paras: List[str] = [para["text"] for para in test_papers[paper_id].values()]


In [133]:
# full title

question = "What type of morphological features are used?"

# from section: 
    # Introduction
    # Predicted Morphological Tags
predicted = [
    "Morphological analysis already provides the aforementioned information about the words. However access to useful morphological features may be problematic due to software licensing issues, lack of robust morphological analyzers and high ambiguity among analyses. Character-level models (CLM), being a cheaper and accessible alternative to morphology, have been reported as performing competitively on various NLP tasks BIBREF0 , BIBREF1 , BIBREF2 . However the extent to which these tasks depend on morphology is small; and their relation to semantics is weak. Hence, little is known on their true ability to reveal the underlying morphological structure of a word and their semantic capabilities. Furthermore, their behaviour across languages from different families; and their limitations and strengths such as handling of long-range dependencies, reaction to model complexity or performance on out-of-domain data are unknown. Analyzing such issues is a key to fully understanding the character-level models.",
    "We use a simple method based on bidirectional LSTMs to train three types of base semantic role labelers that employ (1) words (2) characters and character sequences and (3) gold morphological analysis. The gold morphology serves as the upper bound for us to compare and analyze the performances of character-level models on languages of varying morphological typologies. We carry out an exhaustive error analysis for each language type and analyze the strengths and limitations of character-level models compared to morphology. In regard to the diversity hypothesis which states that diversity of systems in ensembles lead to further improvement, we combine character and morphology-level models and measure the performance of the ensemble to better understand how similar they are.",
    "Although models with access to gold morphological tags achieve better F1 scores than character models, they can be less useful a in real-life scenario since they require gold tags at test time. To predict the performance of morphology-level models in such a scenario, we train the same models with the same parameters with predicted morphological features. Predicted tags were only available for German, Spanish, Catalan and Czech. Our results given in Fig. 5 , show that (except for Czech), predicted morphological tags are not as useful as characters alone."
]

# from section: 
    # Subword Units
gold = [
    "We use three types of units: (1) words (2) characters and character sequences and (3) outputs of morphological analysis. Words serve as a lower bound; while morphology is used as an upper bound for comparison. Table 1 shows sample outputs of various $\\rho $ functions.",
    "Here, char function simply splits the token into its characters. Similar to n-gram language models, char3 slides a character window of width $n=3$ over the token. Finally, gold morphological features are used as outputs of morph-language. Throughout this paper, we use morph and oracle interchangably, i.e., morphology-level models (MLM) have access to gold tags unless otherwise is stated. For all languages, morph outputs the lemma of the token followed by language specific morphological tags. As an exception, it outputs additional information for some languages, such as parts-of-speech tags for Turkish. Word segmenters such as Morfessor and Byte Pair Encoding (BPE) are other commonly used subword units. Due to low scores obtained from our preliminary experiments and unsatisfactory results from previous studies BIBREF13 , we excluded these units."
]

question_embedding: List[float] = embedding_model.encode(question)
gold_embeddings: List[List[float]] = embedding_model.encode(gold)
predicted_embeddings: List[List[float]] = embedding_model.encode(predicted)


gold_similarity : List[float] = util.dot_score(question_embedding, gold_embeddings)[0].cpu().tolist()
predicted_similarity : List[float] = util.dot_score(question_embedding, predicted_embeddings)[0].cpu().tolist()

print(f"Question: {question}")
print(f"Similarity scores of golden evidences:{gold_similarity}")
print(f"Similarity scores of predicted evidences:{predicted_similarity}")

print("+"*100)
all_stored_embeddings = paper_para_embeddings[paper_id]
all_stored_similarity: List[float] = util.dot_score(question_embedding, all_stored_embeddings)[0].cpu().tolist()
para_score_pairs = list(zip(raw_paras, all_stored_similarity))
topk_para_score_pairs: List[Tuple[str, float]] = sorted(para_score_pairs, key=lambda x: x[1], reverse=True)[:TOPK]
topk_paras: List[str] = [para for para, _ in topk_para_score_pairs]

for rank, (para, score) in enumerate(topk_para_score_pairs, 1):
    for gold_id, gold_para in enumerate(gold):
        if gold_para in para:
            print(f"Gold paragraph {gold_id} found in top{TOPK}! (score={score}), ranked {rank}")

Question: What type of morphological features are used?
Similarity scores of golden evidences:[0.4760808050632477, 0.41868165135383606]
Similarity scores of predicted evidences:[0.46930956840515137, 0.4313605725765228, 0.4032084047794342]
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Gold paragraph 0 found in top10! (score=0.4157830476760864), ranked 6
Gold paragraph 1 found in top10! (score=0.3529841899871826), ranked 9


In [None]:
"""
撈到 gold，答對（正常發揮）
95bbd91badbfe979899cca6655afc945ea8a6926 = [2,2]

沒撈到gold，答錯：
why？是因為multisection嗎？ -》 50%
30870a962cf88ac8c8e6b7b795936fd62214f507 = 【2，2】


有撈到gold，還是答錯：
55bafa0f7394163f4afd1d73340aac94c2d9f36c = 【2，2】



"""