Questions we aim to answer in the analysis
1. Behavior on top relevant documents [How many of the top documents for this system were relevant and could they be categorized and distinguished from others?]
2. Behavior on top non-relevant documents [Why were the top non-relevant documents retrieved?] Behavior on unretrieved relevant documents [Why weren’t these relevant documents retrieved within the top 1000?]

    x. Beadplot observations [How does the ranking (especially among the top 50 documents) of this system compare to all other systems?]

3. Base Query observations [What did the system think were the important terms of the original query, and were they good?]
4. Expanded Query observations [If the system expanded the query (4 out of 6 systems did), what were the important terms of the expansion, and were they helpful?]
5. Blunders of system [What obvious mistakes did the system make that it could have easily avoided? Examples might be bad stemming of words or bad handling of hyphenation] Other features of note [Anything else.]
6. What should system to do improve performance? [The individual’s conclusion as to why the system did not retrieve well, and recommendations as to what would have made a better retrieval.]
7. What added information would help performance? How can system get that information? [Is there implicit information in the query, that a human would understand but the system didn’t? Examples might be world knowledge (like Germany is part of Europe).]

**Important**: define local paths to result files so that these can be used for analysis. If your files are in a different location or format, change the values of these constants.

In [204]:
# Path to output file of the model in TREC format
MODEL_OUTPUT_PATH = './LambdaRANK_resuls_01.trec'
# Separator used in output file between values
SEPARATOR = ' '
# Path to MS-MARCO evaluation queries
MSMARCO_QUERIES_PATH = 'collections/msmarco-passage/msmarco-test2019-queries.tsv' 
# Path to Qrels file of the aforementioned queries
MSMARCO_QRELS_PATH = 'collections/msmarco-passage/2019qrels-pass.txt'
# Path to indexes of the MSMARCO dataset
INDEX_PATH = 'indexes/lucene-index-msmarco-passage'
# Path to TREC evaluation file
TREC_EVAL_PATH = 'tools/eval/trec_eval.9.0.4/trec_eval'

In [22]:
import pandas as pd

# Reads the output of a model.
# Lines in the output should be in the form [query_id, doc_id, rank] with sep = '\t'.
def read_results_tsv(loc):
    d = pd.read_csv(loc, sep='\t', header=None, names=['query_id', 'doc_id', 'rank', 'score'])
    return d

# Reads the output of a model.
# Lines in the output should be in the form [query_id, doc_id, rank] with sep=' '.
def read_results_csv(loc):
    d = pd.read_csv(loc, sep=' ', header=None, names=['query_id', 'doc_id', 'rank', 'score'])
    return d

# Gets the ranking of a query.
def get_ranking_by_query_id(d, query_id):
    ranking = d.loc[d.query_id == query_id][['doc_id', 'rank']].sort_values(by=['rank'])['doc_id'].tolist()
    return ranking

In [184]:
import numpy as np

# Reads the relevant documents from the given qrels file.
def read_qrels(loc):
    d = pd.read_csv(loc, names=['query_id', 'Q0', 'doc_id', 'rating'], sep=' ', header=None)
    del d['Q0']
    return d

def read_trec_results(separator, path):
    '''Reads the results file that is in TREC format: query_id, Q0, doc_id, rank, score separates as a csv'''
    d = pd.read_csv(path, sep=separator, header=None, names=['query_id', 'Q0', 'doc_id', 'rank', 'score', 'run_name'])
    # Remove redundant columns
    del d['Q0']
    del d['run_name']
    return d

# Gets the relevant document for the given query id.
def get_non_relevant_doc_ids(qrels, query_id):
    doc_ids = qrels.loc[(qrels.query_id == query_id) & (qrels.rating < 2)][['doc_id', 'rating']]
    return doc_ids

# Gets the relevant document for the given query id.
def get_relevant_doc_ids(qrels, query_id):
    doc_ids = qrels.loc[(qrels.query_id == query_id) & (qrels.rating >= 2)][['doc_id', 'rating']]
    return doc_ids

def get_recall_per_query(qrels, results, n):
    recalls = {}
    for query_id in qrels.query_id.unique():
        ranking = get_ranking_by_query_id(results, query_id)
        relevant = get_relevant_doc_ids(qrels, query_id)
        recalls[query_id] = 0
        for i in range(0, n):
            ratings = relevant.loc[relevant.doc_id == ranking[i]].rating.tolist()
            if len(ratings) > 0 and ratings[0] >= 2:
                recalls[query_id] += 1
    return recalls

# Constructs a vector which counts the number of retrieved documents for each rating.
def get_relevance_vector(qrels, results):
    v = np.zeros(qrels.rating.max() + 1)
    for query_id in qrels.query_id.unique():
        ranking = get_ranking_by_query_id(d, query_id)
        relevant = get_relevant_doc_ids(qrels, query_id)
        for doc_id in ranking:
            rating = relevant.loc[relevant.doc_id == doc_id]['rating']
            v[rating] += 1
    return v

# Given a ranking, return all documents that are relevant, but not in the ranking for the given query.
def get_relevant_doc_ids_not_retrieved(qrels, query_id, ranking):
    relevant_doc_ids = get_relevant_doc_ids(qrels, query_id).doc_id.tolist()
    relevant_doc_ids_not_retrieved = []
    for doc_id in relevant_doc_ids:
        if not doc_id in ranking:
            relevant_doc_ids_not_retrieved.append(doc_id)
    return relevant_doc_ids_not_retrieved

# Given a ranking, return all documents that are relevant and are in the ranking for the given query.
def get_relevant_doc_ids_retrieved(qrels, query_id, ranking):
    relevant_doc_ids = get_relevant_doc_ids(qrels, query_id).doc_id.tolist()
    relevant_doc_ids_retrieved = []
    for doc_id in relevant_doc_ids:
        if doc_id in ranking:
            relevant_doc_ids_retrieved.append(doc_id)
    return relevant_doc_ids_retrieved

# Given a ranking, return all documents that are non-relevant and are in the ranking for the given query.
def get_non_relevant_doc_ids_retrieved(qrels, query_id, ranking):
    non_relevant_doc_ids = get_non_relevant_doc_ids(qrels, query_id).doc_id.tolist()
    non_relevant_doc_ids_retrieved = []
    for doc_id in non_relevant_doc_ids:
        if doc_id in ranking:
            non_relevant_doc_ids_retrieved.append(doc_id)
    return non_relevant_doc_ids_retrieved



# For some reason, qrels contains less unique query ids, i.e., not every query has relevant items (by a long shot).
# print(qrels.query_id.unique())
# print(d.query_id.unique())

In [4]:
def read_queries(loc):
    queries = pd.read_csv(loc, header=None, sep='\t', names=['query_id', 'string'])
    return queries

def get_query(queries, query_id):
    return queries.loc[queries.query_id == query_id].string.tolist()[0]

In [23]:
from pyserini.index import IndexReader

# Load datafiles:
d = read_trec_results(SEPARATOR, MODEL_OUTPUT_PATH)
qrels = read_qrels(MSMARCO_QRELS_PATH)
queries = read_queries(MSMARCO_QUERIES_PATH)
index = IndexReader(INDEX_PATH)

In [6]:
# Gets the document vector for the given doc_id.
def get_doc_vec(doc_id):
    if (type(doc_id) == type(0)):
        return index.get_document_vector("{}".format(doc_id))
    else:
        return index.get_document_vector(doc_id)

# Tokenizes a given query.
def tokenize(query):
    return index.analyze(query)
        

In [7]:
# There are on average ~215 documents per query in the qrels.
# This means that a model can only retrieve on average at most 215 documents.
print(len(qrels) / len(qrels.query_id.unique()))

215.34883720930233


### 1) Behavior on top relevant documents. How many of the top documents for this system were relevant and could they be categorized and distinguished from others?

In [205]:
from subprocess import check_output
import subprocess

# This question can be answered using the calculating the metrics used by the official trec_eval tool.
# https://www-nlpir.nist.gov/projects/trecvid/trecvid.tools/trec_eval_video/A.README
cmd = subprocess.Popen([TREC_EVAL_PATH, '-c', '-mofficial', '-mndcg', '-mrecall' ,'-M 100', MSMARCO_QRELS_PATH, MODEL_OUTPUT_PATH], stdout=subprocess.PIPE)
cmd_out, cmd_err = cmd.communicate()
print(cmd_out.decode("utf-8"))

runid                 	all	pyterrier
num_q                 	all	43
num_ret               	all	4205
num_rel               	all	4102
num_rel_ret           	all	1374
map                   	all	0.2843
gm_map                	all	0.1417
Rprec                 	all	0.3485
bpref                 	all	0.3533
recip_rank            	all	0.8303
iprec_at_recall_0.00  	all	0.8629
iprec_at_recall_0.10  	all	0.6288
iprec_at_recall_0.20  	all	0.5335
iprec_at_recall_0.30  	all	0.4198
iprec_at_recall_0.40  	all	0.2964
iprec_at_recall_0.50  	all	0.2596
iprec_at_recall_0.60  	all	0.2021
iprec_at_recall_0.70  	all	0.1176
iprec_at_recall_0.80  	all	0.0695
iprec_at_recall_0.90  	all	0.0461
iprec_at_recall_1.00  	all	0.0186
P_5                   	all	0.6186
P_10                  	all	0.5791
P_15                  	all	0.5581
P_20                  	all	0.5337
P_30                  	all	0.4783
P_100                 	all	0.3195
P_200                 	all	0.1598
P_500                 	all	0.0639
P_1000               

#### BM25 Results
We can see that of the 4102 relevant documents, BM25 managed to find 2814. Furthermore, when we look at the precision metrics, we can see that the most relevant documents are found early on, as the precision at 10 retrieved documents is 60%, but that the precision at 100 retrieved documents is only 30%. This means that BM25 cannot make a clear distinction between relevant and non-relevant documents accross the retreived documents.

#### MonoT5 Results
(_note that these results are for a k=100, that means only 100 documents are retrieved for a query_)
Of the 4102 relevant documents, 916 were retrieved over all queries (1372 of the 4300 retrieved documents were relevant). Observe that the precision is very high over the first few documents (<10) but decreases significantly as more documents are retrieved:

|Rank|Precision|
|---|---|
|5   |0.8977|
|10  |0.8070|
|15  | 0.7380|
|20  |0.6930|
|30  |0.6217|
|100 |0.3191|

Also notice that the mean reciprocal rank is very high (> 0.97). MAP is very low (~ 0.37) but that is probably due to the low amount of documents retrieved by the model (`k=100`) while there could be more than 200 relevant documents for a query.

### 2) Behavior on top non-relevant documents Why were the top non-relevant documents retrieved? Behavior on unretrieved relevant documents Why weren’t these relevant documents retrieved within the top 1000?

In [155]:
i=17

query_id = qrels['query_id'].unique()[i]
print(query_id)
for t, q_id in enumerate(qrels['query_id'].unique()):
    if "law" in get_query(queries, q_id):
        print(t, q_id)
query = get_query(queries, query_id)
ranking = get_ranking_by_query_id(d, query_id)
print(tokenize(query))
# Functions added:
#get_non_relevant_doc_ids_retrieved
#get_relevant_doc_ids_retrieved
#get_relevant_doc_ids_not_retrieved
docs = get_relevant_doc_ids_not_retrieved(qrels, query_id, ranking)
doc_vec = [get_doc_vec(doc) for doc in docs]
# Print 10 most occuring words in document
for i in range(len(doc_vec)):
    print(i, docs[i], [k for k, v in sorted(doc_vec[i].items(), key=lambda item: item[1], reverse=True)[:10]])
#     print(i, ranking.index(docs[i]), docs[i], [k for k, v in sorted(doc_vec[i].items(), key=lambda item: item[1], reverse=True)[:10]])

443396
17 443396
['lp', 'law', 'definit']
0 1055834 ['person', 'mental', 'grave', 'disabl', 'hi', 'reason', 'shelter', 'lp', 'becaus', 'conservatorship']
1 1055835 ['conservatorship', 'probat', 'financ', 'â', 'code', 'law', 'gener', 'themselv', 'type', 'most']
2 2130373 ['individu', 'act', 'involuntari', 'facil', 'ag', 'voluntari', 'baker', 'person', 'here', 'parent']
3 2130375 ['act', 'health', 'mental', 'baker', 'treatment', 'individu', 'proceed', 'florida', 'voluntari', 'emerg']
4 2130381 ['treatment', 'act', 'admiss', 'voluntari', 'baker', 'parent', 'individu', 'minor', 'year', 'mental']
5 3440848 ['hold', 'section', 'dai', 'treatment', 'code', 'addit', 'institut', 'psychiatr', 'patient', 'california']
6 3440850 ['section', 'danger', 'involuntarili', 'other', 'code', 'lp', 'offic', 'institut', 'mental', 'petrisâ']
7 3440851 ['code', 'institut', 'section', 'welfar', 'hospit', 'mental', '5260', 'hold', 'wic', 'fall']
8 3440853 ['hold', 'right', 'patient', 'patientsâ', 'legal', 'invol

In [164]:
i=5
query_id = qrels['query_id'].unique()[i]
print(query_id, get_query(queries, query_id))
query = get_query(queries, query_id)
ranking = get_ranking_by_query_id(d, query_id)
j = 0
p = 0
for i, doc_id in enumerate(ranking):
    rating = qrels.loc[(qrels.query_id == query_id) & (qrels.doc_id == doc_id)][['rating']].values
    if len(rating) == 0:
        rating = 0
    else:
        rating = rating[0][0]
    
    
    if rating > 1:
        j += 1
        top_10 = [k for k, v in sorted(get_doc_vec(doc_id).items(), key=lambda item: item[1], reverse=True)[:10]]
#         if 'suicid' in top_10[:3]:
#             p += 1
        print(i, p, j, rating, doc_id, top_10)

130510 definition declaratory judgment
0 0 1 2 1494936 ['declaratori', 'judgment', 'parti', 'legal', 'involv', 'sometim', 'conclus', 'appeal', 'right', 'court']
4 0 2 3 8612906 ['declaratori', 'judgment', 'state', 'disput', 'fact', 'patent', 'piec', 'own', 'properti', 'rule']
5 0 3 3 8612903 ['definit', 'other', 'rule', 'parti', 'declaratori', 'award', 'judgment', 'rate', 'civil', 'damag']
6 0 4 3 799647 ['judgment', 'ani', 'parti', 'declaratori', 'case', 'disput', 'duti', 'right', 'court', 'howev']
10 0 5 2 1494935 ['declaratori', 'act', 'judgment', 'mai', 'injunct', 'advoc', 'procedur', 'statutori', 'provis', 'seek']
11 0 6 2 1494938 ['judgment', 'declareâ', 'â', 'featur', 'author', 'forc', 'it', 'right', 'parti', 'declaratori']
13 0 7 3 8612909 ['legal', 'call', 'determin', 'judgment', 'resolv', 'litig', 'declar', 'also', 'court', 'uncertainti']
15 0 8 3 8612910 ['which', 'law', 'question', 'anyth', 'express', 'right', 'court', 'done', 'parti', 'simpli']
17 0 9 3 8612902 ['other', '

#### BM25 Results
In this case, synonyms and similar terms for military such as 'veteran' and 'medic' could have helped in finding this document.

#### MonoT5 Results
We notice that the results that Mono T5 finds that are not relevant are related to _some_ of the tokens in the query. For example for the query: "causes of military suicide", non-relevant documents that are retrieved often contain the words "military" and "suicide" but not both. On the other hand, the relevant documents that are not retrieved often contain a lot of synonyms or related words. For the previous query those are words like: "ptsd", "trama", "veteran", "vietnam", "iraq". Those are words that humans would know are related to military suicides, but such a system would not be aware of that. This is also true for other queries, another interesting example may be "does legionella pneumophila cause pneumonia". Here, the model retrieves relevant documents that contains terms like "legionella", "pneumophila", "pneumonia". But it fails to retrieve relevant documents that contain words like: "bacteria", "disease" and "organ".

### 3) Base Query observations. What did the system think were the important terms of the original query, and were they good?

#### BM25 Results
BM25 has no term weighing? It only removes non-important words.

#### MonoT5 Results
Is unfortunately not available.

### 4) Expanded Query observations. If the system expanded the query (4 out of 6 systems did), what were the important terms of the expansion, and were they helpful?

#### BM25 Results
BM25 Uses no expanded queries.

#### MonoT5 Results
MonoT5 does not expand queries.

### 5) Blunders of system. What obvious mistakes did the system make that it could have easily avoided? Examples might be bad stemming of words or bad handling of hyphenation. Other features of note. Anything else.

We can answer this question by looking at the queries with the worst recall.

In [195]:
rank= 5
doc = 0

recalls = [(k, v) for k, v in get_recall_per_query(qrels, d, 5).items()]
recalls.sort(key=lambda x: x[1])
worst_query_id = recalls[rank][0]

print('Query:', get_query(queries, worst_query_id))
print('Recall:', recalls[rank][1])
print(tokenize(get_query(queries, worst_query_id)))

# relevant = get_relevant_doc_ids_not_retrieved(qrels, worst_query_id, get_ranking_by_query_id(d, worst_query_id))
# print(get_doc_vec(relevant[0]))
# print(index.doc(str(relevant[doc])).raw())

# worst_query_id = recalls[2][0]

# print(get_query(queries, worst_query_id))
# print(tokenize(get_query(queries, worst_query_id)))

# relevant = get_relevant_doc_ids_not_retrieved(qrels, worst_query_id, get_ranking_by_query_id(d, worst_query_id))
# print(index.doc(str(relevant[0])).raw())


Query: what are the three percenters?
Recall: 0
['what', 'three', 'percent']


#### BM25 Results
In this case, the problem with the query with the worst performance is that Percenter is reduced to percent, which now matches with any document using the word percent. This is an obvious stemming issue. This could be resolved by using NER information.

In the second query, a spelling mistake prevents good retrieval, and again synonyms for WW1 might be useful.

#### MonoT5 Results
This model uses the same stemmer as the BM25 model and therefore the same mistake is made with regards to the Percenter --> percent. Also acronyms like "lps" are reduced to "lp" which yields vastly different results.


### 6) What should system to do improve performance? The individual’s conclusion as to why the system did not retrieve well, and recommendations as to what would have made a better retrieval.

#### BM25 Results

#### MonoT5 Results
It is evident that the mistakes that are made are subtle but can have a profounding impact on the results. Some of the common mistakes that have been observed are that documents are found that match only a part of the query. In addition, the tokenization of the queries may remove crucial information from the query. When looking at queries that the model has missed, it seems that these queries often contain terms that are very much related to the query but are not in the query. The model may improve performance if it could find the semantic context of the query.

### 7) What added information would help performance? How can system get that information? Is there implicit information in the query, that a human would understand but the system didn’t? Examples might be world knowledge (like Germany is part of Europe).

#### BM25 Results
In general, similar terms and synonyms could benefit greatly in retrieval as queries are often times very small and might miss key terms. Also, a spell checker may benefit search as some queries showed spelling errors which prevents matching of the same intended word. 

#### MonoT5 Results
By using query expansion to extend the query with other relevant terms, the performance could be significantly improved.