# MS-MARCO Passage (Re-)Ranking using Learn-to-Rank in `pygaggle`
This notebook will walk through the basics of learn-to-rank using [pygaggle](https://github.com/castorini/pygaggle).

This is the CLI command to run the passage reranking challenge using the Mono T5 algorithm:

```shell
python -um pygaggle.run.evaluate_passage_ranker --split dev \
                                                --method t5 \
                                                --model castorini/monot5-base-msmarco \
                                                --dataset collections/msmarco-passage \
                                                --model-type t5-base \
                                                --task msmarco \
                                                --index-dir indexes/msmarco-passage \
                                                --batch-size 32 \
                                                --output-file runs/run.monot5.ans_small.dev.tsv
```

However, we will try to do this in a pythonian manner. First, we import the relevant libraries and the (pretrained) ranking algorithm `MonoT5` ([link?](https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c), [paper](https://arxiv.org/pdf/2003.06713.pdf)):

In [33]:
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5
from pygaggle.rerank.base import hits_to_texts
from pyserini.search import SimpleSearcher
from tqdm.notebook import tqdm

reranker =  MonoT5()

Some weights of the model checkpoint at castorini/monot5-base-msmarco were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now we will import the text from the MS-MARCO dataset and try to re-rank some passages for the queries.

**Why re-rank instead of rank?** It is infeasible for a learner to rank all passages in the corpus, therefore we let a conventional retrieval method find a list of ranked _candidates_ that will be re-ranked by the L2R algorithm.

We run the example from the pygaggle [Github page](https://github.com/castorini/pygaggle) as a toy problem:

In [9]:
# Here's our query:
query = Query('who proposed the geocentric theory')

# fetch some passages to rerank from MS MARCO with Pyserini (BM25)
searcher = SimpleSearcher('indexes/msmarco-passage/lucene-index-msmarco')
hits = searcher.search(query.text)
texts = hits_to_texts(hits)

# Optionally: print out the passages prior to reranking (might be interesting to see how the order changes):
# for i in range(0, 10):
#     print(f'{i+1:2} {texts[i].metadata["docid"]:15} {texts[i].score:.5f} {texts[i].text}')

# Rerank:
reranked = reranker.rerank(query, texts)
reranked.sort(key=lambda x: x.score, reverse=True)
# Optionally: print out reranked results:
# for i in range(0, 10):
#     print(f'rank: {i+1:2}, score: {reranked[i].score:.5f}, document: {reranked[i].text}')

# We print the first result as a proof of success:
print(f'rank: {0+1:2}, score: {reranked[0].score:.5f}, document: {reranked[0].text}')

rank:  1, score: -0.00887, document: {
  "id" : "7744105",
  "contents" : "For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory."
}


Toy problems are not sufficient for ranking the whole test set. Therefore, we will try to load all test queries in the dataset ([download link](https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2019-queries.tsv.gz)). Make sure to place the test queries file in the `collections/msmarco-passage` directory with the other files.

In [29]:
queries = []
QUERIES_PATH = 'collections/msmarco-passage/msmarco-test2019-queries.tsv'
with open(QUERIES_PATH) as f:
    content = f.readlines()
    content = [x.strip().split('\t') for x in content] 
    queries = [Query(x[1], x[0]) for x in content]
for q in queries[:10]:
    print(q.id, q.text)

1108939 what slows down the flow of blood
1112389 what is the county for grand rapids, mn
792752 what is ruclip
1119729 what do you do when you have a nosebleed from having your nose
1105095 where is sugar lake lodge located
1105103 where is steph currys home in nc
1128373 iur definition
1127622 meaning of heat capacity
1124979 synonym for treatment
885490 what party is paul ryan in


Alright, now we are ready to start re-ranking all documents. We should first define a function for outputting a file of the ranked documents (either in `csv` or `tsv` format). 

In [30]:
def output_to_csv(queries, rankings, file_path='runs/monot5.csv'):
    '''Desired output format: 'query_id', 'doc_id', 'rank', 'score'
    '''
    with open(file_path, 'w') as f:
        for (i,q) in enumerate(queries):
            q_rank = rankings[i]
            for (j,r) in enumerate(q_rank):
                f.write(str(q.id) + ' ' + str(r.metadata['docid']) + ' ' + str(j) + ' ' + str(r.score) + '\n')
                
def output_to_tsv(queries, rankings, file_path='runs/monot5.tsv'):
    '''Desired output format: 'query_id', 'doc_id', 'rank', 'score'
    '''
    with open(file_path, 'w') as f:
        for (i,q) in enumerate(queries):
            q_rank = rankings[i]
            for (j,r) in enumerate(q_rank):
                f.write(str(q.id) + '\t' + str(r.metadata['docid']) + '\t' + str(j) + ' ' + str(r.score) + '\n')

Given these functions, we are now ready to re-rank all the queries. A script that executes this ranking procedure is given in: `l2r-passage-ranking.py` 

In [None]:
# Perform ranking (takes ~15 min.)
rankings = []

for query in tqdm(queries):
    reranked = reranker.rerank(query, texts)
    reranked.sort(key=lambda x: x.score, reverse=True)
    rankings.append(reranked)    

HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))

In [None]:
output_to_csv(queries, rankings)
# output_to_tsv(queries, rankings)