# Introduction
This notebook will walk through the basics of learn-to-rank using [pygaggle](https://github.com/castorini/pygaggle).

This is the CLI command to run the passage reranking challenge using the Mono T5 algorithm:

```shell
python -um pygaggle.run.evaluate_passage_ranker --split dev \
                                                --method t5 \
                                                --model castorini/monot5-base-msmarco \
                                                --dataset collections/msmarco-passage \
                                                --model-type t5-base \
                                                --task msmarco \
                                                --index-dir indexes/msmarco-passage \
                                                --batch-size 32 \
                                                --output-file runs/run.monot5.ans_small.dev.tsv
```

However, we will try to do this in a pythonian manner. First, we import the relevant libraries and the (pretrained) ranking algorithm `MonoT5` ([link?](https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c), [paper](https://arxiv.org/pdf/2003.06713.pdf)):

In [1]:
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.transformer import MonoT5
from pygaggle.rerank.base import hits_to_texts

reranker =  MonoT5()

2021-02-27 15:22:53 [INFO] loader: Loading faiss.


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1841.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=891691413.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at castorini/monot5-base-msmarco were not used when initializing T5ForConditionalGeneration: ['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1199.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




Now we will import the text from the MS-MARCO dataset and try to re-rank some passages for the queries.

**Why re-rank instead of rank?** It is infeasible for a learner to rank all passages in the corpus, therefore we let a conventional retrieval method find a list of ranked _candidates_ that will be re-ranked by the L2R algorithm.

We run the example from the pygaggle [Github page](https://github.com/castorini/pygaggle) as a toy problem:

In [28]:
# Here's our query:
query = Query('who proposed the geocentric theory')

# fetch some passages to rerank from MS MARCO with Pyserini (BM25)
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('indexes/msmarco-passage/lucene-index-msmarco')
hits = searcher.search(query.text)

# Optionally: print out the passages prior to reranking (might be interesting to see how the order changes):
# for i in range(0, 10):
#     print(f'{i+1:2} {texts[i].metadata["docid"]:15} {texts[i].score:.5f} {texts[i].text}')

# Rerank:
reranked = reranker.rerank(query, texts)
reranked.sort(key=lambda x: x.score, reverse=True)
print(reranked[0].metadata)
# Print out reranked results:
for i in range(0, 10):
    print(f'{i+1:2}, score: {reranked[i].score:.5f}, document: {reranked[i].text}')

{'raw': '{\n  "id" : "7744105",\n  "contents" : "For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory."\n}', 'docid': '7744105'}
 1, score: -0.00887, document: {
  "id" : "7744105",
  "contents" : "For Earth-centered it was  Geocentric Theory proposed by greeks under the guidance of Ptolemy and Sun-centered was Heliocentric theory proposed by Nicolas Copernicus in 16th century A.D. In short, Your Answers are: 1st blank - Geo-Centric Theory. 2nd blank - Heliocentric Theory."
}
 2, score: -0.01464, document: {
  "id" : "6217200",
  "contents" : "The geocentric model, also known as the Ptolemaic system, is a theory that was developed by philosophers in Ancient Greece and was named after the philosopher Claudius Ptolemy who lived circa 90 to 168 A.D. It was developed 

Toy problems are not sufficient for ranking the whole test set. Therefore, we will try to load all queries in the dataset.

In [27]:
queries = []
QUERIES_PATH = 'collections/msmarco-passage/queries.dev.small.tsv'
with open(QUERIES_PATH) as f:
    content = f.readlines()
    # you may also want to remove whitespace characters like `\n` at the end of each line
    content = [x.strip().split('\t') for x in content] 
    queries = [Query(x[1], x[0]) for x in content]
for q in queries[:10]:
    print(q.id, q.text)

1048585 what is paula deen's brother
2  Androgen receptor define
524332 treating tension headaches without medication
1048642 what is paranoid sc
524447 treatment of varicose veins in legs
786674 what is prime rate in canada
1048876 who plays young dr mallard on ncis
1048917 what is operating system misconfiguration
786786 what is priority pass
524699 tricare service number


Alright, now we are ready to start re-ranking all documents. We should first define a function for outputting a file of the ranked documents (either in `csv` or `tsv` format). 

In [47]:
def output_to_csv(queries, rankings, file_path='runs/monot5.csv'):
    '''Desired output format: 'query_id', 'doc_id', 'rank', 'score'
    '''
    with open(file_path, 'w') as f:
        for (i,q) in enumerate(queries):
            r = rankings[i]
            print(r[i].metadata['docid'])
            for (j,r) in enumerate(r):
                f.write(str(q.id) 
                        + '\s' 
                        + str(r[j].metadata['docid'])
                        + '\s' + str(j) 
                        + r[j].score 
                        + '\n')
def output_to_tsv(queries, rankings, file_path='runs/monot5.csv'):
    '''Desired output format: 'query_id', 'doc_id', 'rank', 'score'
    '''
    with open(file_path, 'w') as f:
        for (i,q) in enumerate(queries):
            r = rankings[i]
            for (j,r) in enumerate(r):
                f.write(str(q.id) + '\t' 
                        + r[j].metadata['docid'] + '\t' + str(j) + r[j].score + '\n')
query.id = 11
output_to_csv([query], [reranked], 'testrun.csv')
output_to_tsv([query], [reranked], 'testrun.tsv')

7744105


TypeError: 'Text' object is not subscriptable