
# INFS7410 Week 9 Practical

##### version 1.0

###### The INFS7410 Teaching Team

##### Tutorial Etiquette:
*Please refrain from loud noises, irrelevant conversations and use of mobile phones during practical activities. Be respectful of everyone's opinions and ideas during the practical activities. You will be asked to leave if you disturb. Remember the tutor is there to help you understand and learn, not to provide debugging of your code or solutions to assignments.*


---

### About today's Practical
In this week's practical, you will learn and implement a powerful ranking model -- BERT reranker (aka monoBERT). Unlike traditional bag-of-words models such as BM25 or feature-based learning to rank models, BERT reranker uses transformer model (BERT) to encode query-document pairs and directly output estimated relevance scores, as the model architecture shown in the leacture:![monobert](monobert.png)

To make our life easier, in this practical, we use the most popular transformer python library called `transformers`, which is developed by [Huggingface](https://huggingface.co/transformers/), to test the effectiveness of monoBERT. If you are interested in transformer models and NLP, we highly recommend you to have a look at their [lectures](https://huggingface.co/course/chapter1). 

Frist, run the following cell to install transformers and pytorch:

In [1]:
!pip install transformers==4.6.1 torch==1.8.1

Collecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 846 kB/s eta 0:00:01
[?25hCollecting torch==1.8.1
  Downloading torch-1.8.1-cp37-none-macosx_10_9_x86_64.whl (119.5 MB)
[K     |████████████████████████████████| 119.5 MB 849 kB/s eta 0:00:01    |██████▊                         | 25.1 MB 1.6 MB/s eta 0:01:01
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Installing collected packages: huggingface-hub, transformers, torch
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.0.12
    Uninstalling huggingface-hub-0.0.12:
      Successfully uninstalled huggingface-hub-0.0.12
  Attempting uninstall: transformers
    Found existing installation: transformers 4.9.1
    Uninstalling transformers-4.9.1:
      Successfully uninstalled transformers-4.9.1
Successfully installed huggingface-hub-0.0.8 torch-1.8.1 transformers

Then, run the following the cell to define our monoBERT scoring function. 

Something you need to be aware of before you continue the exercises:
- We do not train a monoBERT by ourselves because powerful GPUs are needed in order to train it. Instead, we use an off-the-shelf monoBERT model provided by the pyserini team. The paper ["Passage Re-ranking with BERT"](https://arxiv.org/pdf/1901.04085.pdf) describes the details of how to train the monoBERT that we used in this practical. 

- The code `monoBERT = AutoModelForSequenceClassification.from_pretrained('castorini/monobert-large-msmarco', cache_dir="./cache")` and `tokenizer = AutoTokenizer.from_pretrained('castorini/monobert-large-msmarco', cache_dir="./cache")` will download the monoBERT model (around 1.3Gb) and its tokenizer into the cache folder. This may be slow depends where you are. If this is too slow for you, try to download the model from our Google drive [link](https://drive.google.com/file/d/1MGSCmOWoKpzSywapFmZ4RyjQSrDKGspw/view?usp=sharing). Unzip the folder and change `castorini/monobert-large-msmarco` to the folder path for both model and tokenzier.

In [13]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

monoBERT = AutoModelForSequenceClassification.from_pretrained('monobert-large-msmarco', cache_dir="./cache")
tokenizer = AutoTokenizer.from_pretrained('monobert-large-msmarco', cache_dir="./cache")

def monoBERT_score(query: str, passage: str):
    ret = tokenizer.encode_plus(query,
                                passage,
                                max_length=512,
                                truncation=True,
                                return_token_type_ids=True,
                                return_tensors='pt')
    input_ids = ret['input_ids'].to(DEVICE)
    tt_ids = ret['token_type_ids'].to(DEVICE)
    with torch.no_grad():
        output, = monoBERT(input_ids, token_type_ids=tt_ids, return_dict=False)
        score = torch.nn.functional.softmax(output, 1)[0, -1].cpu().item()

    return score

Now let us try the following example to see if our monoBERT works or not:

In [80]:
query = "what is information retrieval"
passage1 = "Information retrieval (IR) is the process of obtaining information system resources that are relevant to an information need from a collection of those resources."
passage2 = "Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus."

score1 = monoBERT_score(query, passage1)
score2 = monoBERT_score(query, passage2)

print("Relevance socre for passage 1:", score1)
print("Relevance socre for passage 2:", score2)

Relevance socre for passage 1: 0.9972731471061707
Relevance socre for passage 2: 8.926082955440506e-06


This means monoBERT thinks the passage1 is relevant with 99.7% confidence, and almost no confidence with the passage2 to be relevant.  

---

# Exercise: Build BM25 + monoBERT re-ranking pipeline

In this exercise, you need to implement the monoBERT re-ranking pipeline, which takes BM25 as the first stage retriever, and use monoBERT to re-rank top k documents. This picture describes the pipeline:
![monobert_ranking](monobert_ranking.png)

Hints:
- You can use pyserini searcher to get the BM25 ranking list, and use hit.raw to get the raw text, as what we did in week2 prac.
- You need to sort the top k docids according to the monoBERT scores (re-ranking) before writing to the run file.
- monoBERT is very expensive to inference, thus you cannot re-rank very deep cut-off (try small numbers first, such as top 10 or 20). However, it is much faster on GPU, if you do not have a GPU machine, you can try running this notebook on [Google colab](https://colab.research.google.com/notebooks/intro.ipynb). Colab gives GPU resource for free.

In [79]:
# TODO: Build the pipeline
from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from collections import Counter
import math
import json

searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-noProcessing/')
searcher.set_analyzer(get_lucene_analyzer(stemming=False, stopwords=False))

query = 'what is information retrieval?'
hits = searcher.search(query)
for i in range(0, 5):
    print(f'{i+1} {hits[i].docid} {hits[i].raw}')


lucene_analyzer = get_lucene_analyzer(stemming=False, stopwords=False)
analyzer = Analyzer(lucene_analyzer)
searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')
searcher.set_analyzer(lucene_analyzer)

index_reader = IndexReader('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')

def search(k: int=10):
    hits = searcher.search(query, k=k)
    q_terms = analyzer.analyze(query)
    # print(q_terms)
    result = []
    for i, hit in enumerate(hits):
        # Compute the statistics.
        tf = index_reader.get_document_vector(hit.docid)
        df= {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}
        doc_len = len(tf)
        bm25_score = 0

        c = list((Counter(q_terms) & Counter(tf.keys())).elements())

        for term in c:
            bm25_score += index_reader.compute_bm25_term_weight(hit.docid, term, analyzer=None)
        content = json.loads(hit.raw)
        result.append((content, bm25_score))
    return result


bm25 = search(k=100)

mono = []

for r in enumerate(sorted(bm25, key=lambda x: x[1], reverse=True)):
    score = monoBERT_score(query, r[1][0]["contents"])
    mono.append((score, r[1][0]))
    #print(r[1][0]["id"])
    
mono.sort(key=lambda x:x[0], reverse=True)

for mo in mono:
    print(mo)

1 1404848 {
  "id" : "1404848",
  "contents" : "Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for document themselves, searching for metadata that describe data and for databases such as text, image or sound. Automated information retrieval systems are used to reduce what has been called information overload."
}
2 7668955 {
  "id" : "7668955",
  "contents" : "Psychologists distinguish among three necessary stages in the learning and memory process: encoding, storage, and retrieval (Melton, 1963). Encoding is defined as initial learning of information; storage refers to maintaining information over time; retrieval is the ability to access information when you need it.he key to improving oneâs memory is to improve processes