
# INFS7410 Week 10 Practical

##### version 1.3

###### The INFS7410 Teaching Team

##### Tutorial Etiquette:
*Please refrain from loud noises, irrelevant conversations and use of mobile phones during practical activities. Be respectful of everyone's opinions and ideas during the practical activities. You will be asked to leave if you disturb. Remember the tutor is there to help you understand and learn, not to provide debugging of your code or solutions to assignments.*


---

### About today's Practical
In this week's practical, you will learn and implement a dense retriever called ANCE and a contextualized exact term matching method called TILDEv2. Unlike monoBERT, these methods can pre-compute document representation offline and only need to encode query representation during online inference, thus enjoy low query latency. However, since GPU is requried to encode the whole passage collection, in this practice we still compute document representations in an "on-the-fly" manner, just like what we did for monoBERT in the last practical. In your project, we will provide pre-computed passage representations for you to download.



## Approximate Nearest Neighbor Search with ANCE

ANCE is a typical dense retriever, which was proposed in the research [paper](https://arxiv.org/pdf/2007.00808.pdf) that pre-compute the document embeddings and compute the query embeddings on-the-fly, then use dot product to compute the similarities between query and documents. The model architecture shown in the leacture:![ANCE.png](ANCE.png)

First, download the model from this [link](https://drive.google.com/file/d/1rbi-C6Ku5p1fG0ivL7cE3zunxzk60usA/view?usp=sharing), unzip it and put it in the same folder as this notebook, then run the following cell to initialize the model and tokenizer.

In [6]:
from modeling import AnceModel
from transformers import AutoTokenizer
device = 'cpu'
ance_model = AnceModel.from_pretrained('ANCE_Model').eval()
ance_model.to(device)
ance_tokenizer = AutoTokenizer.from_pretrained('ANCE_Model')

Now let's use ANCE model to compute the embeddings of the query "what is priority pass".

In [7]:
query = 'what is priority pass'
inputs = ance_tokenizer(
            [query],
            max_length=64,
            padding='longest',
            truncation=True,
            add_special_tokens=True,
            return_tensors='pt'
        )
query_embeddings = ance_model(inputs["input_ids"]).detach().cpu().numpy().flatten()
print(query_embeddings)

[-1.14771283e+00  2.84968227e-01 -6.51907682e-01  3.53783146e-02
 -4.06788290e-01 -5.92642009e-01  1.45296276e+00  4.34321091e-02
  1.87019634e+00  5.45588970e-01  4.63530093e-01  1.34180403e+00
  1.18982032e-01 -1.52603483e+00 -2.41587698e-01 -2.20729017e+00
  1.03682518e+00  2.32850671e-01 -8.05468932e-02 -9.90941346e-01
 -4.33072895e-01  8.97244215e-01 -5.80886245e-01 -6.69079125e-01
  7.66609907e-01 -4.21835512e-01 -1.94484517e-02 -8.76279056e-01
  1.16014814e+00  6.31652176e-01 -2.08898067e-01 -1.25032341e+00
  1.07165284e-01  5.60523927e-01  1.26234651e+00 -1.81427896e+00
  3.76601785e-01 -1.95675898e+00  2.46329337e-01  7.16556132e-01
  8.58545229e-02  7.42675781e-01 -1.56575203e+00  4.38455269e-02
  8.25252056e-01  1.64467371e+00 -2.11803341e+00 -1.18672204e+00
 -2.99669236e-01 -4.55030590e-01  6.03962615e-02  1.31918323e+00
 -1.39826381e+00 -1.63742101e+00 -2.34666601e-01 -2.26846266e+00
 -3.45829457e-01 -4.09893811e-01 -8.11094940e-01  1.45112848e+00
 -1.34125590e+00  8.86524

Now we have the query embeddings, we move to compute two passage embeddings, passage 1 is highly relevant, and passage 2 is not related to the query.

In [8]:
passage_1 = 'Priority Pass is the world’s leading airport lounge access programme, providing an airport lounge for wherever your travel takes you, regardless of your class of travel or airline flown.'
passage_2 = 'Sea World and SeaWorld both keep animals in cramped and unnatural enclosures, and they both force dolphins to participate in shows and breeding programmes.'
passage1_inputs = ance_tokenizer(
            [passage_1],
            max_length=512,
            padding='longest',
            truncation=True,
            add_special_tokens=True,
            return_tensors='pt'
        )
passage2_inputs = ance_tokenizer(
            [passage_2],
            max_length=512,
            padding='longest',
            truncation=True,
            add_special_tokens=True,
            return_tensors='pt'
        )
passage1_inputs.to(device)
passage2_inputs.to(device)
passage1_embeddings = ance_model(passage1_inputs["input_ids"]).detach().cpu().numpy().flatten()
passage2_embeddings = ance_model(passage2_inputs["input_ids"]).detach().cpu().numpy().flatten()
print(passage1_embeddings.shape)
print(passage2_embeddings.shape)

(768,)
(768,)


Now we have the query embeddings and two passage embeddings, we can use dot product to calculate the similarities between the query and the two passages.

In [9]:
import numpy as np
score_1 = np.dot(query_embeddings, passage1_embeddings)
score_2 = np.dot(query_embeddings, passage2_embeddings)
print(score_1)
print(score_2)

713.283
697.2628


### Exercise: build a two-stage ranking pipeline with BM25 + ANCE
We provide the following function `ance_encode` for you to compute the embedding (vector) for a given piece of text.  Similar to last week exercise where you have built a ranking pipeline with BM25 + monoBERT, you can use pyserini build-in SimpleSearcher function to perform BM25 retrieval and use hits.raw() to get document text.

In [11]:
def ance_encode(text, device='cpu'):
    # get query inputs
    inputs = ance_tokenizer(
            [text],
            max_length=64,
            padding='longest',
            truncation=True,
            add_special_tokens=True,
            return_tensors='pt'
        )
    # pass query inputs to device and to model
    # use 'cuda:0' if you are using GPU
    inputs.to(device)
    # compute query embeddings
    embeddings = ance_model(inputs["input_ids"]).detach().cpu().numpy().flatten()
    return embeddings

In [26]:
from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from collections import Counter
import json

searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-noProcessing/')
searcher.set_analyzer(get_lucene_analyzer(stemming=False, stopwords=False))

hits = searcher.search(query)
for i in range(0, 5):
    print(f'{i+1} {hits[i].docid} {hits[i].raw}')


lucene_analyzer = get_lucene_analyzer(stemming=False, stopwords=False)
analyzer = Analyzer(lucene_analyzer)
searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')
searcher.set_analyzer(lucene_analyzer)

index_reader = IndexReader('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')

def search(k: int=10):
    hits = searcher.search(query, k=k)
    q_terms = analyzer.analyze(query)
    # print(q_terms)
    result = []
    for i, hit in enumerate(hits):
        # Compute the statistics.
        tf = index_reader.get_document_vector(hit.docid)
        df= {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}
        doc_len = len(tf)
        bm25_score = 0

        c = list((Counter(q_terms) & Counter(tf.keys())).elements())

        for term in c:
            bm25_score += index_reader.compute_bm25_term_weight(hit.docid, term, analyzer=None)
        content = json.loads(hit.raw)
        result.append((content, bm25_score))
    return result


bm25 = search(k=50)
bm25.sort(key=lambda x:x[1],reverse=True)


query_embedding = ance_encode(query)

result = []

for i in bm25:
    passage_embedding = ance_encode(i[0]["contents"])
    score = np.dot(query_embedding, passage_embedding)
    result.append((score, i[0]["id"]))
    
result.sort(key=lambda x:x[0], reverse=True)    

j = 0

for i in result:
    if j == 5:
        break
    print(i)
    j=j+1




1 8636705 {
  "id" : "8636705",
  "contents" : "Priority Pass Coupons. The 4 most popular Priority Pass coupons & PriorityPass promo codes for March 2017. Priority Pass provides airport VIP lounge access irrespective of who you are flying with, what class you are traveling in, or whether you belong to an airline lounge program."
}
2 8636699 {
  "id" : "8636699",
  "contents" : "Priority Pass Coupons. Rated 4.5 from 88 votes. The 4 most popular Priority Pass coupons & PriorityPass promo codes for March 2017. Priority Pass provides airport VIP lounge access irrespective of who you are flying with, what class you are traveling in, or whether you belong to an airline lounge program."
}
3 8636702 {
  "id" : "8636702",
  "contents" : "Priority Pass Promo Codes. There are 10 priority pass coupon codes, coupons, discounts for you to consider including 10 prioritypass.com promo codes and 0 deals in April 2017. Priority Pass is the worldâs largest independent airport lounge access program."
}


## Compute contextualized term weights with TILDEv2

Unlike dense retrievers, TILDEv2, which was proposed in our research [paper](https://arxiv.org/pdf/2108.08513.pdf) uses contextualized term weights to re-rank documents. Particularly, instead of traditional bag-of-words methods such as BM25 and TF-IDF, TILDEv2 exploits BERT to compute contextualized term weights, as shown in the as the model architecture shown in the leacture:![tildev2.png](tildev2.png) 

Documents then can be re-ranked by summing up the term weights that appeared in both query and document. Let's first check out how to use TILDEv2 to compute contextualized term weights.

First, download the model from this [link](https://drive.google.com/file/d/1g7oA9EZqpnQDNn4Uwv7atsMMk1g3P-Im/view?usp=sharing), unzip it and put it in the same folder as this notebook, then run the following cell to initialize the model and tokenizer.

In [19]:
from modeling import TILDEv2
from transformers import AutoTokenizer

model = TILDEv2.from_pretrained("tildev2-noexp").eval()
tokenizer = AutoTokenizer.from_pretrained("tildev2-noexp")

Now, let's use TILDEv2 to compute contextualized term weights of the passage: "Unlike cats, dogs are usually great exercise pals. Many breads enjoy running and hiking, and will happily trek along on any trip."

In [20]:
text = "Unlike cats, dogs are usually great exercise pals. Many breads enjoy running and hiking, \
and will happily trek along on any trip."
inputs = tokenizer(text, return_tensors='pt')
token_ids, token_weights = model.encode(**inputs)

The cell below will print out the term weights for each token in the passage. However, before you continue, there is something for you to think of:
- If the scoring function is just term frequency. What would be the term weight for the term cats and dogs?
- Which term should be more important in this passage, cats or dogs?

Now lets run this following cell to print out contextualized term weights computed by TILDEv2:

In [21]:
for token_id, weight in zip(token_ids, token_weights):
    print(f"{tokenizer.decode(token_id)}: {weight}")

unlike: 3.254988431930542
cats: 3.862273931503296
dogs: 4.299813747406006
usually: 1.7652360200881958
great: 1.009302020072937
exercise: 4.089148044586182
pal: 7.571357250213623
many: 0.0
bread: 8.876324653625488
enjoy: 3.8195455074310303
running: 3.348376989364624
hiking: 3.4978909492492676
happily: 4.266161918640137
trek: 3.4675521850585938
along: 2.8126070499420166
trip: 2.9206016063690186


What do you observe? dose the term weights make sense to you? We note TILDEv2 will remove some predefined stopwords and special tokens such as `and`. You can think TILDEv2 gives zero weight to those stopwords.

### Exercise: build a two-stage ranking pipeline with BM25 + TILDEv2

We provide the following function `tildev2_scoreing` for you to compute the relevance score give a query-document text pair. Similar to last week exercise where you have built a ranking pipeline with BM25 + monoBERT, you can use pyserini build-in SimpleSearcher function to perform BM25 retrieval and use hits.raw() to get document text.

In [22]:
import numpy as np
stop_ids = model.get_stop_ids(tokenizer)

def tildev2_scoreing(query, document):
    # get document term weights
    inputs = tokenizer(document, return_tensors='pt')
    token_ids, token_weights = model.encode(**inputs)
    token_ids = np.array(token_ids)
    token_weights = np.array(token_weights)
    
    # get query token ids
    query_ids = tokenizer(query, add_special_tokens=False)["input_ids"]
    query_ids = [tok_id for tok_id in query_ids if tok_id not in stop_ids]  # remove stopwords for query
    
    # use query token ids to match term weights in the document
    token_idx = [np.where(token_ids == tok_id) for tok_id in query_ids]
    score = 0
    for idx in token_idx:
        if len(idx[0]) != 0:
            score += np.max(token_weights[idx])  # if a query term appears multiple times in the passage, use the max socre
    return score

In [None]:
print(tildev2_scoreing("I like dogs", text))
print(tildev2_scoreing("I like cats", text))

In [27]:
from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
from pyserini.index import IndexReader
from collections import Counter
import json

searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-noProcessing/')
searcher.set_analyzer(get_lucene_analyzer(stemming=False, stopwords=False))

hits = searcher.search(query)
for i in range(0, 5):
    print(f'{i+1} {hits[i].docid} {hits[i].raw}')


lucene_analyzer = get_lucene_analyzer(stemming=False, stopwords=False)
analyzer = Analyzer(lucene_analyzer)
searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')
searcher.set_analyzer(lucene_analyzer)

index_reader = IndexReader('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')

def search(k: int=10):
    hits = searcher.search(query, k=k)
    q_terms = analyzer.analyze(query)
    # print(q_terms)
    result = []
    for i, hit in enumerate(hits):
        # Compute the statistics.
        tf = index_reader.get_document_vector(hit.docid)
        df= {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}
        doc_len = len(tf)
        bm25_score = 0

        c = list((Counter(q_terms) & Counter(tf.keys())).elements())

        for term in c:
            bm25_score += index_reader.compute_bm25_term_weight(hit.docid, term, analyzer=None)
        content = json.loads(hit.raw)
        result.append((content, bm25_score))
    return result


bm25 = search(k=50)
bm25.sort(key=lambda x:x[1],reverse=True)

result = []

for i in bm25:
    score = tildev2_scoreing(query, i[0]["contents"])
    result.append((score, i[0]["id"]))
    
result.sort(key=lambda x:x[0], reverse=True)    


j = 0

for i in result:
    if j == 5:
        break
    print(i)
    j=j+1

1 8636705 {
  "id" : "8636705",
  "contents" : "Priority Pass Coupons. The 4 most popular Priority Pass coupons & PriorityPass promo codes for March 2017. Priority Pass provides airport VIP lounge access irrespective of who you are flying with, what class you are traveling in, or whether you belong to an airline lounge program."
}
2 8636699 {
  "id" : "8636699",
  "contents" : "Priority Pass Coupons. Rated 4.5 from 88 votes. The 4 most popular Priority Pass coupons & PriorityPass promo codes for March 2017. Priority Pass provides airport VIP lounge access irrespective of who you are flying with, what class you are traveling in, or whether you belong to an airline lounge program."
}
3 8636702 {
  "id" : "8636702",
  "contents" : "Priority Pass Promo Codes. There are 10 priority pass coupon codes, coupons, discounts for you to consider including 10 prioritypass.com promo codes and 0 deals in April 2017. Priority Pass is the worldâs largest independent airport lounge access program."
}
