# Context Retrieval Demo

In [1]:
import json 
import pandas as pd
from datasets import load_dataset
import os
import torch
from torch import nn
from datetime import datetime
import time
from sentence_transformers import SentenceTransformer, CrossEncoder, evaluation, losses, InputExample, datasets
from sentence_transformers import util as sentenceutils
import pickle
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
print("GPU is:", torch.cuda.get_device_name(0))

GPU is: Tesla T4


## Eli5 Dataset 

The Eli5 dataset is made up of questions and answers from Reddit users about random topics. They are retrieved from the “Explain like I’m 5” Reddit posts. The dataset also contains relevant Wikipedia passages as supporting documents for each query and answer. The dataset was retrieved from [Hugging Face](https://huggingface.co/datasets/vblagoje/lfqa_support_docs).

#### Data Format 
The original huggingface dataset was pre-processed to seperate the answers from the passages. The resulting format is as follows:
1. **id:** A unique ID for each query 
2. **input:** A unique query from Reddit user 
3. **answer:** A unique answer from Reddit user (some contain more than 1) 
4. **passages:** A set of 7 relevant Wikipedia passages. Every passage is in a dictionary containing a unique Wikipedia ID, title, and a relevance score based on a cross-encoder.

**Train Set:** ~ 223K records

#### Re-Ranking
The original dataset does not indicate which Wikipedia passages are most relevant. However, the semantic search input requires 1 passage per query. To solve this, a re-ranker cross-encoder was used to rank the 7 passages, so that the top (most relevant) passage is selected for the input pairs. The answer column was also re-ranked to improve performance of the answer generation model. 

In [3]:
with open(usr_dir + '/data/Eli5/Eli5_reranked/eli5_train_reranked.json', 'r') as f:
    eli5 = json.load(f)

In [4]:
eli5 = pd.read_json(eli5, orient='records')

In [5]:
eli5.head()

Unnamed: 0,id,input,answer,passages
0,32wvn8,what's the difference between a forest and a w...,[{'text': 'They're used interchangeably a lot....,"[{'wikipedia_id': '66986', 'title': 'Woodland'..."
1,1yc9zg,Are there any good source material on the Wars...,[{'text': 'Many of the relevant primary source...,"[{'wikipedia_id': '57561029', 'title': 'Barbar..."
2,elzx1n,we do we instinctively grab a part of our body...,[{'text': 'A) instinct. To protect it from fur...,"[{'wikipedia_id': '25294051', 'title': 'Franz ..."
3,1j7pwx,Following the passing of the Thirteenth Amendm...,"[{'text': 'It was less a few dark corners, and...","[{'wikipedia_id': '5858078', 'title': 'Reconst..."
4,3qr7uu,"In medieval and pre-modern times, political en...",[{'text': 'Twenty years of peace is much bette...,"[{'wikipedia_id': '26368', 'title': 'Richard I..."


**Example: quesion, answer, passages**

In [6]:
print('question: {}\n\nanswer: {}\n\npassages: \n'.format(eli5['input'][0], eli5['answer'][0][0]['text']))
eli5['passages'][0]

question: what's the difference between a forest and a wood?

answer: They're used interchangeably a lot. You'll get different answers from different resources, but the general consensus seems to be that woods are smaller than forests.

 >  A wood is an area covered in trees, larger than a grove or a copse. A forest is also an area covered in trees, but it is larger than a wood

 >  The U.S. National Vegetation Classification system differentiates them according to their densities: 25 to 60 percent of a a wood is covered by tree canopies, while 60 to 100 percent of a forest is canopied.

passages: 



[{'wikipedia_id': '66986',
  'title': 'Woodland',
  'section': '',
  'start_paragraph_id': 1,
  'start_character': 0,
  'end_paragraph_id': 1,
  'end_character': 506,
  'text': 'A woodland or wood (or in the U.S., the "plurale tantum" woods) is a low-density forest forming open habitats with plenty of sunlight and limited shade. Woodlands may support an understory of shrubs and herbaceous plants including grasses. Woodland may form a transition to shrubland under drier conditions or during early stages of primary or secondary succession. Higher density areas of trees with a largely closed canopy that provides extensive and nearly continuous shade are referred to as forests. \n',
  'bleu_score': None,
  'meta': None,
  'cross-score': 0.9974125028},
 {'wikipedia_id': '4396843',
  'title': 'Wood drying',
  'section': 'Section::::Types of wood.\n',
  'start_paragraph_id': 9,
  'start_character': 0,
  'end_paragraph_id': 9,
  'end_character': 386,
  'text': 'Wood is divided, according to it

### 'Question, Passage' Input Pairs

The bi-encoder requires a list of passages to encode. There are several formatting options for the input list: 
1. List of passages 
2. List of wikipedia titles (found in passages column in dataset) and passages: 'title, passage'
3. List of queries and passages: 'query, passage'

For this demo, only the list of passages are used for the bi-encoder. 

**Note:** the Facebook DPR encoder requires the input to be the wikipedia title along with the passage separated with a '[SEP]' token. Use 'passages_dpr' instead of 'passages' (see code below) to use the DPR encoder. There are 2 separate encoders for DPR - one for the passage, and one for the query. See [Sentence Transformer documentation](https://www.sbert.net/docs/pretrained_models.html) for more details on DPR and various pretrained encoders. 

In [7]:
questions = eli5['input'].tolist() # remove this 

passages = []
for i in range(0,len(eli5)):
    passages.append(eli5['passages'][i][0]['text'])

In [8]:
# passage format for DPR context encoder only 
# needs 'title [SEP] passage' as format

passages_dpr = []
for i in range(0,len(eli5)):
    passages_dpr.append(eli5['passages'][i][0]['title'] + ' [SEP] ' + eli5['passages'][i][0]['text'])

## Semantic Search & Re-Ranker

The semantic search function performs the initial passage retrieval using a bi-encoder. The passage re-ranking is done using a cross encoder. Both are pre-trained encoders and implemented in the same function below. 

In [9]:
# load encoders 
bi_encoder = SentenceTransformer('msmarco-bert-base-dot-v5')
cross_encoder = CrossEncoder('/contextretrieval/cross-encoder/ms-marco-MiniLM-L-6-v2',default_activation_function=nn.Sigmoid())

In [None]:
# embed all passages in corpus
# this can take a while depending on the size of the dataset - to speed things up, pre-compute the embeddings & load them for future use
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

# save corpus embeddings 
with open('query_msmarco-bert-base-dot-v5.pickle', 'wb') as pkl:
    pickle.dump(corpus_embeddings, pkl)

In [12]:
# load corpus embeddings 
with open('/data/Eli5/biencoder_embeddings/msmarco-bert-base-dot-v5.pickle', 'rb') as pkl:
    corpus_embeddings = pickle.load(pkl)

In [13]:
top_k=10
def search_and_rank(query):
    
    # ------ PASSAGE RETRIEVAL ------
    start_time = time.time()
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    hits = sentenceutils.semantic_search(question_embedding, corpus_embeddings, top_k=top_k, score_function=sentenceutils.dot_score)
    hits = hits[0]  # Get the hits for the first query
    end_time = time.time()
    
    print("Input question:", query)
    print("\n-------------------------\n")
    print("Top 10 passages (after {:.3f} seconds):".format(end_time - start_time))
    
    for hit in hits:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']]))
        hit['passage'] = passages[hit['corpus_id']]
    
    # ------ RE-RANKER -----
    # score passages
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)
    
    # sort results
    for i in range(len(cross_scores)):
        hits[i]['cross-score'] = cross_scores[i]

    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

In [14]:
search_and_rank("What affect continental drift?")

Input question: What affect continental drift?

-------------------------

Top 10 passages (after 0.301 seconds):
	167.490	The theory of plate tectonics demonstrates that the continents of the Earth are moving across the surface at the rate of a few centimeters per year. This is expected to continue, causing the plates to relocate and collide. Continental drift is facilitated by two factors: the energy generation within the planet and the presence of a hydrosphere. With the loss of either of these, continental drift will come to a halt. The production of heat through radiogenic processes is sufficient to maintain mantle convection and plate subduction for at least the next 1.1 billion years.

	167.490	The theory of plate tectonics demonstrates that the continents of the Earth are moving across the surface at the rate of a few centimeters per year. This is expected to continue, causing the plates to relocate and collide. Continental drift is facilitated by two factors: the energy genera