# Part 2

## Setup and import

In [2]:
!pip install beir

Collecting beir
  Downloading beir-0.2.3.tar.gz (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
[?25hCollecting sentence-transformers
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
[K     |████████████████████████████████| 78 kB 4.3 MB/s 
[?25hCollecting pytrec_eval
  Downloading pytrec_eval-0.5.tar.gz (15 kB)
Collecting faiss_cpu
  Downloading faiss_cpu-1.7.1.post2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.4 MB)
[K     |████████████████████████████████| 8.4 MB 24.0 MB/s 
[?25hCollecting elasticsearch
  Downloading elasticsearch-7.15.2-py2.py3-none-any.whl (379 kB)
[K     |████████████████████████████████| 379 kB 75.2 MB/s 
Collecting tensorflow-text
  Downloading tensorflow_text-2.7.3-cp37-cp37m-manylinux2010_x86_64.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 62.7 MB/s 
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 

In [3]:
!pip install sentence_transformers



In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[?25l[K     |█▏                              | 10 kB 30.6 MB/s eta 0:00:01[K     |██▎                             | 20 kB 20.1 MB/s eta 0:00:01[K     |███▍                            | 30 kB 11.4 MB/s eta 0:00:01[K     |████▌                           | 40 kB 9.3 MB/s eta 0:00:01[K     |█████▋                          | 51 kB 5.1 MB/s eta 0:00:01[K     |██████▊                         | 61 kB 5.6 MB/s eta 0:00:01[K     |████████                        | 71 kB 5.4 MB/s eta 0:00:01[K     |█████████                       | 81 kB 6.1 MB/s eta 0:00:01[K     |██████████▏                     | 92 kB 6.3 MB/s eta 0:00:01[K     |███████████▎                    | 102 kB 5.0 MB/s eta 0:00:01[K     |████████████▍                   | 112 kB 5.0 MB/s eta 0:00:01[K     |█████████████▌                  | 122 kB 5.0 MB/s eta 0:00:01[K     |██████████████▋                 | 133 kB 5.0 MB/s eta 0:00:01

In [5]:
!pip install progress

Collecting progress
  Downloading progress-1.6.tar.gz (7.8 kB)
Building wheels for collected packages: progress
  Building wheel for progress (setup.py) ... [?25l[?25hdone
  Created wheel for progress: filename=progress-1.6-py3-none-any.whl size=9628 sha256=8d83a93c6c0f23c20e29e3331de301d2d3babc90d4ffc7397c995b7d7e1810bb
  Stored in directory: /root/.cache/pip/wheels/8e/d7/61/498d8e27dc11e9805b01eb3539e2ee344436fc226daeb5fe87
Successfully built progress
Installing collected packages: progress
Successfully installed progress-1.6


In [6]:
!pip install hnswlib

Collecting hnswlib
  Downloading hnswlib-0.5.2.tar.gz (29 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (PEP 517) ... [?25l[?25hdone
  Created wheel for hnswlib: filename=hnswlib-0.5.2-cp37-cp37m-linux_x86_64.whl size=1326111 sha256=f6471105673f2c873b9367815eff455b75642a60520330924a06b4bb652b4a63
  Stored in directory: /root/.cache/pip/wheels/b4/11/b3/337c4a361b31217d62c3b420ad66fe20d381f1ebb29b046095
Successfully built hnswlib
Installing collected packages: hnswlib
Successfully installed hnswlib-0.5.2


We will detail all the imports we need:

- pandas and nympy : data manipulation
- beri datasets : loading the dbpedia dataset
- typing : for the typing of the functions
- sentence_transformers : text encoding
- faiss, hnswlib, pickle : for approximate nearest neighbors
- operator : sorting of similarity lists
- time : computation time coparison

In [7]:
import pandas as pd
import numpy as np

from beir import util
from beir.datasets.data_loader import GenericDataLoader

from datasets import load_dataset, load_metric

from typing import List
from typing import Dict
from typing import Tuple

from sentence_transformers import SentenceTransformer
import sentence_transformers.util

import faiss
import hnswlib
import pickle

from operator import itemgetter

import time

  from tqdm.autonotebook import tqdm


## Create the mix dataset

### Squadv2 part

Here we will simply load the squad_v2 data set which contains the questions and contexts.

In [8]:
squad_v2 = False

In [9]:
train_dataset = load_dataset("squad_v2" if squad_v2 else "squad", split='train[:10%]')

Downloading:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


In [10]:
train_dataset[0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

And we are now going to collect the context in a list to create our new dataset.

In [11]:
list_context = []

for elem in train_dataset:
    context = elem["context"]
    if (not context in list_context):
        list_context.append(context)

### The Dbpedia part

Nous chargons le nouveau dataset (dbpedia).

In [12]:
dataset = "dbpedia-entity"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

datasets/dbpedia-entity.zip:   0%|          | 0.00/610M [00:00<?, ?iB/s]

  0%|          | 0/4635922 [00:00<?, ?it/s]

Creation d'une fonction pour recupérer les contextes de plus de 50 charactère et sans avoir de doublons.

In [13]:
def get_n_random_text(n, corpus_dataset):

    corpus_list = list(corpus_dataset)
    random_element = np.random.choice(corpus_list, n)

    random_element = [corpus_dataset[elem]["text"] for elem in random_element]

    random_element = [elem for elem in random_element if len(elem) >= 50]

    random_element = np.unique(random_element)

    return random_element

Now we add our context to form our dataset.

In [14]:
new_context = get_n_random_text(10000 - len(list_context) + 1500, corpus)

In [15]:
for elem in new_context:
  if (not elem in list_context):
    list_context.append(elem)

In [52]:
len(list_context)

11412

So we have 11412 contexts now and we can make our calculations on a larger amount of data.

### Create question and context list

In this part we will simply create a list of context with their index to be able to have a reference. We will also create the list of questions.

In [16]:
list_context_index = [[i, list_context[i]] for i in range(len(list_context))]

In [17]:
question_dict = {}

for elem in train_dataset:
    question = elem["question"]
    context = elem["context"]

    index = -1
    for elem in list_context_index:
        if (elem[1] == context):
            index = elem[0]
            break

    if (not question in question_dict):
        question_dict[question] = [index]
    else:
        question_dict[question].append(index)

In [18]:
question_list = []

for elem in train_dataset:
    question = elem["question"]
    if (not question in question_list):
        question_list.append(question)

## Compute MRR


For errors we will calculate the MRR. This error uses the inverse of the result rank. That is, if we search for context c and it appears in the nth position then we have 1/n. And we make the sum for all the elements then we divide by the number of elements.

In [19]:
def compute_MRR(q_dict, q_result_list):

    total_sum = 0
    nb_q = len(q_result_list)

    for elem in q_result_list:
        valid_context = q_dict[elem[0]]

        find_the_good_elem = False
        index = 0

        for i in range(len(elem[1])):
          
            if (elem[1][i][0] in valid_context):
                find_the_good_elem = True
                index = i
                break

        if (find_the_good_elem):
            total_sum += (1 / (index + 1))

    return total_sum / nb_q

## Choose the best model

We create the list of models that we want to create. For each model we have the function of similiarity that it corresponds

In [20]:
model_list = [['msmarco-distilbert-base-v4', 'cos'], ['msmarco-distilbert-base-v3', 'cos'], ['msmarco-distilbert-base-dot-prod-v3', 'dot'], ['msmarco-distilbert-base-tas-b', 'dot']]

For each model we will calculate the list of smiliratities of all the contexts with the question then sort this list and finally calculate the MR. 

For the comparison we will compare the MRR and the execution time.

In [21]:
def try_a_model_part(model, simil_function):
    model = SentenceTransformer(model)

    list_formated_context = model.encode(list_context[:2000], device='cuda', show_progress_bar=True)
    list_formated_context = [[list_context[i], list_formated_context[i], i] for i in range(len(list_formated_context))]

    list_question_result = model.encode(question_list[:2000], device='cuda', show_progress_bar=False)
    list_question_result = [[question_list[i], list_question_result[i]] for i in range(len(list_question_result))]

    list_final = []

    for elem in list_question_result:
        question_formated = elem[1]

        list_sim_question = []

        for context_elem in list_formated_context:

            if (simil_function == 'cos'):
                list_sim_question.append([context_elem[2], sentence_transformers.util.pytorch_cos_sim(context_elem[1], question_formated)])
            elif (simil_function == 'dot'):
                list_sim_question.append([context_elem[2], sentence_transformers.util.dot_score(context_elem[1], question_formated)])
        
        list_sim_question = sorted(list_sim_question, key=itemgetter(1), reverse=True)

        list_final.append([elem[0], list_sim_question[:20]])

    return list_final

In [22]:
str_result = ""

for elem in model_list:
    start_time = time.time()
    temp = try_a_model_part(elem[0], elem[1])
    MRR_value = compute_MRR(question_dict, temp)
    total_time = time.time() - start_time

    str_result += "For the model " + elem[0] + " we have a MRR of " + str(MRR_value) + " and a computation time of " + str(round(total_time, 2)) + " secondes.\n"

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/554 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/547 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [23]:
print(str_result)

For the model msmarco-distilbert-base-v4 we have a MRR of 0.6180692501276664 and a computation time of 351.85 secondes.
For the model msmarco-distilbert-base-v3 we have a MRR of 0.6031518018917787 and a computation time of 338.44 secondes.
For the model msmarco-distilbert-base-dot-prod-v3 we have a MRR of 0.6123553308340267 and a computation time of 187.18 secondes.
For the model msmarco-distilbert-base-tas-b we have a MRR of 0.6924865720150152 and a computation time of 187.06 secondes.



For the base v4 and base v3 models we have good results, around 0.6 MRR but with execution times of more than 300seconds.

While the dot and heap models also have at least 0.6 MRR but with less than 200 seconds of computing time.

We will thus take the last model to be tested which is the tas-b model with an MRR of 0.69 and a calculation time of 187.06 seconds.

Au vu des résultats nous allons choisir le model 'msmarco-distilbert-base-tas-b'. Nous pouvons donc maintenant le train sur le dataset entier.

In [24]:
def run_the_full_model(model):

    model = SentenceTransformer(model)

    list_formated_context = model.encode(list_context, device='cuda', show_progress_bar=True)
    list_formated_context = [[list_context[i], list_formated_context[i], i] for i in range(len(list_formated_context))]

    return list_formated_context

In [25]:
formated_context = run_the_full_model('msmarco-distilbert-base-tas-b')

Batches:   0%|          | 0/357 [00:00<?, ?it/s]

In [26]:
model = SentenceTransformer('msmarco-distilbert-base-tas-b')

## Nearest Neigbourg


For the nearest neighbor we will use the algotihme of the hnswlib library, for the code it is taken from the example in the github of the project. We simply use our previously calculated data.

###  hnswlib

For the analysis function we will base on the first 500 questions and we will test it for parameters and take the best results

In [27]:
def compure_MMR_for_param(param1, param2, param3):
    start_time = time.time()

    dim = len(formated_context[0][1])
    num_elements = len(formated_context)

    p = hnswlib.Index(space = 'ip', dim = dim)
    p.init_index(max_elements = num_elements, ef_construction = param1, M = param2)

    datas = [elem[1] for elem in formated_context]
    indexs = [elem[2] for elem in formated_context]

    p.add_items(datas, indexs)

    list_question_result = []

    list_question_result = model.encode(question_list[:500], device='cuda', show_progress_bar=False)
    list_question_result = [[question_list[i], list_question_result[i]] for i in range(len(list_question_result))]

    final_list = []

    for elem in list_question_result:
        labels, distances = p.knn_query(elem[1], k = param3)
        list_to_add = [elem[0]]
        list_index_dist = []
        for i in range(len(distances[0])):
          list_index_dist.append([labels[0][i], distances[0][i]])
        list_to_add.append(list_index_dist)
        final_list.append(list_to_add)

    MRR_value = compute_MRR(question_dict, final_list)

    total_time = round(time.time() - start_time, 2)

    print("ef_construction = ", param1, ", M = ", param2, ", k = ", param3, ". And we have a MRR of ", MRR_value, "for a time of ", total_time, "secondes.")
    
    return MRR_value

In [28]:
param1_list = [100, 150, 200, 250, 300]
param2_list = [10, 14, 18, 22, 26]
param3_list = [15, 30, 45, 60]

In [29]:
best_value = 0

for param1 in param1_list:
    for param2 in param2_list:
        for param3 in param3_list:
          value = compure_MMR_for_param(param1, param2, param3)

          print(value)

          if (value > best_value):
            best_value = value
            print("\n\n--------------------------------------------------------------------------")
            print("We have a new best value of MRR of ", best_value, "for the param ", param1, param2, param3, "!")
            print("--------------------------------------------------------------------------\n\n")

ef_construction =  100 , M =  10 , k =  15 . And we have a MRR of  0.5894110334110333 for a time of  1.31 secondes.
0.5894110334110333


--------------------------------------------------------------------------
We have a new best value of MRR of  0.5894110334110333 for the param  100 10 15 !
--------------------------------------------------------------------------


ef_construction =  100 , M =  10 , k =  30 . And we have a MRR of  0.6140867782862762 for a time of  1.25 secondes.
0.6140867782862762


--------------------------------------------------------------------------
We have a new best value of MRR of  0.6140867782862762 for the param  100 10 30 !
--------------------------------------------------------------------------


ef_construction =  100 , M =  10 , k =  45 . And we have a MRR of  0.6167620659683108 for a time of  1.75 secondes.
0.6167620659683108


--------------------------------------------------------------------------
We have a new best value of MRR of  0.61676206

So we found our optimal parameters which are: ef_construction = 300, M = 22 and with the list of the 60 closest contexts.

We can now create the model that corresponds to these parameters and create the function that will allow us to use it and read it in part 1.

In [30]:
    dim = len(formated_context[0][1])
    num_elements = len(formated_context)

    p = hnswlib.Index(space = 'ip', dim = dim)
    p.init_index(max_elements = num_elements, ef_construction = 300, M = 22)

    datas = [elem[1] for elem in formated_context]
    indexs = [elem[2] for elem in formated_context]

    p.add_items(datas, indexs)

In [50]:
def find_best_context(question):

    formated_question = model.encode(question, device='cuda', show_progress_bar=False)

    labels, distances = p.knn_query(formated_question, k = param3)

    context_list = []

    for label in labels[0]:
      context_list.append(formated_context[label][0])

    return context_list

In [51]:
find_best_context(question_list[0])

['The Gospel of Mary is an apocryphal book discovered in 1896 in a 5th-century papyrus codex. The codex Papyrus Berolinensis 8502 was purchased in Cairo by German scholar Karl Reinhardt.Although the work is popularly known as the Gospel of Mary, it is not canonical nor is it technically classed as a gospel by scholastic consensus.',
 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a sim