# Part 2

## Setup and import

In [51]:
!pip install beir



In [52]:
!pip install sentence_transformers



In [53]:
!pip install datasets



In [54]:
!pip install progress



In [55]:
!pip install hnswlib



We will detail all the imports we need:

- pandas and nympy : data manipulation
- beri datasets : loading the dbpedia dataset
- typing : for the typing of the functions
- sentence_transformers : text encoding
- faiss, hnswlib, pickle : for approximate nearest neighbors
- operator : sorting of similarity lists
- time : computation time coparison

In [56]:
import pandas as pd
import numpy as np

from beir import util
from beir.datasets.data_loader import GenericDataLoader

from datasets import load_dataset, load_metric

from typing import List
from typing import Dict
from typing import Tuple
from typing import List
from typing import Callable

from sentence_transformers import SentenceTransformer
import sentence_transformers.util

import faiss
import hnswlib
import pickle

from operator import itemgetter

import time

## Create the mix dataset

### Squadv2 part

Here we will simply load the squad_v2 data set which contains the questions and contexts.

In [57]:
squad_v2 = False

In [58]:
train_dataset = load_dataset("squad_v2" if squad_v2 else "squad", split='train[:10%]')

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


In [59]:
train_dataset[0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}

And we are now going to collect the context in a list to create our new dataset.

In [60]:
list_context = []

for elem in train_dataset:
    context = elem["context"]
    if (not context in list_context):
        list_context.append(context)

### The Dbpedia part

Nous chargons le nouveau dataset (dbpedia).

In [61]:
dataset = "dbpedia-entity"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/dbpedia-entity.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

  0%|          | 0/4635922 [00:00<?, ?it/s]

Creation d'une fonction pour recupérer les contextes de plus de 50 charactère et sans avoir de doublons.

In [62]:
def get_n_random_text(n : int, corpus_dataset : Dict[str, Dict[str, str]]) -> List[str]:

    corpus_list = list(corpus_dataset)
    random_element = np.random.choice(corpus_list, n)

    random_element = [corpus_dataset[elem]["text"] for elem in random_element]

    random_element = [elem for elem in random_element if len(elem) >= 50]

    random_element = np.unique(random_element)

    return random_element

Now we add our context to form our dataset.

In [63]:
new_context = get_n_random_text(10000 - len(list_context) + 1500, corpus)

In [64]:
for elem in new_context:
  if (not elem in list_context):
    list_context.append(elem)

In [65]:
len(list_context)

11413

So we have 11412 contexts now and we can make our calculations on a larger amount of data.

### Create question and context list

In this part we will simply create a list of context with their index to be able to have a reference. We will also create the list of questions.

In [66]:
list_context_index = [[i, list_context[i]] for i in range(len(list_context))]

In [67]:
question_dict = {}

for elem in train_dataset:
    question = elem["question"]
    context = elem["context"]

    index = -1
    for elem in list_context_index:
        if (elem[1] == context):
            index = elem[0]
            break

    if (not question in question_dict):
        question_dict[question] = [index]
    else:
        question_dict[question].append(index)

In [68]:
question_list = []

for elem in train_dataset:
    question = elem["question"]
    if (not question in question_list):
        question_list.append(question)

## Compute MRR


For errors we will calculate the MRR. This error uses the inverse of the result rank. That is, if we search for context c and it appears in the nth position then we have 1/n. And we make the sum for all the elements then we divide by the number of elements.

In [69]:
def compute_MRR(q_dict : Dict[str, List[int]],  q_result_list : List[List[str or List[int]]]) -> int:

    total_sum = 0
    nb_q = len(q_result_list)

    for elem in q_result_list:
        valid_context = q_dict[elem[0]]

        find_the_good_elem = False
        index = 0

        for i in range(len(elem[1])):
          
            if (elem[1][i][0] in valid_context):
                find_the_good_elem = True
                index = i
                break

        if (find_the_good_elem):
            total_sum += (1 / (index + 1))

    return total_sum / nb_q

## Choose the best model

We create the list of models that we want to create. For each model we have the function of similiarity that it corresponds

In [70]:
model_list = [['msmarco-distilbert-base-v4', 'cos'], ['msmarco-distilbert-base-v3', 'cos'], ['msmarco-distilbert-base-dot-prod-v3', 'dot'], ['msmarco-distilbert-base-tas-b', 'dot']]

For each model we will calculate the list of smiliratities of all the contexts with the question then sort this list and finally calculate the MR. 

For the comparison we will compare the MRR and the execution time.

In [71]:
def try_a_model_part(model : str, simil_function : str) -> List[List[str or List[List[int]]]]:
    model = SentenceTransformer(model)

    list_formated_context = model.encode(list_context[:2000], device='cuda', show_progress_bar=True)
    list_formated_context = [[list_context[i], list_formated_context[i], i] for i in range(len(list_formated_context))]

    list_question_result = model.encode(question_list[:2000], device='cuda', show_progress_bar=False)
    list_question_result = [[question_list[i], list_question_result[i]] for i in range(len(list_question_result))]

    list_final = []

    for elem in list_question_result:
        question_formated = elem[1]

        list_sim_question = []

        for context_elem in list_formated_context:

            if (simil_function == 'cos'):
                list_sim_question.append([context_elem[2], sentence_transformers.util.pytorch_cos_sim(context_elem[1], question_formated)])
            elif (simil_function == 'dot'):
                list_sim_question.append([context_elem[2], sentence_transformers.util.dot_score(context_elem[1], question_formated)])
        
        list_sim_question = sorted(list_sim_question, key=itemgetter(1), reverse=True)

        list_final.append([elem[0], list_sim_question[:20]])

    return list_final

In [72]:
str_result = ""

for elem in model_list:
    start_time = time.time()
    temp = try_a_model_part(elem[0], elem[1])
    MRR_value = compute_MRR(question_dict, temp)
    total_time = time.time() - start_time

    str_result += "For the model " + elem[0] + " we have a MRR of " + str(MRR_value) + " and a computation time of " + str(round(total_time, 2)) + " secondes.\n"

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [73]:
print(str_result)

For the model msmarco-distilbert-base-v4 we have a MRR of 0.61883334041904 and a computation time of 351.68 secondes.
For the model msmarco-distilbert-base-v3 we have a MRR of 0.6035525153939627 and a computation time of 349.94 secondes.
For the model msmarco-distilbert-base-dot-prod-v3 we have a MRR of 0.6124762715462214 and a computation time of 190.22 secondes.
For the model msmarco-distilbert-base-tas-b we have a MRR of 0.6919105011496064 and a computation time of 187.79 secondes.



For the base v4 and base v3 models we have good results, around 0.6 MRR but with execution times of more than 300seconds.

While the dot and heap models also have at least 0.6 MRR but with less than 200 seconds of computing time.

We will thus take the last model to be tested which is the tas-b model with an MRR of 0.69 and a calculation time of 187.06 seconds.

Au vu des résultats nous allons choisir le model 'msmarco-distilbert-base-tas-b'. Nous pouvons donc maintenant le train sur le dataset entier.

In [74]:
def run_the_full_model(model : str) -> List[List[List[int] or List[List[int]] or int]]:

    model = SentenceTransformer(model)

    list_formated_context = model.encode(list_context, device='cuda', show_progress_bar=True)
    list_formated_context = [[list_context[i], list_formated_context[i], i] for i in range(len(list_formated_context))]

    return list_formated_context

In [75]:
formated_context = run_the_full_model('msmarco-distilbert-base-tas-b')

Batches:   0%|          | 0/357 [00:00<?, ?it/s]

In [76]:
model = SentenceTransformer('msmarco-distilbert-base-tas-b')

## Nearest Neigbourg


For the nearest neighbor we will use the algotihme of the hnswlib library, for the code it is taken from the example in the github of the project. We simply use our previously calculated data.

###  hnswlib

For the analysis function we will base on the first 500 questions and we will test it for parameters and take the best results

In [77]:
def compure_MMR_for_param(param1 : int, param2 : int, param3 : int) -> int:
    start_time = time.time()

    dim = len(formated_context[0][1])
    num_elements = len(formated_context)

    p = hnswlib.Index(space = 'ip', dim = dim)
    p.init_index(max_elements = num_elements, ef_construction = param1, M = param2)

    datas = [elem[1] for elem in formated_context]
    indexs = [elem[2] for elem in formated_context]

    p.add_items(datas, indexs)

    list_question_result = []

    list_question_result = model.encode(question_list[:500], device='cuda', show_progress_bar=False)
    list_question_result = [[question_list[i], list_question_result[i]] for i in range(len(list_question_result))]

    final_list = []

    for elem in list_question_result:
        labels, distances = p.knn_query(elem[1], k = param3)
        list_to_add = [elem[0]]
        list_index_dist = []
        for i in range(len(distances[0])):
          list_index_dist.append([labels[0][i], distances[0][i]])
        list_to_add.append(list_index_dist)
        final_list.append(list_to_add)

    MRR_value = compute_MRR(question_dict, final_list)

    total_time = round(time.time() - start_time, 2)

    print("ef_construction = ", param1, ", M = ", param2, ", k = ", param3, ". And we have a MRR of ", MRR_value, "for a time of ", total_time, "secondes.")
    
    return MRR_value

In [78]:
param1_list = [100, 150, 200, 250, 300]
param2_list = [10, 14, 18, 22, 26]
param3_list = [15, 30, 45, 60]

In [79]:
best_value = 0

for param1 in param1_list:
    for param2 in param2_list:
        for param3 in param3_list:
          value = compure_MMR_for_param(param1, param2, param3)

          print(value)

          if (value > best_value):
            best_value = value
            print("\n\n--------------------------------------------------------------------------")
            print("We have a new best value of MRR of ", best_value, "for the param ", param1, param2, param3, "!")
            print("--------------------------------------------------------------------------\n\n")

ef_construction =  100 , M =  10 , k =  15 . And we have a MRR of  0.5866302752802751 for a time of  1.37 secondes.
0.5866302752802751


--------------------------------------------------------------------------
We have a new best value of MRR of  0.5866302752802751 for the param  100 10 15 !
--------------------------------------------------------------------------


ef_construction =  100 , M =  10 , k =  30 . And we have a MRR of  0.6113932342262767 for a time of  1.24 secondes.
0.6113932342262767


--------------------------------------------------------------------------
We have a new best value of MRR of  0.6113932342262767 for the param  100 10 30 !
--------------------------------------------------------------------------


ef_construction =  100 , M =  10 , k =  45 . And we have a MRR of  0.6119476272015288 for a time of  2.05 secondes.
0.6119476272015288


--------------------------------------------------------------------------
We have a new best value of MRR of  0.61194762

So we found our optimal parameters which are: ef_construction = 300, M = 22 and with the list of the 60 closest contexts.

We can now create the model that corresponds to these parameters and create the function that will allow us to use it and read it in part 1.

In [80]:
    dim = len(formated_context[0][1])
    num_elements = len(formated_context)

    p = hnswlib.Index(space = 'ip', dim = dim)
    p.init_index(max_elements = num_elements, ef_construction = 300, M = 22)

    datas = [elem[1] for elem in formated_context]
    indexs = [elem[2] for elem in formated_context]

    p.add_items(datas, indexs)

In [81]:
def find_best_context(question : str) -> List[str]:

    formated_question = model.encode(question, device='cuda', show_progress_bar=False)

    labels, distances = p.knn_query(formated_question, k = param3)

    context_list = []

    for label in labels[0]:
      context_list.append(formated_context[label][0])

    return context_list

## Analysis of the results

In [82]:
find_best_context(question_list[0])

['Mary Salome and Zebedee by Tilman Riemenschneider (c.1460-1531) originally formed the right wing of an altarpiece showing the family of the Virgin Mary. The central scene would have shown St Anne seated with her daughter Mary and the Christ Child. Mary Salome was another daughter of St Anne, half sister of the Virgin and wife of Zebedee. Tilman Riemenschnieder was one of the most important sculptors in southern Germany in the late fifteenth and sixteenth century.',
 'Jeanne Jugan (October 25, 1792 – August 29, 1879), also known as Sister Mary of the Cross, L.S.P., was a French woman who became known for the dedication of her life to the neediest of the elderly poor. Her service resulted in the establishment of the Little Sisters of the Poor, who care for the elderly who have no other resources throughout the world. She has been declared a saint by the Catholic Church.',
 "Mary of Lusignan (French: Marie de Lusignan; before March 1215 – 5 July 1251 or 1253), was the wife of Count Walt

For the question 0 we can see that we have the context 0 which is the one related to the question that appears in the first clue (in 1st position by the way). 

In [83]:
find_best_context("What is a solar panel ?")

['Solar power is the conversion of sunlight into electricity, either directly using photovoltaics (PV), or indirectly using concentrated solar power (CSP). CSP systems use lenses or mirrors and tracking systems to focus a large area of sunlight into a small beam. PV converts light into electric current using the photoelectric effect.',
 'In the last two decades, photovoltaics (PV), also known as solar PV, has evolved from a pure niche market of small scale applications towards becoming a mainstream electricity source. A solar cell is a device that converts light directly into electricity using the photoelectric effect. The first solar cell was constructed by Charles Fritts in the 1880s. In 1931 a German engineer, Dr Bruno Lange, developed a photo cell using silver selenide in place of copper oxide. Although the prototype selenium cells converted less than 1% of incident light into electricity, both Ernst Werner von Siemens and James Clerk Maxwell recognized the importance of this disco

As for the question "What is a solay panel?" We see that the texts will not be able to bring a simple answer. Indeed, all the texts are well related to the question (especially for the first context). But on the other hand to find a definition in these contexts is not necessarily simple. Indeed it finds contexts related to the question but does not pay attention to the fact that the answer can be found there.

In [84]:
find_best_context("How old are you ?")

['The eligible age-range for contestants is currently fifteen to twenty-eight years old. The initial age limit was sixteen to twenty-four in the first three seasons, but the upper limit was raised to twenty-eight in season four, and the lower limit was reduced to fifteen in season ten. The contestants must be legal U.S. residents, cannot have advanced to particular stages of the competition in previous seasons (varies depending on the season, currently by the semi-final stage until season thirteen), and must not hold any current recording or talent representation contract by the semi-final stage (in previous years by the audition stage).',
 'The Newcastle 85+ Study is a longitudinal study of health and aging of people over 85 years old. It began in 2006, led by Professor Tom Kirkwood at Newcastle University, and included over 1,000 85-year olds born in 1921 and registered with GPs in Newcastle and North Tyneside.11% of those studied said their health was excellent when compared with ot

For the question how old are you we see here that our model is lost. What is normal indeed this question is a question that requires a context of answer, that is to say that depending on the moment or it is asked of and the person it will vary. So we can't really say that a context works or how can we know if a story about a person and a story about an event is a possible good answer? Well, we have here a first error that shows the limitations of our model.

In [85]:
find_best_context("Wich is the first album solo of beyonce ?")

['Beyoncé\'s first solo recording was a feature on Jay Z\'s "\'03 Bonnie & Clyde" that was released in October 2002, peaking at number four on the U.S. Billboard Hot 100 chart. Her first solo album Dangerously in Love was released on June 24, 2003, after Michelle Williams and Kelly Rowland had released their solo efforts. The album sold 317,000 copies in its first week, debuted atop the Billboard 200, and has since sold 11 million copies worldwide. The album\'s lead single, "Crazy in Love", featuring Jay Z, became Beyoncé\'s first number-one single as a solo artist in the US. The single "Baby Boy" also reached number one, and singles, "Me, Myself and I" and "Naughty Girl", both reached the top-five. The album earned Beyoncé a then record-tying five awards at the 46th Annual Grammy Awards; Best Contemporary R&B Album, Best Female R&B Vocal Performance for "Dangerously in Love 2", Best R&B Song and Best Rap/Sung Collaboration for "Crazy in Love", and Best R&B Performance by a Duo or Grou

For this question we see 2 reactions of the model in fact he will detect answers that are related to the question so the first question match well the artistic history of beyonce and therefore texts that speak of his album. But we also have in the first answers a lot of things related to beyonce but not necessarily with her first album. And in addition we can also see results on other artists. So we see that our model detects the themes well but it is limited to the moment when it sees more than one theme so we could possibly add a method that increases the sore when a text addresses the whole theme of the question to have a real bonus.