<a href="https://colab.research.google.com/github/christopherdiamana/question_answering_engine/blob/master/Create_a_searchable_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a searchable index

In this part, we will build an index using semantic similarity and nearest neighbour approximation.

*Installation of a needed packages*

In [None]:
! pip install datasets transformers

In [2]:
from datasets import load_dataset, load_metric

In [None]:
datasets = load_dataset("squad_v2")

In [4]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

We will use SQuAD validation data as "test" data as we don't have access to the test data.

In [5]:
test_data = datasets["validation"]

In [30]:
contexts = test_data["context"]

In [31]:
len(contexts)

11873

In [32]:
set_of_contexts = set(contexts)

In [33]:
len(set_of_contexts)

1204

In [63]:
contexts_list = list(set_of_contexts)

We constate that a lot of questions refer to the same contexts. So out of the 11K validation samples, we have around 1K unique contexts.

## 1 - DBpedia contexts

Here we will mix our test data with DBPedia entity dataset.

Like that we will add the DBpedia contexts of at least 50 words to our corpus of unique SQuAD v2 validation context, so the corpus of unique contexts reaches at least 10K samples.

We will use the BEIR library to extract DBPedia corpus

In [None]:
!pip install beir

In [7]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader

```
  Dataset  |   BEIR-Name
-----------|-----------------
  BPedia   | dbpedia-entity
```


In [8]:
beir_name = "dbpedia-entity"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(beir_name)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")


datasets/dbpedia-entity.zip:   0%|          | 0.00/610M [00:00<?, ?iB/s]

  0%|          | 0/4635922 [00:00<?, ?it/s]

In [13]:
corpus

{'<dbpedia:Animalia_(book)>': {'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.",
  'title': 'Animalia (book)'},
 '<dbpedia:Academy_Award_for_Best_Production_Design>': {'text': "The Academy Awards are the oldest awards ceremony for achievements in motion pictures. The Academy Award for Best Production Design recognizes achievement in art direction on a film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards.  This change resulted from the Art Director's branch of the Academy being renamed the Designer's branch.",
  'title': 'Academy Award for Best Production Design'},
 '<dbpedia:An_American_in_Paris>': {'text': 'An Ameri

To make things easier, we will use the descriptions as searchable attributes (in DBPedia it's the text attribute, in SQuAD it's context)

In [27]:
texts = [ sample["text"] for sample in [value for key, value in corpus.items()]]

In [29]:
len(texts)

4635922

We will add enough paragraph of at least 50 words to SQuAD validation context to reaches at least 10K samples.

In [40]:
len(contexts_list)

1204

We will make sure that the created dataset is reproductible by forcing random seeds in our code.

In [46]:
import random

In [60]:
SEED = 4

In [61]:
random.seed(SEED)

In [62]:
random.shuffle(texts)

In [64]:
for text in texts:
  if len(text.split()) >= 50:
    contexts_list.append(text)
    if len(contexts_list) >= 10000:
      break

In [65]:
len(contexts_list)

10000

In [66]:
len(set(contexts_list))

10000

## 2 - Corpus indexing

In this section we will index our corpus by projecting it in a vector space using a asymetric similarity models of our choice.