For demonstration purposes, let us use a larger dataset to test the **Dense Passage Retrieval (DPR)** pipeline, we can use pre-built datasets that are commonly used for Question Answering (QA) tasks. One of the most popular datasets for this purpose is **Natural Questions**. These datasets are quite large and provide both questions and passages to test retrieval and reading models.

**Natural Questions (NQ) Dataset:**

The **Natural Questions (NQ)** dataset contains real user queries along with corresponding passages retrieved from Wikipedia. It's a great dataset for testing retrieval models.

To demonstrate the DPR pipeline with a larger dataset, we'll use a portion of the **Natural Questions Open (NQ)** dataset, which is publicly available via Hugging Face's `datasets` library.

In [1]:
# Step 1: Install Haystack
!pip install farm-haystack[inference] --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m152.2/152.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.7/224.7 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m763.9/763.9 kB[0m [31m29.1 MB/s[0m eta [3

In [2]:
# Step 1: Install Haystack
!pip install farm-haystack[faiss] datasets --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
# Step 2: Import Required Libraries
from haystack.document_stores import FAISSDocumentStore
from haystack.nodes import DensePassageRetriever, FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.utils import print_answers
from datasets import load_dataset


In [4]:
# Step 3: Set Up the FAISS Document Store
document_store = FAISSDocumentStore(embedding_dim=768)  # 768 is the embedding dimension used by DPR
# Please do not run it 2nd time. It will raise an exception.
# In case you need to rerun First delete the current runtime then run all.

In [5]:
# Step 4: Load NQ Dataset

# Load the 'train' split of Natural Questions Open dataset (subset)
nq_dataset = load_dataset("natural_questions", split="train[:50]")

# Convert the dataset into a format compatible with Haystack
documents = [{"content": item['context']} for item in nq_dataset]

print(f"Loaded {len(documents)} documents.")


# Write the documents to the document store
document_store.write_documents(documents)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/287 [00:00<?, ?files/s]

train-00000-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00001-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00002-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00003-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00004-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00005-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00006-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00007-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00008-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00009-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00010-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00011-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00012-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00013-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00014-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00015-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00016-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00017-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00018-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00019-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00020-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00021-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00022-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00023-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00024-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00025-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00026-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00027-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00028-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00029-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00030-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00031-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00032-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00033-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00034-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00035-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00036-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00037-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00038-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00039-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00040-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00041-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00042-of-00287.parquet:   0%|          | 0.00/180M [00:00<?, ?B/s]

train-00043-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00044-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00045-of-00287.parquet:   0%|          | 0.00/182M [00:00<?, ?B/s]

train-00046-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00047-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00048-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00049-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00050-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00051-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00052-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00053-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00054-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00055-of-00287.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00056-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00057-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00058-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00059-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00060-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00061-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00062-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00063-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00064-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00065-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00066-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00067-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00068-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00069-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00070-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00071-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00072-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00073-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00074-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00075-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00076-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00077-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00078-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00079-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00080-of-00287.parquet:   0%|          | 0.00/183M [00:00<?, ?B/s]

train-00081-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00082-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00083-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00084-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00085-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00086-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00087-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00088-of-00287.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

train-00089-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00090-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00091-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00092-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00093-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00094-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00095-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00096-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00097-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00098-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00099-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00100-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00101-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00102-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00103-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00104-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00105-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00106-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00107-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00108-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00109-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00110-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00111-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00112-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00113-of-00287.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00114-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00115-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00116-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00117-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00118-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00119-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00120-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00121-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00122-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00123-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00124-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00125-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00126-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00127-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00128-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00129-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00130-of-00287.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00131-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00132-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00133-of-00287.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

train-00134-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00135-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00136-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00137-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00138-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00139-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00140-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00141-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00142-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00143-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00144-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00145-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00146-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00147-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00148-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00149-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00150-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00151-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00152-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00153-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00154-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00155-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00156-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00157-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00158-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00159-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00160-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00161-of-00287.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00162-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00163-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00164-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00165-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00166-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00167-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00168-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00169-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00170-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00171-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00172-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00173-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00174-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00175-of-00287.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

train-00176-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00177-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00178-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00179-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00180-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00181-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00182-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00183-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00184-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00185-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00186-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00187-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00188-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00189-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00190-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00191-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00192-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00193-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00194-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00195-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00196-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00197-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00198-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00199-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00200-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00201-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00202-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00203-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00204-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00205-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00206-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00207-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00208-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00209-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00210-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00211-of-00287.parquet:   0%|          | 0.00/209M [00:00<?, ?B/s]

train-00212-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00213-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00214-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00215-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00216-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00217-of-00287.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00218-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00219-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00220-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00221-of-00287.parquet:   0%|          | 0.00/184M [00:00<?, ?B/s]

train-00222-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00223-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00224-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00225-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

train-00226-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00227-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00228-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00229-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00230-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00231-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00232-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00233-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00234-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00235-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00236-of-00287.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

train-00237-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00238-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00239-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00240-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00241-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00242-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00243-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00244-of-00287.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00245-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00246-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00247-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00248-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00249-of-00287.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

train-00250-of-00287.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00251-of-00287.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

train-00252-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00253-of-00287.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

train-00254-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00255-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00256-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00257-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00258-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00259-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00260-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00261-of-00287.parquet:   0%|          | 0.00/186M [00:00<?, ?B/s]

train-00262-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00263-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00264-of-00287.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00265-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00266-of-00287.parquet:   0%|          | 0.00/198M [00:00<?, ?B/s]

train-00267-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00268-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00269-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00270-of-00287.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00271-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00272-of-00287.parquet:   0%|          | 0.00/187M [00:00<?, ?B/s]

train-00273-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00274-of-00287.parquet:   0%|          | 0.00/197M [00:00<?, ?B/s]

train-00275-of-00287.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00276-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00277-of-00287.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

train-00278-of-00287.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00279-of-00287.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00280-of-00287.parquet:   0%|          | 0.00/191M [00:00<?, ?B/s]

train-00281-of-00287.parquet:   0%|          | 0.00/182M [00:00<?, ?B/s]

train-00282-of-00287.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

train-00283-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00284-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00285-of-00287.parquet:   0%|          | 0.00/194M [00:00<?, ?B/s]

train-00286-of-00287.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

validation-00000-of-00007.parquet:   0%|          | 0.00/193M [00:00<?, ?B/s]

validation-00001-of-00007.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

validation-00002-of-00007.parquet:   0%|          | 0.00/189M [00:00<?, ?B/s]

validation-00003-of-00007.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

validation-00004-of-00007.parquet:   0%|          | 0.00/196M [00:00<?, ?B/s]

validation-00005-of-00007.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

validation-00006-of-00007.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/307373 [00:00<?, ? examples/s]

DatasetGenerationError: An error occurred while generating the dataset

In [None]:
# Step 5: Initialize the Dense Passage Retriever (DPR)

"""
The Dense Passage Retriever (DPR) retrieves documents using dense embeddings.
DPR requires two models:

Query Embedding Model:   Embeds the query into a vector space.
Passage Embedding Model: Embeds the documents into the same vector space
                         for similarity comparison.
"""
# Initialize Dense Passage Retriever for dense vector-based retrieval
retriever = DensePassageRetriever(
    document_store=document_store,
    query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
    passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
    use_gpu=False  # Set to True if you want to use a GPU
)

# Update the document store with embeddings for the documents
document_store.update_embeddings(retriever)


In [None]:
# Step 6: Initialize the Reader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

In [None]:
# Step 7: Build the Pipeline
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)


In [None]:
# Step 8: Ask a Question and Get an Answer

query = "Where is the capital of France?"

# Run the pipeline and get answers
prediction = pipeline.run(query=query, params={"Retriever": {"top_k": 1}, "Reader": {"top_k": 1}})

# Print the answers
print_answers(prediction, details="minimum")


In [None]:
# Step 9: Test with Another Query

query_2 = "Who developed the theory of relativity?"

# Run the pipeline and get answers
prediction_2 = pipeline.run(query=query_2, params={"Retriever": {"top_k": 1}, "Reader": {"top_k": 1}})

# Print the answers
print_answers(prediction_2, details="minimum")
