# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [1]:
import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY") # Fix old get.env syntax
PINECONE_API_KEY = userdata.get("PINECONE_API_KEY")

assert OPENAI_API_KEY, "OPENAI_API_KEY is missing from Colab Secrets."
assert PINECONE_API_KEY, "PINECONE_API_KEY is missing from Colab Secrets."

# Optional but useful: set them as env vars so the rest of the notebook can use os.getenv(...)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

print("Secrets loaded OK:", bool(OPENAI_API_KEY), bool(PINECONE_API_KEY))
print("Pinecone key prefix:", PINECONE_API_KEY[:5])

Secrets loaded OK: True True
Pinecone key prefix: pcsk_


# Install Dependencies

In [2]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [3]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Select only the columns we need for retrieval + QA
df = df[["title", "context"]].copy()

# Drop rows that contain duplicate context passages (keeps first occurrence)
df = df.drop_duplicates(subset=["context"]).reset_index(drop=True)

df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
1,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
2,University_of_Notre_Dame,The university is the major seat of the Congre...
3,University_of_Notre_Dame,The College of Engineering was established in ...
4,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
18886,Kathmandu,"Institute of Medicine, the central college of ..."
18887,Kathmandu,Football and Cricket are the most popular spor...
18888,Kathmandu,The total length of roads in Nepal is recorded...
18889,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [5]:
!pip install -qU langchain-pinecone pinecone-notebooks

In [27]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [7]:
from pinecone import Pinecone, ServerlessSpec

# assumes you already set PINECONE_API_KEY earlier
pc = Pinecone(api_key=PINECONE_API_KEY)

index_name = "question-answering"
spec = ServerlessSpec(cloud="aws", region="us-east-1")

# List existing indexes (new SDK returns an object with .names())
existing_indexes = pc.list_indexes().names()

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=spec,
    )

index = pc.Index(index_name)
print("Connected to index:", index_name)

Connected to index: question-answering


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [8]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Retriever model: produces 384-dim embeddings optimized for cosine similarity
retriever = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1", device=device)
retriever

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/multi-qa-MiniLM-L6-cos-v1
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [9]:
from tqdm.auto import tqdm


batch_size = 64
num_rows = len(df)

for start_idx in tqdm(range(0, num_rows, batch_size)):
    # 1) Compute the end index for this batch (handles the final smaller batch)
    end_idx = min(start_idx + batch_size, num_rows)

    # 2) Slice the dataframe for this batch
    batch_df = df.iloc[start_idx:end_idx]

    # 3) Collect the raw text contexts we want to embed
    batch_contexts = batch_df["context"].tolist()

    # 4) Convert contexts -> embedding vectors (384 floats each)
    batch_embeddings = retriever.encode(
        batch_contexts,
        convert_to_numpy=True,
    ).tolist()

    # 5) Build metadata for each record (stored alongside vectors in Pinecone)
    batch_titles = batch_df["title"].tolist()
    batch_metadata = [
        {"title": title, "context": context}
        for title, context in zip(batch_titles, batch_contexts)
    ]

    # 6) Create stable unique IDs (Pinecone wants string IDs)
    batch_ids = [str(row_id) for row_id in range(start_idx, end_idx)]

    # 7) Upsert: (id, vector, metadata)
    vectors_to_upsert = list(zip(batch_ids, batch_embeddings, batch_metadata))
    index.upsert(vectors=vectors_to_upsert)

index.describe_index_stats()



  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

In [10]:
print("df type:", type(df))
print("df value is None:", df is None)

df type: <class 'pandas.core.frame.DataFrame'>
df value is None: False


# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [11]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

QuestionAnsweringPipeline: {'model': 'ElectraForQuestionAnswering', 'dtype': 'float32', 'device': 'cuda', 'input_modalities': 'text'}

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [12]:
from typing import List

# gets context passages from the pinecone index
def get_context(question: str, top_k: int) -> List[str]:
    """Retrieve the top_k most relevant context passages for a given question."""
    # 1) Convert the question to an embedding vector
    question_vector = retriever.encode(question, convert_to_numpy=True).tolist()

    # 2) Query Pinecone for the most similar context vectors
    query_result = index.query(
        vector=question_vector,
        top_k=top_k,
        include_metadata=True,
    )

    # 3) Extract the stored context strings from metadata
    matches = query_result.get("matches", [])
    contexts = [
        match["metadata"]["context"]
        for match in matches
        if match.get("metadata") and "context" in match["metadata"]
    ]

    return contexts

In [13]:
# Test it

test_question = "What is the main topic of this dataset?"
contexts = get_context(test_question, top_k=3)

print("Returned contexts:", len(contexts))
print(contexts[0][:300])

Returned contexts: 3
"Data on ethnic groups are important for putting into effect a number of federal statutes (i.e., enforcing bilingual election rules under the Voting Rights Act; monitoring and enforcing equal employment opportunities under the Civil Rights Act). Data on Ethnic Groups are also needed by local governm


In [14]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [15]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [16]:
extract_answer(question, context)

[{'answer': '691,000 bbl/d',
  'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of '
             'natural gas (in 2013), which makes Egypt as the largest oil '
             'producer not member of the Organization of the Petroleum '
             'Exporting Countries (OPEC) and the second-largest dry natural '
             'gas producer in Africa. In 2013, Egypt was the largest consumer '
             'of oil and natural gas in Africa, as more than 20% of total oil '
             'consumption and more than 40% of total dry natural gas '
             'consumption in Africa. Also, Egypt possesses the largest oil '
             'refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is '
             'currently planning to build its first nuclear power plant in El '
             'Dabaa city, northern Egypt.',
  'end': 33,
  'score': 0.9999855201981802,
  'start': 20}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [17]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Hurley and Chen',
  'context': 'According to a story that has often been repeated in the media, '
             'Hurley and Chen developed the idea for YouTube during the early '
             'months of 2005, after they had experienced difficulty sharing '
             "videos that had been shot at a dinner party at Chen's apartment "
             'in San Francisco. Karim did not attend the party and denied that '
             'it had occurred, but Chen commented that the idea that YouTube '
             'was founded after a dinner party "was probably very strengthened '
             'by marketing ideas around creating a story that was very '
             'digestible".',
  'end': 79,
  'score': 0.9999276399612427,
  'start': 64}]


In [18]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 86,
  'score': 0.9500380754470825,
  'start': 29}]


Let's run another question. This time for top 3 context passages from the retriever.

In [19]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

The result looks pretty good.

In [20]:
# pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [21]:
question = "What is the capital of autralia?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Leulumoega',
  'context': 'The capital village of each district administers and coordinates '
             "the affairs of the district and confers each district's "
             'paramount title, amongst other responsibilities. For example, '
             "the District of A'ana has its capital at Leulumoega. The "
             "paramount title of A'ana is the TuiA'ana. The orator group which "
             'confers this title – the Faleiva (House of Nine) – is based at '
             'Leulumoega. This is also the same for the other districts. In '
             'the district of Tuamasaga, the paramount title of the district – '
             'the Malietoa title – is conferred by the FaleTuamasaga based in '
             'Afega.',
  'end': 378,
  'score': 0.023090426844646572,
  'start': 368}]


Correct answer is : Canberra

The reader did its job and gave an answer, but ther wrong one.

In [23]:
question = "Which planet is known as the red planet?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Neptune',
  'context': 'From its discovery in 1846 until the subsequent discovery of '
             'Pluto in 1930, Neptune was the farthest known planet. When Pluto '
             'was discovered it was considered a planet, and Neptune thus '
             'became the penultimate known planet, except for a 20-year period '
             "between 1979 and 1999 when Pluto's elliptical orbit brought it "
             'closer to the Sun than Neptune. The discovery of the Kuiper belt '
             'in 1992 led many astronomers to debate whether Pluto should be '
             'considered a planet or as part of the Kuiper belt. In 2006, the '
             'International Astronomical Union defined the word "planet" for '
             'the first time, reclassifying Pluto as a "dwarf planet" and '
             'making Neptune once again the outermost known planet in the '
             'Solar System.',
  'end': 344,
  'score': 3.642711268259449e-13,
  'start': 337}]


Correct answer is : Mars

Again, it gave a "planet" but not the correct one.

In [25]:
question = "Who is the first Prime Minister of the UK?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Sir Robert Walpole',
  'context': 'The term prime minister in the sense that we know it originated '
             'in the 18th century in the United Kingdom when members of '
             'parliament disparagingly used the title in reference to Sir '
             'Robert Walpole. Over time, the title became honorific and '
             'remains so in the 21st century.',
  'end': 196,
  'score': 0.1390971839427948,
  'start': 178},
 {'answer': 'First Lord of the Treasury',
  'context': "The United Kingdom's constitution, being uncodified and largely "
             'unwritten, makes no mention of a prime minister. Though it had '
             'de facto existed for centuries, its first mention in official '
             'state documents did not occur until the first decade of the '
             'twentieth century. Accordingly, it is often said "not to exist", '
             'indeed there are several instances of parliament declaring this '
             'to be the case. The pr

Correct answer: Sir Robert Walpole is generally considered to be the first Prime Minister of Great Britain, taking office in 1721.

In this case , the reader gave the correct answer the first time! so not always a miss!