<a href="https://colab.research.google.com/github/appulchen/lab-abstractive-question-answering/blob/main/lab_abstractive_question_answering_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

In [None]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# 768-dim embedding model
retriever = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device=device)

print("Retriever loaded with dimension:", retriever.get_sentence_embedding_dimension())


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Retriever loaded with dimension: 768


# Install Dependencies

In [None]:
!pip uninstall -y pinecone-client
!pip uninstall -y pinecone
!pip install -q "pinecone>=3.0.0"


Found existing installation: pinecone-client 6.0.0
Uninstalling pinecone-client-6.0.0:
  Successfully uninstalled pinecone-client-6.0.0
Found existing installation: pinecone 8.0.0
Uninstalling pinecone-8.0.0:
  Successfully uninstalled pinecone-8.0.0


In [None]:
from google.colab import userdata
import os

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

print("Secrets loaded!")


Secrets loaded!


# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [None]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'wiki_snippets',
    'wikipedia_en_100_0',
    split='train',
    streaming=True
).shuffle(seed=960)

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/53 [00:00<?, ?it/s]

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [None]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'_id': '{"datasets_id": 5607441, "wiki_id": "Volga_State_University_of_Water_Transport", "sp": 15, "sc": 398, "ep": 24, "ec": 18}',
 'datasets_id': 5607441,
 'wiki_id': 'Volga_State_University_of_Water_Transport',
 'start_paragraph': 15,
 'start_character': 398,
 'end_paragraph': 24,
 'end_character': 18,
 'article_title': 'Volga State University of Water Transport',
 'section_title': 'Programs & Navigation & Electromechanical Engineering',
 'passage_text': 'belfry and the cupola with the cross of the house church were lost, and the interior has been redeveloped.  For summer holidays, a sports camp, Vodnik, on the coast of the Gorky sea, is made available for staff and students .  Departments  Navigation  Prepares engineers to navigate for sea and river vessels. The curriculum is includes modern methods and training facilities, including specialized simulators, in compliance with the requirements of the International Convention on the Training and Certification of Seafarers and Watchk

In [None]:
# filter only documents with History as section_title - Replace None with your code
history = (item for item in wiki_data if "History" in item["section_title"])


Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [None]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []

# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # extract the fields we need - article, section, and passage
    docs.append({
        "article_title": d["article_title"],
        "section_title": d["section_title"],
        "passage_text": d["passage_text"]
    })

    # increase the counter on every iteration
    counter += 1

    # stop after 50,000 documents
    if counter == total_doc_count:
        break


  0%|          | 0/50000 [00:00<?, ?it/s]

In [None]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Ace-Ten games,History & Games with national or regional stat...,"is uncertain, it is most likely to have been i..."
1,Ikot Inuen,History & Culture,"Government Area, settled first in Ikot Inyang ..."
2,Glasgow Corporation Water Works,History,Katrine scheme. The council then sought advice...
3,Glasgow Corporation Water Works,History,work to proceed at multiple faces. 25 bridges ...
4,Glasgow Corporation Water Works,History,by this time pneumatic drills and better explo...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [None]:

from google.colab import userdata
import os
# initialize connection to pinecone (get API key at app.pinecone.io)
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

from google.colab import userdata
userdata.list()


Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [None]:
from pinecone import Pinecone, ServerlessSpec
import os

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])


In [None]:
index_name = "abstractive-question-answering"


In [None]:
import time

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(
            cloud=cloud,
            region=region
        )
    )

# wait until index is ready
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

done on the very top cell

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [None]:
from tqdm.auto import tqdm

index_name = "abstractive-question-answering"
index = pc.Index(index_name)      # connect once

batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    end_i = min(i + batch_size, len(df))
    batch = df.iloc[i:end_i]

    # 1. generate embeddings (768-dim from all-mpnet-base-v2)
    embeddings = retriever.encode(
        batch["passage_text"].tolist()
    ).tolist()

    # 2. build metadata + unique string IDs
    metadatas = batch[["article_title", "section_title", "passage_text"]].to_dict(orient="records")
    ids = batch.index.astype(str).tolist()   # unique per row

    # 3. create vector list
    vectors = [
        {
            "id": ids[k],
            "values": embeddings[k],
            "metadata": metadatas[k],
        }
        for k in range(len(embeddings))
    ]

    # 4. upsert into Pinecone
    index.upsert(vectors=vectors)

# After loop, check that vectors are there
index.describe_index_stats()


  0%|          | 0/782 [00:00<?, ?it/s]

{'_response_info': {'raw_headers': {'connection': 'keep-alive',
                                    'content-length': '189',
                                    'content-type': 'application/json',
                                    'date': 'Fri, 21 Nov 2025 18:01:38 GMT',
                                    'grpc-status': '0',
                                    'server': 'envoy',
                                    'x-envoy-upstream-service-time': '35',
                                    'x-pinecone-request-id': '3616941773234647733',
                                    'x-pinecone-request-latency-ms': '35'}},
 'dimension': 768,
 'index_fullness': 0.0,
 'memoryFullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'__default__': {'vector_count': 50000}},
 'storageFullness': 0.0,
 'total_vector_count': 50000,
 'vector_type': 'dense'}

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [None]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(
        vector=xq[0],
        top_k=top_k,
        include_metadata=True
    )
    return xc

In [None]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = "\n".join(context)
    # contcatinate the query and context passages
    query = f"{query}\n{context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [None]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

QueryResponse(matches=[{'id': '2352',
 'metadata': {'article_title': 'Renewable energy in California',
              'passage_text': 'Mining, began receiving electricity from a 12.5 '
                              'mile 2,500 AC power line that originated in '
                              'Bodie, California.  With the first three-phase '
                              'hydroelectric system being built in Germany '
                              'back in 1891, the U.S. gets its first three '
                              'phase system in 1893 in Mill Creek, California: '
                              'featuring a line connection that extended 8 '
                              'miles and carried 2,400 volts of electricity. '
                              'Folsom, California received the same type of '
                              'system in 1893 as well, except it had 11,000 '
                              'volt alternators put in place, and its power '
                              'lin

In [None]:
from pprint import pprint

In [None]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
print(query)

when was the first electric power system built?
<P> Mining, began receiving electricity from a 12.5 mile 2,500 AC power line that originated in Bodie, California.  With the first three-phase hydroelectric system being built in Germany back in 1891, the U.S. gets its first three phase system in 1893 in Mill Creek, California: featuring a line connection that extended 8 miles and carried 2,400 volts of electricity. Folsom, California received the same type of system in 1893 as well, except it had 11,000 volt alternators put in place, and its power line extended all the way to the state capitol, Sacramento.  The acquisition of Colgate hydroelectric plants


The output looks great. Now let's write a function to generate answers.

In [None]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [None]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('The first electric power system was built in Germany back in 1891. The first '
 'hydroelectric plant was built in Germany back in 1891. The first electric '
 'power plant was built in Mill Creek')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [None]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by a telegraph. The telegraph was a '
 'device that used electricity to send a message. The telegraph was a device '
 'that used electricity to send a message')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [None]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

control of 56 of the 120 Knesset seats  October 29, 1969 (Wednesday)  At 10:30 in the evening at the University of California, Los Angeles (UCLA) campus, the first message was sent over ARPANET, the forerunner of the internet.  Leonard Kleinrock would recall later that the first message, transmitted from UCLA to the computer at the Stanford Research Institute (SRI) was intended to be transmitting the letters "L-O-G", after which Stanford would add two more letters to send back the word "LOGIN".  Charley Kline, a 21-year old UCLA student, was asked by Kleinrock to help send a
---
name, with a different technology.  Products and technology  Cellemetry operated by sending messages over the signalling channel of the analog cellular network. It used a non-dialable telephone number as the device identifier and inserted a device generated data message in place of the phone serial number. The Cellemetry device would then send out a registration message to the home cellular system. The Cellemet

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [None]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('COVID-19 is not a new virus, it is a retrovirus. It is a retrovirus that has '
 'been around for a long time. It is a retrovirus')


In [None]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

in New York was chair and principal organizer of the NIAID/NIH Conference "Emerging Viruses: The Evolution of Viruses and Viral Diseases" held 1–3 May 1989 in Washington, DC. In the article summarizing the conference the authors writeChallenged by the sudden appearance of AIDS as a major public health crisis [...] jointly sponsored the conference "Emerging Viruses: The Evolution of Viruses and Viral Diseases" [...] It was convened to consider the mechanisms of viral emergence and possible strategies for anticipating, detecting, and preventing the emergence of new viral diseases in the future. They further noteSurprisingly, most emergent viruses are zoonotic, with
---
natural animal reservoirs a more frequent source of new viruses than is the sudden evolution of a new entity. The most frequent factor in emergence is human behavior that increases the probability of transfer of viruses from their endogenous animal hosts to man.In a 1991 paper Morse underlines how the emergence of new infe

Let’s finish with a final few questions.

In [None]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

In [None]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

In [None]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

As we can see, the model can generate some decent answers.

#### Add a few more questions

In [None]:
query = "How is the chinese moon calendar organized?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Chinese calendar is based on the Gregorian calendar, which is based on '
 'the Gregorian calendar, which is based on the Gregorian calendar, which is '
 'based on the Gregorian calendar.')


In [None]:
query = "Who was the most famous poet in the 1900s?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('I\'m not sure if this counts as a "poetry" question, but I\'m going to give '
 "it a shot. I'm going to focus on American poetry in the early 1900s")


In [None]:
query = "Of what components is iron made of?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('Iron is made up of iron oxide, which is a very dense metal with a very high '
 'melting point. Iron oxide is also a very reactive metal, meaning that it '
 'reacts with other elements to')
