# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [None]:
# !pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

# Load and Prepare Dataset

Our source data will be taken from the Wikimedia Wikipedia dataset. Since the original Wiki Snippets dataset is deprecated, we use the official Wikimedia dataset and split articles into passages ourselves. We will filter for articles containing "History" content and create ~50,000 passages for this demo. Pinecone vector database can effortlessly manage millions of documents for you.

In [7]:
from datasets import load_dataset

# Load Wikipedia dataset - using wikimedia's official dataset
wiki_data = load_dataset(
    'wikimedia/wikipedia',
    '20231101.simple',
    split='train',
    streaming=True
).shuffle(seed=960)

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [8]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'id': '803',
 'url': 'https://simple.wikipedia.org/wiki/Slavery',
 'title': 'Slavery',
 'text': 'Slavery is when a person is treated as the property of another person. This person is usually called a slave and the owner is called a slavemaster. It often means that slaves are forced to work, or else they will be punished by the law (if slavery is legal in that place) or by their master.\n\nThere is evidence that even before there was writing, there was slavery. There have been different types of slavery, and they have been in almost all cultures and continents. Some societies had laws about slavery, or had an economy that was built on it. Ancient Greece and Ancient Rome had many slaves.\n\nDuring the 20th century, almost all countries made laws forbidding slavery. The Universal Declaration of Human Rights says that slavery is wrong. Slavery is now banned by international law. Nevertheless, there are still different forms of slavery in some countries. The Islamic Republic of Mauritania 

In [16]:
# filter only documents with "History" in the text (simulating section_title filter)
# Since wikimedia dataset doesn't have section_title, we filter for articles containing "History" sections
history = wiki_data.filter(lambda x: 'History' in x['text'] or 'history' in x['title'].lower())

Let's iterate through the dataset and apply our filter to select passages from history-related articles. Since the Wikimedia dataset doesn't have pre-chunked passages, we'll split the article text into smaller passages (~500 characters each). We'll extract `article_title`, `section_title` and `passage_text` from each document.

In [17]:
from tqdm.auto import tqdm  # progress bar
import re

total_doc_count = 50000

counter = 0
docs = []

def split_into_passages(text, max_length=500):
    """Split text into passages of roughly max_length characters, breaking at sentence boundaries."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    passages = []
    current_passage = ""
    
    for sentence in sentences:
        if len(current_passage) + len(sentence) < max_length:
            current_passage += sentence + " "
        else:
            if current_passage:
                passages.append(current_passage.strip())
            current_passage = sentence + " "
    
    if current_passage:
        passages.append(current_passage.strip())
    
    return passages

# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # Split the article text into passages
    passages = split_into_passages(d['text'])
    
    for passage in passages:
        if len(passage) > 50:  # Skip very short passages
            doc = {
                'article_title': d['title'],
                'section_title': 'History',  # Using placeholder since wikimedia doesn't have sections
                'passage_text': passage
            }
            docs.append(doc)
            counter += 1
            
            if counter >= total_doc_count:
                break
    
    if counter >= total_doc_count:
        break

print(f"Collected {len(docs)} passages")

 11%|█         | 5506/50000 [00:12<01:38, 450.29it/s]

Collected 50000 passages





In [18]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Brazil,History,Brazil (officially called Federative Republic ...
1,Brazil,History,That group of indigenous people is often calle...
2,Brazil,History,They began to import black slaves from Africa ...
3,Brazil,History,This caused some fights with the Spaniards (pe...
4,Brazil,History,"This led to an increase in slave revolts, espe..."


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [24]:
import os
import dotenv
from pinecone import Pinecone

dotenv.load_dotenv()  # Load environment variables from .env file

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [25]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [30]:
index_name = "history-passages-index"  #give your index a meaningful name

In [31]:
import time

# check if index already exists (it shouldn't if this is first time)
#initialize the index, and insure the stats are all zeros
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,  # dimension of the embeddings
        metric='cosine',
        spec=spec
    )
    # wait until the index is ready
    while True:
        index_description = pc.describe_index(index_name)
        if index_description.status['ready']:
            break
        time.sleep(5)

# connect to the index
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [None]:
import torch # What does torch do? It is a library for tensor computation and deep learning, often used for building and training neural networks. It provides efficient operations on multi-dimensional arrays (tensors) and supports GPU acceleration, making it popular for machine learning tasks.
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base', device=device)
retriever




To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [35]:
# we will use batches of 64
batch_size = 64

#You will create embedding for the passage_text variable and be use to include the meta data in each batch
for i in tqdm(range(0, int(0.1*len(df)), batch_size)):
    # find end of batch
    batch_end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:batch_end]
    # generate embeddings for batch
    embeddings = retriever.encode(batch['passage_text'].tolist(), show_progress_bar=False)
# check that we have all vectors in index
index.describe_index_stats()

100%|██████████| 79/79 [09:58<00:00,  7.58s/it]


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [36]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [40]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query], show_progress_bar=False).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq[0], top_k=top_k, include_metadata=True)
    return xc

In [41]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"{query} {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [42]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [], 'namespace': '', 'usage': {'read_units': 1}}

In [43]:
from pprint import pprint

In [44]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

'when was the first electric power system built? '


The output looks great. Now let's write a function to generate answers.

In [45]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [46]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('Electricity was first used in the 19th century. The first electric power '
 'system built was a steam engine powered by steam.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [47]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by a telegraph. It was sent by a '
 'telegraph operator.')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [48]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [49]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

'COVID-19 is a name for the COVID-19 Task Force. It was formed in 1943.'


In [50]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

Let’s finish with a final few questions.

In [51]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

"I'm not sure if this is what you're looking for, but I'm curious about it."


In [52]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the moon was Neil Armstrong. He walked on the moon '
 'in 1969.')


In [53]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I don't know if this counts, but I'm curious about the cost of the moon "
 'landing.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [57]:
query = "what is the history of Brazil?"
context = query_pinecone(query, top_k=1)
query = format_query(query, context["matches"])
generate_answer(query)

("Brazil is a country with a very long history. It's a country with a very "
 'long history.')
