[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/search/question-answering/extractive-question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/search/question-answering/extractive-question-answering.ipynb)

# Extractive Question Answering with Sentence Transformers in Spanish

This notebook demonstrates how you can build an extractive question-answering application in spanish. This demo components are:

- A vector index to store and run semantic search in Pinecone vector db
- A retriever model for embedding our pieces of texts to become our context
- A reader model for QA tasks to extract answers for our questions and context.

We will use the **SQuAD dataset**, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client sentence-transformers torch python-dotenv

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/486.2 kB[0m [31m17.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m16.6 

## Loading the libraries

In [2]:
from tqdm.auto import tqdm
from pprint import pprint

import torch
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset
from transformers import pipeline

import pinecone

from dotenv import load_dotenv
import os

# Load Dataset

For this demo we load the the SQUAD dataset in spanish from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [3]:
# load the squad dataset into a pandas dataframe
df = load_dataset("squad_es", 'v2.0.0', split="train").to_pandas()

Downloading builder script:   0%|          | 0.00/5.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.92k [00:00<?, ?B/s]

Downloading and preparing dataset squad_es/v2.0.0 to /root/.cache/huggingface/datasets/squad_es/v2.0.0/2.0.0/bcada4f600192451443b95e24f609325705c5185b8aad97bffa8bc3784a867ad...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.79M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/812k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset squad_es downloaded and prepared to /root/.cache/huggingface/datasets/squad_es/v2.0.0/2.0.0/bcada4f600192451443b95e24f609325705c5185b8aad97bffa8bc3784a867ad. Subsequent calls will reuse this data.


Lets inspect our database, we always need to analyze and undestand what we are working with

In [4]:
df.head(5)

Unnamed: 0,id,title,context,question,answers
0,56be85543aeaaa14008c9063,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...,¿Cuándo Beyonce comenzó a ser popular?,"{'text': ['a finales de 1990'], 'answer_start'..."
1,56be85543aeaaa14008c9065,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...,¿En qué áreas compitió Beyonce cuando era niña?,"{'text': ['canto y'], 'answer_start': [197]}"
2,56be85543aeaaa14008c9066,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...,¿Cuándo Beyonce dejó Destiny 's Child y se con...,"{'text': ['(200'], 'answer_start': [527]}"
3,56bf6b0f3aeaaa14008c9601,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...,¿En qué ciudad y estado creció Beyonce?,"{'text': ['Houston, Texas'], 'answer_start': [..."
4,56bf6b0f3aeaaa14008c9602,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...,¿En qué década Beyonce se hizo famoso?,"{'text': ['finales de 1990'], 'answer_start': ..."


In [5]:
# select only title and context column
df = df[["title", "context"]]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset="context", ignore_index=True)
df.head(10)

Unnamed: 0,title,context
0,Beyoncé Knowles,Beyoncé Giselle Knowles-Carter (nacida el 4 de...
1,Beyoncé Knowles,Después de la disolución de Destiny 's Child e...
2,Beyoncé Knowles,"Una auto-descrita ""feminista moderna"", Beyoncé..."
3,Beyoncé Knowles,"Beyoncé Giselle Knowles nació en Houston, Texa..."
4,Beyoncé Knowles,Beyoncé asistió a St. Mary 's Elementary Schoo...
5,Beyoncé Knowles,"A los ocho años, Knowles y su amiga de la infa..."
6,Beyoncé Knowles,El grupo cambió su nombre a Destiny 's Child e...
7,Beyoncé Knowles,LeToya Luckett y Roberson quedaron descontento...
8,Beyoncé Knowles,"Los miembros restantes de la banda grabaron ""I..."
9,Beyoncé Knowles,"En julio de 2002, Beyoncé continuó su carrera ..."


# Initialize the Retriever model

Next, we need to initialize our retriever. The retriever will generate the embeddings for all context passages (context vectors/embeddings) and from our questions (query vector/embedding).

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a multilingual SentenceTransformer model named ``sentence-transformers/distiluse-base-multilingual-cased-v1`` trained on +50 languages.

In [6]:
# set device to GPU if available
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1', device=device)
retriever

Downloading (…)5f450/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Downloading (…)966465f450/README.md:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Downloading (…)6465f450/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5f450/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)966465f450/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)465f450/modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

From the model final description we know that our embeddings will be 512 tokens length, we need this info to create our vector index in the vector database that we need to store the context passages that we will use to extract the answers.

In [None]:
# Our sentences to encode
sentences = ["This is an example sentence", "Esta es una sentencia de ejemplo"]
# Create the embeddings for our sentences
embeddings = retriever.encode(sentences, convert_to_tensor=True)
# Show the final embeddings
print(embeddings)

tensor([[-0.0389,  0.0185, -0.0407,  ...,  0.0101, -0.0166, -0.0014],
        [-0.0206, -0.0221, -0.0081,  ...,  0.0643,  0.0062,  0.0091]],
       device='cuda:0')


In [None]:
# Calculate distance between this two sentences
distance = util.pytorch_cos_sim(embeddings[0], embeddings[1])
print(distance)

tensor([[0.7632]], device='cuda:0')


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [8]:
# Load .env file with environment variables
load_dotenv()

# connect to pinecone environment
pinecone.init(
    api_key=os.environ["PINECONE_API_KEY"],
    environment="us-west4-gcp-free"  # find next to API key in console
)

Now we create a new index called "extractive question-answering". We specify the metric type as "cosine" and dimension as 512 because the retriever for these two parameters.

In [None]:
index_name = "extractive-question-answering"

# check if the extractive-question-answering index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=512,
        metric="cosine"
    )

# connect to extractive-question-answering index we created
index = pinecone.Index(index_name)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [None]:
# we will use batches of 64
batch_size = 64
# In order to minimize compute tim for this demo we limit the number of context passages we will work with
max_context = batch_size*50
print("Max number of context passages:", max_context)

Max number of context passages: 3200


In [None]:
# Check if index is empty
index_stats_response = index.describe_index_stats()
if index_stats_response['total_vector_count']<100:
  for i in tqdm(range(0, max_context, batch_size)):
      # find end of batch
      i_end = min(i+batch_size, max_context)
      # extract batch
      batch = df.iloc[i:i_end]
      # generate embeddings for batch
      emb = retriever.encode(batch['context'].tolist()).tolist()
      # get metadata
      meta = batch.to_dict(orient='records')
      # create unique IDs
      ids = [f"{idx}" for idx in range(i, i_end)]
      # add all to upsert list
      to_upsert = list(zip(ids, emb, meta))
      # upsert/insert these records to pinecone
      _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

{'dimension': 512,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 3200}},
 'total_vector_count': 3200}

# Initialize the Reader model

We use the `timpal0l/mdeberta-v3-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [None]:
# Set the reader model
model_name = 'timpal0l/mdeberta-v3-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Downloading (…)lve/main/config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7f3e987d79d0>

Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [None]:
# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode([question]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(xq, top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c = [x["metadata"]['context'] for x in xc["matches"]]
    return c

In [None]:
# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

It is time to test our question-answering model and how good our embeddings are, we need the appropiate context to solve the question

In [None]:
question = "¿Donde nació Beyoncé?"
context = get_context(question, top_k = 1)
context

['Beyoncé Giselle Knowles nació en Houston, Texas, hija de Celestine Ann "Tina" Knowles, una peluquera y dueña de salón, y Mathew Knowles, un gerente de ventas de Xerox. El nombre de Beyoncé es un homenaje al apellido de soltera de su madre. La hermana menor de Beyoncé, Solange, también es cantante y ex miembro de Destiny \'s Child. Mathew es afroamericano, mientras que Tina es de ascendencia criolla de Luisiana (con ascendencia africana, nativa americana, francesa, cajún, y distante irlandesa y española). A través de su madre, Beyoncé es descendiente del líder acadiano Joseph Broussard. Fue criada en un hogar metodista.']

We can increase the top_k parameter, for this examples it does not look neccesary but lets do it.

In [None]:
question = "¿Donde nació Beyoncé?"
context = get_context(question, top_k = 3)
context

['Beyoncé Giselle Knowles nació en Houston, Texas, hija de Celestine Ann "Tina" Knowles, una peluquera y dueña de salón, y Mathew Knowles, un gerente de ventas de Xerox. El nombre de Beyoncé es un homenaje al apellido de soltera de su madre. La hermana menor de Beyoncé, Solange, también es cantante y ex miembro de Destiny \'s Child. Mathew es afroamericano, mientras que Tina es de ascendencia criolla de Luisiana (con ascendencia africana, nativa americana, francesa, cajún, y distante irlandesa y española). A través de su madre, Beyoncé es descendiente del líder acadiano Joseph Broussard. Fue criada en un hogar metodista.',
 'Beyoncé asistió a St. Mary \'s Elementary School en Fredericksburg, Texas, donde se matriculó en clases de baile. Su talento como cantante fue descubierto cuando la instructora de baile Darlette Johnson comenzó a tararear una canción y ella la terminó, capaz de tocar las notas agudas. El interés de Beyoncé en la música y la realización continuó después de ganar un 

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [None]:
extract_answer(question, context)

[{'answer': ' Houston, Texas,',
  'context': 'Beyoncé Giselle Knowles nació en Houston, Texas, hija de '
             'Celestine Ann "Tina" Knowles, una peluquera y dueña de salón, y '
             'Mathew Knowles, un gerente de ventas de Xerox. El nombre de '
             'Beyoncé es un homenaje al apellido de soltera de su madre. La '
             'hermana menor de Beyoncé, Solange, también es cantante y ex '
             "miembro de Destiny 's Child. Mathew es afroamericano, mientras "
             'que Tina es de ascendencia criolla de Luisiana (con ascendencia '
             'africana, nativa americana, francesa, cajún, y distante '
             'irlandesa y española). A través de su madre, Beyoncé es '
             'descendiente del líder acadiano Joseph Broussard. Fue criada en '
             'un hogar metodista.',
  'end': 48,
  'score': 0.9763014316558838,
  'start': 32},
 {'answer': ' Fredericksburg, Texas,',
  'context': "Beyoncé asistió a St. Mary 's Elementary School en "


Fine!!! The reader model predicted with 97% accuracy the correct answer *Houston, Texas* as seen from the context passage. Let's run few more queries.

In [None]:
question = "¿Cómo se llama la hermana menor de Beyoncé?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': ' Solange,',
  'context': 'Beyoncé Giselle Knowles nació en Houston, Texas, hija de '
             'Celestine Ann "Tina" Knowles, una peluquera y dueña de salón, y '
             'Mathew Knowles, un gerente de ventas de Xerox. El nombre de '
             'Beyoncé es un homenaje al apellido de soltera de su madre. La '
             'hermana menor de Beyoncé, Solange, también es cantante y ex '
             "miembro de Destiny 's Child. Mathew es afroamericano, mientras "
             'que Tina es de ascendencia criolla de Luisiana (con ascendencia '
             'africana, nativa americana, francesa, cajún, y distante '
             'irlandesa y española). A través de su madre, Beyoncé es '
             'descendiente del líder acadiano Joseph Broussard. Fue criada en '
             'un hogar metodista.',
  'end': 277,
  'score': 0.9883753061294556,
  'start': 268}]


Let's run another question. This time for top 3 context passages from the retriever.

In [None]:
question = "¿Quién investiga a Bond?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': ' Sr. White,',
  'context': 'Bond desobedece la orden de M y viaja a Roma para asistir al '
             'funeral de Sciarra. Esa noche visita a la viuda de Sciarra, '
             'Lucía, quien le habla de Espectro, una organización criminal a '
             'la que pertenecía su marido. Bond se infiltra en una reunión del '
             'Espectro, donde identifica al líder, Franz Oberhauser. Cuando '
             'Oberhauser se dirige a Bond por su nombre, se escapa y es '
             'perseguido por el Sr. Hinx, un asesino del Espectro. Moneypenny '
             'informa a Bond que la información que recogió lleva al Sr. '
             'White, ex miembro de Quantum, una subsidiaria de Spectre. Bond '
             'le pide que investigue a Oberhauser, quien fue dado por muerto '
             'años antes.',
  'end': 498,
  'score': 0.3282511830329895,
  'start': 487},
 {'answer': ' Nine Eyes',
  'context': 'Bond y Swann regresan a Londres donde conocen a M, Bill Tanner, '

The result looks pretty good.