![NVIDIA Logo](images/nvidia.png)

# Question Answering

In this notebook you will begin work on an extractive question answering task using the [Stanford Question Answering](https://rajpurkar.github.io/SQuAD-explorer/) (SQuAD) dataset.

---

## Learning Objectives

By the time you complete this notebook you will:
- Be familiar with the SQuAD question answering dataset.
- Observe zero-shot performance for extractive question answering using GPT43B and GPT8B.

---

## Imports

In [1]:
import json
import random

from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.models import Models

---

## List Models

In [2]:
Models.list_models()

gpt8b: gpt-8b-000
gpt20b: gpt20b
gpt43b_2: gpt-43b-002
gpt43b: gpt-43b-001
llama70b_chat: llama-2-70b-chat-hf
llama70b: llama-2-70b-hf


---

## SQuAD

For the question answering task, we will be working with the Stanford Question Answering Dataset (SQuAD). From the SQuAD documentation:

> SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

The dataset contains over 100,000 questions and either its answer, or, that the question in unanswerable from the provided textual context.

In [3]:
with open('data/squad.json', 'r') as f:
    squad_data = json.load(f)

---

## Explore SQuAD

The dataset comes as a dictionary with only 2 keys.

In [4]:
squad_data.keys()

dict_keys(['version', 'data'])

We are entirely interested in `data` which contains 442 different topics, each with many textual contexts and then questions and answers based on that context.

In [5]:
data = squad_data['data']

In [6]:
len(data)

442

In [7]:
for d in data[:10]:
    print(f'Topic: {d['title']}')

Topic: Beyoncé
Topic: Frédéric_Chopin
Topic: Sino-Tibetan_relations_during_the_Ming_dynasty
Topic: IPod
Topic: The_Legend_of_Zelda:_Twilight_Princess
Topic: Spectre_(2015_film)
Topic: 2008_Sichuan_earthquake
Topic: New_York_City
Topic: To_Kill_a_Mockingbird
Topic: Solar_energy


---

## Explore Beyoncé Topic

Let's take a look at the first topic in the dataset, which is about the pop singer Beyoncé.

In [8]:
beyonce = data[0]

In [9]:
beyonce.keys()

dict_keys(['title', 'paragraphs'])

Each topic contains a collection of context paragraphs that serve as the basis for the question answering task.

In [10]:
paragraphs = beyonce['paragraphs']

In [11]:
type(paragraphs)

list

In the case of the Beyoncé topic we can see that there are 66 context paragraphs, each with their own set of questions and answers.

In [12]:
len(paragraphs)

66

### Context, Questions and Answers

Let's look at the first contextual paragraph and its questions and answers.

In [13]:
paragraph = paragraphs[0]

In [14]:
paragraph.keys()

dict_keys(['qas', 'context'])

In [15]:
paragraph['context']

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

This particular contextual paragraph has 15 question/answer pairs associated with it.

In [16]:
qas = paragraph['qas']

In [17]:
len(qas)

15

Here's the structure of a single question/answer pair

In [18]:
qas[0]

{'question': 'When did Beyonce start becoming popular?',
 'id': '56be85543aeaaa14008c9063',
 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
 'is_impossible': False}

Let's take a look at a few of the questions and their answers, and confirm that the answers are derived from text in the provided context paragraph.

In [19]:
for qa in qas[:5]:
    question = qa['question']
    answer = qa['answers'][0]['text']
    print(f'Question: {question}')
    print(f'Answer: {answer}')
    print(f'Answer in paragraph: {answer in paragraph['context']}\n') # See `paragraph['context']` above.

Question: When did Beyonce start becoming popular?
Answer: in the late 1990s
Answer in paragraph: True

Question: What areas did Beyonce compete in when she was growing up?
Answer: singing and dancing
Answer in paragraph: True

Question: When did Beyonce leave Destiny's Child and become a solo singer?
Answer: 2003
Answer in paragraph: True

Question: In what city and state did Beyonce  grow up? 
Answer: Houston, Texas
Answer in paragraph: True

Question: In which decade did Beyonce become famous?
Answer: late 1990s
Answer in paragraph: True



---

## Process SQuAD Data Into Context, Question, Answer Data

Ultimately we are going to use SQuAD data to fine tune a model on a question answering task. To that end it will be helpful to process the SQuAD data to simplify its structure and create a list where each item contains a context, question, and answer.

Knowing what we do about the structure of the SQuAD data above we can run the following cell to do just this.

Note that SQuAD contains some questions that are intentionally impossible to answer based on the provided context. We are going to choose to ignore these questions and instead only use those that have a clear answer.

Also remember that SQuAD contains over 100,000 questions and answers. We know that for PEFT we can typically do well with roughly 1000 samples. With that in mind, and to keep our dataset diverse, we are only going to take the first context paragraph and its questions and answers for each topic.

In [20]:
contexts_questions_answers = []
for topic in data:
    cqa = topic['paragraphs'][0]
    context = cqa['context']
    for qa in cqa['qas']:
        if qa['is_impossible']:
            continue
        question = qa['question']
        answer = qa['answers'][0]['text']
        contexts_questions_answers.append({'context': context, 'question': question, 'answer': answer})

This leaves us with over 2000 context, question, answer items.

In [21]:
len(contexts_questions_answers)

2349

In [22]:
contexts_questions_answers[:2]

[{'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
  'question': 'When did Beyonce start becoming popular?',
  'answer': 'in the late 1990s'},
 {'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas

---

## Shuffle Data

Even though we only took the first context paragraph for each topic in the dataset, we still have many questions for each of those context paragraphs. With that in mind, let's shuffle the data.

We set a random seed here for reproducibility.

In [23]:
random.seed(1)

In [24]:
random.shuffle(contexts_questions_answers)

In [25]:
for cqa in contexts_questions_answers[:5]:
    print(cqa['context']+'\n')
    print(cqa['question'])
    print(cqa['answer']+'\n-----\n')

The National Archives and Records Administration (NARA) is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives. NARA is officially responsible for maintaining and publishing the legally authentic and authoritative copies of acts of Congress, presidential proclamations and executive orders, and federal regulations. The NARA also transmits votes of the Electoral College to Congress.

NARA is responsible for what collection of archives?
National Archives
-----

Traditionally considered the last part of the Stone Age, the Neolithic followed the terminal Holocene Epipaleolithic period and commenced with the beginning of farming, which produced the "Neolithic Revolution". It ended when metal tools became widespread (in the Copper Age or Bronze Age; or, in some geographical regions, in the Iron Age). The Neolithic is a progressi

---

## Question Answering Prompt Template

We will continue the practice of denoting our LLM tasks with a prompt template function. In the case of extractive question answering, we will use the following, which constructs a prompt given a provided `text` context and the `question` we would like answered from the provided `text`.

In [26]:
def extract_template(text, question):
    return f'{text}\n{question} answer: '

---

## Create Prompts with Labels

Now we can combine our `contexts_questions_answers` with the `extract_template` to create a list of prompts and their labels, which we will be able to leverage when working with our LLMs.

In [27]:
prompts_and_answers = []
for cqa in contexts_questions_answers:
    context, question, answer = cqa['context'], cqa['question'], cqa['answer']
    prompt = extract_template(context, question)
    prompts_and_answers.append((prompt, answer))

In [28]:
len(prompts_and_answers)

2349

In [29]:
for prompt, answer in prompts_and_answers[0:3]:
    print(prompt+'\n')
    print(answer+'\n---\n')

The National Archives and Records Administration (NARA) is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives. NARA is officially responsible for maintaining and publishing the legally authentic and authoritative copies of acts of Congress, presidential proclamations and executive orders, and federal regulations. The NARA also transmits votes of the Electoral College to Congress.
NARA is responsible for what collection of archives? answer: 

National Archives
---

Traditionally considered the last part of the Stone Age, the Neolithic followed the terminal Holocene Epipaleolithic period and commenced with the beginning of farming, which produced the "Neolithic Revolution". It ended when metal tools became widespread (in the Copper Age or Bronze Age; or, in some geographical regions, in the Iron Age). The Neolithic is a pr

---

## Try Zero-shot Prompting with GPT43B

Let's see how GPT43B performs on this extractive question answering task with straightforward zero-shot prompting. First we'll instantiate an instance of our model.

In [30]:
gpt43b = NemoServiceBaseModel(Models.gpt43b.value)

Next we'll try it out on the first several prompts in `prompts_and_answers`.

In [31]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt43b.generate(prompt).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

Response: National Archives
Answer: National Archives

Response: Neolithic
Answer: the Neolithic

Response: spiritual intermediaries
Answer: spiritual intermediaries

Response: Mikhail Gorbachev
Answer: Mikhail Gorbachev,

Response: USB
Answer: USB



### Analysis

At a glance, it looks like GPT43B is well suited for this task.

---

## Try Zero-shot Prompting with GPT8B

Now let's see how the much smaller GPT8B does.

In [32]:
gpt8b = NemoServiceBaseModel(Models.gpt8b.value)

In [33]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt8b.generate(prompt).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

Response: 1. National Archives and Records Administration (NARA) 2. National Archives of the United States (NAUS) 3. National Archives of the United States (NAUS) 4. National Archives of the United States (NAUS) 5. National Archives of the United States (NAUS)

What is the National Archives and Records Administration (NARA)?

The National Archives and Records Administration (NARA) is an independent agency of the United States government charged with preserving and documenting government and historical records and with increasing public access to those documents, which comprise the National Archives. NARA is officially responsible for maintaining and publishing the legally authentic and authoritative copies of acts of Congress, presidential proclamations and executive orders, and federal regulations. The NARA also transmits votes of the Electoral College to Congress.

What is the National Archives and Records Administration (NARA)?

The National Archives and Records Administration (NARA

At the least, GPT8B seems to be going on and on, let's try again, indicating that we would like the model to stop generating after newlines.

In [34]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt8b.generate(prompt, stop=['\n']).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

Response: 1. National Archives and Records Administration (NARA) 2. National Archives of the United States (NAUS) 3. National Archives of the United States (NAUS) 4. National Archives of the United States (NAUS) 5. National Archives of the United States (NAUS)
Answer: National Archives

Response: 1. The Neolithic period is the final phase of the Stone Age. 2. The Neolithic period is the final phase of the Stone Age. 3. The Neolithic period is the final phase of the Stone Age. 4. The Neolithic period is the final phase of the Stone Age. 5. The Neolithic period is the final phase of the Stone Age.
Answer: the Neolithic

Response: 1. Religious officials were perceived as the intermediaries between the common people and the divine. 2. Religious officials were perceived as the keepers of tradition. 3. Religious officials were perceived as the enforcers of social order.
Answer: spiritual intermediaries

Response: 1991-1992: Boris Yeltsin. 1992-1993: Boris Yeltsin. 1993-1994: Boris Yeltsin. 1

### Analysis

GPT8B continues to generate much more than we would like. It often repeats itself. It does not appear to be providing an answer extracted from the provided context. It is sometimes (see "Boris Yeltsin") wrong.

---

## Write Prompts and Answers to File

In the next section we will turn our attention to fine-tuning GPT8B on this task and it will be helpful to reuse the `prompts_and_answers` list that we created here. Let's write it to file so we can easily load it into the next notebook.

In [35]:
with open('data/squad_prompts_and_answers.json', 'w') as f:
    json.dump(prompts_and_answers, f)