# Chapter 7: Question Answering

## Building a Review-Based QA System

### The Dataset

> SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
>
> In short, SubjQA provides a setting to study how well extractive QA systems perform on finding answer that are less factual and to what extent modeling subjectivity can improve the performance of QA systems.

Let's download the `subjqa` dataset and poke around a bit.

See the [Dataset card fof `subjqa` at HF](https://huggingface.co/datasets/subjqa).

In [None]:
from datasets import get_dataset_config_names

domains = get_dataset_config_names("subjqa")
domains

In [None]:
from datasets import load_dataset

subjqa = load_dataset("subjqa", name="electronics")
subjqa.shape

In [None]:
import json

#print(subjqa["train"]["answers"][1])
print(json.dumps(
    subjqa["train"][1], 
    indent=2
))

You see how `answers` has children `text`, `answer_start`, `answer_subj_level`, etc. 

If you want to explode the children of `answers` into their own columns, then use [`datasets.flatten`](https://huggingface.co/docs/datasets/process#flatten):

In [None]:
subjqa["validation"].flatten().features.to_dict()

In [None]:
import pandas as pd

dfs = { 
    split: dset.to_pandas() 
    for split, dset 
    in subjqa.flatten().items() 
}

for split, df in dfs.items():
    print(f"Number of questions in {split}: {df['id'].nunique()}")

Note that the `subjqa` dataset is quite small, but entirely in keeping with real-world scenarios since labelled data is very hard to find and expensive to create (you should know that!).

Now that we have transformed the `dataset` into `pandas.DataFrame`, we can use things like [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) to have a closer look...

In [None]:
qa_cols = [
    "title",
    "question",
    "answers.text",
    "answers.answer_start",
    "context"
]

sample_df = dfs["train"][qa_cols].sample(2, random_state=7)
sample_df

In [None]:
start_idx = sample_df["answers.answer_start"].iloc[0][0]
end_idx = start_idx + len(sample_df["answers.text"].iloc[0][0])
sample_df["context"].iloc[0][start_idx:end_idx]

In [None]:
import matplotlib.pyplot as plt

counts = {}

question_types = [
    "What",
    "How",
    "Is",
    "Does",
    "Do",
    "Was",
    "Where",
    "Why"
]

for q in question_types:
    counts[q] = dfs["train"]["question"].str.startswith(q).value_counts()[True]

pd.Series(counts).sort_values().plot.barh()
plt.title("Frequency of Question Types")
plt.show()

In [None]:
for question_type in ["How", "What", "Is"]:
    for question in (
        dfs["train"][dfs["train"].question.str.startswith(question_type)]
        .sample(n=3, random_state=42)["question"]
    ):
        print(question)

## Extracting Answers from Text

On SQuAD2.0:

> Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0. 

Suggested reading:
* [SQuAD2.0 - The Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/)
* [Know What You Don't Know: Unanswerable Questions for SQuAD]() by Rajpurkar, Jia, and Liang, 2018
* [Question Answering on SQuAD2.0](https://paperswithcode.com/sota/question-answering-on-squad20) on paperswithcode.com

### Tokening text for QA

* Training dataset is small, with only 1295 examples.
* Since the structure of the labels for QA (predicting the start/end of an answer span) should be the same across datasets, starting from a fine-tuned, large-scale QA model is the sane approach.

We will use [`deepset/minilm-uncased-squad2`](https://huggingface.co/deepset/minilm-uncased-squad2#minilm-l12-h384-uncased-for-qa).

In [None]:
from transformers import AutoTokenizer

model_ckpt = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
question = "How much music can this hold?"
context = """An MP3 is about 1 MG/minute, so about 6000 hours depending on file size."""

inputs = tokenizer(
    question,
    context,
    return_tensors="pt"
)

`inputs` has `input_ids` and `attention_mask` as expected, but notice how `token_type_id` indicate `0` for question token and `1` for context token.

In [None]:
pd.DataFrame(
    dict(
        (k, v.tolist()[0]) 
        for k,v in inputs.items()
    )
).T

In [None]:
print(tokenizer.decode(inputs["input_ids"][0]))

In [None]:
import torch

from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

with torch.no_grad():
    outputs = model(**inputs)
print(outputs)

In [None]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits

In [None]:
print(f"Input IDs shape: {inputs.input_ids.size()}")
print(f"Start logits shape: {start_logits.size()}")
print(f"End logits shape: {end_logits.size()}")

In [None]:
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits) + 1

answer_span = inputs["input_ids"][0][start_idx:end_idx]
answer = tokenizer.decode(answer_span)

print(f"Question: {question}")
print(f"Answer: {answer}")

... and the same as above, but now wrapped in an HF `pipeline`...

In [None]:
from transformers import pipeline

pipe = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer
)
pipe(
    question=question, 
    context=context, 
    topk=5
)

In the case of a question for which no answer is possble, this model will assign a high start and end score to the `[CLS]` token, mapping the output to the empty string.

In [None]:
pipe(
    question="How many roads must a man walk?",
    context=context,
    handle_impossible_answer=True
)