# Workshop Week 10: Question Answering

#### Please follow the instructions in this code and the workshop Instructor.

Read the first part and try to understand the code. Then run it, look at the outputs, then complete the tasks.

Types of QA systems:

    Extractive QA systems: These systems extract the answer directly from the given text by identifying the relevant section of text that contains the answer.

    Abstractive QA systems: These systems generate a new answer by understanding the meaning of the question and synthesizing information from various sources.

Classical (before deep neural learning) QA systems:

    Information Retrieval based QA systems: These systems use information retrieval techniques to search for relevant documents and retrieve the most relevant answers.

    Knowledge Graph based QA systems: These systems represent information in a structured format and use graph-based algorithms to answer questions.

    Watson QA system: This system, developed by IBM, uses a combination of natural language processing, machine learning, and information retrieval techniques to answer questions in a wide range of domains.

Evaluation of QA and Stanford Question Answering Dataset (SQuAD):

SQuAD is a popular dataset used for evaluating QA systems. It consists of a large number of questions and answers, along with the corresponding passages of text that contain the answers. The dataset is used to evaluate the accuracy and performance of different QA systems.

Language models for QA systems:

    BiDAF (Bidirectional Attention Flow): This model uses a bidirectional attention mechanism to encode the question and the passage and identify the most relevant words and phrases.

    Encoder-decoder transformers: These models use transformer networks to encode the input text and generate the output answer.

    SpanBERT: This model is an extension of the BERT (Bidirectional Encoder Representations from Transformers) model and uses a span-based approach to answer questions. It considers all possible spans in the input text to generate the final answer.

In [2]:
# %pip install transformers
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the BiDAF model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')
model = AutoModelForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')

# Define a sample question and passage
question = "What is the capital of France?"
passage = "France, officially the French Republic, is a country primarily located in Western Europe, consisting of metropolitan France and several overseas regions and territories. Paris is the capital and most populous city of France."

# Encode the question and passage using the tokenizer
inputs = tokenizer.encode_plus(question, passage, return_tensors='pt', max_length=512, truncation=True, truncation_strategy='only_second')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention_mask = inputs['attention_mask']

# Pass the encoded input through the BiDAF model
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, return_dict=True)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Decode the predicted start and end positions to get the answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits) + 1

input_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer_tokens = input_ids[0][start_index:end_index]
answer = tokenizer.decode(answer_tokens)
print("Answer0:", answer)

# Skip over any tokens before the start position or after the end position
for i, token in enumerate(answer_tokens):
    if token == tokenizer.cls_token_id:
        start_index += 1
    elif token == tokenizer.sep_token_id:
        end_index -= 1
answer_tokens = input_ids[0][start_index:end_index]

# Decode the answer tokens to get the final answer
answer = tokenizer.decode(answer_tokens)
print("Answer:", answer)

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Answer0: Paris
Answer: Paris


## Task 1: Construct Q/A system

Task Description: In this task, you will be given a set of questions and a corresponding set of passages. Your goal is to use a QA model to find the answer to each question in its corresponding passage.

Please review the code to understand it, run the first part, and complete the rest to make a QA system.

Then follow the instructions from the workshop Instructor.


Example Questions and Passages:

Question 1: What is the capital of the United States?

Passage 1: The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.

Question 2: Who wrote the novel "To Kill a Mockingbird"?

Passage 2: "To Kill a Mockingbird" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.

Question 3: What is the largest country in the world by area?

Passage 3: Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.

Question 4: What is the capital of France?

Passage 4: Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.

Question 5: Who was the first president of the United States?

Passage 5: George Washington was the first president of the United States. He served from 1789 to 1797, and is widely regarded as one of the most important figures in American history.

In [3]:
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the QA model and tokenizer
model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Define a set of questions and passages
questions = [
    "What is the capital of the United States?",
    "Who wrote the novel \"To Kill a Mockingbird\"?",
    "What is the largest country in the world by area?",
    "What is the capital of France?",
    "Who was the first president of the United States?"
]
passages = [
    "The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.",
    "\"To Kill a Mockingbird\" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.",
    "Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.",
    "Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.",
    "George Washington was the first president of the United States. He served from 1789 to 1797, and is widely regarded as one of the most important figures in American history."
]

# Loop over each question and passage, and use the QA model to find the answer
for i, (question, passage) in enumerate(zip(questions, passages)):
    # Encode the question and passage using the tokenizer
    inputs = tokenizer.encode_plus(question, passage, return_tensors='pt', max_length=512, truncation=True, truncation_strategy='only_second')
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Pass the encoded input through the QA model
    outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits


    # Decode the predicted start and end positions to get the answer
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits) + 1

    # Skip over any tokens before the start position or after the end position
    for j, token_id in enumerate(input_ids[0]):
        if j < start_index or j >= end_index:
            input_ids[0][j] = tokenizer.pad_token_id

    # Decode the answer from the corresponding tokens
    answer_tokens = input_ids[0][start_index:end_index]
    answer = tokenizer.decode(answer_tokens)

    # Print the question, passage, and answer
    print("Question {}: {}".format(i+1, question))
    print("Passage {}: {}".format(i+1, passage))
    print("Answer {}: {}\n".format(i+1, answer))

Question 1: What is the capital of the United States?
Passage 1: The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.
Answer 1: Washington, D. C

Question 2: Who wrote the novel "To Kill a Mockingbird"?
Passage 2: "To Kill a Mockingbird" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.
Answer 2: Harper Lee

Question 3: What is the largest country in the world by area?
Passage 3: Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.
Answer 3: Russia

Question 4: What is the capital of France?
Passage 4: Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.
Answer 4: Paris

Question 5: Who was the first president of the United States

## Task 2: Use the QA code above

Apply the code to one of Assignment 1 articles. Make a question, ground truth answer, and predict an answer using the code. Evaluate answer using precision/recall.

In [4]:
article = "Hans Rosling, a Swedish doctor who transformed himself into a   statistician by converting dry numbers into dynamic graphics that challenged preconceptions about global health and gloomy prospects for population growth, died on Tuesday in Uppsala, Sweden. He was 68."
questions = ["What was Hans Rosling's occupation?",
             "Where did Hans Rosling die?",
             "When did Hans Rosling die?",
            "How old was Hans Rosling when he died?",
            "Is Hans Rosling still alive?"]
print("Article: {}".format(article))

# Loop over each question against the same article
for i, question in enumerate(questions):
    # Encode the question and article using the tokenizer
    inputs = tokenizer.encode_plus(question, article, return_tensors='pt', max_length=1024, truncation=True, truncation_strategy='only_second')
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Pass the encoded input through the QA model
    outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Decode the predicted start and end positions to get the answer
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits) + 1

    # Skip over any tokens before the start position or after the end position
    for j, token_id in enumerate(input_ids[0]):
        if j < start_index or j >= end_index:
            input_ids[0][j] = tokenizer.pad_token_id

    # Decode the answer from the corresponding tokens
    answer_tokens = input_ids[0][start_index:end_index]
    answer = tokenizer.decode(answer_tokens)

    # Print the question, article, and answer
    print("Question {}: {}".format(i+1, question))
    print("Answer {}: {}\n".format(i+1, answer))

Article: Hans Rosling, a Swedish doctor who transformed himself into a   statistician by converting dry numbers into dynamic graphics that challenged preconceptions about global health and gloomy prospects for population growth, died on Tuesday in Uppsala, Sweden. He was 68.


Question 1: What was Hans Rosling's occupation?
Answer 1: doctor

Question 2: Where did Hans Rosling die?
Answer 2: Uppsala, Sweden

Question 3: When did Hans Rosling die?
Answer 3: Tuesday

Question 4: How old was Hans Rosling when he died?
Answer 4: 68

Question 5: Is Hans Rosling still alive?
Answer 5: died on Tuesday in Uppsala, Sweden. He was 68



In [5]:
article = "SEOUL, South Korea  ?   A special prosecutor investigating the corruption scandal that led to President Park  ?s impeachment summoned the de facto head of Samsung for questioning on Wednesday, calling him a bribery suspect. The de facto leader, Jay Y. Lee, the vice chairman of Samsung, will be questioned on Thursday, according to the special prosecutor?s office, which recommended that he also be investigated on suspicion of perjury. Mr. Lee effectively runs Samsung, South Korea?s largest conglomerate he is the son of its chairman, Lee   who has been incapacitated with health problems. He is expected to be asked whether   donations that Samsung made to two foundations controlled by Choi   a longtime friend of the president, amounted to bribes, and what role, if any, he played in the decision to give the money. Investigators at the special prosecutor?s office have questioned other senior Samsung executives as suspects about the bribery accusations. Neither Samsung nor Mr. Lee responded immediately to the announcement on Wednesday. Allegations that Ms. Park helped Ms. Choi extort millions in bribes from Samsung and other companies are at the heart of the corruption scandal that led to the National Assembly?s vote to impeach her last month. Since then, Ms. Park?s powers have been suspended, and she is on trial at the Constitutional Court, which will ultimately decide whether to end her presidency. Last month, Mr. Lee testified at a National Assembly hearing that he was not involved in the decision by Samsung to make the donations. He also said that the donations were not voluntary, suggesting that the company was a victim of extortion, not a participant in bribery. The reference on Wednesday to possible perjury charges against Mr. Lee stemmed from that testimony. The special prosecutor?s office said it had evidence that Mr. Lee had ?received a request for bribery from the president and ordered Samsung subsidiaries to send bribes to destinations designated by the president. ? It asked the National Assembly to file a perjury complaint against Mr. Lee, which would authorize the special prosecutor to open an investigation of that charge. Asked whether investigators would seek to arrest Mr. Lee on bribery charges, a spokesman for the special prosecutor?s office, Lee   said, ?All possibilities are open. ? In November, state prosecutors indicted Ms. Choi on charges of coercing 53 big businesses, including Samsung, to contribute $69 million to her two foundations. They identified Ms. Park as an accomplice but stopped short of filing any charges against the businesses, all of which insisted that they were under government pressure to donate. In its impeachment bill, the National Assembly asserted that the donations were bribes, made with the expectation of political favors from the president. The special prosecutor, which took over the investigations from the state prosecutors last month, has been looking into possible bribery charges against not only Ms. Park but the businesses, particularly Samsung. Ms. Park cannot be indicted while in office. Samsung gave the largest donations to Ms. Choi?s foundations, totaling $17 million. Unlike the other corporate contributors, it went beyond support for the foundations, signing an $18 million contract with a sports management company that Ms. Choi ran in Germany, to fund a program for training Korean equestrians, which mainly benefited Ms. Choi?s daughter. Samsung also contributed $1. 3 million to a winter sports program for young athletes that Ms. Choi and her nephew ran. Also on Wednesday, the special prosecutor?s office said it had acquired a tablet computer used by Ms. Choi that contained emails she exchanged with a Samsung executive. The emails contained information about the financial support provided by Samsung, the prosecutor?s office said. The special prosecutor has been investigating whether Samsung gave its support to Ms. Choi in exchange for a decision by the   National Pension Service to support a contentious merger of two Samsung affiliates in 2015. Moon   chairman of the pension fund, was arrested last month on charges that he illegally pressured the fund to back that merger when he was South Korea?s health and welfare minister. The national pension fund?s support was crucial for the merger, which analysts said helped Mr. Lee inherit control of Samsung from his father."
question = "Who is the vice chairman of Samsung"

# Encode the question and article using the tokenizer
inputs = tokenizer.encode_plus(question, article, return_tensors='pt', max_length=2048, truncation=True, truncation_strategy='only_second')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
# Resize the input tensors to match the expected size
input_ids = input_ids[:, :512]
attention_mask = attention_mask[:, :512]

# Pass the encoded input through the QA model
outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Decode the predicted start and end positions to get the answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits) + 1

# Skip over any tokens before the start position or after the end position
for j, token_id in enumerate(input_ids[0]):
    if j < start_index or j >= end_index:
        input_ids[0][j] = tokenizer.pad_token_id

# Decode the answer from the corresponding tokens
answer_tokens = input_ids[0][start_index:end_index]
answer = tokenizer.decode(answer_tokens)

# Print the question, article, and answer
print("Question: " + question)
print("Answer: " + answer)

Question: Who is the vice chairman of Samsung
Answer: Jay Y. Lee
