# Use LLM to Automatically Generate Question And Answer From Document

In [None]:
import boto3
import urllib.request
import math
import re
import json
from utils.book_helper import extract_chapter

Setup boto3 clients for Bedrock model invocations

In [None]:
bedrock_runtime = boto3.client("bedrock-runtime")
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

## Download Content
In this example, we'll use a book titled [The Adventures of Sherlock Holmes](https://www.gutenberg.org/cache/epub/1661/pg1661.txt) as the source of the knowledge. 
This book made available for free by [Project Gutenberg](https://www.gutenberg.org). The book has a copyright status of public domain. For more information please refer to the detail [here](https://www.gutenberg.org/ebooks/1661).

In [None]:
download_file_path = "data/sherlock_holmes.txt"

In [None]:
target_url = "https://www.gutenberg.org/cache/epub/1661/pg1661.txt" # The adventures of Sherlock Holmes
urllib.request.urlretrieve(target_url, download_file_path)
with open(download_file_path, "r") as f:
    data = f.read()

In [None]:
chapter_contents = extract_chapter(data)

## Automate Question Generation
To evaluate a retriever system, we would first need a test set of questions on the documents. These questions need to be diverse, relevant, and coherent. Manually generating questions may be challenging because it first requires you to understand the documents, and spend lots of time coming up with questions for them.

In the following step, we'll use a Bedrock model (e.g. Claude3 Sonnet) to help create questions from the given document chunk.

In [None]:
def generate_questions(bedrock_runtime, model_id, documents):

    prompt_template = """The question should be diversed in nature \
across the document. The question should not contain options, not start with Q1/ Q2. \
Restrict the question to the context information provided.\

<document>
{{document}}
</document>


Your response must follow the format as followed:

Question: question
Answer: answer

Here are a few examples of the question and answer format:

### Example
Question: What does John likes to do when he's free?
Answer: John like to read books and play soccer.

### Example
Question: When did Alice start her new role in company A?
Answer: Alice started her new role last week. She's excited to get back to workforce after a long break. 


Think step by step and pay attention to the number of question to create. Only return the question and answer. Do not provide any other explanation or pretext.

"""
    system_prompt = """You are a professor. Your task is to setup 1 question for an upcoming \
quiz/examination based on the given document wrapped in <document></document> XML tag."""

    prompt = prompt_template.replace("{{document}}", documents)
    temperature = 0.9
    top_k = 250
    messages = [{"role": "user", "content": [{"text": prompt}]}]
    # Base inference parameters to use.
    inference_config = {"temperature": temperature, "maxTokens": 512, "topP": 1.0}
    # Additional inference parameters to use.
    additional_model_fields = {"top_k": top_k}

    # Send the message.
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=messages,
        system=[{"text": system_prompt}],
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields
    )
    result = response['output']['message']['content'][0]['text']
    q_pos = [(a.start(), a.end()) for a in list(re.finditer("Question:", result))]
    a_pos = [(a.start(), a.end()) for a in list(re.finditer("Answer:", result))]

    data_samples = {}
    questions = []
    answers = []

    for idx, q in enumerate(q_pos):
        q_start = q[1]
        a_start = a_pos[idx][0]
        a_end = a_pos[idx][1]
        question = result[q_start:a_start-1]
        if idx == len(q_pos) - 1:
            answer = result[a_end:]
        else:
            next_q_start = q_pos[idx+1][0]
            answer = result[a_end:next_q_start-2]
        print(f"===============")
        print(f"Question: {question}")
        print(f"Answer: {answer}")
        questions.append(question.strip())
        answers.append(answer.strip())
    data_samples['question'] = questions
    data_samples['ground_truth'] = answers
    return data_samples

Format the generated Q&A dataset into the following format:

```
{
  "question" : [...],
  "ground_truth" : [ ... ]
}
``` 

In [None]:
data_samples = {}
data_samples['question'] = []
data_samples['ground_truth'] = []
for chapter_content in chapter_contents[2:12]: # skip the last 2 batches to avoid creating QA for content not directly related to the book.
    ds = generate_questions(bedrock_runtime, model_id, chapter_content)
    data_samples['question'].extend(ds['question'])
    data_samples['ground_truth'].extend(ds['ground_truth'])
data_samples['question'] = data_samples['question'][:10]
data_samples['ground_truth'] = data_samples['ground_truth'][:10]

*Note:* After the QA pairs are generated, verify each question and answer to make sure there are missing missing information. If there are any empty question or answer, you should rerun the previous cell to regenerate the QA dataset. Missing question or answer will result in inconsistency in the RAG evaluation. 

# Save the Q&A Dataset
In this final step, we'll save the Q&A output into a JSON file. This file will be used in the next [notebook](rag_evaluation.ipynb) which will focus on performing RAG evaluation. 

In [None]:
with open("data/qa_samples.json", "w") as f:
    f.write(json.dumps(data_samples))

In [None]:
%store chapter_contents