This notebook demonstrates the generation of question-answer (QA) pairs from text extracted from a PDF document using a pre-trained T5 model.

Key steps include:
1. Importing necessary libraries, including `fitz` (PyMuPDF) for PDF handling and `transformers` for model processing.
2. Extracting text from the PDF file, page by page.
3. Splitting the extracted text into manageable chunks, suitable for input into the T5 model.
4. Generating QA pairs for each chunk using a pre-trained T5 model (`valhalla/t5-small-qa-qg-hl`):
   - The model generates questions based on the text.
   - It then generates corresponding answers using the generated question and the context.
5. Printing the generated questions and answers for each chunk of text.


In [1]:
import fitz  # PyMuPDF
from transformers import T5Tokenizer, T5ForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

In [3]:
# Function to split text into smaller chunks
def split_text_into_chunks(text, chunk_size=512):
    words = text.split()
    chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

In [4]:
# Function to generate QA pairs for each chunk
def generate_qa_pairs_for_chunks(chunks, model, tokenizer):
    qa_pairs = []
    for chunk in chunks:
        question_prompt = f"generate question: {chunk} </s>"
        
        # Generate question
        question_inputs = tokenizer.encode(question_prompt, return_tensors="pt", max_length=512, truncation=True)
        question_outputs = model.generate(question_inputs, max_length=50, num_beams=5, early_stopping=True)
        question = tokenizer.decode(question_outputs[0], skip_special_tokens=True)
        
        answer_prompt = f"generate answer: context: {chunk} question: {question} </s>"

        # Generate answer
        answer_inputs = tokenizer.encode(answer_prompt, return_tensors="pt", max_length=512, truncation=True)
        answer_outputs = model.generate(answer_inputs, max_length=50, num_beams=5, early_stopping=True)
        answer = tokenizer.decode(answer_outputs[0], skip_special_tokens=True)
        
        qa_pairs.append((question, answer))
    return qa_pairs

In [5]:
# Main function
def main(pdf_path):
    input_text = extract_text_from_pdf(pdf_path)
    chunks = split_text_into_chunks(input_text, chunk_size=100)
    
    tokenizer = T5Tokenizer.from_pretrained("valhalla/t5-small-qa-qg-hl")
    model = T5ForConditionalGeneration.from_pretrained("valhalla/t5-small-qa-qg-hl")
    
    qa_pairs = generate_qa_pairs_for_chunks(chunks, model, tokenizer)
    
    for i, (question, answer) in enumerate(qa_pairs):
        print(f"Chunk {i+1} - Question: {question}")
        print(f"Chunk {i+1} - Answer: {answer}\n")

In [6]:
# The path to PDF file
pdf_path = '/Users/zarinadossayeva/Desktop/WIL_LLM/Canteach/CANTEACH_Documents/00000_General_Project/19950101.pdf'
main(pdf_path)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  return self.fget.__get__(instance, owner)()


Chunk 1 - Question: What is Canada's leading role in nuclear development?
Chunk 1 - Answer: Canada's leading role and eminent accomplishments in nuclear development now span more than half a century

Chunk 2 - Question: What is Canada's view of Canada's nuclear achievements?
Chunk 2 - Answer: Canada's nuclear achievement makes an interesting and timely story

Chunk 3 - Question: What is the historical, technical and economic perspective of the future of nuclear power?
Chunk 3 - Answer: an historical, technical and economic perspective

Chunk 4 - Question: What was Macpherson's career in the nudear industry?
Chunk 4 - Answer: What was Macpherson's career in the nudear industry?

Chunk 5 - Question: What was C.D. Howe's decision to move to Canada?
Chunk 5 - Answer: Howe's decision was culmination of a year-long discussion with Britain and the United States to move to Canada the heavy water and uranium dioxide research

Chunk 6 - Question: What story did Douglas and Harris portray in the 

In [10]:
pdf_path = '/Users/zarinadossayeva/Desktop/WIL_LLM/Canteach/CANTEACH_Documents/33100_Main_Heat_Transport_System/20043704.pdf'
main(pdf_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Chunk 1 - Question: What are the Design Requirements and Engineering Considerations?
Chunk 1 - Answer: Design Requirements and Engineering Considerations 2-1 Chapter 2

Chunk 2 - Question: What type of medium is used to slow down?
Chunk 2 - Answer: the reactor coolant

Chunk 3 - Question: What is the basic neutron cycle?
Chunk 3 - Answer: a slow neutron is absorbed by a fissile nucleus

Chunk 4 - Question: What is an example of an "economy of neutrons"?
Chunk 4 - Answer: the process must exhibit an "economy of neutrons"

Chunk 5 - Question: What is one of the constraints of the basic neutron cycle?
Chunk 5 - Answer: the reactor system must perform the desired function (ie, generate X MWe)

Chunk 6 - Question: What is the only naturally occurring fuel of significant quantities?
Chunk 6 - Answer: 23.,U

Chunk 7 - Question: What is the CANDU approach?
Chunk 7 - Answer: the probability of fission must be enhanced

Chunk 8 - Question: What is the most important factor in a CANDU reactor?
Ch