## Generation QA pairs for LLM fine-tuning using OpenAI API

### Overview
This script automates the extraction of text from PDF files, splits the text into manageable chunks, generates question-answer (QA) pairs using OpenAI language model, and saves the results to a JSON file. The JSON format is specifically structured for fine-tuning large language models (LLMs) with the generated QA pairs. The script utilizes the OpenAI API, which requires a paid API key for access.

### Configuration
The script begins by setting up necessary configurations, including the source directory for the PDF files, parameters for text chunking, and the OpenAI model settings. It also sets the OpenAI API key required for accessing the language model. This API key must be a valid, paid key to use the OpenAI services.

### Functions
*Extract Text from PDFs*
This function reads all PDF files in the specified directory and extracts their text content. It handles errors, ensuring that the script continues to run even if some files cannot be processed. The extracted text from each PDF is stored in a list.

*Chunk Text*
This function splits the extracted text into smaller chunks with a specified overlap. The chunking process makes it easier for the language model to process the text, as smaller chunks are more manageable. The function takes in parameters for chunk size and overlap, allowing flexibility in how the text is split.

*Generate QA Pairs*
This function generates QA pairs from the text chunks using the OpenAI language model. It runs the model on each chunk of text and parses the generated output to extract questions and answers. The function also limits the number of QA pairs generated per chunk to the specified maximum, ensuring that the output is not overly verbose and remains relevant. Since this function relies on the OpenAI API, it requires a paid API key to function.

*Save QA Pairs to JSON*
This function cleans up the QA pairs by removing any special tags (e.g., <question>, </question>, <answer>, </answer>) and saves the cleaned pairs to a JSON file. The JSON format is required for fine-tuning LLMs with the generated QA pairs. Each QA pair is saved as a dictionary with "prompt" and "response" keys.

*Main Execution*
The main execution block orchestrates the entire process. It first extracts text from all PDF files in the specified directory. Then, it splits the extracted text into chunks. For each chunk, the script generates QA pairs and collects them. Finally, it saves all generated QA pairs to a JSON file. Throughout the process, the script prints debugging information to the console, including the number of files processed, chunks created, QA pairs generated, and the final count of QA pairs saved.

### Summary of Key Features
Text Extraction: The script extracts text from PDF files in the specified directory, handling errors to ensure robustness.

Text Chunking: It splits the extracted text into smaller, manageable chunks based on specified chunk size and overlap parameters.

QA Pair Generation: The script generates QA pairs using the OpenAI language model, which requires a paid API key. It ensures relevance by limiting the number of QA pairs per chunk.

JSON Saving: Generated QA pairs are cleaned and saved in a JSON format suitable for fine-tuning large language models (LLMs).

Error Handling: The script includes basic error handling to manage issues with PDF file processing, ensuring continuity of the overall process.

### Code:

In [None]:
# pip install openai langchain pandas tqdm PyMuPDF
# pip install -U langchain-community
# pip install openpyxl

In [None]:
# Importing necessary libraries
import os
import glob
import json
import pandas as pd
from tqdm import tqdm
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader

# Load configuration
SOURCE_DIRECTORY = '/Users/zarinadossayeva/Desktop/WIL_LLM/CNSC_QA_pairs_JSON/CNSC_docs_1_10'
chunk_size = 1000
chunk_overlap = 100
model = "gpt-3.5-turbo-0125"
temperature = 0
max_tokens = None
max_qa_per_chunk = 5  # Limiting the number of QA pairs per chunk

# Set OpenAI API key
api_key = 'open-api-key'
os.environ['OPENAI_API_KEY'] = api_key

# Verify API key is set
print(f"Using OpenAI API key: {os.environ.get('OPENAI_API_KEY')}")  # Debugging line

# Define a function to read all PDFs and extract text
def extract_text_from_pdfs(directory):
    """Extracts text from all PDF files in the specified directory.

    Args:
        directory (str): The directory containing PDF files.

    Returns:
        list: A list of strings, each containing the extracted text from a PDF file.
    """
    text_data = []
    pdf_files = glob.glob(os.path.join(directory, "*.pdf"))
    print(f"Found {len(pdf_files)} PDF files")  # Debugging line

    if not pdf_files:
        print("No PDF files found in the specified directory.")  # Debugging line

    for pdf_file in pdf_files:
        print(f"Processing file: {pdf_file}")  # Debugging line
        try:
            loader = PyMuPDFLoader(file_path=pdf_file)
            documents = loader.load()  # Expecting a list of documents
            for document in documents:
                text = document.page_content
                text_data.append(text)
                print(f"Extracted text from {pdf_file} (length {len(text)}): {text[:500]}...")  # Debugging line
        except Exception as e:
            print(f"Error processing file {pdf_file}: {e}")  # Debugging line

    return text_data

# Convert documents to chunks
def chunk_text(text, chunk_size=1000, chunk_overlap=100):
    """Splits the text into smaller chunks with a specified overlap.

    Args:
        text (str): The text to be split.
        chunk_size (int, optional): The size of each chunk. Defaults to 1000.
        chunk_overlap (int, optional): The overlap between chunks. Defaults to 100.

    Returns:
        list: A list of text chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = text_splitter.split_text(text)
    print(f"Chunked text into {len(chunks)} chunks (total length {len(text)}):")  # Debugging line
    if chunks:
        print(f"First chunk: {chunks[0]}")  # Debugging line
    return chunks

# Prompt to use OpenAI API as QA generator
qa_prompt_template = """
You are an intelligent and supportive assistant.
Your task is to create question-answer pairs from the given context.
Question type: Context based/Yes-No/ short question answer/ long question answer.
Just use the information in the context to write question and answer.
Please don't make up anything outside the given context.

Text: {context}

Generate as many question-answer pairs as possible. Always use tags to enclose question answers as shown in below examples.

<question>What is SMR?</question>
<answer>SMR stands for Small Modular Reactors, which are smaller, more flexible nuclear energy plants that can be deployed in various settings, including large established grids, smaller grids, remote off grid communities, and resource projects. They are designed to provide non-emitting baseload generation and can support intermittent renewable sources like wind and solar. They are also capable of producing steam for industrial purposes.</answer>
<question>What are the key objectives of the SMR project at the Darlington site in Ontario?</question>
<answer>The key objectives of the SMR project at the Darlington site in Ontario are to maintain a diverse generation supply mix to minimize carbon emissions from electricity generation in the province, to demonstrate a First-Of-A-Kind (FOAK) SMR to be ready for deployment across Canada by 2030, and to ensure economic development by securing Canadian content both for domestic and export projects from the developer in exchange for providing the opportunity to deploy their FOAK unit and be a first mover towards an SMR fleet.</answer>
... (continue as needed)
"""

# Initialize the LangChain prompt
qa_prompt = PromptTemplate(template=qa_prompt_template, input_variables=["context"])
llm = ChatOpenAI(model=model, temperature=temperature, max_tokens=max_tokens)

# Create an LLMChain
qa_chain = LLMChain(prompt=qa_prompt, llm=llm)

# Generate QA pairs for every chunk
def generate_qa_pairs(chunks, max_qa_per_chunk):
    """Generates question-answer pairs from text chunks using the language model.

    Args:
        chunks (list): The list of text chunks.
        max_qa_per_chunk (int): The maximum number of QA pairs to generate per chunk.

    Returns:
        list: A list of dictionaries, each containing a prompt and response.
    """
    qa_pairs = []
    for i, chunk in enumerate(tqdm(chunks, desc="Generating QA pairs")):
        try:
            print(f"Processing chunk {i + 1}/{len(chunks)} (length {len(chunk)}): {chunk[:500]}...")  # Debugging line
            response = qa_chain.run({"context": chunk})
            print(f"Generated QA pairs for chunk {i + 1}/{len(chunks)}: {response}")  # Debugging line
            
            # Parse the QA pairs from the response
            qa_list = response.split('<question>')
            for qa in qa_list[1:max_qa_per_chunk+1]:
                if '<answer>' in qa:
                    question = "<question>" + qa.split('</question>')[0] + "</question>"
                    answer = "<answer>" + qa.split('<answer>')[1].split('</answer>')[0] + "</answer>"
                    qa_pairs.append({"prompt": question, "response": answer})
                else:
                    qa_pairs.append({"prompt": qa.strip(), "response": ""})
            
        except Exception as e:
            print(f"Error generating QA for chunk {i + 1}/{len(chunks)}: {e}")
    return qa_pairs

# Save questions and answers to a JSON file
def save_questions_to_json(qa_pairs, output_file="CNSC_QA_pairs_1_10.json"):
    """Saves the generated QA pairs to a JSON file.

    Args:
        qa_pairs (list): The list of QA pairs to save.
        output_file (str, optional): The name of the output JSON file. Defaults to "questions.json".
    """
    cleaned_qa_pairs = []
    for qa in qa_pairs:
        prompt = qa['prompt'].replace("<question>", "").replace("</question>", "").strip()
        response = qa['response'].replace("<answer>", "").replace("</answer>", "").strip()
        cleaned_qa_pairs.append({"prompt": prompt, "response": response})
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(cleaned_qa_pairs, f, ensure_ascii=False, indent=4)
    print(f"Saved {len(cleaned_qa_pairs)} QA pairs to {output_file}")

# Main execution
if __name__ == "__main__":
    """Main execution function to run the entire pipeline: 
    extract text, chunk text, generate QA pairs, and save to JSON.
    """
    text_data = extract_text_from_pdfs(SOURCE_DIRECTORY)
    all_qa_pairs = []
    
    for text in text_data:
        chunks = chunk_text(text, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
        qa_pairs = generate_qa_pairs(chunks, max_qa_per_chunk)
        all_qa_pairs.extend(qa_pairs)
    
    print(f"Total QA pairs generated: {len(all_qa_pairs)}")  # Printing total QA pairs generated
    save_questions_to_json(all_qa_pairs)