## Evaluation of the QA pairs for relevance to use in fine-tuning LLM models

This code is designed to evaluate the relevance of question-answer (QA) pairs using the OpenAI API, specifically leveraging the GPT-3.5 model. It involves loading QA pairs from a JSON file, running each pair through a predefined relevance evaluation prompt, and then saving the results. The process is broken down into several key steps:

*Configuration:* The model, temperature, and maximum tokens for the OpenAI API are set. The OpenAI API key is also configured.

*Prompt Template Definition:* A detailed prompt template is defined, instructing the model to flag a QA pair as irrelevant if it meets certain critaria. The prompt includes several examples to guide the model.

*Model Initialization:* The LangChain model is initialized with the specified configuration.

*Function Definitions:*

- load_qa_pairs: This function loads QA pairs from a specified JSON file.
- flag_irrelevant_qa_pairs: This function evaluates each QA pair using the OpenAI API, flags it as irrelevant or relevant based on the model's response, and collects the results.
- save_flagged_qa_pairs: This function saves the flagged QA pairs to a specified JSON file.

*Main Execution:*

The script loads the QA pairs from the input JSON file.
It processes the first 1000 QA pairs, evaluating their relevance using the model.
The flagged results are then saved to an output JSON file.

In [5]:
import os
import json
from tqdm import tqdm
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

# Configuration
model = "gpt-3.5-turbo-0125"
temperature = 0
max_tokens = 100

# Set your OpenAI API key
api_key = 'API-key'
os.environ['OPENAI_API_KEY'] = api_key

# Initialize the LangChain model
llm = ChatOpenAI(model=model, temperature=temperature, max_tokens=max_tokens)

# Define the prompt template to evaluate relevance
evaluation_prompt_template = """
You are an intelligent assistant. Evaluate the relevance of the following question-answer pair for fine-tuning 
large language models (LLMs). Flag it as irrelevant if it contains any references to sections of a document, tables, figures, appendices, equations, specific parts of a text, or any information that may cause LLMs to hallucinate.

Example 1:
Question: What does the section on skin contamination in the Radionuclide Information Booklet provide guidance on?
Answer: The section on skin contamination provides guidance to licensees on evaluating skin dose as a result of a skin contamination incident.
Is this QA pair irrelevant? Yes

Example 2:
Question: What does the term "estimation" refer to in the context of this document?
Answer: In the context of this document, estimation refers to two types of approaches to estimating doses: indirect monitoring and dose modelling.
Is this QA pair irrelevant? Yes

Example 3:
Question: What is Appendix D focused on?
Answer: Appendix D is focused on radionuclide-specific recommendations related to bioassay measurements and internal dosimetry for Tritium.
Is this QA pair irrelevant? Yes

Question: {question}

Answer: {answer}

Is this QA pair irrelevant? Answer with 'Yes' or 'No'.
"""

# Initialize the prompt
evaluation_prompt = PromptTemplate(template=evaluation_prompt_template, input_variables=["question", "answer"])
evaluation_chain = LLMChain(prompt=evaluation_prompt, llm=llm)

def load_qa_pairs(input_file):
    """Load QA pairs from a JSON file."""
    with open(input_file, 'r', encoding='utf-8') as f:
        return json.load(f)

def flag_irrelevant_qa_pairs(qa_pairs):
    """Flag irrelevant QA pairs using the OpenAI API."""
    flagged_qa_pairs = []
    for qa in tqdm(qa_pairs, desc="Flagging QA pairs"):
        question = qa["prompt"]
        answer = qa["response"]
        try:
            response = evaluation_chain.run({"question": question, "answer": answer}).strip().lower()
            is_irrelevant = response == "yes"
            flagged_qa_pairs.append({
                "prompt": question,
                "response": answer,
                "is_irrelevant": is_irrelevant
            })
        except Exception as e:
            print(f"Error evaluating QA pair: {e}")
            flagged_qa_pairs.append({
                "prompt": question,
                "response": answer,
                "is_irrelevant": None
            })
    return flagged_qa_pairs

def save_flagged_qa_pairs(flagged_qa_pairs, output_file):
    """Save flagged QA pairs to a JSON file."""
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(flagged_qa_pairs, f, ensure_ascii=False, indent=4)
    print(f"Saved {len(flagged_qa_pairs)} flagged QA pairs to {output_file}")

# Main execution
if __name__ == "__main__":
    input_file = "/Users/zarinadossayeva/Desktop/WIL_LLM/CNSC_QA_pairs_JSON/CNSC_QA_pairs/CNSC_QA_pairs_41_50.json"
    output_file = "flagged_questions_41_50_28Jul4.json"
    
    # Load the QA pairs from the input file
    qa_pairs = load_qa_pairs(input_file)
    
    # Flag irrelevant QA pairs
    #flagged_qa_pairs = flag_irrelevant_qa_pairs(qa_pairs)
        # Process only the first 1000 QA pairs
    first_1000_qa_pairs = qa_pairs[:1000]
    
    # Flag irrelevant QA pairs
    flagged_qa_pairs = flag_irrelevant_qa_pairs(first_1000_qa_pairs)
    
    # Save the flagged QA pairs to the output file
    save_flagged_qa_pairs(flagged_qa_pairs, output_file)

Flagging QA pairs: 100%|████████████████████| 1000/1000 [06:42<00:00,  2.49it/s]

Saved 1000 flagged QA pairs to flagged_questions_41_50_28Jul4.json



