# Question-Answer Validation: Large Language Model

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/alvaro-francisco-gil/language-technologies-applications/blob/main/exercises/02_classification_by_llm.ipynb) 
[![View on GitHub](https://img.shields.io/badge/Open%20on-GitHub-blue?logo=github)](https://github.com/alvaro-francisco-gil/language-technologies-applications/blob/main/exercises/02_classification_by_llm.ipynb)



The task of this experiment is to measure how much of the truthfulness of a question-answer pair can be inferred using both machine learning and deep learning approaches

## Imports

If you are running this notebook in Google Colab, you can install the required packages by running the following cell:


In [1]:
# !pip install torch transformers pandas numpy seaborn matplotlib

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from datasets import load_dataset
import openai
import json
import os
import random
from openai import OpenAI
import random
import json
from pathlib import Path
from typing import List, Dict, Any
import pandas as pd
import time
import numpy as np
import matplotlib.pyplot as plt

from dotenv import load_dotenv
load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')
openai.api_key = openai_api_key

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


## Dataset Preparation

In [3]:
boolq = load_dataset("boolq")

print(f"Available splits: {boolq.keys()}")
print(f"Training set size: {boolq['train'].shape}")
print(f"Validation set size: {boolq['validation'].shape}")

print("\nExample data point:")
boolq["train"][0]

Available splits: dict_keys(['train', 'validation'])
Training set size: (9427, 3)
Validation set size: (3270, 3)

Example data point:


{'question': 'do iran and afghanistan speak the same language',
 'answer': True,
 'passage': 'Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.'}

For this task, we chose the dataset boolq, which is a dataset of boolean questions and answers. As our aim is not directly answering the question, but evaluating wether the answer of a question is correct, we first need to transform the dataset into a question-answer format. For this, we divide each datapoint into two, the correct question-answer pair and the incorrect one.

In [4]:
# Create new training and validation datasets with correct and incorrect question-answer pairs
new_train_data = []
new_validation_data = []

# Process training data
for example in boolq["train"]:
    # Create correct pair
    correct_pair = {
        "text": f"The question is '{example['question']}' and the answer is '{example['answer']}'",
        "label": "correct"
    }
    new_train_data.append(correct_pair)
    
    # Create incorrect pair
    incorrect_pair = {
        "text": f"The question is '{example['question']}' and the answer is '{not example['answer']}'",
        "label": "incorrect"
    }
    new_train_data.append(incorrect_pair)

# Process validation data
for example in boolq["validation"]:
    # Create correct pair
    correct_pair = {
        "text": f"The question is '{example['question']}' and the answer is '{example['answer']}'",
        "label": "correct"
    }
    new_validation_data.append(correct_pair)
    
    # Create incorrect pair
    incorrect_pair = {
        "text": f"The question is '{example['question']}' and the answer is '{not example['answer']}'",
        "label": "incorrect"
    }
    new_validation_data.append(incorrect_pair)

print("Training examples:")
print(new_train_data[0])
print(new_train_data[1])
print("\nValidation examples:")
print(new_validation_data[0])
print(new_validation_data[1])

Training examples:
{'text': "The question is 'do iran and afghanistan speak the same language' and the answer is 'True'", 'label': 'correct'}
{'text': "The question is 'do iran and afghanistan speak the same language' and the answer is 'False'", 'label': 'incorrect'}

Validation examples:
{'text': "The question is 'does ethanol take more energy make that produces' and the answer is 'False'", 'label': 'correct'}
{'text': "The question is 'does ethanol take more energy make that produces' and the answer is 'True'", 'label': 'incorrect'}


In [5]:
len(new_train_data)

18854

By composition, the dataset will be perfectly balanced, with 50% of the data being correct and 50% being incorrect.

Separate x and y

In [6]:
X_train = [example['text'] for example in new_train_data]
y_train = [example['label'] for example in new_train_data]

X_validation = [example['text'] for example in new_validation_data]
y_validation = [example['label'] for example in new_validation_data]

print(X_train[0])
print(y_train[0])


The question is 'do iran and afghanistan speak the same language' and the answer is 'True'
correct


Let's encode the labels

In [7]:
y_train = [1 if label == 'correct' else 0 for label in y_train]
y_validation = [1 if label == 'correct' else 0 for label in y_validation]

print(y_train[0])
print(y_validation[0])

1
1


## Large Language Model Approach

In [8]:
def create_few_shot_examples(train_data: List[Dict[str, Any]], num_examples: int = 3) -> str:
    """Create few-shot examples from training data for the prompt."""
    examples = random.sample(train_data, min(num_examples, len(train_data)))
    few_shot_text = ""
    for example in examples:
        few_shot_text += f"Text: {example['text']}\n"
        few_shot_text += f"Label: {example['label']}\n\n"
    return few_shot_text

def evaluate_with_llm(
    test_data: List[Dict[str, Any]],
    train_data: List[Dict[str, Any]],
    model_name: str = "gpt-4",
    num_few_shot: int = 3,
    results_dir: str = "results",
    timeout: float = 900.0
) -> Dict[str, Any]:
    """
    Evaluate test data using LLM with few-shot examples.
    
    Args:
        test_data: List of test examples
        train_data: List of training examples for few-shot learning
        model_name: Name of the LLM model to use
        num_few_shot: Number of few-shot examples to include
        results_dir: Directory to store results
        timeout: Timeout for API calls in seconds
        
    Returns:
        Dictionary containing results and metadata
    """
    client = OpenAI(timeout=timeout)
    
    # Create results directory if it doesn't exist
    Path(results_dir).mkdir(parents=True, exist_ok=True)
    
    # Generate few-shot examples
    few_shot_examples = create_few_shot_examples(train_data, num_few_shot)
    
    results = {
        "model": model_name,
        "num_few_shot": num_few_shot,
        "total_examples": len(test_data),
        "predictions": [],
        "metadata": {
            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
            "timeout": timeout
        }
    }
    
    for i, example in enumerate(test_data):
        prompt = f"""Given the following examples:

{few_shot_examples}

Now evaluate this new example:
Text: {example['text']}

Is the answer correct? Respond with {{"answer": "correct"}} or {{"answer": "incorrect"}}"""

        try:
            response = client.with_options(timeout=timeout).chat.completions.create(
                model=model_name,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that evaluates whether answers to questions are correct based on the given text."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0
            )
            
            prediction = response.choices[0].message.content
            results["predictions"].append({
                "example_id": i,
                "prediction": prediction,
                "ground_truth": example["label"]
            })
            
            # Save intermediate results every 10 examples
            if (i + 1) % 10 == 0:
                timestamp = time.strftime("%Y%m%d_%H%M%S")
                filename = f"{results_dir}/results_{model_name}_{num_few_shot}shot_{timestamp}.json"
                with open(filename, "w") as f:
                    json.dump(results, f, indent=2)
                    
        except Exception as e:
            print(f"Error processing example {i}: {str(e)}")
            results["predictions"].append({
                "example_id": i,
                "error": str(e)
            })
    
    # Save final results
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    filename = f"{results_dir}/results_{model_name}_{num_few_shot}shot_{timestamp}.json"
    with open(filename, "w") as f:
        json.dump(results, f, indent=2)
    
    return results

In [9]:
# gpt-4o-mini

# Run evaluation with different few-shot settings
results_0shot = evaluate_with_llm(
    test_data=new_validation_data,
    train_data=new_train_data,
    model_name="gpt-4",
    num_few_shot=0
)

results_3shot = evaluate_with_llm(
    test_data=new_validation_data,
    train_data=new_train_data,
    model_name="gpt-4",
    num_few_shot=3
)

# Convert results to DataFrame
def results_to_df(results: Dict[str, Any]) -> pd.DataFrame:
    """Convert results dictionary to pandas DataFrame."""
    predictions = results["predictions"]
    df = pd.DataFrame(predictions)
    df["model"] = results["model"]
    df["num_few_shot"] = results["num_few_shot"]
    return df

# Create DataFrames
df_0shot = results_to_df(results_0shot)
df_3shot = results_to_df(results_3shot)

# Combine results
df = pd.concat([df_0shot, df_3shot], ignore_index=True)

# Save to CSV
df.to_csv("results/llm_classification_results.csv", index=False)


Error processing example 1455: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Error processing example 1456: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}
Error processing example 1457: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code

KeyboardInterrupt: 

In [None]:
df.head()

Thanks for reading the code, connect with me on [LinkedIn](https://www.linkedin.com/in/alvaro-francisco-gil/) or [GitHub](https://github.com/alvaro-francisco-gil) if you have any questions or comments.

*Álvaro Francisco Gil*