## Using LLM as a Judge for RAG system evaluation

In [None]:
from groq import Groq

client = Groq()



In [4]:
completion = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "user",
            "content": "Explain why fast inference is critical for reasoning models"
        }
    ]
)
print(completion.choices[0].message.content)

Fast inference is critical for reasoning models because it enables these models to make decisions quickly and efficiently, which is essential for many real-world applications. Here are some reasons why fast inference is crucial for reasoning models:

1. **Real-time Decision-Making**: Many applications, such as self-driving cars, robotics, and video games, require rapid decision-making to respond to changing situations. Fast inference allows reasoning models to make decisions in real-time, which is critical for these applications.
2. **Low Latency**: In many applications, low latency is essential to provide a responsive and interactive experience. Fast inference helps to reduce the latency between input and output, making the system more responsive and user-friendly.
3. **Scalability**: As the number of users or inputs increases, the system needs to be able to handle the increased load without a significant decrease in performance. Fast inference enables reasoning models to scale more e

In [1]:
import sys
from pathlib import Path

# Add the src folder to sys.path
src_path = Path(r"D:\Study and Planning\Projects\Story Retrieval and Classification System\src")
if str(src_path) not in sys.path:
    sys.path.append(str(src_path))

In [None]:
# Function to create system + user messages for the evaluator
def generate_eval_messages(question, rag_answer, gt_answer):
    messages = [
        {
            "role": "system",
            "content": (
                "You are an evaluator for retrieved-augmented-generation (RAG) answers. "
                "Your task is to score the RAG model's answers based on correctness, completeness, "
                "and relevance compared to the query itself and the ground truth answer .\n\n"
                "Scoring Guidelines:\n"
                "1. Score from 1 to 5:\n"
                "   - 5: Completely correct, fully matches or explains the ground truth.\n"
                "   - 4: Mostly correct, minor details missing or slightly inaccurate.\n"
                "   - 3: Partially correct, some important points missing.\n"
                "   - 2: Mostly incorrect, significant errors.\n"
                "   - 1: Completely incorrect or irrelevant.\n"
                "2. Provide a short reasoning (1-2 sentences) explaining the score.\n"
                "3. ALWAYS output ONLY in JSON format:\n"
                '{"score": <number>, "reasoning": "<your short reasoning>"}'
            )
        },
        {
            "role": "user",
            "content": f"""
Question:
{question}

RAG Answer:
{rag_answer}

Ground Truth:
{gt_answer}
"""
        }
    ]
    return messages

In [18]:
import json



# Load your JSON file
with open("evaluation_data.json", "r") as f:
    eval_data = json.load(f)

In [21]:
results = []

for entry in eval_data:
    question = entry["question"]
    rag_answer = entry["rag_answer"]
    gt_answer = entry["GT_answer"]

    messages = generate_eval_messages(question, rag_answer, gt_answer)

    completion = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=messages,
        max_tokens=150
    )

    result_text = completion.choices[0].message.content
    results.append({
        "question": question,
        "rag_answer": rag_answer,
        "GT_answer": gt_answer,
        "evaluation": result_text
    })

# Save the results to a new JSON file
with open("eval_results.json", "w") as f:
    json.dump(results, f, indent=2)

print("Evaluation complete! Results saved to eval_results.json")

Evaluation complete! Results saved to eval_results.json


In [None]:
import time
import torch
import random


In [None]:
from RAG.stores.llm.providers import quantized_model
def random_prompt(length_tokens: int):
    # Just generates a string with "word" repeated
    words = ["token{}".format(random.randint(0, 999)) for _ in range(length_tokens)]
    return " ".join(words)


In [None]:
prompt = random_prompt(200)  # 200 tokens input
start_time = time.time()

llm = quantized_model()
answer = llm.run_chat(messages=prompt)

end_time = time.time()

latency = end_time - start_time
print(f"Latency: {latency:.4f} seconds")


### Calculting the latency and generation speed for the 4 bit quantized QWEN model

In [None]:
latencies = []
tokens_per_second_list = []

for _ in range(10):
    prompt = random_prompt(200)  # 200 tokens
    num_tokens = len(prompt.split())  # count tokens

    start_time = time.time()
    response = llm.run_chat(prompt)
    end_time = time.time()

    latency = end_time - start_time
    latencies.append(latency)

    speed = num_tokens / latency  # tokens per second
    tokens_per_second_list.append(speed)

avg_latency = sum(latencies) / len(latencies)
avg_speed = sum(tokens_per_second_list) / len(tokens_per_second_list)

print(f"Average Latency: {avg_latency:.4f} seconds")
print(f"Average Generation Speed: {avg_speed:.2f} tokens/sec")


### Calculting the latency and generation speed for the 8 bit quantized QWEN model

In [None]:
from RAG.stores.llm.providers import quantized_model
llm = quantized_model()

In [None]:
latencies = []
tokens_per_second_list = []

for _ in range(10):
    prompt = random_prompt(200)  # 200 tokens
    num_tokens = len(prompt.split())  # count tokens

    start_time = time.time()
    response = llm.run_chat(prompt)
    end_time = time.time()

    latency = end_time - start_time
    latencies.append(latency)

    speed = num_tokens / latency  # tokens per second
    tokens_per_second_list.append(speed)

avg_latency = sum(latencies) / len(latencies)
avg_speed = sum(tokens_per_second_list) / len(tokens_per_second_list)

print(f"Average Latency: {avg_latency:.4f} seconds")
print(f"Average Generation Speed: {avg_speed:.2f} tokens/sec")


### Calculting the latency and generation speed for the main fp16 QWEN model

In [None]:
from RAG.stores.llm.providers import GenerativeModel
llm = GenerativeModel()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
latencies = []
tokens_per_second_list = []

for _ in range(10):
    prompt = random_prompt(200)  # 200 tokens
    num_tokens = len(prompt.split())  # count tokens

    start_time = time.time()
    response = llm.run_chat(prompt)
    end_time = time.time()

    latency = end_time - start_time
    latencies.append(latency)

    speed = num_tokens / latency  # tokens per second
    tokens_per_second_list.append(speed)

avg_latency = sum(latencies) / len(latencies)
avg_speed = sum(tokens_per_second_list) / len(tokens_per_second_list)

print(f"Average Latency: {avg_latency:.4f} seconds")
print(f"Average Generation Speed: {avg_speed:.2f} tokens/sec")
