# LLM Reliability Engine

## Environment Setup

In [55]:
import os
import json
import time
from dotenv import load_dotenv
from tqdm import tqdm

from openai import OpenAI

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

print("✅ Environment ready")


✅ Environment ready


### LLM Call Helper

The LLM call helper provides a single interface for interacting with the language model throughout the project.

Instead of calling the model directly in multiple locations, this function centralizes all model interactions. This approach improves maintainability, consistency, and reproducibility during testing.

The helper serves several purposes:

- **Consistency**: Ensures all prompts use the same model configuration and parameters.
- **Control**: Allows temperature and model settings to be adjusted from a single location.
- **Reproducibility**: Enables deterministic testing by setting temperature to 0 during reliability evaluation.
- **Scalability**: Simplifies future extensions such as logging, response validation, or retry logic.

In production AI systems, abstraction layers like this are commonly used to separate business logic from model interaction, making evaluation and iteration safer and more controlled.

In [56]:
def call_llm(prompt, temperature=0, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content.strip()


### Multi-Run Testing Engine

Large Language Models are probabilistic systems, meaning that identical prompts can produce different outputs across multiple executions. A prompt that appears correct in a single test may still fail when used repeatedly in production.

To evaluate prompt reliability, this function executes the same prompt multiple times and records all responses. This enables systematic identification of failure patterns such as:

- format drift
- inconsistent field naming
- variation in structure or wording
- unexpected additional explanations

By analyzing outputs across multiple runs (5, 10, and 15 iterations), reliability can be measured empirically rather than assumed from a single successful response.

This approach mirrors real-world AI quality assurance workflows, where prompt behavior is validated through repeated testing before deployment into production systems.

In [65]:
def run_multiple_tests(prompt, runs=5, delay=1, temperature=0, model="gpt-4o-mini"):
    results = []
    for i in tqdm(range(runs)):
        output = call_llm(prompt, temperature=temperature, model=model)
        results.append({"run": i + 1, "output": output})
        time.sleep(delay)
    return results


In [66]:
import os

def save_results(results, filepath):
    os.makedirs(os.path.dirname(filepath), exist_ok=True)

    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(results, f, ensure_ascii=False, indent=2)

    print(f"✅ Results saved to {filepath}")


In [67]:
test_prompt = "Reply with exactly one word: OK"
outputs = run_multiple_tests(test_prompt, runs=3, delay=1)
outputs


100%|██████████| 3/3 [00:04<00:00,  1.66s/it]


[{'run': 1, 'output': 'OK'},
 {'run': 2, 'output': 'OK'},
 {'run': 3, 'output': 'OK'}]

### V1 uncontrolled generation

In [68]:
extraction_prompt_v1 = """
Extract information from this customer feedback:

"I ordered item #12345 on March 15th. The delivery was fast but the packaging was damaged."
"""


In [69]:
print(call_llm(extraction_prompt_v1))


- **Order Number:** #12345
- **Order Date:** March 15th
- **Delivery Speed:** Fast
- **Issue:** Damaged packaging


In [70]:
def preview_runs(results, n=5):
    for r in results[:n]:
        print(f"Run {r['run']}: {r['output']}\n---")


In [71]:
sentiment_v1_runs_5 = run_multiple_tests(sentiment_prompt_v1, runs=5, delay=1)
preview_runs(sentiment_v1_runs_5)
save_results(sentiment_v1_runs_5, "results/sentiment_v1_runs5.json")


100%|██████████| 5/5 [00:09<00:00,  1.88s/it]

Run 1: The sentiment of the customer message is positive.
---
Run 2: The sentiment of the customer message is positive.
---
Run 3: The sentiment of the customer message is positive.
---
Run 4: The sentiment of the customer message is positive.
---
Run 5: The sentiment of the customer message is positive.
---
✅ Results saved to results/sentiment_v1_runs5.json





### Extraction Task — Baseline Testing (v1, 5 Runs)

The initial extraction prompt (v1) represents a zero-shot baseline without structural constraints or output formatting requirements.

The objective of this stage is to observe how the model behaves when given minimal guidance and to identify reliability issues that may emerge across repeated executions.

The prompt is executed five times using identical inputs to evaluate:

- consistency of output structure
- stability of field naming
- presence of additional explanatory text
- variation in how extracted information is represented

At this stage, variability is expected. The goal is not correctness alone, but to identify failure patterns that could affect downstream systems relying on structured outputs.


In [72]:
extraction_v1_runs_5 = run_multiple_tests(extraction_prompt_v1, runs=5, delay=1)
preview_runs(extraction_v1_runs_5)
save_results(extraction_v1_runs_5, "results/extraction_v1_runs5.json")


100%|██████████| 5/5 [00:11<00:00,  2.27s/it]

Run 1: - **Order Number:** #12345
- **Order Date:** March 15th
- **Delivery Speed:** Fast
- **Issue:** Damaged packaging
---
Run 2: - **Order Number**: #12345
- **Order Date**: March 15th
- **Delivery Speed**: Fast
- **Issue**: Damaged packaging
---
Run 3: - **Order Number**: #12345
- **Order Date**: March 15th
- **Delivery Speed**: Fast
- **Issue**: Damaged packaging
---
Run 4: - **Order Number**: #12345
- **Order Date**: March 15th
- **Delivery Speed**: Fast
- **Issue**: Damaged packaging
---
Run 5: - **Order Number**: #12345
- **Order Date**: March 15th
- **Delivery Speed**: Fast
- **Issue**: Damaged packaging
---
✅ Results saved to results/extraction_v1_runs5.json





In [73]:
extraction_v1_runs_5[-1]["output"]


'- **Order Number**: #12345\n- **Order Date**: March 15th\n- **Delivery Speed**: Fast\n- **Issue**: Damaged packaging'

In [74]:
from collections import Counter

def unique_output_count(results):
    outputs = [r["output"] for r in results]
    return len(set(outputs)), Counter(outputs).most_common(3)


In [75]:
extraction_v1_runs_10 = run_multiple_tests(extraction_prompt_v1, runs=10, delay=1)
save_results(extraction_v1_runs_10, "results/extraction_v1_runs10.json")

unique_output_count(extraction_v1_runs_10)



100%|██████████| 10/10 [00:24<00:00,  2.41s/it]

✅ Results saved to results/extraction_v1_runs10.json





(3,
 [('Here is the extracted information from the customer feedback:\n\n- **Order Number**: #12345\n- **Order Date**: March 15th\n- **Delivery Speed**: Fast\n- **Packaging Condition**: Damaged',
   6),
  ('- **Order Number**: #12345\n- **Order Date**: March 15th\n- **Delivery Speed**: Fast\n- **Packaging Condition**: Damaged',
   3),
  ('- **Order Number**: #12345\n- **Order Date**: March 15th\n- **Delivery Speed**: Fast\n- **Issue**: Damaged packaging',
   1)])

### V2 structured output

In [76]:
sentiment_prompt_v2 = """
Classify the sentiment of the following customer message.

Respond with ONLY one word.
Allowed responses:
Positive
Negative
Neutral

Customer message:
"I love this product! It's exactly what I needed."
"""


In [77]:
print(call_llm(sentiment_prompt_v2))


Positive


### Product Description Generation — Structured Prompt (v2)

The initial product description prompt (v1) produced outputs with significant variation in length, tone, and structure. While variability is expected in generative tasks, uncontrolled variation can lead to inconsistent brand messaging and unpredictable content quality.

In this iteration, explicit structural constraints are introduced to guide model behavior while preserving creative flexibility. The prompt specifies:

- target length range
- tone and style expectations
- inclusion requirements
- formatting restrictions

The objective is not to eliminate variation entirely, but to reduce unwanted variability and improve consistency across repeated executions. This reflects real-world content generation workflows where AI outputs must remain aligned with brand and editorial standards.

In [78]:
product_prompt_v2 = """
Create a product description for a wireless mouse that costs $29.99.

Requirements:
- Length: 60–80 words
- Professional marketing tone
- Include price once
- Do not include headings or bullet points
"""


In [79]:
extraction_prompt_v2 = """
Extract information from the customer feedback below.

Return the result in EXACTLY this format.
Do not add explanations or additional text.

Order Number:
Order Date:
Delivery Speed:
Issue:

Customer feedback:
"I ordered item #12345 on March 15th. The delivery was fast but the packaging was damaged."
"""



In [80]:
extraction_v2_runs_5 = run_multiple_tests(extraction_prompt_v2, runs=5, delay=1)
unique_output_count(extraction_v2_runs_5)


100%|██████████| 5/5 [00:10<00:00,  2.08s/it]


(2,
 [('Order Number: 12345  \nOrder Date: March 15th  \nDelivery Speed: Fast  \nIssue: Packaging was damaged',
   4),
  ('Order Number: 12345  \nOrder Date: March 15th  \nDelivery Speed: Fast  \nIssue: Damaged packaging',
   1)])

Based on the failure analysis from v1 testing, the prompts were updated to introduce explicit structure and output constraints.

The goal of v2 is to reduce variability caused by:
- format drift
- inconsistent field naming
- additional explanatory text

No few-shot examples or reasoning steps are introduced yet. The focus is purely on structural consistency.


After introducing explicit output structure requirements, the number of unique outputs decreased significantly compared to v1.

The model consistently returned the same schema and avoided introductory explanations, demonstrating that structural constraints alone can dramatically improve reliability in extraction tasks.


In [81]:
extraction_v2_runs_10 = run_multiple_tests(
    extraction_prompt_v2,
    runs=10,
    delay=1
)

save_results(
    extraction_v2_runs_10,
    "results/extraction_v2_runs10.json"
)

unique_output_count(extraction_v2_runs_10)


100%|██████████| 10/10 [00:21<00:00,  2.12s/it]

✅ Results saved to results/extraction_v2_runs10.json





(2,
 [('Order Number: 12345  \nOrder Date: March 15th  \nDelivery Speed: Fast  \nIssue: Packaging was damaged',
   5),
  ('Order Number: 12345  \nOrder Date: March 15th  \nDelivery Speed: Fast  \nIssue: Damaged packaging',
   5)])

After introducing explicit structural constraints, output variability was reduced from four unique formats in v1 to two in v2. The remaining variation occurred only in value phrasing ("Packaging was damaged" vs "Damaged packaging"), while the schema and structure remained stable.

This demonstrates that structural constraints alone significantly improve reliability, but do not fully eliminate semantic variation in generated outputs.


### V3 structured reasoning + consistent behavior
While structural constraints in v2 significantly improved output consistency, reliability can still degrade when inputs become more complex or ambiguous.

In this iteration, advanced prompt engineering techniques are introduced:

- Few-shot prompting to demonstrate the desired output behavior through examples
- Chain-of-Thought reasoning to guide the model through intermediate reasoning steps before producing structured output

The goal of v3 is to improve robustness and maintain consistent behavior across a wider range of inputs while preserving the structured format introduced in v2.


### Structured LLM Output Helper (JSON Mode)

In production AI systems, model outputs are often consumed by downstream applications such as databases, automation workflows, or analytics pipelines. Free-form text responses introduce reliability risks because small variations in formatting can break parsing logic.

This helper function extends the standard LLM call by enforcing structured JSON output using the model’s response formatting capabilities. By requiring valid JSON responses:

- outputs become machine-readable and pipeline-safe
- format drift is significantly reduced
- schema validation becomes possible
- automated evaluation and reliability scoring can be applied

This approach reflects real-world AI system design, where structured outputs are preferred over natural language responses when consistency and automation are required.

In [82]:
def call_llm_json(prompt, temperature=0, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": prompt}],
    )
    return json.loads(response.choices[0].message.content)


In [83]:
extraction_prompt_v3 = """
You are an information extraction engine. Extract structured fields from customer feedback.

Think step-by-step internally to identify the correct values, normalize wording, and map to the schema.
Do NOT output your reasoning. Output ONLY valid JSON.

Schema (required keys):
{
  "order_number": "string or null",
  "order_date": "string or null",
  "delivery_speed": "fast|normal|slow|null",
  "issue": "damaged_packaging|late_delivery|wrong_item|missing_items|customer_support|other|null"
}

Normalization rules:
- order_number: digits only (e.g., "12345")
- delivery_speed: map phrases to one of fast/normal/slow
- issue: choose the closest enum value (e.g., packaging damaged → "damaged_packaging")

Examples:

Feedback: "Order #555 arrived late and the box was crushed."
JSON:
{"order_number":"555","order_date":null,"delivery_speed":"slow","issue":"damaged_packaging"}

Feedback: "I ordered item #999 on April 2. Delivery was quick but support never replied."
JSON:
{"order_number":"999","order_date":"April 2","delivery_speed":"fast","issue":"customer_support"}

Now extract from this feedback:
"I ordered item #12345 on March 15th. The delivery was fast but the packaging was damaged."
"""


In [84]:
call_llm_json(extraction_prompt_v3)


{'order_number': '12345',
 'order_date': 'March 15th',
 'delivery_speed': 'fast',
 'issue': 'damaged_packaging'}

In [85]:
def run_multiple_tests_json(prompt, runs=15, delay=1, temperature=0, model="gpt-4o-mini"):
    results = []
    for i in tqdm(range(runs)):
        obj = call_llm_json(prompt, temperature=temperature, model=model)
        valid, reason = eval_extraction_json(obj)
        results.append({"run": i+1, "output": obj, "valid": valid, "reason": reason})
        time.sleep(delay)
    return results

extraction_v3_runs_15 = run_multiple_tests_json(extraction_prompt_v3, runs=15, delay=1)
save_results(extraction_v3_runs_15, "results/extraction_v3_runs15.json")

valid_rate = sum(r["valid"] for r in extraction_v3_runs_15) / len(extraction_v3_runs_15)
unique_outputs = len({json.dumps(r["output"], sort_keys=True) for r in extraction_v3_runs_15})

valid_rate, unique_outputs


100%|██████████| 15/15 [00:28<00:00,  1.92s/it]

✅ Results saved to results/extraction_v3_runs15.json





(1.0, 1)

In [86]:
valid_rate = sum(r["valid"] for r in extraction_v3_runs_15) / len(extraction_v3_runs_15)
unique_outputs = len({json.dumps(r["output"], sort_keys=True) for r in extraction_v3_runs_15})

print("valid_rate:", valid_rate)
print("unique_outputs:", unique_outputs)


valid_rate: 1.0
unique_outputs: 1


In [87]:
sentiment_prompt_v3 = """
You are a sentiment classification engine.

Task:
Classify the customer message sentiment as exactly one of:
Positive, Negative, Neutral

Output rules:
- Output ONLY the single word label.
- No punctuation, no explanation.

Examples:
Message: "This is amazing — I’m so happy with it."
Sentiment: Positive

Message: "Terrible experience. I want a refund."
Sentiment: Negative

Message: "It arrived yesterday."
Sentiment: Neutral

Now classify:
Message: "I love this product! It's exactly what I needed."
Sentiment:
"""


In [88]:
print(call_llm(sentiment_prompt_v3))


Positive


In [89]:
ALLOWED_SENTIMENT = {"Positive", "Negative", "Neutral"}

def eval_sentiment(output: str):
    return output in ALLOWED_SENTIMENT

def run_multiple_tests_text(prompt, runs=15, delay=1, temperature=0, model="gpt-4o-mini"):
    results = []
    for i in tqdm(range(runs)):
        out = call_llm(prompt, temperature=temperature, model=model)
        results.append({"run": i+1, "output": out, "valid": eval_sentiment(out)})
        time.sleep(delay)
    return results

sentiment_v3_runs_15 = run_multiple_tests_text(sentiment_prompt_v3, runs=15, delay=1)
save_results(sentiment_v3_runs_15, "results/sentiment_v3_runs15.json")

valid_rate = sum(r["valid"] for r in sentiment_v3_runs_15) / len(sentiment_v3_runs_15)
unique_outputs = len({r["output"] for r in sentiment_v3_runs_15})

valid_rate, unique_outputs


100%|██████████| 15/15 [00:23<00:00,  1.56s/it]

✅ Results saved to results/sentiment_v3_runs15.json





(1.0, 1)

In [91]:
call_llm_json(product_prompt_v3)


{'title': 'Wireless Mouse with Ergonomic Shape and Silent Clicks',
 'bullets': ['2.4GHz wireless connectivity for a reliable connection',
  'Silent clicks for a distraction-free environment',
  'Ergonomic shape designed for comfort during long hours of use',
  '12-month battery life to minimize interruptions'],
 'short_description': 'A wireless mouse featuring silent clicks and an ergonomic design, perfect for remote workers and students.',
 'long_description': 'Enhance your productivity with this wireless mouse, designed for comfort and efficiency. The 2.4GHz wireless connectivity ensures a stable connection, while the silent clicks allow you to work without disturbing those around you. Its ergonomic shape provides support for extended use, making it ideal for remote workers and students alike. With an impressive 12-month battery life, you can focus on your tasks without the hassle of frequent recharging.',
 'aio_snippet': 'This wireless mouse offers silent clicks and an ergonomic des

In [103]:
product_prompt_v3 = """
You are a product copy generation engine for ecommerce.

Goal:
Generate consistent, brand-safe marketing copy from product facts.

OUTPUT RULES (STRICT):
- Output ONLY valid JSON (no markdown, no backticks, no commentary).
- Follow the schema EXACTLY and include ALL keys.
- Do NOT include any extra keys.
- bullets must contain EXACTLY 3 items (no more, no fewer).
- short_description must be <= 40 words.
- long_description must be EXACTLY 105–110 words.
  Count words before output. If you are outside 105–110, rewrite until within range.
- aio_snippet must be 1–2 sentences.

Brand voice:
- Tone: premium, confident, helpful
- Style: clear, benefit-oriented, not hypey

Compliance rules (hard constraints):
- Do NOT claim "best", "#1", "guaranteed", or make medical/health promises.
- Do NOT mention competitors.
- Do NOT invent specifications not provided.
- If a spec is missing, omit it rather than guessing.

Schema (required keys and types):
{
  "title": "string",
  "bullets": ["string", "string", "string"],
  "short_description": "string",
  "long_description": "string",
  "aio_snippet": "string"
}

Few-shot example (match this structure exactly):

Input:
Product: "Stainless Steel Water Bottle"
Price: "$19.99"
Key features: "750ml", "double-wall insulation", "leak-proof lid"
Audience: "commuters and gym users"

Output:
{
  "title": "Stainless Steel Water Bottle (750ml) with Leak-Proof Lid",
  "bullets": [
    "Keeps drinks hot or cold longer with double-wall insulation",
    "Leak-proof lid designed for bags and on-the-go use",
    "750ml capacity suits commuting, workouts, and daily hydration"
  ],
  "short_description": "A durable 750ml bottle with double-wall insulation and a leak-proof lid—ideal for commuting and training.",
  "long_description": "Stay hydrated wherever the day takes you. This 750ml stainless steel bottle features double-wall insulation to help maintain your drink’s temperature for longer. The leak-proof lid is built for life on the move—toss it in a work bag or gym backpack with confidence. With a clean, durable finish and a practical size, it’s an easy everyday upgrade for commuters and fitness routines alike.",
  "aio_snippet": "A 750ml insulated stainless steel bottle with a leak-proof lid, designed for commuters and gym users who want reliable temperature retention on the go."
}

Now generate output for:

Input:
Product: "Wireless Mouse"
Price: "$29.99"
Key features: "2.4GHz wireless", "silent clicks", "ergonomic shape", "12-month battery life"
Audience: "remote workers and students"

Output:
"""


In [104]:
def run_multiple_tests_product_json(prompt, runs=15, delay=1, temperature=0, model="gpt-4o-mini"):
    results = []

    for i in tqdm(range(runs)):
        obj = call_llm_json(prompt, temperature=temperature, model=model)

        valid, reason = eval_product_json(obj)

        results.append({
            "run": i + 1,
            "output": obj,
            "valid": valid,
            "reason": reason
        })

        time.sleep(delay)

    return results


In [105]:
def eval_product_json(obj):
    required = {"title", "bullets", "short_description", "long_description", "aio_snippet"}
    if not isinstance(obj, dict):
        return False, "not_a_dict"

    if set(obj.keys()) != required:
        return False, "wrong_keys"

    if not isinstance(obj["title"], str) or not obj["title"].strip():
        return False, "bad_title"

    if not isinstance(obj["bullets"], list) or len(obj["bullets"]) != 3:
        return False, "bullets_not_3"

    if any(not isinstance(b, str) or not b.strip() for b in obj["bullets"]):
        return False, "bad_bullets"

    # Word count checks
    if len(obj["short_description"].split()) > 40:
        return False, "short_too_long"

    long_wc = len(obj["long_description"].split())
    if not (90 <= long_wc <= 120):
        return False, "long_out_of_range"

    # Compliance check
    banned_phrases = ["#1", "best", "guaranteed"]

    text_all = " "._


In [101]:
product_v3_runs_15 = run_multiple_tests_product_json(product_prompt_v3, runs=15, delay=1)


100%|██████████| 15/15 [01:28<00:00,  5.91s/it]


In [106]:
save_results(product_v3_runs_15, "results/product_v3_runs15.json")

valid_rate = sum(r["valid"] for r in product_v3_runs_15) / len(product_v3_runs_15)
unique_outputs = len({json.dumps(r["output"], sort_keys=True) for r in product_v3_runs_15})

reasons = {}
for r in product_v3_runs_15:
    reasons[r["reason"]] = reasons.get(r["reason"], 0) + 1

print("valid_rate:", valid_rate)
print("unique_outputs:", unique_outputs)
print("failure_reasons:", reasons)


✅ Results saved to results/product_v3_runs15.json
valid_rate: 0.0
unique_outputs: 12
failure_reasons: {'long_out_of_range': 15}
