# Evaluating LLM Recipe Processing with LLM-Aware Metrics 
This notebook demonstrates how to use the LLM-aware metrics module to evaluate recipe processing tasks. We'll focus on:

1. Loading and preprocessing recipe data
2. Evaluating structural consistency
3. Analyzing prompt aggregated similarity scores
4. Comparing response quality

## Setup and Imports

In [1]:
import json
from pathlib import Path
import pandas as pd
from typing import Dict, Any

# Import our custom metrics
from llm_metrics.semantic_similarity_metrics import BERTScore
from examples.llm_aware_metrics.code.prompt_aware import PromptAwareMetric
from examples.llm_aware_metrics.code.schema_based import SchemaAwareMetric
from examples.llm_aware_metrics.code.aggregated_similarity_score import AggregatedSimilarityMetric

In [2]:
## Load and Prepare Data

def load_recipe_data(file_path: str) -> Dict[str, Any]:
    """Load recipe conversion data from JSON file."""
    with open(file_path, 'r') as f:
        data = json.load(f)

    return {
        "system_prompt": data["system-prompt"],
        "user_prompt": data["prompt"],
        "llm_response": data["response"],
    }

# Load our recipe data
recipe_data = load_recipe_data('../../data/R3_conversion_1-shot-0.3.json')

## Define Recipe Schema

In [3]:
# Load the recipe schema
RECIPE_SCHEMA = {
    "type": "object",
    "required": [
        "recipe_name",
        "macronutrients",
        "food_role",
        "ingredients",
        "hasDairy",
        "hasNuts",
        "hasMeat",
        "prep_time",
        "cook_time",
        "serves",
        "instructions"
    ],
    "properties": {
        "recipe_name": {"type": "string"},
        "macronutrients": {
            "type": "object",
            "patternProperties": {
                "^.*$": {
                    "type": "object",
                    "properties": {
                        "measure": {"type": "string"},
                        "unit": {"type": "string"}
                    }
                }
            }
        },
        "food_role": {
            "type": "array",
            "items": {
                "type": "string",
                "enum": ["Main Course", "Side Dish", "Beverage", "Dessert"]
            }
        },
        "ingredients": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "quantity": {"type": "object"},
                    "unit": {"type": "string"}
                }
            }
        },
        "hasDairy": {"type": "boolean"},
        "hasNuts": {"type": "boolean"},
        "hasMeat": {"type": "boolean"},
        "prep_time": {"type": "string"},
        "cook_time": {"type": "string"},
        "serves": {"type": "number"},
        "instructions": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "original_text": {"type": "string"},
                    "input_condition": {"type": "array"},
                    "tasks": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "action_name": {"type": "string"},
                                "output_quality": {"type": "array"},
                                "background_knowledge": {
                                    "type": "object",
                                    "properties": {
                                        "tool": {"type": "array"},
                                        "failure": {"type": "array"}
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

In [4]:
# Create a reference response (this would typically be human-annotated data)
reference_response = {
    "recipe_name": "Peekaboo Sugar Eggs",
    "macronutrients": {},  # No nutritional information provided
    "food_role": ["Dessert"],
    "ingredients": [
        {
            "name": "granulated sugar",
            "quantity": "4",
            "unit": "cups"
        },
        {
            "name": "powdered sugar",
            "quantity": "3/4",
            "unit": "cup"
        },
        {
            "name": "egg white",
            "quantity": "1",
            "unit": "whole"
        },
        {
            "name": "food coloring",
            "quantity": "as needed",
            "unit": "drops"
        }
    ],
    "hasDairy": False,
    "hasNuts": False,
    "hasMeat": False,
    "prep_time": "30 minutes",
    "cook_time": "20 minutes",
    "serves": 2,  # Makes two 3-inch eggs
    "instructions": [
        {
            "original_text": "In a small bowl, mix the egg white with food coloring until the color is evenly distributed and the egg is frothy.",
            "input_condition": ["have_egg_white", "have_food_coloring"],
            "tasks": [
                {
                    "action_name": "mix egg white and food coloring",
                    "output_quality": ["color evenly distributed", "mixture is frothy"],
                    "background_knowledge": {
                        "tool": ["small bowl", "mixing utensil"],
                        "failure": ["uneven color distribution"]
                    }
                }
            ]
        }
        # ... more instructions would follow
    ]
}

In [5]:
# Initialize our metrics
bert_score = BERTScore()
schema_metric = SchemaAwareMetric(bert_score)
prompt_metric = PromptAwareMetric(bert_score)
aggregated_metric = AggregatedSimilarityMetric(bert_score)

## Evaluate Metrics

### Schema-based evaluation
This metric checks if the LLM's output strictly adheres to our predefined JSON schema for recipes. The schema requires specific fields like `recipe_name`, `ingredients`, `instructions`, etc., with defined data types and structures.

In [6]:
schema_score = schema_metric.calculate_with_prompt(
    json.dumps(reference_response),
    recipe_data["llm_response"],
    recipe_data["system_prompt"],
    metadata={"schema": RECIPE_SCHEMA}
)

print(f"Schema-based score: {schema_score}")

Schema-based score: 0.0


  warn("One or more outputs do not match schema. Returning 0.0.")


Our score of 0.0 indicates that the LLM's output didn't perfectly match our schema. This is expected because:
- Recipe parsing is a complex task requiring understanding of both content and structure
- The LLM needs to transform unstructured text into highly structured JSON
- Our schema validation is binary (pass/fail) and quite strict
- Even small deviations from the expected structure result in a failed validation

### Prompt-aware evaluation
This metric evaluates how well the LLM's output captures the content while considering the context provided in the system and user prompts.

In [7]:
# 2. Prompt-aware evaluation
prompt_score = prompt_metric.calculate_with_prompt(
    json.dumps(reference_response),
    recipe_data["llm_response"],
    recipe_data["system_prompt"],
    recipe_data["user_prompt"]
)

print(f"Prompt-aware score: {prompt_score}")

Prompt-aware score: {'precision': 0.5313692688941956, 'recall': 0.4975050091743469, 'f1': 0.5138798356056213}


Using BERTScore as the base metric, we get three scores:
- Precision: 0.53 - How much of the LLM's output is relevant
- Recall: 0.50 - How much of the expected content is captured
- F1: 0.51 - Harmonic mean of precision and recall

These scores indicate that:
- The LLM captured about 50% of the expected content
- There's a good balance between precision and recall
- The model understood the basic recipe structure but missed some details

### Aggregated similarity score evaluation
This metric measures how similar the LLM's output is to the reference response, focusing on key elements like `recipe_name`, `ingredients`, and `instructions`.

In [8]:
# 3. Alignment score evaluation
alignment_score = aggregated_metric.calculate_with_prompt(
    json.dumps(reference_response),
    recipe_data["llm_response"],
    recipe_data["system_prompt"],
    recipe_data["user_prompt"],
    metadata={
        "key_elements": [
            "recipe_name",
            "ingredients",
            "instructions",
            "prep_time",
            "cook_time"
        ]
    }
)

print(f"Aggregation score: {alignment_score}")

Alignment score: {'precision': 0.5740346511205038, 'recall': 0.5878864526748657, 'f1': 0.5779176155726115}


The scores are:
- Precision: 0.57 - Higher than prompt-aware, suggesting good adherence to prompt requirements
- Recall: 0.59 - Better coverage of expected elements
- F1: 0.58 - Overall better performance when considering prompt alignment

The aggregated similarity scores are higher because:
- This metric considers the relationship between prompt requirements and output
- It's more forgiving of structural deviations while focusing on content alignment
- The LLM better captured the essential recipe elements requested in the prompt

## Analysis of Results

The combined metrics tell us that:
1. While the LLM failed to produce perfectly structured JSON (schema score: 0.0), it did capture meaningful recipe information
2. The content quality is moderate (prompt-aware F1: 0.51), suggesting room for improvement in content extraction
3. The output shows good alignment with prompt requirements (alignment F1: 0.58)

### Areas for Improvement
1. JSON Structure: The LLM needs better guidance on producing valid JSON
2. Content Extraction: Some recipe details were missed or incorrectly formatted
3. Schema Compliance: A more flexible schema validation approach might be beneficial

### Next Steps
1. Consider using a more lenient schema validation approach
2. Add intermediate structure validation to guide the LLM
3. Experiment with different prompt formats to improve JSON structure adherence