# Prompt Optimization

- Now that we have calibrated a judge, let's leverage that judge to optimize our feedback
- This closes the loop - we are directly integrating SME feedback into our judges & then optimizing our agent with a prompt that is optimized based on the SME-calibrated judge

In [0]:
%pip install -U -qqqq backoff databricks-openai uv databricks-agents mlflow==3.9.0rc0 dspy
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import mlflow
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness
from mlflow.genai.datasets import create_dataset
import json
from pathlib import Path

CONFIG = json.loads(Path("config/dc_assistant.json").read_text())

# Extract configuration variables
EXPERIMENT_ID = CONFIG["mlflow"]["experiment_id"]
ALIGNED_JUDGE_NAME = CONFIG["judges"]["aligned_judge_name"]
PROMPT_NAME = CONFIG["prompt_registry"]["prompt_name"]
REFLECTION_MODEL = CONFIG["prompt_registry"]["reflection_model"]
OPTIMIZATION_DATASET_NAME = CONFIG["optimization"]["optimization_dataset_name"]

# Set the MLflow experiment
mlflow.set_experiment(experiment_id=EXPERIMENT_ID)

<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/2517718719552044', creation_time=1768690316954, experiment_id='2517718719552044', last_update_time=1768803175855, lifecycle_stage='active', name=('/Users/austin.choi@databricks.com/GenAI/mlflow updates/AC updates '
 'dc-assistant-agent_experiment'), tags={'mlflow.databricks.managed_evals.experiment_permissions_check': '',
 'mlflow.databricks.review_app.experiment_permissions_check': '',
 'mlflow.experiment.sourceName': '/Users/austin.choi@databricks.com/GenAI/mlflow '
                                 'updates/AC updates '
                                 'dc-assistant-agent_experiment',
 'mlflow.experimentKind': 'genai_development',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.latestTraceEvaluationTimestampMs': '1768760238842',
 'mlflow.ownerEmail': 'austin.choi@databricks.com',
 'mlflow.ownerId': '3275534733162887',
 'product': 'mlflow',
 'purpose': 'football_analysis'}>

In [0]:
# Load configuration from setup notebook


# Register initial prompt for classifying medical paper sections

optimization_dataset = [
    {
        "inputs": {
            "input": [
                 {"role": "user", "content": "Who are the primary ball carriers for the 2024 Detroit Lions on 3rd and short?"},
          ]
        },
        "expectations": {
            "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_by_down_distance` with arguments for the Detroit Lions, 2024 season, 3rd down, and short distance. The response should list key players like David Montgomery or Jahmyr Gibbs and their involvement and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What are the tendency of the 2024 San Francisco 49ers in a 2 minute drill?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_two_minute_drill` with `team='SF'`, `season=2024`. The response should analyze their play selection (pass vs run) and pace during 2-minute situations and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How effective are screen passes for the 2024 Miami Dolphins?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.screen_play_tendencies` with `team='MIA'`, `season=2024`. The response should provide stats on screen play frequency and success and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What do the 2024 Philadelphia Eagles tend to do on 1st down?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_by_down_distance` with `team='PHI'`, `season=2024`, `down=1`. The response should breakdown run/pass ratios and preferred play types on first down and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "Who gets the ball when the 2024 Dallas Cowboys are in the red zone?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_by_offense_situation` with `team='DAL'`, `season=2024`. The response should identify targets specifically in the red zone context and give tactical recommendations for how the defense should react. If there are any data quality issues related to red zone data that should be explicitly stated to the end user."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What are the redzone tendencies of the Buffalo Bills offense?"}]
        },
        "expectations": {
             "expected_response": "The agent should assume season 2024. It should call general tendency tools like `users.wesley_pasfield.tendencies_by_down_distance` or `users.wesley_pasfield.tendencies_by_drive_start` to give an overview of the offense and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How do the 2024 Bills use motion?"}]
        },
        "expectations": {
             "expected_response": "The agent should identify and articulate that it doesn't have access to motion data, and offer alternative options of data pulls to make."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How does the 2024 Kansas City Chiefs offense perform against the blitz?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.success_by_pass_rush_and_coverage` with `team='KC'`, `season=2024`. Analyze performance vs pressure and give overall success metrics by different strategies and give tactical recommendations for how the defense should react."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How do the 2024 LA Rams play in the second half?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_by_score_2nd_half` with `team='LA'`, `season=2024`. Analyze if they are conservative or aggressive based on score and give tactical recommendations for how the defense should react.."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What formations does the 2024 Green Bay Packers offense prefer?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_by_offense_formation` with `team='GB'`, `season=2024`. The response should detail the most common formations and their usage rates and give tactical recommendations for defensive adjustments."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "Who gets the ball on 3rd and long for the 2024 Minnesota Vikings?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_by_down_distance` with `team='MIN'`, `season=2024`, `down=3`, and long distance parameters. The response should identify key receivers and their target share and give tactical recommendations for coverage schemes."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What are the 2024 Seattle Seahawks' tendencies when starting a drive in their own territory?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_by_drive_start` with `team='SEA'`, `season=2024`. The response should analyze play-calling patterns based on field position and give tactical recommendations for defensive game planning."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How do the 2023 Tampa Bay Buccaneers attack defenses in 2 minute drills?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_two_minute_drill` with `team='TB'`, `season=2024`. The response should identify primary targets and route concepts used in hurry-up situations and give tactical recommendations for coverage adjustments."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What's the Baltimore Ravens run game strategy on 2nd and short in 2024?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_by_down_distance` with `team='BAL'`, `season=2024`, `down=2`, and short distance. The response should detail rushing personnel and tendency rates and give tactical recommendations for run defense alignment."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "Which formations does the Cincinnati Bengals use most when they're trailing?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.tendencies_by_offense_formation` with `team='CIN'`, `season=2024`. It may also call `users.wesley_pasfield.tendencies_by_score_2nd_half` to filter by trailing scenarios. The response should detail formation preferences when behind and give tactical recommendations, and emphasize that it does not have a perfect tool to answer that strategy"
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What pass rush strategies work best against the 2024 New York Jets offense?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.success_by_pass_rush_and_coverage` with `team='NYJ'`, `season=2024`. The response should detail which pass rush schemes are most effective and give tactical recommendations for defensive coordinator playcalling."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "How does the 2023 Arizona Cardinals offense distribute the ball out of 11 personnel?"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.who_got_ball_by_down_distance_and_form` with `team='ARI'`, `season=2024`, and 11 personnel parameters. The response should identify target distribution from that formation and give tactical recommendations for defensive personnel matching."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "Tell me about the 2024 New Orleans Saints screen game effectiveness"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.screen_play_tendencies` with `team='NO'`, `season=2024`. The response should provide frequency, success rate, and preferred screen concepts and give tactical recommendations for screen defense."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "Tell me about the 2024 New Orleans Saints preferences after getting a turnover"}]
        },
        "expectations": {
             "expected_response": "The agent should call `users.wesley_pasfield.first_play_after_turnover` with `team='NO'`, `season=2024`. The response should provide frequency, success rate, and preferred screen concepts and comment on aggressiveness, and give tactical recommendations for defenses in these situations."
        }
    },
    {
        "inputs": {
             "input": [
                  {"role": "user", "content": "What are the 2024 Washington Commanders 1st down tendencies from different field positions?"}]
        },
        "expectations": {
             "expected_response": "The agent should call both `users.wesley_pasfield.tendencies_by_down_distance` with `team='WAS'`, `season=2024`, `down=1` AND `users.wesley_pasfield.tendencies_by_drive_start` to correlate first down play calls with field position. The response should provide comprehensive tendency analysis and give tactical recommendations for situational defense."
        }
    }
]

# Save optimization dataset to MLflow GenAI dataset
print(f"Creating MLflow GenAI dataset: {OPTIMIZATION_DATASET_NAME}")
optimization_dataset_mlflow = create_dataset(
    name=OPTIMIZATION_DATASET_NAME,
)
print(f"Created optimization dataset: {optimization_dataset_mlflow.name}")

# Add records to the dataset
optimization_dataset_mlflow = optimization_dataset_mlflow.merge_records(optimization_dataset)
print(f"Added {len(optimization_dataset)} records to optimization dataset")



📊 Creating MLflow GenAI dataset: users.wesley_pasfield.dcassistant_optimization_data
✅ Created optimization dataset: users.wesley_pasfield.dcassistant_optimization_data
✅ Added 20 records to optimization dataset


In [0]:
from mlflow.genai.scorers import get_scorer, RelevanceToQuery

# Load aligned judge using config variables (already loaded in previous cell)
aligned_judge = get_scorer(name=ALIGNED_JUDGE_NAME, experiment_id=EXPERIMENT_ID)
print(aligned_judge.instructions)

Evaluate if the response in {{ outputs }} appropriately analyzes the available data and provides an actionable recommendation the question in {{ inputs }}. The response should be accurate, contextually relevant, and give a strategic advantage to the  person making the request. Your grading criteria should be:  1: Completely unacceptable. Incorrect data interpretation or no recommendations 2: Mostly unacceptable. Irrelevant or spurious feedback or weak recommendations provided with minimal strategic advantage 3:  Somewhat acceptable. Relevant feedback provided with some strategic advantage 4: Mostly acceptable. Relevant feedback provided with strong strategic advantage 5 Completely acceptable. Relevant feedback provided with excellent strategic advantage

If the inputs show an analysis that cites tendencies and success rates but: (a) does not benchmark those numbers against other coverage shells or personnel looks in the same dataset, (b) treats situational proxies (e.g., 3rd down) as n

In [0]:
from agent import AGENT, SYSTEM_PROMPT
import copy
from typing import List, Dict, Any
import warnings
import logging

# Suppress warnings comprehensively
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=UserWarning, module='pydantic')
logging.getLogger('mlflow.genai.judges.instructions_judge').setLevel(logging.ERROR)

# Load prompt using config variables (already loaded in previous cells)
print(f"Loading prompt: {PROMPT_NAME} - using current production prompt")
system_prompt = mlflow.genai.load_prompt(f"prompts:/{PROMPT_NAME}@production")
print(f"✅ Loaded prompt: {system_prompt.uri}")

# Define objective function to convert Feedback to numerical scores
candidate_counter = {"count": 0, "current_prompt": None, "scores": []}

def objective_function(scores: dict) -> float:
    """
    Extract the numerical score from the judge's Feedback object.
    
    The judge returns a Feedback object with feedback.value as a string (e.g., '5').
    We need to convert this to a float for GEPA optimization.
    """
    feedback = scores.get(ALIGNED_JUDGE_NAME)
    
    # Extract the float value from the Feedback object
    if feedback and hasattr(feedback, 'feedback') and hasattr(feedback.feedback, 'value'):
        try:
            raw_score = float(feedback.feedback.value)
            
            # Normalize to 0-1 range for GEPA (which assumes 1.0 is perfect)
            normalized_score = raw_score / 5.0
            print(normalized_score)
            
            # Track raw scores for human readability
            candidate_counter["scores"].append(raw_score)
            
            # Print summary when we've completed evaluating all examples for one candidate
            if len(candidate_counter["scores"]) == len(optimization_dataset):
                avg_score = sum(candidate_counter["scores"]) / len(candidate_counter["scores"])
                candidate_counter["count"] += 1
                print(f"\n✅ Candidate #{candidate_counter['count']} Average Score: {avg_score:.2f}/5.0")
                candidate_counter["scores"] = []  # Reset for next candidate
            
            return normalized_score
        except (ValueError, TypeError) as e:
            logging.warning(f"Could not convert feedback value to float: {e}")
            return 0.6  # Default to middle score (3.0/5.0)
    
    # Fallback
    return 0.6

# Define predict_fn following the exact pattern from MLflow docs
last_prompt_hash = {"hash": None}

def predict_fn(input):
    """Predict function that uses the agent with the MLflow prompt."""
    # Load the current prompt version (will be optimized during the process)
    prompt = mlflow.genai.load_prompt(system_prompt.uri)
    
    # Use prompt.format() to ensure MLflow tracks usage
    system_content = prompt.format()
    
    # Check if the prompt has changed
    current_hash = hash(system_content)
    if last_prompt_hash["hash"] != current_hash:
        last_prompt_hash["hash"] = current_hash
        print(f"\n🆕 NEW PROMPT CANDIDATE DETECTED!")
        print(f"📝 Prompt (first 10000 chars): {system_content[:10000]}...")
        print(f"📏 Full length: {len(system_content)} characters\n")
    
    # Extract the user message from the input list
    user_message = input[0]['content']
    
    # Create input messages as a simple list of dicts
    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_message}
    ]
    
    # Call the agent with the input dict structure
    response = AGENT.predict({"input": messages})
    # Return the text output
    return response

# Optimize the prompt
result = mlflow.genai.optimize_prompts(
    predict_fn=predict_fn,
    train_data=optimization_dataset,
    prompt_uris=[system_prompt.uri],
    optimizer=GepaPromptOptimizer(
        reflection_model=REFLECTION_MODEL, 
        max_metric_calls=75, 
        display_progress_bar=True,
    ),
    scorers=[aligned_judge],
    aggregation=objective_function,
)

# Use the optimized prompt
optimized_prompt = result.optimized_prompts[0]
print(f"\n" + "="*80)
print("OPTIMIZATION COMPLETE")
print("="*80)
print(f"\nOptimized template:\n{optimized_prompt.template}")
print(f"\nInitial score: {result.initial_eval_score if hasattr(result, 'initial_eval_score') else 'N/A'}")
print(f"Final score: {result.final_eval_score if hasattr(result, 'final_eval_score') else 'N/A'}")
prod_var = 'prod' if final_eval_score > initial_eval_score else 'no_action'

  return orig_warn(*args, **kwargs)


Loading prompt: users.wesley_pasfield.dcassistant - using current production prompt


2025/11/26 06:02:42 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
2025/11/26 06:02:42 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.


✅ Loaded prompt: prompts:/users.wesley_pasfield.dcassistant/1

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters



GEPA Optimization:   0%|          | 0/75 [00:00<?, ?rollouts/s]

0.6
0.6
0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer

0.6
0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.2
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6
0.6
0.6


GEPA Optimization:  27%|██▋       | 20/75 [01:51<05:06,  5.57s/rollouts]

0.6

✅ Candidate #1 Average Score: 2.90/5.0
Iteration 0: Base program full valset score: 0.58
Iteration 1: Selected program 0 score: 0.58
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6
Iteration 1: Proposed new text for users.wesley_pasfield.dcassistant: You are an assistant that helps NFL defensive coaches prepare for facing a specific offense. Your job is to interpret the user’s intent, call the available tools to gather the right data, and turn that into concise, actionable scouting insights with clear defensive countermeasures. When a season is not specified, assume 2024. For any tool that accepts a redzone parameter, ALWAYS pass redzone=false unless the user explicitly asks about the red zone.

Core operating principles:
- Always ground your analysis in tool outputs. Summarize the most relevant metrics (plays, pass/run split, avg EPA, success rate, avg yards, air yards, YAC, first-down rate) and translate them into what the defense should expect.
- Pair raw findings with actionable, coverage- and front-specific recommendations (e.g., bracket/bracket variant, press-bail technique, match rules, peel rules for edge defenders on screens, LB drop landmarks, s

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


GEPA Optimization:  35%|███▍      | 26/75 [04:06<08:43, 10.67s/rollouts]

0.6
Iteration 1: New subsample score 1.4 is not better than old score 1.7999999999999998, skipping
Iteration 2: Selected program 0 score: 0.58

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters

0.4
0.8
0.6
Iteration 2: Proposed new text for users.wesley_pasfield.dcassistant: You are an assistant for NFL defensive coaches preparing to face a specific offense. Your job is to interpret user intent, call the right tools, synthesize findi

GEPA Optimization:  43%|████▎     | 32/75 [05:44<08:50, 12.34s/rollouts]

0.6
Iteration 2: New subsample score 1.6 is not better than old score 1.8, skipping
Iteration 3: Selected program 0 score: 0.58

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters

0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6
Iteration 3: Proposed new text for users.wesley_pasfield.dcassistant: You are an assistant for Defensive NFL coaches preparing to face specific offenses. Your job is to interpret user questions, call the provided analytics tools appropriately, synthesize the outputs, and deliver concise, actionable scouting insights with concrete defensive countermeasures.

Core operating rules:
- Default season: 2024 unless the user specifies otherwise.
- Any tool with a redzone parameter must receive FALSE unless the user explicitly asks about red zone.
- Always cite sample sizes (plays) when highlighting tendencies, and flag any slice with very small n (e.g., n < 10) as low-confidence.
- When giving performance metrics (EPA, success rate, yards), provide context:
  - Compare against the offense’s other situations in the same tool output (e.g., other distance buckets, formations, drive-start zones).
  - If relevant, compare across zones in the same dataset (e.g., Own <25 vs Own 25-50 vs Oppon

GEPA Optimization:  51%|█████     | 38/75 [07:36<08:47, 14.25s/rollouts]

0.6
Iteration 3: New subsample score 1.7999999999999998 is not better than old score 1.7999999999999998, skipping
Iteration 4: Selected program 0 score: 0.58

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters

0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6

✅ Candidate #2 Average Score: 2.85/5.0
0.6
Iteration 4: Proposed new text for users.wesley_pasfield.dcassistant: You are an assistant for Defensive NFL coaches preparing to face a specific offense. Your job is to interpret user questions, call the appropriate tools, synthesize the outputs, and deliver concise, actionable scouting insights. Always assume season 2024 unless a different season is explicitly provided. For any tool with a redzone parameter, ALWAYS pass redzone=false unless the user explicitly asks about the red zone; if they do, pass redzone=true.

Core tools you can use (and when to use them):
- users.wesley_pasfield.tendencies_by_offense_formation: Use to quantify formation and personnel usage, pass/run splits, EPA, success rate, and yards. Summarize top formations and personnel packages and their effectiveness.
- users.wesley_pasfield.tendencies_by_down_distance: Use to profile run/pass rates, EPA, success, and yards by down and distance buckets. In red zone queries

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(
GEPA Optimization:  59%|█████▊    | 44/75 [09:28<08:02, 15.56s/rollouts]

0.6
Iteration 4: New subsample score 1.4 is not better than old score 1.7999999999999998, skipping
Iteration 5: Selected program 0 score: 0.58

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters

0.6
0.4
0.6
Iteration 5: Proposed new text for users.wesley_pasfield.dcassistant: You are an assistant for Defensive NFL coaches preparing to face specific offenses. Your job is to interpret the coach’s question, choose the right data tools, a

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.8


GEPA Optimization:  67%|██████▋   | 50/75 [11:41<07:19, 17.58s/rollouts]

0.6
Iteration 5: New subsample score 1.6 is not better than old score 1.6, skipping
Iteration 6: Selected program 0 score: 0.58

🆕 NEW PROMPT CANDIDATE DETECTED!
📝 Prompt (first 10000 chars): You are an assistant that helps Defensive NFL coaches prepare for facing a specific offense. Your role is to interpret user input, and leverage your available tools to better understand how offenses will approach certain situations. Answer users questions, and use your tools to extrapolate and provide additional relevant information as well. If no season is provided, assume 2024. For queries with a redzone parameter, ALWAYS pass FALSE for the redzone parameter unless the user explicitly asks about the redzone....
📏 Full length: 515 characters

0.2
0.6
0.6
Iteration 6: Proposed new text for users.wesley_pasfield.dcassistant: You are a Defensive NFL Game-Planning Assistant for coordinators preparing to face specific offenses. Your job is to interpret the coach’s question, call the right data tools, 

  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6
Iteration 6: New subsample score 1.7999999999999998 is better than old score 1.4. Continue to full eval and add to candidate pool.
1.0
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6

✅ Candidate #3 Average Score: 2.80/5.0


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6


  PydanticSerializationUnexpectedValue(Expected 9 fields but got 5: Expected `Message` - serialized value may not be as expected [field_name='message', input_value=Message(content='{"result...er_specific_fields=None), input_type=Message])
  PydanticSerializationUnexpectedValue(Expected `StreamingChoices` - serialized value may not be as expected [field_name='choices', input_value=Choices(finish_reason='st...r_specific_fields=None)), input_type=Choices])
  return self.__pydantic_serializer__.to_python(


0.6
0.8
0.6
0.6
0.6
0.6
0.6
0.6


GEPA Optimization:  67%|██████▋   | 50/75 [15:57<07:58, 19.14s/rollouts]

0.6
Iteration 6: New program is on the linear pareto front
Iteration 6: Full valset score for new program: 0.63
Iteration 6: Full train_val score for new program: 0.63
Iteration 6: Individual valset scores for new program: [0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 1.0, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.8, 0.6]
Iteration 6: New valset pareto front scores: [0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 1.0, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.8, 0.6]
Iteration 6: Full valset pareto front score: 0.63
Iteration 6: Updated valset pareto front programs: [{0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {1}, {0, 1}, {1}, {0, 1}]
Iteration 6: Best valset aggregate score so far: 0.63
Iteration 6: Best program as per aggregate score on train_val: 1
Iteration 6: Best program as per aggregate score on valset: 1
Iteration 6: Best score on valset: 0.63
Iteration 6: Best score on train_val: 0.63
Iteration 6: 





OPTIMIZATION COMPLETE

Optimized template:
You are a Defensive NFL Game-Planning Assistant for coordinators preparing to face specific offenses. Your job is to interpret the coach’s question, call the right data tools, synthesize the outputs, and deliver concise, actionable defensive recommendations. Always assume season=2024 unless the user specifies another season. If a tool accepts a redzone parameter, ALWAYS pass FALSE unless the user explicitly asks about the red zone.

Core principles
- Be tool-driven: fetch data before concluding. Then translate numbers into concrete defensive tactics (fronts, pressures, coverages, leverage, brackets, adjustments by down/distance and personnel).
- Be precise and cautious with samples: call out sample sizes for any grouping <30 plays; avoid elevating very small-n findings; label them as exploratory only.
- Benchmark findings: compare within-shell/rush-number groupings and against the team’s other options in the same dataset. Where relevant, indi

In [0]:
# Register this new version as production in prompt registry

system_prompt = mlflow.genai.register_prompt(
    name=PROMPT_NAME,
    template=optimized_prompt.template,
    commit_message=f"Post optimize_prompts optimizatio using {ALIGNED_JUDGE_NAME}",
    tags={"finalscore": str(result.final_eval_score), "optimization": "GEPA", "judge": ALIGNED_JUDGE_NAME},
    )

print(system_prompt)

# Optional - Execute Another Evaluation Job to compare performance

- The GEPA optimization process validates that the new prompt exceed the performance of the prior prompt
- if you have multiple judges or a separate evaluation set, you can rerun evaluation jobs with the old prompt, and the prompt produced by GEPA
- The concept in general is to only promote a new version of the prompt if it exceeds performance of the prior prompt
- In this example - we will just promote the prompt to the production alias based on superior performance in the GEPA/prompt optimization step
- After registering the new prompt - redeploy the endpoint by creating a new version of the endpoint, and re-deploy to production (I think?)

In [0]:
def prompt_promotion(prompt_name, prod_gate, new_prompt):
  if prod_gate == 'prod':
    mlflow.genai.set_prompt_alias(
        name=f"{PROMPT_NAME}",
        alias="production",
        version=new_prompt.version
    )
    print(f"Registered {prompt_name} as production version {new_prompt.version}")
  else:
    print("No improvement in prompt score, production alias not updated")

prompt_promotion(PROMPT_NAME, prod_var, system_prompt)
