# Prompt Evaluation and Tracing with MLflow
This notebook is designed to help you evaluate and trace prompts using MLflow. It provides a structured way to log prompt evaluations, track their performance, and visualize the results over time.

Key Benefits of MLflow Prompt Evaluation
* Effective Evaluation: `MLflow's LLM Evaluation API provides a simple and consistent way to evaluate prompts across different models and datasets without writing boilerplate code.
* Compare Results: Compare evaluation results with ease in the MLflow UI.
* Tracking Results: Track evaluation results in MLflow Experiment to maintain the history of prompt performance and different evaluation settings.
* Tracing: Inspect model behavior during inference deeply with traces generated during evaluation.

In [None]:
!python -m pip install pandas mlflow evaluate litellm textstat --quiet
!python -m pip install dspy google-genai --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Make sure to RESTART the kernel after installing new packages.

#

In [2]:
import mlflow
mlflow.set_tracking_uri("http://20.75.92.162:5000/")

In [3]:
from google import genai
client = genai.Client()

### 1. Create a new prompt

In [4]:
import mlflow

# Use double curly braces for variables in the template
initial_template = """\
Summarize content you are provided with in {{ num_sentences }} sentences.

Sentences: {{ sentences }}
"""

# Register a new prompt
prompt = mlflow.register_prompt(
    name="b6-gcp-anshupandey-summarization-prompt",
    template=initial_template,
    # Optional: Provide a commit message to describe the changes
    commit_message="Initial commit",
)

# The prompt object contains information about the registered prompt
print(f"Created prompt '{prompt.name}' (version {prompt.version})")

  prompt = mlflow.register_prompt(
2025/11/03 09:21:56 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: b6-gcp-anshupandey-summarization-prompt, version 1


Created prompt 'b6-gcp-anshupandey-summarization-prompt' (version 1)


### 2. Prepare Evaluation Data

In [5]:
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Artificial intelligence has transformed how businesses operate in the 21st century. Companies are leveraging AI for everything from customer service to supply chain optimization. The technology enables automation of routine tasks, freeing human workers for more creative endeavors. However, concerns about job displacement and ethical implications remain significant. Many experts argue that AI will ultimately create more jobs than it eliminates, though the transition may be challenging.",
            "Climate change continues to affect ecosystems worldwide at an alarming rate. Rising global temperatures have led to more frequent extreme weather events including hurricanes, floods, and wildfires. Polar ice caps are melting faster than predicted, contributing to sea level rise that threatens coastal communities. Scientists warn that without immediate and dramatic reductions in greenhouse gas emissions, many of these changes may become irreversible. International cooperation remains essential but politically challenging.",
            "The human genome project was completed in 2003 after 13 years of international collaborative research. It successfully mapped all of the genes of the human genome, approximately 20,000-25,000 genes in total. The project cost nearly $3 billion but has enabled countless medical advances and spawned new fields like pharmacogenomics. The knowledge gained has dramatically improved our understanding of genetic diseases and opened pathways to personalized medicine. Today, a complete human genome can be sequenced in under a day for about $1,000.",
            "Remote work adoption accelerated dramatically during the COVID-19 pandemic. Organizations that had previously resisted flexible work arrangements were forced to implement digital collaboration tools and virtual workflows. Many companies reported surprising productivity gains, though concerns about company culture and collaboration persisted. After the pandemic, a hybrid model emerged as the preferred approach for many businesses, combining in-office and remote work. This shift has profound implications for urban planning, commercial real estate, and work-life balance.",
            "Quantum computing represents a fundamental shift in computational capability. Unlike classical computers that use bits as either 0 or 1, quantum computers use quantum bits or qubits that can exist in multiple states simultaneously. This property, known as superposition, theoretically allows quantum computers to solve certain problems exponentially faster than classical computers. Major technology companies and governments are investing billions in quantum research. Fields like cryptography, material science, and drug discovery are expected to be revolutionized once quantum computers reach practical scale.",
        ],
        "targets": [
            "AI has revolutionized business operations through automation and optimization, though ethical concerns about job displacement persist alongside predictions that AI will ultimately create more employment opportunities than it eliminates.",
            "Climate change is causing accelerating environmental damage through extreme weather events and melting ice caps, with scientists warning that without immediate reduction in greenhouse gas emissions, many changes may become irreversible.",
            "The Human Genome Project, completed in 2003, mapped approximately 20,000-25,000 human genes at a cost of $3 billion, enabling medical advances, improving understanding of genetic diseases, and establishing the foundation for personalized medicine.",
            "The COVID-19 pandemic forced widespread adoption of remote work, revealing unexpected productivity benefits despite collaboration challenges, and resulting in a hybrid work model that impacts urban planning, real estate, and work-life balance.",
            "Quantum computing uses qubits existing in multiple simultaneous states to potentially solve certain problems exponentially faster than classical computers, with major investment from tech companies and governments anticipating revolutionary applications in cryptography, materials science, and pharmaceutical research.",
        ],
    }
)

### 3. Define prediction function

In [6]:


def predict(data: pd.DataFrame) -> list[str]:
    predictions = []
    prompt = mlflow.genai.load_prompt("prompts:/b6-gcp-anshupandey-summarization-prompt/1")

    for _, row in data.iterrows():
        # Fill in variables in the prompt template
        content = prompt.format(sentences=row["inputs"], num_sentences=1)

        completion = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=content,
        )
        predictions.append(completion.text)

    return predictions

### 4. Run Evaluation

In [7]:
mlflow.set_experiment('b6-gcp-testing')

2025/11/03 09:24:13 INFO mlflow.tracking.fluent: Experiment with name 'b6-gcp-testing' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/592752536522566076', creation_time=1762142054149, experiment_id='592752536522566076', last_update_time=1762142054149, lifecycle_stage='active', name='b6-gcp-testing', tags={}>

In [8]:

with mlflow.start_run(run_name="anshu-prompt-evaluation"):
    mlflow.log_param("model", "gemini-2.0-flash")

    results = mlflow.evaluate(
        model=predict,
        data=eval_data,
        targets="targets",
        extra_metrics=[
            mlflow.metrics.latency(),
            mlflow.metrics.flesch_kincaid_grade_level(),
            mlflow.metrics.ari_grade_level(),
        ],
    )

2025/11/03 09:26:58 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
2025/11/03 09:26:58 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2025/11/03 09:27:17 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


üèÉ View run anshu-prompt-evaluation at: http://20.75.92.162:5000/#/experiments/592752536522566076/runs/71706bab20ee48bf9ca87e9e4c303d42
üß™ View experiment at: http://20.75.92.162:5000/#/experiments/592752536522566076


### 5. View Results
<img src="https://mlflow.org/docs/latest/assets/images/prompt-evaluation-result-7c106f17187fdc750439725d086c389b.png" alt="MLflow LLM Evaluation UI" width="800">

<img src = "https://mlflow.org/docs/latest/assets/images/prompt-evaluation-chart-8a93612e37184b8279c699fd6640013d.png" >

TypeError: 'EvaluationResult' object is not iterable

## Prompt Optimization

In [13]:
import os
from typing import Any
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.genai.optimize import OptimizerConfig, LLMParams

In [14]:
mlflow.set_tracking_uri("http://20.75.92.162:5000/")

from google import genai
client = genai.Client()

In [15]:
# Register the initial prompt
initial_template = """
Answer to this math question: {{question}}.
Return the result in a JSON string in the format of {"answer": "xxx"}.
"""

prompt = mlflow.genai.register_prompt(
    name="b6-gcp-anshu-math",
    template=initial_template,
)


2025/11/03 09:49:03 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: b6-gcp-anshu-math, version 1


In [16]:

# The data can be a list of dictionaries, a pandas DataFrame, or an mlflow.genai.EvaluationDataset
# It needs to contain inputs and expectations where each row is a dictionary.
train_data = [
    {
        "inputs": {"question": "Given that $y=3$, evaluate $(1+y)^y$."},
        "expectations": {"answer": "64"},
    },
    {
        "inputs": {
            "question": "The midpoint of the line segment between $(x,y)$ and $(-9,1)$ is $(3,-5)$. Find $(x,y)$."
        },
        "expectations": {"answer": "(15,-11)"},
    },
    {
        "inputs": {
            "question": "What is the value of $b$ if $5^b + 5^b + 5^b + 5^b + 5^b = 625^{(b-1)}$? Express your answer as a common fraction."
        },
        "expectations": {"answer": "\\frac{5}{3}"},
    },
    {
        "inputs": {"question": "Evaluate the expression $a^3\\cdot a^2$ if $a= 5$."},
        "expectations": {"answer": "3125"},
    },
    {
        "inputs": {"question": "Evaluate $\\lceil 8.8 \\rceil+\\lceil -8.8 \\rceil$."},
        "expectations": {"answer": "17"},
    },
]

eval_data = [
    {
        "inputs": {
            "question": "The sum of 27 consecutive positive integers is $3^7$. What is their median?"
        },
        "expectations": {"answer": "81"},
    },
    {
        "inputs": {"question": "What is the value of $x$ if $x^2 - 10x + 25 = 0$?"},
        "expectations": {"answer": "5"},
    },
    {
        "inputs": {
            "question": "If $a\\ast b = 2a+5b-ab$, what is the value of $3\\ast10$?"
        },
        "expectations": {"answer": "26"},
    },
    {
        "inputs": {
            "question": "Given that $-4$ is a solution to $x^2 + bx -36 = 0$, what is the value of $b$?"
        },
        "expectations": {"answer": "-5"},
    },
]


In [39]:

# Define a custom scorer function to evaluate prompt performance with the @scorer decorator.
# The scorer function for optimization can take inputs, outputs, and expectations.
@scorer
def exact_match(expectations: dict[str, Any], outputs: dict[str, Any]) -> bool:
    print(expectations,outputs)
    if isinstance(expectations,dict) and isinstance(outputs,dict):
        return expectations["answer"] == outputs["answer"]
    else:
        return expectations==outputs



In [40]:
prompt.uri

'prompts:/b6-gcp-anshu-math/1'

In [45]:
def predict(question: str) -> str:
    prompt = mlflow.genai.load_prompt("prompts:/b6-gcp-anshu-math/1")

    content = prompt.format(question=question)

    completion = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=content,)
    
    if '```' in completion.text:
        return completion.text[7:-3].strip()

    return completion.text

In [46]:
from mlflow.genai.optimize import GepaPromptOptimizer

In [47]:
result = mlflow.genai.optimize_prompts(predict_fn=predict,
                                       train_data=train_data,
                                       prompt_uris=["prompts:/b6-gcp-anshu-math/1"],
                                       optimizer=GepaPromptOptimizer(reflection_model="gemini:/gemini-2.0-flash"),
                                       scorers=[exact_match])

2025/11/03 10:13:18 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.
  return _dataset_source_registry.resolve(
  return _dataset_source_registry.resolve(


{'answer': '(15,-11)'} {"answer": "21,-11"}
{'answer': '3125'} {"answer": "3125"}
{'answer': '17'} {"answer": "0"}
{'answer': '\\frac{5}{3}'} {"answer": "5/2"}
{'answer': '64'} {"answer": "64"}
Iteration 0: Base program full valset score: 0.0
Iteration 1: Selected program 0 score: 0.0
{'answer': '(15,-11)'} {"answer": "15,-11"}
{'answer': '17'} {"answer": "0"}
{'answer': '\\frac{5}{3}'} {"answer": "5/2"}
Iteration 1: Proposed new text for b6-gcp-anshu-math: Solve the provided math question. The answer should be returned as a JSON string with the key "answer". The value associated with "answer" should be the solution to the math problem. Ensure the answer is formatted correctly according to mathematical conventions. Here are specific formatting requirements to follow:

*   **Fractions:** Express fractions in their simplest form as common fractions (e.g., "5/3").
*   **Ceiling Function:** For ceiling function problems (denoted by $\lceil x \rceil$), the ceiling function returns the small

2025/11/03 10:16:13 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: b6-gcp-anshu-math, version 3


üèÉ View run bedecked-jay-261 at: http://20.75.92.162:5000/#/experiments/592752536522566076/runs/77fcb548d44a427cb52a42ca0b38e243
üß™ View experiment at: http://20.75.92.162:5000/#/experiments/592752536522566076
