<a href="https://colab.research.google.com/github/anshupandey/prompt_engineering/blob/main/code8_prompt_evaluation_tracing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Evaluation and Tracing with MLflow
This notebook is designed to help you evaluate and trace prompts using MLflow. It provides a structured way to log prompt evaluations, track their performance, and visualize the results over time.

Key Benefits of MLflow Prompt Evaluation
* Effective Evaluation: `MLflow's LLM Evaluation API provides a simple and consistent way to evaluate prompts across different models and datasets without writing boilerplate code.
* Compare Results: Compare evaluation results with ease in the MLflow UI.
* Tracking Results: Track evaluation results in MLflow Experiment to maintain the history of prompt performance and different evaluation settings.
* Tracing: Inspect model behavior during inference deeply with traces generated during evaluation.

In [1]:
!pip install mlflow openai --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.7/24.7 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import os
import mlflow
from openai import OpenAI
os.environ['OPENAI_API_KEY'] = "sk-proj-xx-xxx-xx-xx-xx"
mlflow.set_tracking_uri("http://20.80.248.210:5000/")

### 1. Create a new prompt

In [5]:
import mlflow

# Use double curly braces for variables in the template
initial_template = """\
Summarize content you are provided with in {{ num_sentences }} sentences.

Sentences: {{ sentences }}
"""

# Register a new prompt
prompt = mlflow.register_prompt(
    name="anshu-summarization-prompt",
    template=initial_template,
    # Optional: Provide a commit message to describe the changes
    commit_message="Initial commit",
)

# The prompt object contains information about the registered prompt
print(f"Created prompt '{prompt.name}' (version {prompt.version})")

  prompt = mlflow.register_prompt(
2025/07/18 12:02:53 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for prompt version to finish creation. Prompt name: anshu-summarization-prompt, version 3


Created prompt 'anshu-summarization-prompt' (version 3)


### 2. Prepare Evaluation Data

In [6]:
import pandas as pd

eval_data = pd.DataFrame(
    {
        "inputs": [
            "Artificial intelligence has transformed how businesses operate in the 21st century. Companies are leveraging AI for everything from customer service to supply chain optimization. The technology enables automation of routine tasks, freeing human workers for more creative endeavors. However, concerns about job displacement and ethical implications remain significant. Many experts argue that AI will ultimately create more jobs than it eliminates, though the transition may be challenging.",
            "Climate change continues to affect ecosystems worldwide at an alarming rate. Rising global temperatures have led to more frequent extreme weather events including hurricanes, floods, and wildfires. Polar ice caps are melting faster than predicted, contributing to sea level rise that threatens coastal communities. Scientists warn that without immediate and dramatic reductions in greenhouse gas emissions, many of these changes may become irreversible. International cooperation remains essential but politically challenging.",
            "The human genome project was completed in 2003 after 13 years of international collaborative research. It successfully mapped all of the genes of the human genome, approximately 20,000-25,000 genes in total. The project cost nearly $3 billion but has enabled countless medical advances and spawned new fields like pharmacogenomics. The knowledge gained has dramatically improved our understanding of genetic diseases and opened pathways to personalized medicine. Today, a complete human genome can be sequenced in under a day for about $1,000.",
            "Remote work adoption accelerated dramatically during the COVID-19 pandemic. Organizations that had previously resisted flexible work arrangements were forced to implement digital collaboration tools and virtual workflows. Many companies reported surprising productivity gains, though concerns about company culture and collaboration persisted. After the pandemic, a hybrid model emerged as the preferred approach for many businesses, combining in-office and remote work. This shift has profound implications for urban planning, commercial real estate, and work-life balance.",
            "Quantum computing represents a fundamental shift in computational capability. Unlike classical computers that use bits as either 0 or 1, quantum computers use quantum bits or qubits that can exist in multiple states simultaneously. This property, known as superposition, theoretically allows quantum computers to solve certain problems exponentially faster than classical computers. Major technology companies and governments are investing billions in quantum research. Fields like cryptography, material science, and drug discovery are expected to be revolutionized once quantum computers reach practical scale.",
        ],
        "targets": [
            "AI has revolutionized business operations through automation and optimization, though ethical concerns about job displacement persist alongside predictions that AI will ultimately create more employment opportunities than it eliminates.",
            "Climate change is causing accelerating environmental damage through extreme weather events and melting ice caps, with scientists warning that without immediate reduction in greenhouse gas emissions, many changes may become irreversible.",
            "The Human Genome Project, completed in 2003, mapped approximately 20,000-25,000 human genes at a cost of $3 billion, enabling medical advances, improving understanding of genetic diseases, and establishing the foundation for personalized medicine.",
            "The COVID-19 pandemic forced widespread adoption of remote work, revealing unexpected productivity benefits despite collaboration challenges, and resulting in a hybrid work model that impacts urban planning, real estate, and work-life balance.",
            "Quantum computing uses qubits existing in multiple simultaneous states to potentially solve certain problems exponentially faster than classical computers, with major investment from tech companies and governments anticipating revolutionary applications in cryptography, materials science, and pharmaceutical research.",
        ],
    }
)

### 3. Define prediction function

In [7]:
client = OpenAI()

def predict(data: pd.DataFrame) -> list[str]:
    predictions = []
    prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")

    for _, row in data.iterrows():
        # Fill in variables in the prompt template
        content = prompt.format(sentences=row["inputs"], num_sentences=1)
        completion = client.responses.create(
            model="gpt-4o-mini",
            input=[{"role": "user", "content": content}],
            temperature=0.1,
        )
        predictions.append(completion.output_text)

    return predictions

### 4. Run Evaluation

In [10]:

with mlflow.start_run(run_name="anshu-prompt-evaluation"):
    mlflow.log_param("model", "gpt-4o-mini")
    mlflow.log_param("temperature", 0.1)

    results = mlflow.evaluate(
        model=predict,
        data=eval_data,
        targets="targets",
        extra_metrics=[
            mlflow.metrics.latency(),
            # Specify GPT4 as a judge model for answer similarity. Other models such as Anthropic,
            # Bedrock, Databricks, are also supported.
            mlflow.metrics.genai.answer_similarity(model="openai:/gpt-4o"),
            mlflow.metrics.toxicity(),
            mlflow.metrics.flesch_kincaid_grade_level()
        ],
    )

2025/07/18 12:27:32 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
  prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")
  prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")
  prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")
  prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")
  prompt = mlflow.load_prompt("prompts:/anshu-summarization-prompt/3")
2025/07/18 12:27:40 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/5 [00:00<?, ?it/s]



🏃 View run anshu-prompt-evaluation at: http://20.80.248.210:5000/#/experiments/0/runs/dbc4f300bb5a480b81e06b55c5d6c7ef
🧪 View experiment at: http://20.80.248.210:5000/#/experiments/0


### 5. View Results
<img src="https://mlflow.org/docs/latest/assets/images/prompt-evaluation-result-7c106f17187fdc750439725d086c389b.png" alt="MLflow LLM Evaluation UI" width="800">

<img src = "https://mlflow.org/docs/latest/assets/images/prompt-evaluation-chart-8a93612e37184b8279c699fd6640013d.png" >

## Tracing

In [11]:
mlflow.openai.autolog()

with mlflow.start_run(run_name="anshu123"):
  response = client.responses.create(model='gpt-4o',
                                     input='What is quantum computing?')
  print(response.output_text)

Quantum computing is an advanced field of computing that leverages the principles of quantum mechanics to process information in fundamentally different ways than classical computers. Here are some key aspects:

1. **Qubits**: Unlike classical bits, which can be either 0 or 1, qubits can exist in superpositions, where they hold both 0 and 1 simultaneously. This property allows quantum computers to process a vast amount of possibilities at once.

2. **Entanglement**: This is a quantum phenomenon where qubits become interconnected such that the state of one qubit can depend on the state of another, no matter how far apart they are. Entanglement enables complex computations that are difficult for classical computers.

3. **Quantum Gates**: Quantum computations are conducted using quantum gates, which manipulate qubits through quantum operations. These gates differ significantly from classical logic gates due to their ability to maintain superpositions and entanglements.

4. **Quantum Deco