# MLflow 3.3 Evaluation API Demo

## Setup

First, we will import the necessary libraries and configure the OpenAI client.

In [20]:
from dotenv import load_dotenv

import os
import openai
import mlflow
from mlflow.genai.scorers import Correctness, Guidelines

load_dotenv()

client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Next, start the tracking server by running:

```bash
mlflow ui --port 5000
```

Lastly, we will specify the tracking URI and create a new experiment for logging these evaluations.

In [21]:
mlflow.set_tracking_uri('http://localhost:5000')

mlflow.set_experiment("mlflow-genai-eval")

<Experiment: artifact_location='mlflow-artifacts:/653076567649131449', creation_time=1755125128603, experiment_id='653076567649131449', last_update_time=1755125128603, lifecycle_stage='active', name='mlflow-genai-eval', tags={}>

## Define Evaluation Dataset

Next, we define a simple evaluation dataset. This dataset only consists of inputs and expected outputs. Note that we can structure our evaluation dataset in a number of different ways. We can even create an evaluation set from previously-generated traces.

In [23]:
dataset = [
    {
        "inputs": {"question": "Can MLflow manage prompts?"},
        "expectations": {"expected_response": "Yes!"},
    },
    {
        "inputs": {"question": "Can MLflow create a taco for my lunch?"},
        "expectations": {
            "expected_response": "No, unfortunately, MLflow is not a taco maker."
        },
    },
]

## Define Predict Function

Next we define our `predict_fn`. This is the function that is used to generate predictions. Note how the argument (`questions`) matches the `inputs` in our dataset.

In [24]:
def predict_fn(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-nano", messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

## Run the Evaluation

Lastly, we run the evaluation. We will use the built-in `Correctness` scorer, along with two custom "Guidelines" scorers. `Guidelines` make it easy to add custom LLM-based true/false criteria using natural language.

In [25]:
mlflow.openai.autolog()

results = mlflow.genai.evaluate(
    data=dataset,
    predict_fn=predict_fn,
    scorers=[
        # Built-in LLM judge
        Correctness(),
        # Custom criteria using LLM judge
        Guidelines(name="is_english", guidelines="The answer must be in English"),
        Guidelines(name="is_concise", guidelines="The answer must answer the question concisely, without excessive explanation or followup questions."),
    ],
)

2025/08/14 09:26:00 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset.


('Evaluating: 100%|██████████| 2/2 [Elapsed: 00:19, Remaining: 00:00] ',)

🏃 View run aged-hen-421 at: http://localhost:5000/#/experiments/653076567649131449/runs/965f93bfcb964b54a53a8dc17521bf0b
🧪 View experiment at: http://localhost:5000/#/experiments/653076567649131449





## View results in MLflow UI

Now that we have run the evaluation, we can view the results in the UI!