https://github.com/mlflow/mlflow/pull/14678/files#diff-2af9f81cd25764f6674ae7394b905beb245d4d8fc52dc6e767607581af8aaee2

# Part 1: Autologging and Tracing

##MLflow for GenAI Guide Overview

This is part 1 of a step-by-step guide to using MLflow to experiment with and deploy your generative AI projects.

Generative AI projects tend to get complicated quickly. GenAI applications often have many components, including both deterministic functions and GenAI model calls. They often use models from multiple different providers, and may even need to handle different modalities (e.g. text, images, audio, etc.). Testing and iterating on these projects often involves changes to multiple components, including prompts, models, and application logic, and it can be difficult to track and evaluate the impact of these changes.

MLflow helps to solve these problems by providing a suite of tools for tracing and visualizing all of your GenAI model calls, evaluating your models and applications, building application logic into custom models, tracking and versioning your models, and deploying your models to production.

This is the first part of a four-part guide showing how to use MLflow to experiment with and deploy your generative AI projects. It will cover every phase of the process, from prototyping and informal experimentation through deployment.

In this first part of the guide, we will walk through the process of prototyping a GenAI application. We will build a social media style transfer system that generates new posts based on some context and instructions in the style of a set of provided example posts. Part 1 will focus on:

- Testing the viability of the project and recording those tests with MLflow tracing
- Informal comparison of different models and prompts

Subsequent sections will cover:

- More rigorous comparison and evaluation of the project with MLflow evaluation metrics
- Encapsulating the application logic in a custom model and registering it in the model registry
- Deploying the model to staging, evaluating it, and promoting it to production

MLflow offers many different tools and approaches that can be used at each step of the process. This guide will introduce some of those options and walk through the thought process behind choosing each particular approach.
```


## Prerequisites and Setup

This guide assumes you have a basic understanding of Python and the MLflow library. You will also need to have the OpenAI Python SDK installed.

Inside your Python environment, install the MLflow and OpenAI Python SDK packages:

In [0]:
!pip install openai

Collecting openai
  Obtaining dependency information for openai from https://files.pythonhosted.org/packages/23/17/6f83e6c9d632eb9707663e01f9e74fdd604536fb3ff12ec42da94daf19df/openai-1.73.0-py3-none-any.whl.metadata
  Downloading openai-1.73.0-py3-none-any.whl.metadata (25 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Obtaining dependency information for httpx<1,>=0.23.0 from https://files.pythonhosted.org/packages/2a/39/e50c7c3a983047577ee07d2a9e53faf5a69493943ec3f6a384bdc792deb2/httpx-0.28.1-py3-none-any.whl.metadata
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Obtaining dependency information for jiter<1,>=0.4.0 from https://files.pythonhosted.org/packages/be/bd/976b458add04271ebb5a255e992bd008546ea04bb4dcadc042a16279b4b4/jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting tqdm>4 (f

Start a new MLflow experiment with the following Python code:


In [0]:
import mlflow

mlflow.set_experiment("/Shared/genai-social")

<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/2056821910552801', creation_time=1744470614881, experiment_id='2056821910552801', last_update_time=1744470614881, lifecycle_stage='active', name='/Shared/genai-social', tags={'mlflow.experiment.sourceName': '/Shared/genai-social',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': 'basak@marvelousmlops.io',
 'mlflow.ownerId': '4485001664756549'}>

## Initial Prototyping with MLflow autologging and tracing

Before we have written any code, we generally have some idea we want to try out. In this case, we want to answer the question: "Can a GenAI model generate social media posts that match our brand voice?" More specifically, we want to see if we can use GenAI to generate posts for the MLflow LinkedIn account in a style that is consistent with the existing posts.

The very first step is to try out a few different prompts to see whether this is plausible. At this phase, we are often working in a notebook environment, and are not particularly systematic about structuring our code or recording formal experiments. Still, we might stumble upon some interesting ideas we want to apply later in the project, so having *some* system for recording our tests is helpful. To that end, we can use MLflow's autologging and tracing features to record our experiments.

We will use OpenAI's GPT-4o model for experimenting in this stage: it's powerful, reasonably inexpensive, easy to use, and many developers are already familiar with it.

### Choosing a Model Interface

MLflow offers many different ways to interface with the model—let's look at some of the options and decide which makes the most sense at this point.

1. `mlflow.openai.autolog()`: This approach will allow us to use the native OpenAI Python SDK and will automatically log traces of inputs, outputs, errors, and other information to MLflow.
2. A manually logged OpenAI model: Logging the model with `mlflow.openai.log_model()` allows us to [log a model with a custom input template](https://mlflow.org/docs/latest/llms/openai/guide/index.html#direct-openai-service-usage) but will require us to manually load the logged model to use it.
3. [MLflow AI Gateway](https://mlflow.org/docs/latest/llms/deployments/index.html): We could configure an AI gateway endpoint with multiple models and providers, making it easy to switch between models and providers.

At this point in the project, we are not especially concerned with logging specific model configurations or setting up a gateway endpoint to manage multiple models. We just want to get started quickly and record our tests. To that end, we will use the first option: `mlflow.openai.autolog()`. This approach requires just one line of code to enable, and otherwise lets us use the native OpenAI Python SDK for experimentation.

You might also wonder why we are using MLflow at all at this early point in the project. We could just use the OpenAI Python SDK—or even ChatGPT—to try out our ideas. One major benefit we have already alluded to is MLflow tracing. Tracing gives us a systematic way to record and learn from our experiments—capturing the exact prompts, parameters, and outputs of every model call. This helps us track what works (and what doesn't), debug issues, and preserve valuable discoveries that we might want to revisit later in the project.

## MLflow Tracing with Autologging

Enable autologging for the OpenAI Python SDK by running the following:

In [0]:
import mlflow

mlflow.openai.autolog()

In [0]:
from openai import OpenAI

API_KEY = dbutils.secrets.get(scope="mlflow_genai", key="OPENAI_API_KEY")
client = OpenAI(api_key=API_KEY)

Now, when we make a call to the OpenAI API, MLflow will automatically log the inputs, outputs, and other information with MLflow tracing:

In [0]:
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is best Ramen place in Amsterdam?"}],
)

Trace(request_id=tr-4682c34dfb5242e49e949da3824f17de)

In [0]:
trace = mlflow.get_last_active_trace()
mlflow.MlflowClient().set_trace_tag(trace.info.request_id, "model", "gpt-4o")

### What is a Trace?

MLflow tracing provides a record of every call to a model, including the inputs, outputs, errors, and other information. An individual trace is made up of:

- **Trace Info**: Metadata about the trace, like timing and status
- **Trace Data**: The actual content of the trace, including inputs, outputs, and any intermediate steps

You can learn much more about MLflow tracing [here](https://mlflow.org/docs/latest/llms/index.html).


### Generating our First Post

Now that we have a basic system for logging our experiments, let's see if we can get a basic social media style transfer system working. As a quick reminder, we want to build a GenAI system that generates new LinkedIn posts based on some provided context and instructions in a style that is consistent with a set of provided example posts.

We will need:

- An example post to use as a style reference (eventually, we will use a set of multiple examples)
- A source document or website with content to use as a source of information for a new post
- A template prompt to tie the above information together
- A system prompt explaining the task to the model

Let's try the following. Feel free to experiment with your own prompts and ideas.


**Prompts:**

In [0]:
system_prompt_1 = """You are a social media content specialist who can precisely match writing styles. Your task is to:

1. Analyze the provided example post(s) to understand their style and tone
2. Generate a new LinkedIn post about the given topic that perfectly matches this style
3. Return only the generated post, nothing else.

"""

user_template = """

example posts:

{example_posts}

topic:

{topic}

additional instructions:

{additional_instructions}

"""

**Example Post:**

In [0]:
example_post = """MLflow's GenAI evaluation metrics now work as callable functions as of MLflow 2.17, making them easier to use and integrate.

Now you can use metrics like answer_relevance, answer_correctness, faithfulness, and toxicity directly as functions—no need to go through mlflow.evaluate() anymore if you're just prototyping with individual metrics or integrating metric calls into systems where mlflow.evalaute is not necessary.

This means:

🔍 Easier debugging during prototyping

🔌 More flexible integration options

🎯 Works with or without other MLflow features

Learn more:

📚 Docs: https://lnkd.in/gyBzcrDr

📝 Release notes: https://lnkd.in/gBrNQfFC

#MachineLearning #AI #LLMs #LLMOps #Evals"""

**New Post Topic:**

We will mostly get source information from the MLflow docs and blog posts. Let's write a quick helper function to get the text from a webpage and convert it to markdown. We will use the [`markdownify`](https://github.com/matthewwithanm/python-markdownify) library to do this—you can install it with `pip install markdownify`.


In [0]:
!pip install markdownify

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import requests
from markdownify import markdownify

def webpage_to_markdown(url):
    # Get webpage content
    response = requests.get(url)
    html_content = response.text

    # Convert to markdown
    markdown_content = markdownify(html_content)

    return markdown_content

Now we can use this function to get the text from the MLflow docs:


In [0]:
url = "https://mlflow.org/docs/latest/llms/chat-model-intro/index.html"
markdown_content = webpage_to_markdown(url)

We now have a system prompt, a template for our user prompt, an example post, and a source of information. Now we just need some custom instructions and a helper function to assemble the final set of messages to send to the model.

**Additional Instructions:**


In [0]:
additional_instructions = """This post will be written for the MLflow LinkedIn account.

Maintain a professional but approachable tone. Developers are the primary audience.

No marketing slop."""

**Prompt Formatting Helper Function:**


In [0]:
def generate_prompt(
    system, user_template, example_posts, topic, additional_instructions
):
    """Generate a prompt for the LLM based on the example posts, topic, and additional instructions."""
    example_posts = "\n".join(
        [f"Example {i+1}:\n{post}" for i, post in enumerate(example_posts)]
    )
    prompt = user_template.format(
        example_posts=example_posts,
        topic=topic,
        additional_instructions=additional_instructions,
    )

    formatted_prompt = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]

    return formatted_prompt

In [0]:
messages = generate_prompt(system_prompt_1, user_template, [example_post], [markdown_content], additional_instructions)
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
)

Trace(request_id=tr-da1e040687d849d2a043c25ecbe84d9c)

First Result

This is a reasonable first result, and gives us some confidence that this project is feasible. If you're following along, we encourage you to try out your own prompts and ideas. What is working? What isn't? What hypotheses can you come up with that you would like to test more rigorously?

### Tagging Traces

Now, as we iterate on the prompts and helper functions, the results will automatically be logged to MLflow tracing. We can add tags to the traces to help us organize our experiments. For example, suppose we want to add tags to record the platform for which we are generating posts. We can add a tag to the last active trace as follows:


In [0]:
trace = mlflow.get_last_active_trace()
mlflow.MlflowClient().set_trace_tag(trace.info.request_id, "platform", "LinkedIn")

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-6620300559654609>, line 1[0m
[0;32m----> 1[0m trace [38;5;241m=[39m mlflow[38;5;241m.[39mget_last_active_trace()
[1;32m      2[0m mlflow[38;5;241m.[39mMlflowClient()[38;5;241m.[39mset_trace_tag(trace[38;5;241m.[39minfo[38;5;241m.[39mrequest_id, [38;5;124m"[39m[38;5;124mplatform[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mLinkedIn[39m[38;5;124m"[39m)

[0;31mNameError[0m: name 'mlflow' is not defined

Note that we can also add tags to the trace in the UI. We can use the tags to [search and filter](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.search_traces) traces.

### Tracing additional models

We might also be interested in trying out some different models. For example, we might want to try the Gemini 2.0 Flash model as an alternative to GPT-4o. We want to make sure that calls to new models are traced alongside our OpenAI traces. We have a couple of options for making that happen:

1. Use the [`google-generativeai`](https://ai.google.dev/api?lang=python) library and the `mlflow.gemini.autolog()` function to trace the Gemini model calls.
2. Use Gemini [via the OpenAI SDK](https://ai.google.dev/gemini-api/docs/openai), so calls will be traced because of our existing autologging setup.

We will use the second option here—it ensures that our existing helper functions and prompt formatting will work and that our traces will be captured in the same format.


**Set up Gemini:**

We can set up an OpenAI client for Gemini as follows:

In [0]:
import os

os.environ["GEMINI_API_KEY"] = dbutils.secrets.get(scope="mlflow_genai", key="GEMINI_API_KEY")

gemini_client = OpenAI(
    api_key=os.environ["GEMINI_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

Note: [Gemini is now accessible from the OpenAI Library](https://developers.googleblog.com/en/gemini-is-now-accessible-from-the-openai-library/)


In [0]:
response = gemini_client.chat.completions.create(
    model="gemini-2.0-flash-exp",
    messages=messages,
)

Trace(request_id=tr-901881af8a4345f4a1b8c5be0148d143)

While we're at it, let's tag the latest trace with both the platform and the model provider. The model name is already captured in the trace, but tagging will make it easier to distinguish in the UI:


In [0]:
trace = mlflow.get_last_active_trace()
mlflow.MlflowClient().set_trace_tag(trace.info.request_id, "platform", "LinkedIn")
mlflow.MlflowClient().set_trace_tag(trace.info.request_id, "model", "gemini-2.0-flash-exp")

## Gemini Trace

From here, we can continue to try out different models and prompts, tagging traces to track our experiments across platforms, models, and prompts.

While these informal experiments are great for developing intuitions and initial hypotheses, they don't provide a systematic way to evaluate our application. Throughout development, we'll want to make various changes to our system—but without rigorous evaluation, measuring their impact is mostly guesswork. Building out a proper evaluation suite will help us validate our current hypotheses and give us a framework for measuring future changes.


## Conclusion

In this first part of a four-part guide detailing how MLflow integrates with a GenAI project, from conception through deployment, we prototyped and evaluated our AI system. In particular, we saw how:

- MLflow autologging let us capture traces of our early informal experiments with just one line of code: `mlflow.openai.autolog()`.
- We can use MLflow tracing to capture and learn from informal experiments
- We can compare models from different providers and tag their respective traces

The next section of this guide shows how to build on our informal experiments and develop a more rigorous evaluation suite.


# Part 2: Structured Evaluation


MLflow for GenAI Guide Overview

Now that we have some promising initial results, it's time to move from informal experimentation to systematic evaluation. Our goal is to build an evaluation framework that will help us:

- Compare different models and prompts quantitatively
- Measure key qualities like factual accuracy and style consistency
- Create benchmarks we can use to measure future improvements

We will use MLflow's evaluation tools to build this framework and run our first formal comparison of candidate systems.

## What are we trying to evaluate?

The first step in evaluating our system is to formalize our questions and hypotheses. This is important: don't start coding an evaluation suite until you know what you want to test. Here are some questions you might consider before starting:

- Which variable(s) are we holding constant? Which are we changing? For example: we might choose to use the same set of example posts while varying the prompt or model.
- What does it mean for one generated post to be "better" than another? If you were to design a rubric for comparing the quality of the generated posts, what would it look like?
- What are the most important issues to avoid when generating posts? For example, we probably want to make sure to avoid generating posts that contain factual errors or harmful/toxic content.

Let's suppose our testing has identified two candidate prompts. Furthermore, we want to compare the performance of the Gemini 2.0 Flash model to the GPT-4o model. How do we want to judge which prompt and model is better? While there could be many different ways to evaluate our system, let's start with a few basics:

1. We want to make sure the generated posts are grounded in the source information. All factual information in the generated post should be present in the source information.
2. We want to make sure the generated posts do not contain any harmful or toxic content. Though it is unlikely that a model will generate harmful content on purpose, it's still worth checking for.
3. We want to make sure the generated posts are in the style of the example posts.

This should be enough to get started. Over time, as we get more data and feedback, we can refine our evaluation suite.

## Mapping our Evaluation Criteria to Metrics

Mapping evaluation criteria to metrics

Now that we know what criteria we want to use to compare our generated posts, we can use MLflow to build our evaluation suite using [MLflow's LLM Evaluation features](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html).

The first step is to map our evaluation criteria to MLflow metrics. An [MLflow metric](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html#llm-evaluation-metrics) is either a heuristic-based function that calculates a numerical score (like ROUGE or BLEU) or an LLM-as-a-Judge metric that uses another LLM to evaluate and score model outputs. We will largely rely on LLM-as-a-Judge metrics because they can better evaluate qualities like writing style, factual consistency, and tone that are crucial for social media content.

To check for groundedness and toxicity, we can use the built-in [faithfulness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.faithfulness) and [toxicity](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html?highlight=toxicity#mlflow.metrics.toxicity) metrics. There is no built-in metric for style similarity, so we can create our own.

Ultimately, we will want to set up our evaluation suite to run all at once with `mlflow.evaluate()`, but we can start by testing our metrics one at a time. MLflow metrics work as Python callable functions, making it easy to test them individually as we are configuring our evaluation suite. Let's work through them one at a time.

### Faithfulness

*Faithfulness* is a metric that checks whether the generated post is grounded in the source information. Note that we could also consider using the [answer_correctness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_correctness) metric, which checks whether the generated post is correct. There is a subtle difference between the two: *faithfulness* checks whether the generated post is grounded in the source information, while *answer_correctness* checks whether the generated post is correct relative to a target answer. We are more interested in the former, so we will use the `faithfulness` metric.

Here's how it works. We will stick with the default judge model, GPT-4o, though you can [choose your preferred model to use as a judge](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html#selecting-the-judge-model).



In [0]:
!pip install aiohttp

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
%restart_python

In [0]:
from mlflow.metrics.genai import faithfulness
import os

os.environ["OPENAI_API_KEY"] = dbutils.secrets.get(scope="mlflow_genai", key="OPENAI_API_KEY")

faithfulness_metric = faithfulness(
    model="openai:/gpt-4o"
)

result = response.choices[0].message.content

print(faithfulness_metric(predictions=result,
                    inputs=additional_instructions, # ignored
                    context=markdown_content))

  0%|          | 0/1 [00:00<?, ?it/s]

MetricValue(scores=[5], justifications=["All claims in the output are directly supported by the provided context. The output accurately describes the features and benefits of MLflow's `ChatModel` class, its comparison with `PythonModel`, and the tutorial's content, all of which are consistent with the context provided. Therefore, the faithfulness score is 5."], aggregate_results={'mean': 5.0, 'variance': 0.0, 'p90': 5.0})


### Toxicity

We can use `toxicity` similarly. Note that we do not pass a model to the `toxicity` metric: it uses the specialized `facebook/roberta-hate-speech-dynabench-r4-target` model for toxicity detection.

In [0]:
from mlflow.metrics import toxicity

toxicity_metric = toxicity()

toxicity_score = toxicity_metric(predictions=result)
# toxicity_score.aggregate_results["mean"]



In [0]:
toxicity_score

### Style Similarity

For style similarity, we need to create a custom metric. We will use the `make_genai_metric` function to create a custom metric that compares the generated post to a set of example posts and returns a score between 0 and 5.

The first step in creating an effective metric is to define some examples with inputs, scores, and justifications. We will use the `EvaluationExample` class to define our examples.

First, let's copy over some more example posts from the [MLflow LinkedIn account](https://www.linkedin.com/company/mlflow-org/posts/). We will use these posts throughout the rest of this guide (recall: we will be keeping the examples fixed but varying the prompt and model).

In [0]:
post_example_1 = """MLflow's GenAI evaluation metrics now work as callable functions as of MLflow 2.17, making them easier to use and integrate.
Now you can use metrics like answer_relevance, answer_correctness, faithfulness, and toxicity directly as functions—no need to go through mlflow.evaluate() anymore if you're just prototyping with individual metrics or integrating metric calls into systems where mlflow.evalaute is not necessary.
This means:
🔍 Easier debugging during prototyping
🔌 More flexible integration options
🎯 Works with or without other MLflow features
Check it out in action ⬇️
Learn more:
📚 Docs: https://lnkd.in/gyBzcrDr
📝 Release notes: https://lnkd.in/gBrNQfFC
#MachineLearning #AI #LLMs #LLMOps #Evals"""

post_example_2 = """If you're already building with Python ML libraries, adding mlflow.autolog() to your code instantly gives you production-grade experiment tracking, model management, and observability—no extra infrastructure or code changes needed.
The automatic logging works across a remarkable breadth of libraries—from GenAI frameworks like LangChain, OpenAI, and LlamaIndex to traditional ML and deep learning libraries like PyTorch, scikit-learn, and Fastai.
MLflow's autolog feature changes this equation. With a single line—mlflow.autolog()—you get automatic logging of:
📊 Training metrics and parameters for scikit-learn, PyTorch, many other ML frameworks
🔄 LLM traces, prompts, responses, and tool calls for OpenAI and LangChain
📦 Model signatures and artifacts
💾 Dataset information and example inputs
The best part is that it works out of the box with the most popular libraries in the Python ML ecosystem: no need to modify your existing training code or add manual logging statements.
Read more: https://lnkd.in/e_aTp6HH
#machinelearning #mlops #ai #llmops"""

post_example_3 = """New tutorial: Step-by-step guide to building a tool-calling LLM application using MLflow's ChatModel wrapper and tracing system.
This tutorial shows you how to:
🔧 Create a tool-calling model using mlflow.pyfunc.ChatModel
🔄 Implement OpenAI function calling with automatic input/output handling
🔍 Add comprehensive tracing to debug multi-step LLM interactions
🚀 Deploy your model with full MLflow lifecycle management
The guide includes a practical example building a weather information agent, showing how ChatModel simplifies complex LLM patterns while providing enterprise-grade observability.
Check out the complete tutorial here: https://lnkd.in/gdTw8N2S
#MLOps #AIEngineering #LLMOps #AI"""

example_posts = [post_example_1, post_example_2, post_example_3]


In [0]:
similar_post = """
MLflow's ChatModel and PythonModel classes serve different needs when deploying GenAI applications. Here's when to use each:
ChatModel simplifies GenAI deployment with standardized OpenAI-compatible interfaces. This means:
🔗 Immediate compatibility with existing OpenAI-based tools and workflows
🚀 Pre-defined model signatures that work out of the box
📊 Streamlined integration with MLflow's tracking and evaluation features
PythonModel is your choice when you need complete control over:
🛠️ Custom input/output schemas for specialized use cases
🔄 Complex data transformations beyond standard chat patterns
⚙️ Fine-grained model behavior and deployment configurations
For most conversational AI applications, ChatModel's standardized approach helps you avoid common deployment pitfalls while maintaining consistent interfaces across your GenAI services. Consider PythonModel when your use case requires specialized data handling or custom interaction patterns.
See the comment below for links to in-depth tutorials on ChatModel 👇 
#MLflow #LLMOps #MachineLearning #GenAI"""

In [0]:
dissimilar_post = """
🔥 HOLY MOLY! MLflow Just Dropped Something INSANE for AI Deployment! 🤯
TWO EPIC WAYS to deploy your next-gen AI:
1️⃣ ChatModel: The No-BS Fast Track!
INSTANT OpenAI compatibility 🤝
Zero headaches, works RIGHT NOW 🚀
All the tracking & metrics you're craving 📈
2️⃣ PythonModel: For When You Need to GO WILD!
Customize EVERYTHING 🎨
Transform data like a BOSS 💪
Ultimate control = Ultimate POWER! ⚡️
Don't sleep on this update! Your AI deployment game is about to get ABSOLUTELY CRACKED! 🚀✨
#MLflowGang #AIrevolution #FutureIsNow #TechTwitter
"""

Now we can set up two `EvaluationExample` objects, one for the similar post and one for the dissimilar post. These need to include the input, output, examples, score, and justification.

Note that we provide examples via the `grading_context` argument. Furthermore, we're just passing the "additional instructions" as the input. These are not relevant to the style similarity metric. We want to be careful not to send the full message history or the source information as this could impair the metric's ability to evaluate the style similarity.


In [0]:
from mlflow.metrics.genai import EvaluationExample

evaluation_example_1 = EvaluationExample(
    input=additional_instructions,
    output=similar_post,
    grading_context={"examples": example_posts},
    score=5,
    justification="This post is a perfect match to the style of the example posts."
)

evaluation_example_2 = EvaluationExample(
    input=additional_instructions,
    output=dissimilar_post,
    grading_context={"examples": example_posts},
    score=1,
    justification=("The post earns a 1/5 for maintaining the basic bullet-point structure and use of emojis, "
                   "but significantly overplays the informal tone with phrases like 'HOLY MOLY!' and 'fam' "
                   "that aren't present in the examples. While the example posts balance professional "
                   "enthusiasm with technical detail, this submission sacrifices information density for "
                   "excessive hype and casual language that goes well beyond the controlled informality shown "
                   "in the reference posts."
                   )
)

Now that we have our examples, we can use the [`make_genai_metric`](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.make_genai_metric) function to create a custom metric. There are a few key components we need to provide in order to define a new GenAI metric:

- a definition, which describes the basic intent of the metric
- a grading prompt, which describes the scoring system and provides any other necessary notes
- the evaluation examples, which we created above

In this case, we also set a custom model: we're using the Anthropic Claude 3.5 Sonnet model for this metric.

In [0]:
from mlflow.metrics.genai import make_genai_metric

style_similarity_metric = make_genai_metric(
    name="style_similarity",
    definition=(
        "Style similarity measures how well a generated social media post matches the style, tone, "
        "and vocabulary of provided example posts. This includes analyzing the similarity of tone, "
        "word choice, punctuation, sentence structure, and stylistic elements like hashtags and emojis. "
        "Content similarity should not factor into the style similarity score. Post length is of minimal importance."
    ),
    grading_prompt=(
        "Style Similarity: Score the generated post's similarity to the example posts on a scale from 0 to 5:\n"
        "- Score 0: No stylistic similarity at all\n"
        "- Score 1: Minimal stylistic similarity\n"
        "- Score 2: Some stylistic elements match but significant differences exist\n"
        "- Score 3: Moderate stylistic similarity in tone, vocabulary, or structure\n"
        "- Score 4: High stylistic similarity across most elements\n"
        "- Score 5: Could be written by the same author\n\n"
        "Consider:\n"
        "- Tone: similarity in voice and attitude\n"
        "- Vocabulary: similarity in word choice and complexity\n"
        "- Style: similarity in punctuation, sentence structure, hashtags, and emojis"
    ),
    examples=[evaluation_example_1, evaluation_example_2],
    version="v1",
    model="anthropic:/claude-3-5-sonnet-20241022",
    parameters={"temperature": 0.0, "max_tokens": 1000},
    aggregations=["mean", "variance", "p90"],
    grading_context_columns=["examples"],
    greater_is_better=True
)

In [0]:
import os
os.environ["ANTHROPIC_API_KEY"] = dbutils.secrets.get(scope="mlflow_genai", key="ANTHROPIC_API_KEY")


In [0]:
too_formal_example = """MLflow has introduced distinct deployment paradigms through its ChatModel and PythonModel classes, each serving specific implementation requirements in GenAI applications.
ChatModel implements a standardized deployment framework utilizing OpenAI-compatible interfaces, offering several advantages:
Full compatibility with existing OpenAI infrastructure and workflows
Implementation of predefined model signatures ensuring immediate functionality
Seamless integration with MLflow's comprehensive tracking and evaluation systems
Conversely, PythonModel provides advanced customization capabilities for specialized requirements:
Implementation of bespoke input/output schemas
Advanced data transformation protocols beyond standard conversational patterns
Granular control over model behavior and deployment specifications
For standard conversational AI implementations, ChatModel's structured approach mitigates common deployment challenges while maintaining consistent interfaces across GenAI services. PythonModel remains the optimal choice for implementations requiring specialized data handling protocols or custom interaction patterns.
For detailed implementation guidelines, please refer to the accompanying documentation.
Reference: MLflow Documentation
"""

style_similarity_metric(predictions=too_formal_example,
                    inputs=additional_instructions,
                    examples=[example_posts])

  0%|          | 0/1 [00:00<?, ?it/s]

MetricValue(scores=[2], justifications=['The output maintains some basic structural elements from the example posts (bullet points, technical content), but lacks many key stylistic elements present in the examples. It\'s missing emojis, hashtags, and the more engaging, direct tone of the examples. The writing is overly formal and documentation-like, using phrases like "conversely" and "mitigates common deployment challenges" where the examples use more approachable language. The post also lacks the clear call-to-action links and interactive elements ("Check it out ⬇️") that characterize the example posts.'], aggregate_results={'mean': 2.0, 'variance': 0.0, 'p90': 2.0})

#### Tying it all together with `mlflow.evaluate()`

Now that we have our metrics, it's time to structure our evaluation suite and run our experiments. As we mentioned earlier, we want to identify the best model and system prompt combination.

For the purposes of this guide, we will use our `webpage_to_markdown` function to generate a set of ten different sources for posts. We will then generate four different posts for each source, using two different prompts and two different models. We will save the results of our experiments as a Pandas DataFrame and use the `mlflow.evaluate()` function to evaluate the generated posts using our three metrics. In a real-world scenario, we would want to run a larger experiment with more examples and, potentially, a wider range of candidate models and prompts.

Note that we could, instead, simply track these experiments as individual MLflow runs. However, using `mlflow.evaluate()` provides a more structured way to compare models and prompts across multiple metrics, with built-in support for viewing and analyzing evaluation results in the MLflow UI.

We also have a few different options for how to use `mlflow.evaluate()`. We can pass the model to `mlflow.evaluate()`, which will call on the model to generate fresh predictions each time we run the evaluation suite, or we can pass a dataset including all the necessary inputs, outputs, and context. We will go with the latter approach. This way, if we need to debug or update our evaluation setup, we do not need to re-run the generation step: we can simply run the updated evaluations on the static dataset.

First, we'll set up our key experiment variables:

In [0]:
system_prompt_2 = """You are a social media content specialist with expertise in matching writing styles and voice across platforms. Your task is to:
1. Analyze the provided example post(s) by examining:
   - Writing style, tone, and voice
   - Sentence structure and length
   - Use of hashtags, emojis, and formatting
   - Engagement techniques and calls-to-action
2. Generate a new LinkedIn post about the given topic that matches:
   - The identified writing style and tone
   - Similar structure and formatting choices
   - Equivalent use of platform features and hashtags
   - Comparable engagement elements
3. Return only the generated post, formatted exactly as it would appear on LinkedIn, without any additional commentary or explanations."""

system_prompts = {"concise": system_prompt_1, "detailed": system_prompt_2}


clients = {"openai": OpenAI(), "gemini": OpenAI(
    api_key=os.environ["GEMINI_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)}

Next, let's generate our dataset. We will use the `webpage_to_markdown` function to generate a set of ten different sources for posts. We will then generate four different posts for each source, using two different prompts and two different models.


In [0]:
mlflow_pages = {"Tutorial: Custom GenAI Models using ChatModel": "https://mlflow.org/docs/latest/llms/chat-model-guide/index.html",
                "MLflow Tracing Schema": "https://mlflow.org/docs/latest/llms/tracing/tracing-schema.html",
                "MLflow AI Gateway (Experimental)": "https://mlflow.org/docs/latest/llms/deployments/index.html",
                "MLflow LLM Evaluation": "https://mlflow.org/docs/latest/llms/llm-evaluate/index.html",
                "LLM Evaluation with MLflow Example Notebook": "https://mlflow.org/docs/latest/llms/llm-evaluate/notebooks/question-answering-evaluation.html",
                "MLflow Tracing for LLM Observability": "https://mlflow.org/docs/latest/llms/tracing/index.html",
                "Deep Learning": "https://mlflow.org/docs/latest/deep-learning/index.html",
                "DSPy Quickstart": "https://mlflow.org/docs/latest/llms/dspy/notebooks/dspy_quickstart.html",
                "Building Custom Python Function Models with MLflow": "https://mlflow.org/docs/latest/traditional-ml/creating-custom-pyfunc/index.html",
                "Quickstart with MLflow PyTorch Flavor": "https://mlflow.org/docs/latest/deep-learning/pytorch/quickstart/pytorch_quickstart.html"
                }

### Generate the Evaluation Dataset


Now that we have set up our experimental variables and identified the sources we want to use for post generation, we can generate an evaluation dataset. For each source we will generate four different posts, using two different system prompts and two different models. We will record:

- the model used
- the system prompt used
- the source webpage content
- the generated post
- the example posts used (even though we aren't comparing different examples in the experiment, we still need to include them in the dataset for the evaluation suite to work because they are used by the style similarity metric).

We will also use some slightly more detailed tracing as we generate this evaluation dataset—we will trace the markdown conversion, the prompt generation, and the post generation. This way, if we encounter any issues generating the evaluation dataset, we can easily identify the source of the issue.


Here's the function we use to generate the evaluation dataset. Note that, to get the detailed tracing, we added `@mlflow.trace(span_type="FUNCTION")` decorators to the `webpage_to_markdown` and `generate_prompt` functions.



In [0]:

import pandas as pd
import mlflow

def generate_evaluation_dataset(system_prompts: dict, clients: dict, user_instruction: str, example_posts: list, context_pages: dict):
    # Create list to store results
    results = []

    for page_name, page_url in context_pages.items():
        for prompt_name, system_prompt in system_prompts.items():
            for client_name, client in clients.items():

                with mlflow.start_span(name="eval_dataset_generation",
                                       span_type="CHAIN",
                                       ) as parent_span:
                    model = "gpt-4o" if client_name == "openai" else "gemini-2.0-flash-exp"
                    parent_span.set_inputs({"model": model, "system_prompt": system_prompt,
                                            "example_post": page_name})
                    page_content = webpage_to_markdown(page_url)
                    messages = generate_prompt(system_prompt, user_template, example_posts, page_content, additional_instructions)

                    response = client.chat.completions.create(
                        model=model,
                        messages=messages,
                    )


                    results.append({
                        'model': client_name,
                        'system_prompt': prompt_name,
                        'context_page': page_name,
                        'user_instruction': user_instruction,
                        'output': response.choices[0].message.content
                    })
                    # wait for 1 second
                    #time.sleep(1)
                    parent_span.set_outputs({"output": response.choices[0].message.content})

    return pd.DataFrame(results)

user_instruction = "Create a LinkedIn post about the benefits of using MLflow for machine learning projects."


eval_dataset = generate_evaluation_dataset(system_prompts, clients, user_instruction, example_posts, mlflow_pages)

[Trace(request_id=tr-12fcc941e32d4b05b72b1b05b0020524), Trace(request_id=tr-4f20f4162f1f44419b8d37a276128ba0), Trace(request_id=tr-378d52c602cb4b83a451d1d2f90e0e4a), Trace(request_id=tr-04573910a8f046dd84bdfa72434790ee), Trace(request_id=tr-5a2934b3320943a1b90ba7bdf8d7d723), Trace(request_id=tr-67ef6bbe01434bf4989e55ebe038d6ef), Trace(request_id=tr-ba03a9d4912b47c49a58d93bd0b65cb3), Trace(request_id=tr-ef544575b3664c7a892ba7728af4bc4f), Trace(request_id=tr-98b58fdf965c446daa983aa48dc6463e), Trace(request_id=tr-24f1c9306a214287886b86e292c75b09)]

In [0]:
eval_dataset.head(4)

Unnamed: 0,model,system_prompt,context_page,user_instruction,output
0,openai,concise,Tutorial: Custom GenAI Models using ChatModel,Create a LinkedIn post about the benefits of u...,New Tutorial Alert: Dive into Custom GenAI Mod...
1,gemini,concise,Tutorial: Custom GenAI Models using ChatModel,Create a LinkedIn post about the benefits of u...,New tutorial: Build custom GenAI models with M...
2,openai,detailed,Tutorial: Custom GenAI Models using ChatModel,Create a LinkedIn post about the benefits of u...,🎓 New Tutorial Alert: Creating Custom GenAI Mo...
3,gemini,detailed,Tutorial: Custom GenAI Models using ChatModel,Create a LinkedIn post about the benefits of u...,New tutorial alert! 🚨 Learn how to build custo...


In [0]:
eval_dataset['examples'] = [example_posts] * len(eval_dataset)

### Run the Evaluation Suite on the Evaluation Dataset

Now that we have our evaluation dataset, we can run the evaluation suite on it. We will use the `mlflow.evaluate()` function to evaluate the generated posts using our three metrics.

We will group the evaluation by model and system prompt, and then evaluate each group. Organizing the evaluation into separate runs gives us the easiest way to view the evaluation results in the MLflow UI.

There are a few important things to note here:

- We listed our metrics in the `extra_metrics` argument. This is because we are not using the default metrics associated with a particular task type and are instead using only the specific metrics we decided on for our use case.
- In the `evaluator_config` argument, we set the `col_mapping` to map the inputs and context to the columns in our evaluation dataset. This allows the evaluation suite to correctly map the inputs and context to the evaluation results.
- For the sake of organization, we created a parent run and nested each of the evaluation runs within it.


In [0]:
example_posts

["MLflow's GenAI evaluation metrics now work as callable functions as of MLflow 2.17, making them easier to use and integrate.\nNow you can use metrics like answer_relevance, answer_correctness, faithfulness, and toxicity directly as functions—no need to go through mlflow.evaluate() anymore if you're just prototyping with individual metrics or integrating metric calls into systems where mlflow.evalaute is not necessary.\nThis means:\n🔍 Easier debugging during prototyping\n🔌 More flexible integration options\n🎯 Works with or without other MLflow features\nCheck it out in action ⬇️\nLearn more:\n📚 Docs: https://lnkd.in/gyBzcrDr\n📝 Release notes: https://lnkd.in/gBrNQfFC\n#MachineLearning #AI #LLMs #LLMOps #Evals",
 "If you're already building with Python ML libraries, adding mlflow.autolog() to your code instantly gives you production-grade experiment tracking, model management, and observability—no extra infrastructure or code changes needed.\nThe automatic logging works across a remark

In [0]:
examples

In [0]:
import uuid
mlflow.set_tracking_uri("file:./../mlruns")
mlflow.set_experiment("/Shared/genai-social")

# Get unique combinations of model and system_prompt
model_prompts = eval_dataset[['model', 'system_prompt']].drop_duplicates()

# Create parent run for all evaluations
with mlflow.start_run(run_name=f"social-post-generation-eval-{uuid.uuid4()}") as parent_run:
    mlflow.log_param("evaluation_type", "social_post_generation")

    # Evaluate each configuration subset
    for _, row in model_prompts.iterrows():
        model = row['model']
        system_prompt = row['system_prompt']

        # Subset the data for this configuration
        subset_df = eval_dataset[
            (eval_dataset['model'] == model) & 
            (eval_dataset['system_prompt'] == system_prompt)
        ]

        # Create child run for this specific configuration
        with mlflow.start_run(run_name=f"{model}-{system_prompt}", nested=True) as child_run:
            # Log configuration parameters
            mlflow.log_params({
                "model": model,
                "system_prompt": system_prompt,
            })

            # Run evaluation
            eval_results = mlflow.evaluate(
                data=subset_df,
                predictions="output",
                extra_metrics=[
                    style_similarity_metric,
                    faithfulness_metric,
                    toxicity_metric
                ],
                evaluator_config={
                    "col_mapping": {
                        "inputs": "user_instruction",
                        "context": "context_page",
                        "examples": "examples"   
                    }
                }
            )

2025/04/13 17:22:31 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

2025/04/13 17:22:56 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

2025/04/13 17:23:20 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

2025/04/13 17:23:45 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]



The evaluation UI gives us a number of ways to view the results. We can, for example, select the relevant runs, group by the context page name, and compare the faithfulness metric in order to see how well each model/system prompt combination performed on each page.

We can also load the evaluation results into a Pandas DataFrame for further custom analysis. Let's get the last five run IDs (parent run + four child runs), load the evaluation results into a DataFrame, and then analyze the results.


In [0]:
import mlflow
import pandas as pd

# Initialize the MLflow client
mlflow_client = mlflow.MlflowClient()

# Get recent runs
recent_runs = mlflow_client.search_runs(
    experiment_ids=[mlflow.get_experiment_by_name("/Shared/genai-social").experiment_id],
    max_results=5,
    order_by=["start_time DESC"]
)

# Extract relevant data from the runs
data = []
for run in recent_runs:
    run_data = {
        'run_id': run.info.run_id,
        'model': run.data.params.get('model'),
        'system_prompt': run.data.params.get('system_prompt'),
        'style_similarity/v1/score': run.data.metrics.get('style_similarity/v1/mean'),
        'faithfulness/v1/score': run.data.metrics.get('faithfulness/v1/mean'),
        'toxicity/v1/score': run.data.metrics.get('toxicity/v1/mean')
    }
    data.append(run_data)

# Create a DataFrame from the extracted data
df = pd.DataFrame(data)

# Group by 'model' and 'system_prompt' and calculate the mean of the scores
summary_df = df.groupby(['model', 'system_prompt']).agg({
    'style_similarity/v1/score': 'mean',
    'faithfulness/v1/score': 'mean',
    'toxicity/v1/score': 'mean'
}).round(3)

# Rename columns for better readability
summary_df.columns = ['Style Similarity', 'Faithfulness', 'Toxicity']

# Print the summary DataFrame
print(summary_df)

                      Style Similarity  Faithfulness  Toxicity
model  system_prompt                                          
gemini concise                     4.0           1.1       NaN
       detailed                    4.0           1.0       NaN
openai concise                     4.0           1.0       NaN
       detailed                    4.0           1.0       NaN



The ability to load the evaluation results into a Pandas DataFrame also allows us to use our preferred plotting libraries to visualize the results. For example, the following was created using `matplotlib`.

The model and prompt combinations perform relatively similarly. Based on this experiment, the `gemini-2.0-flash-exp` model with the `detailed` system prompt appears to offer the best combination of style similarity and faithfulness. In a production setting, we would want to run a larger experiment to confirm this result.

## Conclusion

In this second part of a four-part guide detailing how MLflow integrates with a GenAI project, from conception through deployment, we evaluated our AI system. In particular, we saw how:

- MLflow's callable metrics let us quickly and easily test the metrics we wanted to use to evaluate our system
- `mlflow.evaluate()` can be used to run an evaluation suite, including two LLM-as-judge metrics, on our AI system.

In the next section, we will see how to encapsulate our our application logic in a custom MLflow model.

# Part 3: Custom MLflow Model