# Generative AI Evaluation Metrics in MLflow

MLflow 2.8 introduced new [Generative AI Metrics](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#generative-ai-metrics) that use LLMs to evaluate model output text. There are a few different GenAI metrics to choose from:
- [Answer Correctness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_correctness), which compares a model's output to a ground truth answer
- [Answer Relevance](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_correctness), which evaluates how appropriate and applicable a response is with respect to the input
- [Answer Similarity](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_similarity), which assesses the semantic similarity of a generated response to a ground truth answer
- [Faithfulness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.faithfulness), which tests the factual similarity of a model's response to some provided context (e.g. in a RAG system)
- [Relevance](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.relevance), which examines the output with respect to the input and provided context (e.g. in a RAG system) and rates its relevance and significance. Note that this differs from the `Answer Similarity` metric, which does not have a context component.

These all work in fudamentally the same way: pick a model and (optionally) define an example, at which point you can use the new metric in the MLflow.evaluate() system. Let's go through each in turn.

In [2]:
# setup
import openai
import pandas as pd
from dotenv import load_dotenv
load_dotenv(override=True)

True

## Answer Correctness

In [3]:
from mlflow.metrics.genai import EvaluationExample, answer_correctness

example1 = EvaluationExample(
    input="What is MLflow Tracking?",
    output="MLflow Tracking's API and UI log ML workflow aspects like parameters, "
    "code, metrics, and artifacts. It provides a unified view of a model's "
    "development, aiding team analysis. Designed for diverse environments "
    "like scripts or notebooks, it simplifies result logging to files or servers, "
    "enhancing run comparisons for users.",
    score=5,
    grading_context={
        "targets": "MLflow Tracking provides both an API and UI dedicated to the logging "
        "of parameters, code versions, metrics, and artifacts during the ML process. This "
        "centralized repository captures details such as parameters, metrics, artifacts, "
        "data, and environment configurations, giving teams insight into their models’ "
        "evolution over time. Whether working in standalone scripts, notebooks, or other "
        "environments, Tracking facilitates the logging of results either to local files or a "
        "server, making it easier to compare multiple runs across different users."
    },
    justification="The answer gives a correct summary of MLflow tracking. "
    "The answer does not include any innacuracies or significant omissions.",
)

example2 = EvaluationExample(
    input="What is the MLflow Model Registry?",
    output="MLflow's Model Registry is a version control hub for managing "
    "ML model versions. It helps track model stages and facilitates a "
    "smooth transition to production with a centralized model store and UI.",
    score=3,
    grading_context={
        "targets": "A systematic approach to model management, the Model Registry assists "
        "in handling different versions of models, discerning their current state, and "
        "ensuring a smooth transition from development to production. It offers a centralized "
        "model store, APIs, and UI to collaboratively manage an MLflow Model’s full lifecycle, "
        "including model lineage, versioning, stage transitions, and annotations."
    },
    justification="The output inaccurately characterizes the Model Registry as a version control "
    "only system, thereby failing to capture its broader role in the full lifecycle management "
    "of machine learning models. This overlooks key features such as collaboration, annotations, "
    "and comprehensive lifecycle management, leading to a deduction in the correctness score."
)

example3 = EvaluationExample(
    input="What is automatic logging in MLflow?",
    output="Automatic logging in MLflow is an AI-driven feature for optimizing data storage, "
    "leveraging algorithms to enhance data retrieval and backups within the ML workflow.",
    score=1,
    grading_context={
        "targets": "Automatic logging allows you to log metrics, parameters, and models "
        "without the need for explicit log statements. There are two ways to use autologging: Call "
        "mlflow.autolog() before your training code. This will enable autologging for each supported "
        "library you have installed as soon as you import it. Use library-specific autolog calls for "
        "each library you use in your code."
    },
    justification="The output erroneously represents automatic logging as a data storage optimization "
    "feature, which is entirely incorrect. Automatic logging in MLflow is actually designed to log "
    "metrics, parameters, and models automatically during the machine learning model training process. "
    "This significant misstatement of MLflow's functionality warrants the score of 1."
)

# Construct the metric using OpenAI GPT-4 as the judge
answer_correctness_metric = answer_correctness(
    model="openai:/gpt-4"#, examples=[example1, example2, example3]
)

print(answer_correctness_metric)

EvaluationMetric(name=answer_correctness, greater_is_better=True, long_name=answer_correctness, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_correctness based on the input and output.
A definition of answer_correctness and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Answer correctness is evaluated on the accuracy of the provided output based on the provided targets, which is the ground truth. Scor

In [7]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Delta Lake?",
            "How to exit vim?",
            "How to exit emacs?",
        ],
        "ground_truth": [
            "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
            "Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.",
            "To exit vim, press ESC to enter command mode, then type :q and press Enter.",
            "To exit emacs, press Ctrl+x, then Ctrl+c."
        ],
    }
)


In [8]:
import mlflow

with mlflow.start_run() as run:
    system_prompt = "Concisely answer the following question."
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        targets="ground_truth",
        extra_metrics=[
            answer_correctness_metric,
        ],
    )

results.metrics

2023/11/06 11:25:17 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/11/06 11:25:17 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/11/06 11:25:46 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/11/06 11:25:57 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_correctness


  0%|          | 0/4 [00:00<?, ?it/s]

{'toxicity/v1/mean': 0.0008873457809386309,
 'toxicity/v1/variance': 6.291812835807224e-07,
 'toxicity/v1/p90': 0.0017894275253638628,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 12.0,
 'flesch_kincaid_grade_level/v1/variance': 23.380000000000003,
 'flesch_kincaid_grade_level/v1/p90': 16.88,
 'ari_grade_level/v1/mean': 15.549999999999999,
 'ari_grade_level/v1/variance': 28.0875,
 'ari_grade_level/v1/p90': 20.66,
 'exact_match/v1': 0.0,
 'answer_correctness/v1/mean': 4.75,
 'answer_correctness/v1/variance': 0.1875,
 'answer_correctness/v1/p90': 5.0}

In [9]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
df = results.tables["eval_results_table"]
df[["inputs", "outputs", "answer_correctness/v1/score", "answer_correctness/v1/justification"]].style

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,outputs,answer_correctness/v1/score,answer_correctness/v1/justification
0,What is MLflow?,"MLflow is an open-source platform used for managing the entire machine learning life cycle, including tracking experiments, packaging and reproducibility of models, and deployment of machine learning models. It provides tools and APIs to help data scientists and engineers keep track of experiments, manage and version trained models, and enable model deployment as web services in various environments.",5,"The output provided by the model is correct. It accurately describes MLflow as an open-source platform used for managing the entire machine learning life cycle, including tracking experiments, packaging and reproducibility of models, and deployment of machine learning models. This aligns perfectly with the provided targets, demonstrating a high degree of accuracy and semantic similarity. Therefore, according to the grading rubric, the model's answer_correctness score is 5."
1,What is Delta Lake?,"Delta Lake is an open-source storage layer that sits on top of existing data lakes and provides ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and data versioning capabilities. It enables reliable and efficient data ingestion, processing, and query operations on big data platforms.",4,"The output provided by the model is mostly correct. It correctly identifies Delta Lake as an open-source storage layer that provides ACID transactions and sits on top of existing data lakes. It also correctly mentions its capabilities such as schema enforcement and data versioning. However, it does not specifically mention that Delta Lake brings ACID transactions to Apache Spark, which is a minor omission. Therefore, the model's answer_correctness score is 4"
2,How to exit vim?,"To exit vim, press the ""Esc"" key to enter command mode, then type "":q"" and press ""Enter"". If you have unsaved changes, you can add an exclamation mark to force the exit, like "":q!"".",5,"The output provided by the model is completely correct. It accurately describes the process of exiting vim, which aligns perfectly with the provided targets. The output mentions pressing the ""Esc"" key to enter command mode, then typing "":q"" and pressing ""Enter"". It also includes the additional information about forcing an exit with unsaved changes by adding an exclamation mark, like "":q!"". This demonstrates a high degree of accuracy and semantic similarity to the targets, which is why it gets a score of 5 according to the grading rubric."
3,How to exit emacs?,"To exit Emacs, you can use the keyboard command ""Ctrl-x Ctrl-c"" or type ""M-x save-buffers-kill-emacs"" and press Enter.",5,"The output provided by the model is correct. It accurately provides the keyboard command ""Ctrl-x Ctrl-c"" to exit Emacs, which aligns perfectly with the provided targets. Additionally, the model provides an extra command ""M-x save-buffers-kill-emacs"" to exit Emacs, which is also correct, although it was not mentioned in the targets. Therefore, the model's answer_correctness score is 5 according to the grading rubric."


## Answer Relevance

In [None]:
from mlflow.metrics.genai import answer_relevance

example1 = EvaluationExample(
    input="What is MLflow Tracking?",
    output="MLflow Tracking provides both an API and UI dedicated to the logging "
    "of parameters, code versions, metrics, and artifacts during the ML process. "
    "This centralized repository captures details such as parameters, metrics, "
    "artifacts, data, and environment configurations, giving teams insight into their "
    "models’ evolution over time.",
    score=5,
    justification="The answer directly addresses the input question and  "
    "provides a concise and clear description of MLflow Tracking.",
)

example2 = EvaluationExample(
    input="What is MLflow Model Registry?",
    output="MLflow Model Registry is a component of MLflow that helps in managing "
    "and deploying models in production. It provides versioning and stage transitions. "
    "MLflow also has a model evaluation feature for evaluating ML models.",
    score=3,
    justification="The answer provides a general idea about MLflow Model Registry and "
    "includes correct details about versioning and stage transitions. The mention of model evaluation "
    "is irrelevant to the input question, hence the score of 3.",
)

example3 = EvaluationExample(
    input="What is automatic logging in MLflow?",
    output="Delta Lake is an open-source storage layer that brings ACID transactions to Apache "
    "Spark and big data workloads.",
    score=1,
    justification="The output is completely irrelevant to the input question about "
    "automatic logging in MLflow, hence the score of 1.",
)

# Construct the metric using OpenAI GPT-4 as the judge
answer_relevance_metric = answer_relevance(model="openai:/gpt-4", examples=[example1, example2, example3])

print(answer_relevance_metric)


In [None]:
import mlflow

with mlflow.start_run() as run:
    system_prompt = "Concisely answer the following question."
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        targets="ground_truth",
        extra_metrics=[
         #   answer_relevance_metric,
            answer_correctness_metric,
        ],  # use the answer similarity metric created above
    )

results.metrics

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
df = results.tables["eval_results_table"]

# just show question outputs faithfulness/v1/score	faithfulness/v1/justification columns
#df[["inputs", "outputs", "answer_relevance/v1/score", "answer_relevance/v1/justification"]].style

## Faithfulness
The [*faithfulness*](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.faithfulness) metric assesses "how factually consistent the output is to the context." So for this metric, we require input, output, and context.

In [None]:
from mlflow.metrics.genai import faithfulness, EvaluationExample

# Create a good and bad example for faithfulness in the context of this problem
faithfulness_examples = [
    EvaluationExample(
        input="What is MLflow Tracking?",
        output="MLflow Tracking offers an API and UI for centralized logging of machine "
        "learning parameters, code versions, metrics, and artifacts, supporting a variety "
        "of environments and enabling easier comparison of multiple runs across users.",
        score=5,
        justification="The output provides a clear answer including only details from the provided context",
        grading_context={
            "context": "MLflow Tracking provides both an API and UI dedicated to the logging "
            "of parameters, code versions, metrics, and artifacts during the ML process. This "
            "centralized repository captures details such as parameters, metrics, artifacts, "
            "data, and environment configurations, giving teams insight into their models’ "
            "evolution over time. Whether working in standalone scripts, notebooks, or other "
            "environments, Tracking facilitates the logging of results either to local files or a "
            "server, making it easier to compare multiple runs across different users."
        },
    ),
    EvaluationExample(
        input="What is MLflow Model Registry?",
        output="The Model Registry centralizes MLflow Models in a repository, offering APIs, a "
        "user interface, and features like versioning, state tracking, and annotations, as "
        "well as the ability to archive, delete, and search models for a seamless transition "
        "from development to production and ongoing management.",
        score=3,
        justification="The output receives a score of 3 rather than a lower score like 1 "
        "because it does accurately reflect most of the core details mentioned in the "
        "original context such as centralized storage, APIs, UI, versioning, state tracking, "
        "and annotations. However, it introduces extraneous details like the ability to "
        "'archive, delete, and search models,' which are not derived from the context. "
        "Additionally, it substitutes 'repository' for 'store' and introduces 'ongoing "
        "management,' slight deviations from the original description. These inaccuracies "
        "and additions are not fundamentally wrong, but they are not faithful to the original "
        "context, warranting a deduction in score.",
        grading_context={
            "context": "A systematic approach to model management, the Model Registry assists "
            "in handling different versions of models, discerning their current state, and "
            "ensuring a smooth transition from development to production. It offers a centralized "
            "model store, APIs, and UI to collaboratively manage an MLflow Model’s full lifecycle, "
            "including model lineage, versioning, stage transitions, and annotations."
        },
    ),
    EvaluationExample(
        input="What is automatic logging in MLflow?",
        output="Automatic logging in MLflow is a machine learning technique used to automate the "
        "process of cutting down trees for lumber.",
        score=1,
        justification="The output is entirely incorrect and fails to utilize the given context. "
        "The context is about MLflow's automatic logging for metrics, parameters, and models in "
        "machine learning projects. Instead, the output discusses using machine learning to automate "
        "the process of cutting down trees, which is completely unrelated to the actual context provided.",
        grading_context={
            "context": "Automatic logging allows you to log metrics, parameters, and models without the "
            "need for explicit log statements. There are two ways to use autologging: Call mlflow.autolog() "
            "before your training code. This will enable autologging for each supported library you have "
            "installed as soon as you import it. Use library-specific autolog calls for each library you use "
            "in your code."
        },
    ),
]

faithfulness_metric = faithfulness(
    model="openai:/gpt-4", examples=faithfulness_examples
)
print(faithfulness_metric)

In [None]:
eval_df = pd.DataFrame(
    {
        "question": [
            "What is MLflow?",
            "What is Delta Lake?",
            "How to exit vim?",
            "How to exit emacs?",
        ],
        "context": [
            "MLflow, at its core, provides a suite of tools aimed at simplifying the ML "
            "workflow. It is tailored to assist ML practitioners throughout the various "
            "stages of ML development and deployment. Despite its expansive offerings, "
            "MLflow’s functionalities are rooted in several foundational components:"
            "Tracking: MLflow Tracking provides both an API and UI dedicated to the "
            "logging of parameters, code versions, metrics, and artifacts during the ML "
            "process. This centralized repository captures details such as parameters, "
            "metrics, artifacts, data, and environment configurations, giving teams "
            "insight into their models’ evolution over time. Whether working in standalone "
            "scripts, notebooks, or other environments, Tracking facilitates the logging "
            "of results either to local files or a server, making it easier to compare "
            "multiple runs across different users."
            "Model Registry: A systematic approach to model management, the Model Registry "
            "assists in handling different versions of models, discerning their current "
            "state, and ensuring a smooth transition from development to production. It "
            "offers a centralized model store, APIs, and UI to collaboratively manage an "
            "MLflow Model’s full lifecycle, including model lineage, versioning, stage "
            "transitions, and annotations."
            "AI Gateway: This server, equipped with a set of standardized APIs, streamlines "
            "access to both SaaS and OSS LLM models. It serves as a unified interface, "
            "bolstering security through authenticated access, and offers a common set of "
            "APIs for prominent LLMs."
            "Evaluate: Designed for in-depth model analysis, this set of tools facilitates "
            "objective model comparison, be it traditional ML algorithms or cutting-edge "
            "LLMs."
            "Prompt Engineering UI: A dedicated environment for prompt engineering, this "
            "UI-centric component provides a space for prompt experimentation, refinement, "
            "evaluation, testing, and deployment."
            "Recipes: Serving as a guide for structuring ML projects, Recipes, while "
            "offering recommendations, are focused on ensuring functional end results "
            "optimized for real-world deployment scenarios."
            "Projects: MLflow Projects standardize the packaging of ML code, workflows, "
            "and artifacts, akin to an executable. Each project, be it a directory with "
            "code or a Git repository, employs a descriptor or convention to define its "
            "dependencies and execution method.",
            "Delta Lake is an open source project that enables building a Lakehouse "
            "architecture on top of data lakes. Delta Lake provides ACID transactions, "
            "scalable metadata handling, and unifies streaming and batch data processing "
            "on top of existing data lakes, such as S3, ADLS, GCS, and HDFS."
            "Specifically, Delta Lake offers:"
            "ACID transactions on Spark: Serializable isolation levels ensure that readers "
            "never see inconsistent data."
            "Scalable metadata handling: Leverages Spark distributed processing power to "
            "handle all the metadata for petabyte-scale tables with billions of files at ease."
            "Streaming and batch unification: A table in Delta Lake is a batch table as well "
            "as a streaming source and sink. Streaming data ingest, batch historic backfill, "
            "interactive queries all just work out of the box."
            "Schema enforcement: Automatically handles schema variations to prevent insertion "
            "of bad records during ingestion."
            "Time travel: Data versioning enables rollbacks, full historical audit trails, "
            "and reproducible machine learning experiments."
            "Upserts and deletes: Supports merge, update and delete operations to enable "
            "complex use cases like change-data-capture, slowly-changing-dimension (SCD) "
            "operations, streaming upserts, and so on.",
            "After saving your changes, you can quit Vim with :q. Or the saving and "
            "quitting can be combined into one operation with :wq or :x."
            "If you want to discard any changes, enter :q! to quit Vim without saving.",
            "C-x C-c\n\n    Kill Emacs (save-buffers-kill-terminal)."
            "C-z\n\n    On a text terminal, suspend Emacs; on a graphical display, "
            "iconify (or “minimize”) the selected frame (suspend-frame)."
            "Killing Emacs means terminating the Emacs program. To do this, type "
            "C-x C-c (save-buffers-kill-terminal). A two-character key sequence is "
            "used to make it harder to type by accident. If there are any modified "
            "file-visiting buffers when you type C-x C-c, Emacs first offers to save "
            "these buffers. If you do not save them all, it asks for confirmation "
            "again, since the unsaved changes will be lost. Emacs also asks for "
            "confirmation if any subprocesses are still running, since killing "
            "Emacs will also kill the subprocesses (see Running Shell Commands from Emacs)."
            "C-x C-c behaves specially if you are using Emacs as a server. If you "
            "type it from a client frame, it closes the client connection. See Using "
            "Emacs as a Server.",
        ],
    }
)


In [None]:
with mlflow.start_run() as run:
    system_prompt = "Concisely answer the following question using only the information provided in the context."
    basic_context_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Context:\n{context}\n\nQuestion:\n{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_context_model.model_uri,
        eval_df,
        model_type="question-answering",
        evaluators="default",
        predictions="result",
        evaluator_config={
            "col_mapping": {
                "inputs": "question",
                "context": "context"
            }
        },
        extra_metrics=[faithfulness_metric],  # use the faithfulness metric
    )

results.metrics

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
df = results.tables["eval_results_table"]

# just show question outputs faithfulness/v1/score	faithfulness/v1/justification columns
df[["question", "outputs", "faithfulness/v1/score", "faithfulness/v1/justification"]].style