# Generative AI Evaluation Metrics in MLflow

MLflow 2.8 introduced new [Generative AI Metrics](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#generative-ai-metrics) that use LLMs to evaluate model output text. There are a few different GenAI metrics to choose from:
- [Answer Correctness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_correctness), which compares a model's output to a ground truth answer
- [Answer Relevance](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_correctness), which evaluates how appropriate and applicable a response is with respect to the input
- [Answer Similarity](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.answer_similarity), which assesses the semantic similarity of a generated response to a ground truth answer
- [Faithfulness](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.faithfulness), which tests the factual similarity of a model's response to some provided context (e.g. in a RAG system)
- [Relevance](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#mlflow.metrics.genai.relevance), which examines the output with respect to the input and provided context (e.g. in a RAG system) and rates its relevance and significance. Note that this differs from the `Answer Similarity` metric, which does not have a context component.

These all work in fudamentally the same way: pick a model and (optionally) define an example, at which point you can use the new metric in the MLflow.evaluate() system. Let's try it out.

In [4]:
# setup
import openai
import pandas as pd
from dotenv import load_dotenv
load_dotenv(override=True)

True

In [5]:
from mlflow.metrics.genai import EvaluationExample, answer_relevance

example1 = EvaluationExample(
    input="What is MLflow Tracking?",
    output="MLflow Tracking provides both an API and UI dedicated to the logging "
    "of parameters, code versions, metrics, and artifacts during the ML process. "
    "This centralized repository captures details such as parameters, metrics, "
    "artifacts, data, and environment configurations, giving teams insight into their "
    "models’ evolution over time.",
    score=5,
    justification="The answer directly addresses the input question and  "
    "provides a concise and clear description of MLflow Tracking.",
)

example2 = EvaluationExample(
    input="What is MLflow Model Registry?",
    output="MLflow Model Registry is a component of MLflow that helps in managing "
    "and deploying models in production. It provides versioning and stage transitions. "
    "MLflow also has a model evaluation feature for evaluating ML models.",
    score=3,
    justification="The answer provides a general idea about MLflow Model Registry and "
    "includes correct details about versioning and stage transitions. The mention of model evaluation "
    "is irrelevant to the input question, hence the score of 3.",
)

example3 = EvaluationExample(
    input="What is automatic logging in MLflow?",
    output="Delta Lake is an open-source storage layer that brings ACID transactions to Apache "
    "Spark and big data workloads.",
    score=1,
    justification="The output is completely irrelevant to the input question about "
    "automatic logging in MLflow, hence the score of 1.",
)

# Construct the metric using OpenAI GPT-4 as the judge
answer_relevance_metric = answer_relevance(model="openai:/gpt-4", examples=[example1, example2, example3])

print(answer_relevance_metric)


EvaluationMetric(name=answer_relevance, greater_is_better=True, long_name=answer_relevance, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called answer_relevance based on the input and output.
A definition of answer_relevance and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Answer relevance measures the appropriateness and applicability of the output with respect to the input. Scores should reflect the extent to 

In [6]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "What is MLflow?",
            "What is Delta Lake?",
            "How to exit vim?",
            "How to exit emacs?",
        ],
        "ground_truth": [
            "MLflow is an open source platform for managing the end-to-end machine learning lifecycle.",
            "Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.",
            "To exit vim, press ESC to enter command mode, then type :q and press Enter.",
            "To exit emacs, press Ctrl+x, then Ctrl+c."
        ]
    }
)


In [7]:
import mlflow

with mlflow.start_run() as run:
    system_prompt = "Concisely answer the following question."
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        extra_metrics=[answer_relevance_metric],  # use the answer similarity metric created above

    )

results.metrics

2023/10/30 17:27:25 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/10/30 17:27:25 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/10/30 17:27:31 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/10/30 17:27:36 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: answer_relevance


  0%|          | 0/4 [00:00<?, ?it/s]

{'answer_relevance/v1/mean': 5.0,
 'answer_relevance/v1/variance': 0.0,
 'answer_relevance/v1/p90': 5.0}

In [8]:
results.tables["eval_results_table"]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,ground_truth,outputs,token_count,answer_relevance/v1/score,answer_relevance/v1/justification
0,What is MLflow?,MLflow is an open source platform for managing...,MLflow is an open-source platform that assists...,22,5,The output directly addresses the input questi...
1,What is Delta Lake?,Delta Lake is an open-source storage layer tha...,Delta Lake is an open-source data lake storage...,49,5,The output provides a comprehensive and accura...
2,How to exit vim?,"To exit vim, press ESC to enter command mode, ...","To exit Vim, you can press the Esc key to swit...",26,5,The output directly answers the input question...
3,How to exit emacs?,"To exit emacs, press Ctrl+x, then Ctrl+c.","To exit Emacs, you can use the keyboard shortc...",41,5,The output directly addresses the input questi...


In [3]:
import openai

prompt = "test openai chat completion @OpenAI"

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ],
)

completion = response.choices[0]

completion




<OpenAIObject at 0x289cc2630> JSON: {
  "index": 0,
  "message": {
    "role": "assistant",
    "content": "Hello! How can I assist you today?"
  },
  "finish_reason": "stop"
}