# Introduction

MLflow 2.8 introduced [LLM-as-judge genai metrics](https://mlflow.org/docs/latest/python_api/mlflow.metrics.html#generative-ai-metrics). [This notebook](mlflow-genai-metrics.ipynb) shows how to use preconfigured genai metrics, such as answer correctness and answere relevance, to evaluate models in MLflow. Here, we show how to generate [custom genai metrics](https://mlflow.org/docs/latest/llms/llm-evaluate/notebooks/question-answering-evaluation.html#Custom-LLM-judged-metric-for-professionalism), allowing you to use LLMs to judge based on any criteria you can describe.

To define a custom metric, use the `mlflow.metrics.make_genai_metric` function. You must provide:
- A `definition`, which describes what the judge LLM is measuring;
- A `grading_prompt`, which describes the grading scale and criteria;
- Grading examples; and
- A grading model and related configurations (e.g. gpt4 with temperature).

Here's an example metric that is graded on the basis of "accessibility." A model should receive a poor "accessibility" score if it answers questions with unnecessary jargon, technical language, or confusing sentence structures. A model should receive a high "accessibility" score if it answers questions with clear and concise language.

In [5]:
# setup
import openai
import pandas as pd
from dotenv import load_dotenv
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

load_dotenv(override=True)

True

In [6]:

answer_accessibility = make_genai_metric(
    name="accessibility",
    definition=(
        "Accessibility in this context refers to the use of language that is easily "
        "understandable by a wide audience, minimizing technical jargon, complex "
        "sentence structures, or specialized terminology. It involves using clear, "
        "concise language and presenting information in a straightforward manner."
    ),
    grading_prompt=(
        "Accessibility: Evaluate how accessible the language in the model's response "
        "is. \n"
        "- Score 0: The response is filled with dense jargon, technical language, and "
        "complex sentence structures, making it very difficult for a general audience "
        "to understand.\n"
        "- Score 5: The response uses some technical terms or complex sentences but "
        "generally remains understandable to a broad audience.\n"
        "- Score 10: The response is exceptionally clear and concise, free of "
        "unnecessary jargon, and easily understandable by a wide audience, regardless "
        "of their background."
    ),
    examples=[
        EvaluationExample(
            input="What is machine learning?",
            output=(
                "Machine learning is a subset of artificial intelligence where computers are "
                "programmed to learn from data. Unlike traditional programming, where we "
                "explicitly code every decision the computer should make, machine learning "
                "allows the computer to uncover patterns and make decisions based on past "
                "observations."
            ),
            score=7,
            justification=(
                "This response provides a direct explanation of machine learning with some "
                "level of complexity. It avoids heavy jargon and is fairly accessible, yet "
                "it involves concepts like 'subset of artificial intelligence' and 'traditional "
                "programming,' which add minor complexity."
            ),
        ),
        EvaluationExample(
            input="What is the historical foundation of the Iliad?",
            output=(
                "The Iliad's narrative, steeped in mythological lore, ostensibly traces its lineage "
                "back to the epoch of the Trojan War, a putative event of the Bronze Age. However, "
                "its veracity as a historical document is contentious, with characters like Achilles "
                "and Hector likely being more allegorical or aggrandized than actual historical "
                "personages."
            ),
            score=3,
            justification=(
                "The response incorporates more complex and less commonly used terms such as 'ostensibly', "
                "'putative', and 'allegorical', which can make the text less accessible to a general "
                "audience. While it addresses the topic, the use of these terms and a somewhat circuitous "
                "explanation style obscure the clarity, warranting a score of 3."
            ),
        ),
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(answer_accessibility)

EvaluationMetric(name=accessibility, greater_is_better=True, long_name=accessibility, version=v1, metric_details=
Task:
You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called accessibility based on the input and output.
A definition of accessibility and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before completing the task.

Input:
{input}

Output:
{output}

{grading_context_columns}

Metric definition:
Accessibility in this context refers to the use of language that is easily understandable by a wide audience, minimizing technical jargon, complex senten

In [7]:
import pandas as pd

eval_df = pd.DataFrame(
    {
        "inputs": [
            "Explain the concept of supply and demand in economics.",
            "Describe the process of photosynthesis.",
            "What are the principles of object-oriented programming?",
            "How does quantum computing differ from classical computing?",
            "Discuss the themes in Shakespeare's Hamlet.",
        ]
    }
)

In [9]:
import mlflow

with mlflow.start_run() as run:
    system_prompt = "Concisely answer the following question."
    basic_qa_model = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        extra_metrics=[
            answer_accessibility
        ],  # use the answer similarity metric created above
    )

results.metrics

2023/11/16 10:43:12 INFO mlflow.models.evaluation.base: Evaluating the model with the default evaluator.
2023/11/16 10:43:12 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2023/11/16 10:43:24 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
Using default facebook/roberta-hate-speech-dynabench-r4-target checkpoint


  0%|          | 0/1 [00:00<?, ?it/s]

2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: token_count
2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: toxicity
2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: flesch_kincaid_grade_level
2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: ari_grade_level
2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating builtin metrics: exact_match
2023/11/16 10:43:40 INFO mlflow.models.evaluation.default_evaluator: Evaluating metrics: accessibility


  0%|          | 0/5 [00:00<?, ?it/s]

{'toxicity/v1/mean': 0.00015637676988262684,
 'toxicity/v1/variance': 4.611356448080803e-10,
 'toxicity/v1/p90': 0.00018160425242967903,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 18.28,
 'flesch_kincaid_grade_level/v1/variance': 19.749599999999997,
 'flesch_kincaid_grade_level/v1/p90': 22.919999999999998,
 'ari_grade_level/v1/mean': 21.560000000000002,
 'ari_grade_level/v1/variance': 25.238399999999995,
 'ari_grade_level/v1/p90': 26.84,
 'accessibility/v1/mean': 8.2,
 'accessibility/v1/variance': 3.3600000000000003,
 'accessibility/v1/p90': 10.0}

In [12]:
pd.set_option("display.max_columns", None)
pd.set_option("display.expand_frame_repr", False)
df = results.tables["eval_results_table"]
df[
    [
        "inputs",
        "outputs",
        "accessibility/v1/score",
        "accessibility/v1/justification",
    ]
].style

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,inputs,outputs,accessibility/v1/score,accessibility/v1/justification
0,Explain the concept of supply and demand in economics.,"Supply and demand is a fundamental concept in economics that explains the relationship between the quantity of a product that producers are willing to supply and the quantity that consumers are willing to buy. The demand curve represents the quantity of a product that consumers are willing and able to purchase at various prices, while the supply curve represents the quantity of a product that producers are willing and able to sell at various prices. The intersection of these curves determines the equilibrium price and quantity of a product in a market.",8,"The response provides a clear and concise explanation of the concept of supply and demand in economics. It uses some technical terms such as 'demand curve', 'supply curve', and 'equilibrium price', which are necessary for the explanation. However, these terms are well-explained and the sentences are structured in a straightforward manner, making the response generally accessible to a broad audience. The response could be slightly more accessible if it provided simpler definitions or examples for these terms. Therefore, the score is 8."
1,Describe the process of photosynthesis.,"Photosynthesis is the process in which green plants, algae, and some bacteria convert sunlight, water, and carbon dioxide into oxygen and glucose. This process occurs in the chloroplasts of the plant cells, where chlorophyll captures sunlight energy. The energy is used to split water molecules, releasing oxygen as a by-product and producing high-energy molecules called ATP and NADPH. Carbon dioxide is then absorbed from the air and combined with ATP and NADPH to synthesize glucose. Oxygen is released back into the atmosphere through small pores called stomata. Overall, photosynthesis is a vital process that sustains life on Earth by generating oxygen and providing energy-rich molecules for organisms.",8,"The response provides a detailed explanation of photosynthesis using mostly simple and clear language. It does use some technical terms such as 'chloroplasts', 'chlorophyll', 'ATP', 'NADPH', and 'stomata', which are necessary for a complete explanation of the process. However, these terms are explained in the context, making the information more accessible. The sentence structure is straightforward and the information is presented in a logical order. Therefore, the response is quite accessible to a broad audience, but not entirely free of technical terms."
2,What are the principles of object-oriented programming?,"The principles of object-oriented programming include encapsulation, inheritance, and polymorphism.",10,"The response is clear and concise, providing a straightforward answer to the question. The terms used, while technical, are fundamental to the topic of object-oriented programming and are presented without additional jargon or complex sentence structures. This makes the response easily understandable to a wide audience, regardless of their background in programming. Therefore, it scores a perfect 10 on the accessibility scale."
3,How does quantum computing differ from classical computing?,"Quantum computing differs from classical computing by harnessing the principles of quantum mechanics, such as superposition and entanglement, to perform computational tasks. It utilizes quantum bits or qubits, which can represent multiple states simultaneously, enabling parallel processing and potentially exponential speedup over classical computers.",5,"The response uses some technical terms such as 'quantum mechanics', 'superposition', 'entanglement', 'quantum bits' or 'qubits', and 'parallel processing'. These terms are necessary to accurately explain the difference between quantum and classical computing, but they may not be easily understood by a general audience. However, the overall structure of the sentences is not overly complex, and the response does make an effort to explain the concepts in a straightforward manner, such as by stating that qubits can represent multiple states simultaneously. Therefore, the response is somewhat accessible, but not exceptionally so, warranting a score of 5."
4,Discuss the themes in Shakespeare's Hamlet.,"The themes in Shakespeare's Hamlet include the complex nature of revenge, the uncertainty of mortality, the consequences of inaction, the corrupting influence of power, the struggle with existential questions, and the destructive effects of deception and appearance versus reality.",10,"The response is clear, concise, and free of unnecessary jargon. It uses simple language to explain the themes in Shakespeare's Hamlet, making it easily understandable by a wide audience, regardless of their background. The themes are listed in a straightforward manner, and the language used is not overly complex or technical. Therefore, it scores a perfect 10 on the accessibility scale."


In [11]:
df

Unnamed: 0,inputs,outputs,token_count,toxicity/v1/score,flesch_kincaid_grade_level/v1/score,ari_grade_level/v1/score,accessibility/v1/score,accessibility/v1/justification
0,Explain the concept of supply and demand in ec...,Supply and demand is a fundamental concept in ...,96,0.000158,15.3,17.8,8,The response provides a clear and concise expl...
1,Describe the process of photosynthesis.,Photosynthesis is the process in which green p...,136,0.000146,11.4,14.0,8,The response provides a detailed explanation o...
2,What are the principles of object-oriented pro...,The principles of object-oriented programming ...,16,0.000141,22.5,26.0,10,"The response is clear and concise, providing a..."
3,How does quantum computing differ from classic...,Quantum computing differs from classical compu...,57,0.00014,19.0,22.6,5,The response uses some technical terms such as...
4,Discuss the themes in Shakespeare's Hamlet.,The themes in Shakespeare's Hamlet include the...,49,0.000197,23.2,27.4,10,"The response is clear, concise, and free of un..."
