# LLM Evaluation with MLflow

* https://mlflow.org/docs/latest/llms/llm-evaluate/notebooks/question-answering-evaluation.html
* https://openrouter.ai/meta-llama/llama-3.2-3b-instruct:free/api
* https://openrouter.ai/google/gemini-flash-1.5-8b-exp

In [2]:
import os

assert (
    "OPENAI_API_BASE" in os.environ
), "OPENAI_API_BASE environment variable must be set"
assert "OPENAI_API_KEY" in os.environ, "OPENAI_API_KEY environment variable must be set"

In [3]:
import mlflow
import openai
import pandas as pd

In [4]:
eval_df = pd.DataFrame(
    {
        "inputs": [
            "How does useEffect() work?",
            "What does the static keyword in a function mean?",
            "What does the 'finally' block in Python do?",
            "What is the difference between multiprocessing and multithreading?",
        ],
        "ground_truth": [
            "The useEffect() hook tells React that your component needs to do something after render. React will remember the function you passed (we’ll refer to it as our “effect”), and call it later after performing the DOM updates.",
            "Static members belongs to the class, rather than a specific instance. This means that only one instance of a static member exists, even if you create multiple objects of the class, or if you don't create any. It will be shared by all objects.",
            "'Finally' defines a block of code to run when the try... except...else block is final. The finally block will be executed no matter if the try block raises an error or not.",
            "Multithreading refers to the ability of a processor to execute multiple threads concurrently, where each thread runs a process. Whereas multiprocessing refers to the ability of a system to run multiple processors in parallel, where each processor can run one or more threads.",
        ],
    }
)

In [6]:
mlflow.set_tracking_uri("http://127.0.0.1:5000")

mlflow.set_experiment(experiment_name="llm-qa-evaluation")

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    basic_qa_model = mlflow.openai.log_model(
        model="meta-llama/llama-3.2-3b-instruct:free",
        task=openai.chat.completions,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )
    results = mlflow.evaluate(
        basic_qa_model.model_uri,
        eval_df,
        targets="ground_truth",  # specify which column corresponds to the expected output
        model_type="question-answering",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )
results.metrics

  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|███████████████████████████████████████████████| 5/5 [00:00<00:00, 3225.89it/s]
2024/11/17 09:21:48 INFO mlflow.models.evaluation.default_evaluator: Computing model predictions.
2024/11/17 09:22:06 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/11/17 09:22:06 INFO mlflow.tracking._tracking_service.client: 🏃 View run efficient-snake-149 at: http://127.0.0.1:5000/#/experiments/771549027936709739/runs/974201f6da88405ca81748d087099357.
2024/11/17 09:22:06 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://127.0.0.1:5000/#/experiments/771549027936709739.


{'exact_match/v1': 0.0}

In [7]:
results.tables["eval_results_table"]

Downloading artifacts: 100%|████████████████████████████████████████████████| 1/1 [00:00<00:00, 238.26it/s]


Unnamed: 0,inputs,ground_truth,outputs,token_count
0,How does useEffect() work?,The useEffect() hook tells React that your com...,The `useEffect()` hook in React is used to han...,84
1,What does the static keyword in a function mean?,"Static members belongs to the class, rather th...",The static keyword in a function means that th...,67
2,What does the 'finally' block in Python do?,'Finally' defines a block of code to run when ...,The 'finally' block in Python is a special blo...,61
3,What is the difference between multiprocessing...,Multithreading refers to the ability of a proc...,Multiprocessing and multithreading are two dif...,83
