# Evaluate LLMs on Perplexity and Toxicity

MLflow now offers the option to evaluate LLMs on Perplexity and Toxicity. This notebook will briefly guide you through the process of using `mlflow.evaluate` to compare perplexity and toxicity.

To get started with evaluation, check out [this notebook](./compare-llms-with-mlflow.ipynb).

## 1. Setup
We will compare OpenAI's `gpt-3.5-turbo` and Meta's `Llama-2-70b-chat` (via the Mosaic inference starter tier) using MLflow AI Gateway. But you can use any models compatible with the MLflow Evaluation functionality.

In [0]:
!pip install --upgrade 'mlflow[gateway]'
dbutils.library.restartPython()

In [0]:

# configure keys
import os
from dotenv import load_dotenv
import mlflow.gateway

load_dotenv()

OPENAI_API_KEY = dbutils.secrets.get(scope="daniel.liden", key="OPENAI_API_KEY")
MOSAIC_API_KEY = dbutils.secrets.get(scope="daniel.liden", key="MOSAIC_API_KEY")

In [0]:
# Only necessary if you haven't already created the models

# mlflow.gateway.delete_route("dl-gpt-3_5-turbo")
mlflow.gateway.create_route(
    name="dl-gpt-3_5-turbo",
    route_type="llm/v1/chat",
    model= {
        "name": "gpt-3.5-turbo", 
        "provider": "openai",
        "openai_config": {
          "openai_api_key": OPENAI_API_KEY,
        }
    }
)

# mlflow.gateway.delete_route("dl-llama-70b-chat-mosaic")
mlflow.gateway.create_route(
    name="dl-llama-70b-chat-mosaic",
    route_type="llm/v1/chat",
    model= {
        "name": "llama2-70b-chat", 
        "provider": "mosaicml",
        "mosaicml_config": {
          "mosaicml_api_key": MOSAIC_API_KEY,
        }
    }
)

## 2. Register MLflow AI Gateway models

Now register the models with AI gateway, storing the run IDs so we can evaluate in the same runs.

Note: `MLflow.evaluate` evaluation datasets can't have dictionaries within columns, which is what the chat models expect as inputs. Later, we will convert input dictionaries to JSON strings. So the predict methods defined below include logic for converting JSON strings back to dictionaries when needed.

In [0]:
import os
import pandas as pd
import mlflow
from mlflow.gateway import MlflowGatewayClient


# configure the predict methods for the custom models
def predict_mosaic(data):
    client = MlflowGatewayClient("databricks")
    # Convert JSON strings in 'messages' column to dicts if necessary
    data["messages"] = data["messages"].apply(lambda x: json.loads(x) if isinstance(x, str) else x)
    payload = data.to_dict(orient="records")
    return [
        client.query(route="dl-llama-70b-chat-mosaic",
                     data=query)['candidates'][0]['message']['content']
        for query in payload
    ]

def predict_openai(data):
    client = MlflowGatewayClient("databricks")
    # Convert JSON strings in 'messages' column to dicts if necessary
    data["messages"] = data["messages"].apply(lambda x: json.loads(x) if isinstance(x, str) else x)
    payload = data.to_dict(orient="records")
    return [
        client.query(route="dl-gpt-3_5-turbo",
                     data=query)['candidates'][0]['message']['content']
        for query in payload
    ]


# generate input examples
input_example = pd.DataFrame.from_dict(
    {
        "messages": [
            [
                {"role": "user", "content": "Very concisely explain MLflow runs."},
                {"role": "assistant", "content": "No."},
                {"role": "user", "content": "Very concisely explain MLflow runs."},
            ],
            [{"role": "user", "content": "Very concisely explain MLflow artifacts."}],
        ],
        "temperature": 0.6,
        "max_tokens": 50,
    }
)

# generate model signatures
signature = mlflow.models.infer_signature(
    input_example, ["MLflow runs are...", "MLflow artifacts are..."]
)

# record run ids and log models
run_ids = []

with mlflow.start_run(run_name="log_mosaic_model"):
    run_ids.append(mlflow.active_run().info.run_id)
    mosaic_model_info = mlflow.pyfunc.log_model(
        python_model=predict_mosaic,
        registered_model_name="dl-llama-70b-chat-mosaic-gateway",
        artifact_path="dbfs/daniel.liden/mlflow/dl-llama-70b-chat-mosaic-gateway/",
        input_example=input_example,
        signature=signature,
    )

with mlflow.start_run(run_name="log_openai_model"):
    run_ids.append(mlflow.active_run().info.run_id)
    openai_model_info = mlflow.pyfunc.log_model(
        python_model=predict_openai,
        registered_model_name="dl-gpt-3_5-turbo-gateway",
        artifact_path="dbfs/daniel.liden/mlflow/dl-gpt-3_5-turbo-gateway/",
        input_example=input_example,
        signature=signature,
    )

# optionally load the models
loaded_mosaic_model = mlflow.pyfunc.load_model(mosaic_model_info.model_uri)
loaded_openai_model = mlflow.pyfunc.load_model(openai_model_info.model_uri)

## 3. Define an evaluation dataset

As noted above, we convert the input dicts into JSON strings as required by `mlflow.evaluate`.

In [0]:
import json

data = {
        "messages": [
            [{"role": "user", "content": "Very concisely explain MLflow runs."}],
            [{"role": "user", "content": "Very concisely explain MLflow artifacts."}],
            [{"role": "user", "content": "How does backpropagation work?"}],
            [{"role": "user", "content": "Differences between RNN and LSTM?"}],
            [{"role": "user", "content": "What is transfer learning?"}],
            [{"role": "user", "content": "Explain the concept of overfitting."}],
            [{"role": "user", "content": "How are CNNs used in image recognition?"}],
            [{"role": "user", "content": "What does the term 'bias' mean in AI?"}],
            [{"role": "user", "content": "Describe reinforcement learning briefly."}],
            [{"role": "user", "content": "Tell me about gradient descent."}],
        ],
        "temperature": 0.6,
        "max_tokens": 50,
    }

data["messages"] = [json.dumps(item) for item in data["messages"]]
eval_df = pd.DataFrame.from_dict(data)
eval_df


## 4. Evaluate with perplexity and toxicity
perplexity and toxicity are now default evaluation metrics for models with `model_type="text"`, so we don't need to do anything special to add these metrics.

In [0]:
with mlflow.start_run(run_id = run_ids[0]):
    eval_df["model_name"] = "mosaic_llama"
    mlflow.evaluate(
        #model=loaded_mosaic_model,
        model = mosaic_model_info.model_uri,
        data=eval_df,
        model_type="text",
    )

with mlflow.start_run(run_id = run_ids[1]):
    eval_df["model_name"] = "openai"
    mlflow.evaluate(
        #model=loaded_openai_model,
        model = openai_model_info.model_uri,
        data=eval_df,
        model_type="text",
    )

You can view the results in the MLflow UI or load the evaluation table as follows:

In [0]:
mlflow.load_table("eval_results_table.json")