# Comparing OpenAI Models to Open-Source Transformers Models with MLFLow

With MLFlow, you can compare outputs from open-source LLMs to outputs from hosted proprietary models such as those from OpenAI, Anthropic, or Cohere.

To run this example, make sure you have your `OPENAI_API_KEY` set in your environment or in a `.env` file.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

assert (
    "OPENAI_API_KEY" in os.environ
), "Please set the OPENAI_API_KEY environment variable."

## Log the OpenAI Model
To use an OpenAI model for evaluation, we need to save it in MLFlow format via `mlflow.openai.log_model()`

In [None]:
import mlflow
import openai

mlflow.set_experiment("compare-openai-transformers-4")

with mlflow.start_run(run_name="log_model_gpt-3.5-turbo"):
    gpt3_5_turbo_model_info = mlflow.openai.log_model(
        model="gpt-3.5-turbo",
        task=openai.ChatCompletion,
        artifact_path="gpt_3_5_turbo_model",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant",
            }
        ],
    )

Once we've logged the model, we can load it and call it as follows:

In [None]:
model = mlflow.pyfunc.load_model(gpt3_5_turbo_model_info.model_uri)
print(model.predict("Where would you likely find a whale?"))

## Log the Transformer model

We'll use the HuggingFace `gpt2-xl` model and load it as a text generation pipeline.

In [None]:
import transformers
import accelerate
from transformers import pipeline, GenerationConfig

gpt2_pipe = pipeline("text-generation", model="gpt2-xl", device_map="auto")

To make this model usable for the sake of evaluation, we'll need to wrap it in a pyfunc model class.

In [None]:
import pandas as pd

class PyfuncGpt2(mlflow.pyfunc.PythonModel):
    """PyfuncTransformer is a class that extends the mlflow.pyfunc.PythonModel class
    and is used to create a custom MLflow model for text generation using Transformers.
    """

    def __init__(self):
        """
        Initializes a new instance of the PyfuncTransformer class.

        Args:
            model_name (str): The name of the pre-trained Transformer model to use.
            gen_config_dict (dict): A dictionary of generation configuration parameters.
            examples: examples for multi-shot prompting, prepended to the input.
        """
        super().__init__()

    def load_context(self, context):
        """
        Loads the model and tokenizer using the specified model_name.

        Args:
            context: The MLflow context.
        """

        self.model = pipeline("text-generation", model="gpt2-xl",
                              device_map="auto",)

    def predict(self, context, model_input):
        """
        Generates text based on the provided model_input using the loaded model.

        Args:
            context: The MLflow context.
            model_input: The input used for generating the text.

        Returns:
            list: A list of generated texts.
        """
        if isinstance(model_input, pd.DataFrame):
            model_input = model_input.values.flatten().tolist()
        elif not isinstance(model_input, list):
            model_input = [model_input]

        generated_text = []

        for input_text in model_input:
            output = self.model(
                "Answer the following question or instruction.\nQuestion: " + input_text + "\nAnswer: ", 
                return_full_text=False,
                do_sample=True, 
                top_k=5,
                temperature=0.7, 
                max_new_tokens = 15,
                repetition_penalty=1.1,
            )
            output_text = output[0]["generated_text"]
            cutoff_index = output_text.find('Question: ')
            # Cut off the text before this position if 'Question: ' is found. If not, return the full text.
            short_output = output_text if cutoff_index == -1 else output_text[:cutoff_index]
            generated_text.append(short_output)

        return generated_text


And we can log this model as follows:

In [None]:
gpt2 = PyfuncGpt2()

with mlflow.start_run(run_name=f"log_model_gpt2"):
    pyfunc_model = gpt2
    artifact_path = f"gpt_2_xl_model"
    gpt2_model_info = mlflow.pyfunc.log_model(
        artifact_path=artifact_path,
        python_model=pyfunc_model,
    )


We can load and use this model in the same way as we loaded and used the OpenAI model. MLFlow provides a consistent API, making it straightforward to compare the two models, even though one is an open source model while the other is accessed via a proprietary API.

In [None]:
model = mlflow.pyfunc.load_model(gpt2_model_info.model_uri)
print(model.predict("Where would you likely find a whale?"))

## Comparing the Models

We can use `mlflow.evaluate()` (as described in [this post](https://medium.com/@dliden/comparing-llms-with-mlflow-1c69553718df)) to compare the two models. First, we generate a comparison dataset.

In [None]:
eval_df = pd.DataFrame(
    {"question": ["The Bering Strait is a narrow sea passage that separates Russia and Alaska. The crossing is approximately 82 kilometers wide and is named after Vitus Bering. What does the passage say is 82 kilometers wide?",
"Does the following sentence express a positive or negative sentiment? The movie was enjoyable and exceeded my expectations.",
"Can the hypothesis be inferred from the premise? Premise: The trophy would not fit in the brown suitcase because it was too big. Hypothesis: The trophy was too big for the suitcase.",
"Where would you likely find a whale?",
"What is the capital of Canada?"]}
)

In [None]:
for i in [gpt3_5_turbo_model_info, gpt2_model_info]:
    with mlflow.start_run(
        run_id=i.run_id,
    ):  # reopen the run with the stored run ID
        evaluation_results = mlflow.evaluate(
            model=f"runs:/{i.run_id}/{i.artifact_path}",
            model_type="text",
            data=eval_df,
        )