# Local LLM Inference Server with MLflow

In this notebook, we will create a custom MLflow `PythonModel` (PyFunc model) and serve it on our local machine. We will then use MLflow AI Gateway to access the model via an OpenAI-compatible endpoint that works with the OpenAI Python SDK.

Our custom model will randomly select a model and a system prompt, hinting at how we could further develop this approach into a more sophisticated model evaluation system.

## Model Setup

First, we will load our environment variables, import the necessary libraries, and set up our system prompts.

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

## Define the Model

Note that, at present, the schema of ChatModel is not compatible with Gateway, so we need to use a PythonModel instead.

See [here](https://github.com/mlflow/mlflow/blob/0866e4c4b06a598144c0771592fbdaf7f0fadd6d/mlflow/gateway/providers/mlflow.py#L145C1-L170C1) for details on the schema AI Gateway sends to the MLflow model server.

AI Gateway sends a request to the model server formatted as follows:

```
# Example request to MLflow REST API for chat:
# {
#     "inputs": ["question"],
#     "params": ["temperature": 0.2],
# }
```

Some notes:
- AI Gateway expects messages formatted in an OpenAI-compatible way.
- Those messages are translated to the above format, which is what is sent to the model server.
- This will not send a message history. Only the most recent message is sent.

Here is the code for our custom model:

In [74]:
%%writefile my_chat_pyfunc.py

import mlflow.pyfunc
from mlflow.pyfunc import PythonModel
from openai import OpenAI
import os
import random
from typing import Dict, Any
import numpy as np
from dotenv import load_dotenv
from mlflow.models import set_model

class TestModel(PythonModel):
    def __init__(self):
        if not os.getenv("OPENAI_API_KEY") and os.path.exists(".env"):
            load_dotenv()
        self.openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        self.gemini_client = OpenAI(
            api_key=os.getenv("GEMINI_API_KEY"),
            base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
        )
    
    def _select_model(self):
        return "gpt-4o-mini" if random.random() < 0.5 else "gemini-1.5-flash-latest"

    def _select_sys_prompt(self):
        return random.choice([
            "You are a helpful assistant.",
            "You are watching TV and don't want to be bothered."
        ])

    def predict(self, context, model_input):
        # Get the content string from the input
        content = model_input[0]
        
        # Format messages with system prompt
        messages = [
            {"role": "system", "content": self._select_sys_prompt()},
            {"role": "user", "content": content}
        ]

        # Get response from model
        model = self._select_model()
        client = self.openai_client if model == "gpt-4o-mini" else self.gemini_client
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.8,
            max_tokens=1000
        )
        
        # Return string response in list as expected by AI Gateway.
        # It will be translated to a dictionary by Gateway.
        return [response.choices[0].message.content]

set_model(TestModel())

Overwriting my_chat_pyfunc.py


## Test the Model

Let's load the model and make sure it works.

In [82]:
import mlflow
code_path = "my_chat_pyfunc.py"

# Input example should match how it arrives from AI Gateway
input_example = ["Hello, I have a question for you."]

with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        "my_chat_pyfunc",
        python_model=code_path,
        input_example=input_example,
    )

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

# Test the model with array input (how it arrives via serve)
result = loaded_model.predict(input_example)
print("Serve format test:", result)

2024/12/11 14:44:53 INFO mlflow.models.model: Found the following environment variables used during model inference: [GEMINI_API_KEY, OPENAI_API_KEY]. Please check if you need to set them when deploying the model. To disable this message, set environment variable `MLFLOW_RECORD_ENV_VARS_IN_MODEL_LOGGING` to `false`.


Serve format test: ["I'm currently focused on my show, but feel free to ask your question, and I'll do my best to help!"]


# Serve the Custom Model Locally

We can serve the model with `mlflow models serve` CLI command. We use:

`export $(cat .env | xargs) && mlflow models serve -m file:///Users/daniel.liden/git/devrel-examples/notebooks/mlflow/mlruns/0/1a5f86514a6c4cfbac6db9aa38c44fda/artifacts/test_model -p 5002 --env-manager local`

So the model is now being served at `http://127.0.0.1:5002`.

For full compatibility with the OpenAI SDK, we need to use the MLFlow AI Gateway, which will provide an openai-compatible endpoint. Here's how we configure it to work with mlflow serving.

In [72]:
%%writefile gateway.yaml
endpoints:
  - name: chat
    endpoint_type: llm/v1/chat
    model:
      provider: mlflow-model-serving
      name: my_chat_pyfunc
      config:
        model_server_url: http://127.0.0.1:5002
        openai_api_key: $OPENAI_API_KEY
        gemini_api_key: $GEMINI_API_KEY

Overwriting gateway.yaml


Now we start both servers.

In [None]:
# Start the model server (run in a separate terminal):
# export $(cat .env | grep -e OPENAI_API_KEY -e GEMINI_API_KEY | xargs) && mlflow models serve -m models:/test_model/latest -p 5002 --env-manager local

# Start the gateway (run in a separate terminal):
# mlflow gateway start --config-path gateway.yaml --port 7000

And now we can query our custom model using the OpenAI SDK—with the caveat noted above. We cannot send a message history, only the most recent message.

Note that we also did not configure our model to handle streaming or inference parameters.

In [83]:
from openai import OpenAI

# Initialize client pointing to the local gateway
client = OpenAI(
    base_url="http://localhost:7000/v1",
    api_key="not-needed"
)

# Format the data as a dictionary with the required schema
messages = [
    {
        "role": "user",
        "content": "Hello",
    }
]

# Make the request
response = client.chat.completions.create(
    model="chat",
    messages=messages,
    temperature=0.7
)

print(response)

ChatCompletion(id=None, choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='(No response, eyes glued to the TV screen.)\n', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1733950135, model='my_chat_pyfunc', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=None, prompt_tokens=None, total_tokens=None, completion_tokens_details=None, prompt_tokens_details=None))


And here is how we can use the MLflow REST API directly (without invoking the Gateway).

In [89]:
import requests

# Test direct model server
response = requests.post(
    "http://127.0.0.1:5002/invocations",
    json={
        "inputs": ["Hello, I have a question for you."]
    },
    headers={"Content-Type": "application/json"}
)

print("Status Code:", response.status_code)
print("Response:", response.json())

Status Code: 200
Response: {'predictions': ["Of course! I'm here to help. What's your question?"]}
