# Creating a Large Language Model Inference Service

Welcome to the fourth part of our tutorial series on building a question-answering application over a corpus of private documents using Large Language Models (LLMs). In the previous Notebooks, we journeyed through the processes of creating vector embeddings of our documents, setting up a Vector Store Inference Service, and testing its performance.

<figure>
  <img src="images/llm.jpg" alt="llm" style="width:100%">
  <figcaption>
      Photo by <a href="https://unsplash.com/@deepmind?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Google DeepMind</a> on <a href="https://unsplash.com/photos/LaKwLAmcnBc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
  </figcaption>
</figure>

Now, we're moving towards the next crucial step: creating an Inference Service for the Large Language Model (LLM). This Inference Service will be the centerpiece of our question-answering system, working in tandem with the Vector Store Inference Service to deliver comprehensive and accurate answers to user queries.

In this Notebook, we'll guide you through the steps required to set up this LLM Inference Service. We'll discuss how to create a Docker image for the custom predictor, the role of the transformer component, define a KServe InferenceService YAML, and deploy the service.

By the end of this Notebook, you'll have a fully functioning LLM Inference Service that can take user queries, interact with the Vector Store, and provide insightful responses.

Let's dive in! Let's import the libraries you'll need:

In [None]:
import os
import base64
import logging
import warnings
import subprocess

import mlflow
import requests
import boto3
import s3transfer
import ipywidgets as widgets

warnings.filterwarnings('ignore')

In [None]:
# Get all loggers
loggers = logging.Logger.manager.loggerDict.values()

# Iterate over all loggers and set their level to ERROR
# as we don't want to polute the output of the code cells
# with debugging messages.
for logger in loggers:
    if isinstance(logger, logging.Logger):
        logger.setLevel(logging.ERROR)

In [None]:
import base64

def encode_base64(message: str):
    encoded_bytes = base64.b64encode(message.encode('ASCII'))
    return encoded_bytes.decode('ASCII')

# Architecture

In this setup, an additional component, called a "transformer", plays a pivotal role in processing user queries and integrating the Vector Store Inference Service with the LLM Inference Service. The transformer's role is to intercept the user's request, extract the necessary information, and then communicate with the Vector Store Inference Service to retrieve the relevant context. The transformer then takes this context, combines it with the user's query, and forwards the enriched prompt to the LLM predictor.

Here's a detailed look at the process:

1. Intercepting the User's Request: The transformer acts as a gateway between the user and the LLM inference service. When a user sends a query, it first reaches the transformer. The transformer extracts the query from the request.
1. Communicating with the Vector Store Inference Service: The transformer then takes the user's query and sends a POST request to the Vector Store Inference Service including the user's query in the payload, just like you did in the previous Notebook.
1. Receiving and Processing the Context: The Vector Store Inference Service responds by sending back the relevant context.
1. Combining the Context with the User's Query: The transformer then combines the received context with the user's original query using a prompt template. This creates an enriched prompt that contains both the user's original question and the relevant context from our documents.
1. Forwarding the Enriched Query to the LLM Predictor: Finally, the transformer forwards this enriched query to the LLM predictor. The predictor then processes this query and generates a response, which is sent back to the user.

As such, you should build two custom Docker images at this point: one for the predictor and one for the transformer. The source code and the Dockerfiles are provided in the corresponding folders: `llm` and `transformer`. For your convenience you can use the images we have pre-built for you:

- Predictor: `gcr.io/mapr-252711/ezua-demos/llm-predictor:1.0`
- Transformer: `gcr.io/mapr-252711/ezua-demos/llm-transformer:1.0`

Once ready, proceed with the next steps.

# Creating the Inference Service

As before, you'll need to provide a few variables:

1. The domain of your EzAF cluster (e.g., hpe-ezaf)
2. Your username and password so you can create the secret your inference service will need to connect to the Vector Store.
3. The custom predictor Docker image you built.
4. The customr transfromer image you built.

In [1]:
# Add heading
heading = widgets.HTML("<h2>MLflow Credentials</h2>")
display(heading)

ezaf_env_input = widgets.Text(description='EZAF Env:')
username_input = widgets.Text(description='Username:')
password_input = widgets.Password(description='Password:')
predictor_input = widgets.Text(description='Predictor image:')
transformer_input = widgets.Text(description='Transformer image:')
submit_button = widgets.Button(description='Submit')
success_message = widgets.Output()

ezaf_env = None
username = None
password = None
predictor = None
transformer = None

def submit_button_clicked(b):
    global ezaf_env, username, password, predictor, transformer
    ezaf_env = ezaf_env_input.value
    username = username_input.value
    password = password_input.value
    predictor = predictor_input.value
    transformer = transformer_input.value
    with success_message:
        success_message.clear_output()
        print("Credentials submitted successfully!")
    submit_button.disabled = True

submit_button.on_click(submit_button_clicked)

# Set margin on the submit button
submit_button.layout.margin = '20px 0 20px 0'

# Display inputs and button
display(ezaf_env_input, username_input, password_input, predictor_input, transformer_input, submit_button, success_message)

NameError: name 'widgets' is not defined

In [None]:
isvc = """
apiVersion: v1
kind: Secret
metadata:
  name: keycloak-secret
type: Opaque
data:
  EZAF_ENV: {0}
  USERNAME: {1}
  PASSWORD: {2}

---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm
spec:
  predictor:
    timeout: 600
    containers:
    - name: kserve-container
      image: {3}
      imagePullPolicy: Always
      resources:
        requests:
          memory: "8Gi"
          cpu: "1000m"
        limits:
          memory: "8Gi"
          cpu: "2000m"
  transformer:
    timeout: 600
    containers:
      - image: {4}
        name: kserve-container
        args: ["--use_ssl"]
        env:
          - name: EZAF_ENV
            valueFrom:
              secretKeyRef:
                key: EZAF_ENV
                name: keycloak-secret
          - name: USERNAME
            valueFrom:
              secretKeyRef:
                key: USERNAME
                name: keycloak-secret 
          - name: PASSWORD
            valueFrom:
              secretKeyRef:
                key: PASSWORD
                name: keycloak-secret
""".format(encode_base64(ezaf_env),
           encode_base64(username),
           encode_base64(password),
           predictor,
           transformer)

with open("llm/isvc.yaml", "w") as f:
    f.write(isvc)

In [None]:
subprocess.run(["kubectl", "apply", "-f", "llm/isvc.yaml"])

# Conclusion and Next Steps

Congratulations on completing this crucial step in this tutorial series! You've successfully built a Large Language Model (LLM) Inference Service, and you've learned about the role of a transformer in enriching user queries with relevant context from our documents. Together with the Vector Store Inference Service, these components form the backbone of our question-answering application.

However, the journey doesn't stop here. The next and final step is to test our LLM Inference Service, ensuring that it's working as expected and delivering accurate responses. This will help us gain confidence in our setup and prepare us for real-world applications.

In the next Notebook, we will guide you through the process of invoking the LLM Inference Service. We will show you how to construct suitable requests, communicate with the service, and interpret the responses.