The purpose of this notebook is to deploy the model to be used by the QA Bot accelerator.  This notebook is available at https://github.com/databricks-industry-solutions/diy-llm-qa-bot.

##Introduction

In this notebook, we will deploy the custom model registered with MLflow in the prior notebook and deploy it to Databricks model serving ([AWS](https://docs.databricks.com/machine-learning/model-serving/index.html)|[Azure](https://learn.microsoft.com/en-us/azure/databricks/machine-learning/model-serving/)).  Databricks model serving provides containerized deployment options for registered models thought which authenticated applications can interact with the model via a REST API.  This provides MLOps teams an easy way to deploy, manage and integrate their models with various applications.

In [0]:
%run "./utils/config_utils"

In [0]:
import mlflow
import requests
import json
import time
from mlflow.utils.databricks_utils import get_databricks_host_creds

In [0]:
latest_version = mlflow.MlflowClient().get_latest_versions(config['registered_model_name'], stages=['Production'])[0].version

##Step 1: Deploy Model Serving Endpoint

Models may typically be deployed to model serving endpoints using either the Databricks workspace user-interface or a REST API.  Because our model depends on the deployment of a sensitive environment variable, we will need to leverage a relatively new model serving feature that is currently only available via the REST API.

See our served model config below and notice the `env_vars` part of the served model config - you can now store a key in a secret scope and pass it to the model serving endpoint as an environment variable.

In [0]:

served_models = [
    {
      "name": "current",
      "model_name": config['registered_model_name'],
      "model_version": latest_version,
      "workload_size": "Small",
      "scale_to_zero_enabled": "true",
      "env_vars": [{
        "env_var_name": "OPENAI_API_KEY",
        "secret_scope": config['openai_key_secret_scope'],
        "secret_key": config['openai_key_secret_key'],
      }]
    }
]
traffic_config = {"routes": [{"served_model_name": "current", "traffic_percentage": "100"}]}

In [0]:
traffic_config

Out[12]: {'routes': [{'served_model_name': 'current', 'traffic_percentage': '100'}]}

In [0]:
def endpoint_exists():
  """Check if an endpoint with the serving_endpoint_name exists"""
  url = f"https://{serving_host}/api/2.0/serving-endpoints/{config['serving_endpoint_name']}"
  headers = { 'Authorization': f'Bearer {creds.token}' }
  response = requests.get(url, headers=headers)
  return response.status_code == 200

def wait_for_endpoint():
  """Wait until deployment is ready, then return endpoint config"""
  headers = { 'Authorization': f'Bearer {creds.token}' }
  endpoint_url = f"https://{serving_host}/api/2.0/serving-endpoints/{config['serving_endpoint_name']}"
  response = requests.request(method='GET', headers=headers, url=endpoint_url)
  while response.json()["state"]["ready"] == "NOT_READY" or response.json()["state"]["config_update"] == "IN_PROGRESS" : # if the endpoint isn't ready, or undergoing config update
    print("Waiting 30s for deployment or update to finish")
    time.sleep(30)
    response = requests.request(method='GET', headers=headers, url=endpoint_url)
    response.raise_for_status()
  return response.json()

def create_endpoint():
  """Create serving endpoint and wait for it to be ready"""
  print(f"Creating new serving endpoint: {config['serving_endpoint_name']}")
  endpoint_url = f'https://{serving_host}/api/2.0/serving-endpoints'
  headers = { 'Authorization': f'Bearer {creds.token}' }
  request_data = {"name": config['serving_endpoint_name'], "config": {"served_models": served_models}}
  json_bytes = json.dumps(request_data).encode('utf-8')
  response = requests.post(endpoint_url, data=json_bytes, headers=headers)
  response.raise_for_status()
  wait_for_endpoint()
  displayHTML(f"""Created the <a href="/#mlflow/endpoints/{config['serving_endpoint_name']}" target="_blank">{config['serving_endpoint_name']}</a> serving endpoint""")
  
def update_endpoint():
  """Update serving endpoint and wait for it to be ready"""
  print(f"Updating existing serving endpoint: {config['serving_endpoint_name']}")
  endpoint_url = f"https://{serving_host}/api/2.0/serving-endpoints/{config['serving_endpoint_name']}/config"
  headers = { 'Authorization': f'Bearer {creds.token}' }
  request_data = { "served_models": served_models, "traffic_config": traffic_config }
  json_bytes = json.dumps(request_data).encode('utf-8')
  response = requests.put(endpoint_url, data=json_bytes, headers=headers)
  response.raise_for_status()
  wait_for_endpoint()
  displayHTML(f"""Updated the <a href="/#mlflow/endpoints/{config['serving_endpoint_name']}" target="_blank">{config['serving_endpoint_name']}</a> serving endpoint""")

In [0]:
# gather other inputs the API needs
serving_host = spark.conf.get("spark.databricks.workspaceUrl")
creds = get_databricks_host_creds()

# kick off endpoint creation/update
if not endpoint_exists():
  create_endpoint()
else:
  update_endpoint()

Creating new serving endpoint: icbf_llm-qabot-endpoint
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish
Waiting 30s for deployment or update to finish


##Step 2: Test Endpoint API

Next, we can use the code below to setup a function to query this endpoint.  This code is a slightly modified version of the code accessible through the *Query Endpoint* UI accessible through the serving endpoint page:

In [0]:
import os
import requests
import numpy as np
import pandas as pd
import json

endpoint_url = f"""https://{serving_host}/serving-endpoints/{config['serving_endpoint_name']}/invocations"""


def create_tf_serving_json(data):
    return {
        "inputs": {name: data[name].tolist() for name in data.keys()}
        if isinstance(data, dict)
        else data.tolist()
    }


def score_model(dataset):
    url = endpoint_url
    headers = {
        "Authorization": f"Bearer {creds.token}",
        "Content-Type": "application/json",
    }
    ds_dict = (
        {"dataframe_split": dataset.to_dict(orient="split")}
        if isinstance(dataset, pd.DataFrame)
        else create_tf_serving_json(dataset)
    )
    data_json = json.dumps(ds_dict, allow_nan=True)
    response = requests.request(method="POST", headers=headers, url=url, data=data_json)
    if response.status_code != 200:
        raise Exception(
            f"Request failed with status {response.status_code}, {response.text}"
        )

    return response.json()



And now we can test the endpoint:

In [0]:
# assemble question input
queries = pd.DataFrame({'question':[
  "Que es una demanda de alimentos?"
]})

score_model( 
   queries
    )

Out[16]: {'predictions': [{'answer': 'Una demanda de alimentos es un proceso legal en el que se busca obtener el pago de la cuota de alimentos de una persona obligada a brindarlos. Este proceso se lleva a cabo en el ámbito civil y se adelanta ante el Juez de Familia competente. En la demanda, se puede solicitar el embargo de bienes y derechos del progenitor obligado a brindar alimentos. El Juez tiene la facultad de ordenar al pagador o empleador de esa persona que descuente y consigne a órdenes del juzgado una parte del salario mensual del demandado, así como de sus prestaciones sociales.',
   'source': 'https://www.icbf.gov.co/cual-es-la-diferencia-entre-el-proceso-ejecutivo-de-alimentos-y-el-de-inasistencia-alimentaria',
   'output_metadata': {'token_usage': {'prompt_tokens': 387,
     'completion_tokens': 141,
     'total_tokens': 528},
    'model_name': 'gpt-3.5-turbo'}}]}

Some observed limitations:
* If we allow the endpoint to scale to zero, we will save cost when the bot is not queried. However, the first request after a long pause can take a few minutes, as it will require the endpoint to scale up from zero nodes
* The timeout limit for a serverless model serving request is 60 seconds. If more than 3 questions are submitted in the same request, the model may time out.

© 2023 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License. All included or referenced third party libraries are subject to the licenses set forth below.

| library                                | description             | license    | source                                              |
|----------------------------------------|-------------------------|------------|-----------------------------------------------------|
| langchain | Building applications with LLMs through composability | MIT  |   https://pypi.org/project/langchain/ |
| tiktoken | Fast BPE tokeniser for use with OpenAI's models | MIT  |   https://pypi.org/project/tiktoken/ |
| faiss-cpu | Library for efficient similarity search and clustering of dense vectors | MIT  |   https://pypi.org/project/faiss-cpu/ |
| openai | Building applications with LLMs through composability | MIT  |   https://pypi.org/project/openai/ |