This solution accelerator notebook is available at [Databricks Industry Solutions](https://github.com/databricks-industry-solutions/semantic-caching).

#Create and deploy a standard RAG chain

In this notebook, we will build a standard RAG chatbot without semantic caching to serve as a benchmark. We will utilize the [Databricks Mosaic AI Agent Framework](https://www.databricks.com/product/machine-learning/retrieval-augmented-generation), which enables rapid prototyping of the initial application. In the following cells, we will define a chain, log and register it using MLflow and Unity Catalog, and finally deploy it behind a [Databricks Mosaic AI Model Serving](https://docs.databricks.com/en/machine-learning/model-serving/index.html) endpoint.

## Cluster configuration
We recommend using a cluster with the following specifications to run this solution accelerator:
- Unity Catalog enabled cluster 
- Databricks Runtime 15.4 LTS ML or above
- Single-node cluster: e.g. `m6id.2xlarge` on AWS or `Standard_D8ds_v4` on Azure Databricks.

In [0]:
%pip install -r requirements.txt --quiet
dbutils.library.restartPython()

In [0]:
from config import Config
config = Config()

In [0]:
%run ./99_init $reset_all_data=false

Here, we define environment variables `HOST` and `TOKEN` for our Model Serving endpoint to authenticate against our Vector Search index. 

In [0]:
import os

HOST = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

os.environ['DATABRICKS_HOST'] = HOST
os.environ['DATABRICKS_TOKEN'] = TOKEN

## Create and register a chain to MLflow 

The next cell defines a standard RAG chain using Langchain. When executed, it will write the content to the `chain/chain.py` file, which will then be used to log the chain in MLflow.

In [0]:
%%writefile chain/chain.py
from databricks.vector_search.client import VectorSearchClient
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import ChatDatabricks
from langchain_community.vectorstores import DatabricksVectorSearch
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter
from config import Config
import mlflow
import os

# Enable MLflow Tracing
mlflow.langchain.autolog()

# load parameters
config = Config()

# Connect to the Vector Search Index
vs_index = VectorSearchClient(
    workspace_url=os.environ['DATABRICKS_HOST'],
    personal_access_token=os.environ['DATABRICKS_TOKEN'],
    disable_notice=True,
    ).get_index(
    endpoint_name=config.VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=config.VS_INDEX_FULLNAME,
)

# Turn the Vector Search index into a LangChain retriever
vector_search_as_retriever = DatabricksVectorSearch(
    vs_index,
    text_column="content",
    columns=["id", "content", "url"],
).as_retriever(search_kwargs={"k": 3}) # Number of search results that the retriever returns
# Enable the RAG Studio Review App and MLFlow to properly display track and display retrieved chunks for evaluation
mlflow.models.set_retriever_schema(primary_key="id", text_column="content", doc_uri="url")

# Method to format the docs returned by the retriever into the prompt (keep only the text from chunks)
def format_context(docs):
    chunk_contents = [f"Passage: {d.page_content}\n" for d in docs]
    return "".join(chunk_contents)

# Prompt template to be used to prompt the LLM
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", f"{config.LLM_PROMPT_TEMPLATE}"),
        ("user", "{question}"),
    ]
)

# Our foundation model answering the final prompt
model = ChatDatabricks(
    endpoint=config.LLM_MODEL_SERVING_ENDPOINT_NAME,
    extra_params={"temperature": 0.01, "max_tokens": 500}
)

# Return the string contents of the most recent messages: [{...}] from the user to be used as input question
def extract_user_query_string(chat_messages_array):
    return chat_messages_array[-1]["content"]

# RAG Chain
chain = (
    {
        "question": itemgetter("messages") | RunnableLambda(extract_user_query_string),
        "context": itemgetter("messages")
        | RunnableLambda(extract_user_query_string)
        | vector_search_as_retriever
        | RunnableLambda(format_context),
    }
    | prompt
    | model
    | StrOutputParser()
)

# Tell MLflow logging where to find your chain.
mlflow.models.set_model(model=chain)

In this cell, we log the chain to MLflow. Note that we are passing `config.py` as a dependency, allowing the chain to load the necessary parameters when deployed to another compute environment or to a Model Serving endpoint. MLflow returns a trace of the inference that shows the detail breakdown of the latency and the input/output from each step in the chain.

In [0]:
# Log the model to MLflow
config_file_path = "config.py"

# Create a config file to be used by the chain
with mlflow.start_run(run_name=f"rag_chatbot"):
    logged_chain_info = mlflow.langchain.log_model(
        lc_model=os.path.join(os.getcwd(), 'chain/chain.py'),  # Chain code file e.g., /path/to/the/chain.py 
        artifact_path="chain",  # Required by MLflow
        input_example=config.INPUT_EXAMPLE,  # MLflow will execute the chain before logging & capture it's output schema.
        code_paths = [config_file_path], # Include the config file in the model
    )

# Test the chain locally
chain = mlflow.langchain.load_model(logged_chain_info.model_uri)
chain.invoke(config.INPUT_EXAMPLE)

If we are happy with the logged chain, we will go ahead and register the chain in Unity Catalog.

In [0]:
# Register to UC
uc_registered_model_info = mlflow.register_model(
  model_uri=logged_chain_info.model_uri, 
  name=config.MODEL_FULLNAME
  )

## Deploy the chain to a Model Serving endpoint

We deploy the chaing using custom functions defined in the `utils.py` script.

In [0]:
import utils
utils.deploy_model_serving_endpoint(
  spark, 
  config.MODEL_FULLNAME,
  config.CATALOG,
  config.LOGGING_SCHEMA,
  config.ENDPOINT_NAME,
  HOST,
  TOKEN,
  )

Wait until the endpoint is ready. This may take some time (~15 minutes), so grab a coffee!

In [0]:
utils.wait_for_model_serving_endpoint_to_be_ready(config.ENDPOINT_NAME)

Once the endpoint is up and running, let's send a request and see how it responds. If the following cell fails with 404 Not Found error, take a minute and try re-running the cell. 

In [0]:
import utils
data = {
    "inputs": {
        "messages": [
            {
                "content": "What is Model Serving?",
                "role": "user"
            }
        ]
    }
}
# Now, call the function with the correctly formatted data
utils.send_request_to_endpoint(
    config.ENDPOINT_NAME, 
    data,
    )

In this notebook, we built a standard RAG chatbot without semantic caching to serve. We will use this chain to benchmark against the chain with semantic caching, which we will build in the next `03_rag_chatbot_with_cache` notebook.

© 2024 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License.