
# 1/ Data preparation for LLM Chatbot RAG

## Building and indexing our knowledge base into Databricks Vector Search

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-managed-flow-1.png?raw=true" style="float: right; width: 800px; margin-left: 10px">

In this notebook, we'll ingest our documentation pages and index them with a Vector Search index to help our chatbot provide better answers.

Preparing high quality data is key for your chatbot performance. We recommend taking time to implement these next steps with your own dataset.

Thankfully, Lakehouse AI provides state of the art solutions to accelerate your AI and LLM projects, and also simplifies data ingestion and preparation at scale.

For this example, we will use Databricks documentation from [docs.databricks.com](docs.databricks.com):
- Download the web pages
- Split the pages in small chunks of text
- Compute the embeddings using a Databricks Foundation model as part of our Delta Table
- Create a Vector Search index based on our Delta Table  

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-science&org_id=341332174749405&notebook=%2F02-simple-app%2F01-Data-Preparation-and-Index&demo_name=llm-rag-chatbot&event=VIEW&path=%2F_dbdemos%2Fdata-science%2Fllm-rag-chatbot%2F02-simple-app%2F01-Data-Preparation-and-Index&version=1">

In [0]:
%pip install --quiet -U mlflow[databricks] lxml==4.9.3 transformers==4.49.0 langchain==0.3.25 databricks-vectorsearch==0.55 bs4==0.0.2 markdownify==0.14.1
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-init $reset_all_data=false

## Extracting Databricks documentation sitemap and pages

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-data-prep-1.png?raw=true" style="float: right; width: 600px; margin-left: 10px">

First, let's create our raw dataset as a Delta Lake table.

For this demo, we will directly download a few documentation pages from `docs.databricks.com` and save the HTML content.

Here are the main steps:

- Run a quick script to extract the page URLs from the `sitemap.xml` file
- Download the web pages
- Use BeautifulSoup to extract the ArticleBody
- Save the HTML results in a Delta Lake table

In [0]:
spark.sql("DROP TABLE IF EXISTS raw_documentation")

if not spark.catalog.tableExists("raw_documentation") or spark.table("raw_documentation").isEmpty():
    # Download Databricks documentation to a DataFrame (see _resources/00-init for more details)
    download_and_write_databricks_documentation_to_table(table_name="raw_documentation")
    # download_databricks_documentation_articles()

display(spark.table("raw_documentation").limit(2))


### Splitting documentation pages into small chunks

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-data-prep-2.png?raw=true" style="float: right; width: 600px; margin-left: 10px">

LLM models typically have a maximum input context length, and you won't be able to compute embeddings for very long texts.
In addition, the longer your context length is, the longer it will take for the model to provide a response.

Document preparation is key for your model to perform well, and multiple strategies exist depending on your dataset:

- Split document into small chunks (paragraph, h2...)
- Truncate documents to a fixed length
- The chunk size depends on your content and how you'll be using it to craft your prompt. Adding multiple small doc chunks in your prompt might give different results than sending only a big one
- Split into big chunks and ask a model to summarize each chunk as a one-off job, for faster live inference
- Create multiple agents to evaluate each bigger document in parallel, and ask a final agent to craft your answer...


### Splitting our big documentation pages in smaller chunks (h2 sections)

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/chunk-window-size.png?raw=true" style="float: right" width="700px">
<br/>
In this demo, we have big documentation articles, which are too long for the prompt to our model. 

We won't be able to use multiple documents as RAG context as they would exceed our max input size. Some recent studies also suggest that bigger window size isn't always better, as the LLMs seem to focus on the beginning and end of your prompt.

In our case, we'll split these articles between HTML `h2` tags, remove HTML and ensure that each chunk is less than 500 tokens using LangChain. 

#### LLM Window size and Tokenizer

The same sentence might return different tokens for different models. LLMs are shipped with a `Tokenizer` that you can use to count tokens for a given sentence (usually more than the number of words) (see [Hugging Face documentation](https://huggingface.co/docs/transformers/main/tokenizer_summary) or [OpenAI](https://github.com/openai/tiktoken))

Make sure the tokenizer you'll be using here matches your model. Databricks DBRX Instruct uses the same tokenizer as GPT4. We'll be using the `transformers` library to count DBRX Instruct tokens with its tokenizer. This will also keep our document token size below our embedding max size (1024).

<br/>
<br style="clear: both">
<div style="background-color: #def2ff; padding: 15px;  border-radius: 30px; ">
  <strong>Information</strong><br/>
  Remember that the following steps are specific to your dataset. This is a critical part to building a successful RAG assistant.
  <br/> Always take time to manually review the chunks created and ensure that they make sense and contain relevant information.
</div>

In [0]:
import re
from langchain.text_splitter import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, OpenAIGPTTokenizer

max_chunk_size = 500

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=max_chunk_size, chunk_overlap=50)
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=[("##", "header2")])

# Split on H2, but merge small h2 chunks together to avoid having too small chunks. 
def split_html_on_h2(html, min_chunk_size=20, max_chunk_size=500):
    if not html:
        return []
    #removes b64 images captured in the md    
    html = re.sub(r'data:image\/[a-zA-Z]+;base64,[A-Za-z0-9+/=\n]+', '', html, flags=re.MULTILINE)
    chunks = []
    previous_chunk = ""
    for c in md_splitter.split_text(html):
        content = c.metadata.get('header2', "") + "\n" + c.page_content
        if len(tokenizer.encode(previous_chunk + content)) <= max_chunk_size / 2:
            previous_chunk += content + "\n"
        else:
            chunks.extend(text_splitter.split_text(previous_chunk.strip()))
            previous_chunk = content + "\n"
    if previous_chunk:
        chunks.extend(text_splitter.split_text(previous_chunk.strip()))
    return [c for c in chunks if len(tokenizer.encode(c)) > min_chunk_size]

# Let's try our chunking function
html = spark.table("raw_documentation").limit(1).collect()[0]['text']
split_html_on_h2(html)

### Creating the chunk and saving them to our Delta Table

The last step is to apply our UDF all our documentation text and save them to our `databricks_documentation` table

*Note that this part would typically be setup as a production-grade job, running as soon as a new documentation page is updated. <br/> This could be setup as a Delta Live Table pipeline to incrementally consume updates.*

In [0]:
%sql
--Note that we need to enable Change Data Feed on the table to create the index
CREATE TABLE IF NOT EXISTS databricks_documentation (
  id BIGINT GENERATED BY DEFAULT AS IDENTITY,
  url STRING,
  content STRING
) TBLPROPERTIES (delta.enableChangeDataFeed = true); 

In [0]:
# Let's create a user-defined function (UDF) to chunk all our documents with spark
@pandas_udf("array<string>")
def parse_and_split(docs: pd.Series) -> pd.Series:
    return docs.apply(split_html_on_h2)
    
(spark.table("raw_documentation")
      .filter('text is not null')
      .repartition(30)
      .withColumn('content', F.explode(parse_and_split('text')))
      .drop("text")
      .write.mode('overwrite').saveAsTable("databricks_documentation"))

display(spark.table("databricks_documentation"))

## What's required for our Vector Search Index

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/databricks-vector-search-managed-type.png?raw=true" style="float: right" width="800px">

Databricks provides multiple types of vector search indexes:

- **Managed embeddings**: you provide a text column and endpoint name and Databricks synchronizes the index with your Delta table  **(what we'll use in this demo)**
- **Self Managed embeddings**: you compute the embeddings and save them as a field of your Delta Table, Databricks will then synchronize the index
- **Direct index**: when you want to use and update the index without having a Delta Table

In this demo, we will show you how to setup a **Managed Embeddings** index *(self managed embeddings are covered in the advanced demo).*


## Introducing Databricks GTE Embeddings Foundation Model endpoints

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-data-prep-4.png?raw=true" style="float: right; width: 600px; margin-left: 10px">

Foundation Models are provided by Databricks, and can be used out-of-the-box.

Databricks supports several endpoint types to compute embeddings or evaluate a model:
- A **foundation model endpoint**, provided by Databricks (ex: DBRX, MPT, GTE). **This is what we'll be using in this demo.**
- An **external endpoint**, acting as a gateway to an external model (ex: Azure OpenAI)
- A **custom**, fined-tuned model hosted on Databricks model service

Open the [Model Serving Endpoint page](/ml/endpoints) to explore and try the foundation models.

For this demo, we will use the foundation model `GTE` (embeddings) and `DBRX` (chat). <br/><br/>

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/databricks-foundation-models.png?raw=true" width="600px" >

In [0]:
import mlflow.deployments
deploy_client = mlflow.deployments.get_deploy_client("databricks")

#Embeddings endpoints convert text into a vector (array of float). Here is an example using GTEgte:
response = deploy_client.predict(endpoint="databricks-gte-large-en", inputs={"input": ["What is Apache Spark?"]})
embeddings = [e['embedding'] for e in response.data]
print(embeddings)

### Creating our Vector Search Index with Managed Embeddings and GTE

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/chatbot-rag/llm-rag-data-prep-3.png?raw=true" style="float: right; width: 600px; margin-left: 10px">

With Managed embeddings, Databricks will automatically compute the embeddings for us. This is the easier mode to get started with Databricks.

A vector search index uses a **Vector search endpoint** to serve the embeddings (you can think about it as your Vector Search API endpoint).

Multiple Indexes can use the same endpoint. 

Let's start by creating one.


In [0]:
from databricks.vector_search.client import VectorSearchClient
vsc = VectorSearchClient()

if not endpoint_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME):
    vsc.create_endpoint(name=VECTOR_SEARCH_ENDPOINT_NAME, endpoint_type="STANDARD")

wait_for_vs_endpoint_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME)
print(f"Endpoint named {VECTOR_SEARCH_ENDPOINT_NAME} is ready.")


<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/index_creation.gif?raw=true" width="600px" style="float: right">

You can view your endpoint on the [Vector Search Endpoints UI](#/setting/clusters/vector-search). Click on the endpoint name to see all indexes that are served by the endpoint.


### Creating the Vector Search Index

All we now have to do is to as Databricks to create the index. 

Because it's a managed embedding index, we just need to specify the text column and our embedding foundation model (`GTE`).  Databricks will compute the embeddings for us automatically.

This can be done using the API, or in a few clicks within the Unity Catalog Explorer menu.


In [0]:
from databricks.sdk import WorkspaceClient
import databricks.sdk.service.catalog as c

#The table we'd like to index
source_table_fullname = f"{catalog}.{db}.databricks_documentation"
# Where we want to store our index
vs_index_fullname = f"{catalog}.{db}.databricks_documentation_vs_index"

if not index_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname):
  print(f"Creating index {vs_index_fullname} on endpoint {VECTOR_SEARCH_ENDPOINT_NAME}...")
  try:
    vsc.create_delta_sync_index(
      endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
      index_name=vs_index_fullname,
      source_table_name=source_table_fullname,
      pipeline_type="TRIGGERED",
      primary_key="id",
      embedding_source_column='content', #The column containing our text
      embedding_model_endpoint_name='databricks-gte-large-en' #The embedding endpoint used to create the embeddings
    )
  except Exception as e:
    display_quota_error(e, VECTOR_SEARCH_ENDPOINT_NAME)
    raise e
  #Let's wait for the index to be ready and all our embeddings to be created and indexed
  wait_for_index_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname)
else:
  #Trigger a sync to update our vs content with the new data saved in the table
  wait_for_index_to_be_ready(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname)
  vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname).sync()

print(f"index {vs_index_fullname} on table {source_table_fullname} is ready")

## Searching for similar content

That's all we have to do. Databricks will automatically capture and synchronize new entries in your Delta Live Table.

Note that depending on your dataset size and model size, index creation can take a few seconds to start and index your embeddings.

Let's give it a try and search for similar content.

*Note: `similarity_search` also support a filters parameter. This is useful to add a security layer to your RAG system: you can filter out some sensitive content based on who is doing the call (for example filter on a specific department based on the user preference).*

In [0]:
import mlflow.deployments
deploy_client = mlflow.deployments.get_deploy_client("databricks")

question = "How can I track billing usage on my workspaces?"

results = vsc.get_index(VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname).similarity_search(
  query_text=question,
  columns=["url", "content"],
  num_results=1)
docs = results.get('result', {}).get('data_array', [])
docs

## Next step: Deploy our chatbot model with RAG using DBRX

We've seen how Databricks Lakehouse AI makes it easy to ingest and prepare your documents, and deploy a Vector Search index on top of it with just a few lines of code and configuration.

This simplifies and accelerates your data projects so that you can focus on the next step: creating your real-time chatbot endpoint with well-crafted prompt augmentation.

Open the [02-Deploy-RAG-Chatbot-Model]($./02-Deploy-RAG-Chatbot-Model) notebook to create and deploy a chatbot endpoint.