## Install Required Python Libraries
First, let's install the required Python libraries


In [0]:
%pip install -q -U llama-index

In [0]:
%pip install mlflow>=3.0 databricks-feature-engineering --upgrade
%pip install llama-index-llms-databricks

In [0]:
dbutils.library.restartPython()

First, let's explore the PDF content we parsed into structured format in our knowledge base table


In [0]:
%sql

-- select * from <CATALOG_NAME>.<SCHEMA_NAME>.parsed_policy_pdfs;
select * from hytech_workshop.ai_agent.parsed_policy_pdfs;

In [0]:
pdfs_df = spark.table("hytech_workshop.ai_agent.parsed_policy_pdfs")

In [0]:
import pandas as pd
from typing import Iterator
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Document
from llama_index.llms.databricks import Databricks
from pyspark.sql.functions import pandas_udf

# Chunking UDF using Databricks LLM Model
@pandas_udf("array<string>")
def chunk_pdf_content(content: pd.Series) -> pd.Series:
    llm = Databricks(
        model="databricks-gte-large-en",
        api_key="<INSERT YOUR PERSONAL ACCESS TOKEN HERE>",
        api_base="https://<DATABRICKS WORKSPACE URL>/serving-endpoints"
    )    
    Settings.llm = llm
    splitter = SentenceSplitter(chunk_size=500, chunk_overlap=50)

    def extract_and_split(txt):
        nodes = splitter.get_nodes_from_documents([Document(text=txt)])
        return [n.text for n in nodes]

    return content.apply(extract_and_split)

In [0]:
from pyspark.sql.functions import explode

# Chunk using Databricks LLM Models
exploded_chunks_dbx_df = (pdfs_df
                .withColumn("content", explode(chunk_pdf_content("content")))
                 .selectExpr('doc_uri as pdf_name', 'content')
                )
display(exploded_chunks_dbx_df)

# Save Chunked Data as a Vector Database Table

This cell saves the exploded chunks DataFrame as a Delta Table, enabling efficient semantic search and retrieval over your PDF documents using vector embeddings.


In [0]:
catalog = 'CATALOG_NAME'
schema = 'SCHEMA_NAME'
catalog = 'hytech_workshop'
schema = 'ai_agent'

exploded_chunks_dbx_df.write.mode('overwrite').option("delta.enableChangeDataFeed", "true").saveAsTable(f"{catalog}.{schema}.policy_pdfs_chunked_db")

## Create a Vector Search Index for `policy_pdfs_chunked_db`

To enable efficient semantic search over your PDF chunks, you need to create a vector search index on your Delta table (`policy_pdfs_chunked_db`). Follow these steps:

1. **Ensure Requirements:**
   - Your workspace must have Unity Catalog enabled.
   - Serverless compute must be enabled.
   - You must have `CREATE TABLE` privileges on the target schema.

2. **Create the Vector Search Index:**
   - You can use the Databricks UI, Python SDK, or REST API. The UI is the simplest method.

   **Using the Databricks UI:**
   - Go to the Data tab in your Databricks workspace.
   - Locate your Delta table (`policy_pdfs_chunked_db`).
   - Click on the table and select "Create Vector Search Index".
   - Follow the prompts to configure the index (choose the embedding column, set index options, etc.).
   - Click "Create".

   **Using Python SDK or REST API:**
   - Refer to the [Databricks documentation](https://learn.microsoft.com/en-us/azure/databricks/vector-search/create-vector-search/) for code examples and API details.

3. **After Creation:**
   - The index will automatically sync with your Delta table.
   - You can now perform semantic search queries using the vector search endpoint.

> **Note:** Do not use a column named `_id` in your source table, as it is reserved for vector search indexes.