# Using Databricks Vector Search with the Foundation Model API

In [0]:
%pip install --upgrade --force-reinstall databricks-vectorsearch databricks-genai-inference
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
Collecting databricks-vectorsearch
  Using cached databricks_vectorsearch-0.22-py3-none-any.whl (8.5 kB)
Collecting databricks-genai-inference
  Using cached databricks_genai_inference-0.1.3-py3-none-any.whl (15 kB)
Collecting mlflow-skinny<3,>=2.4.0
  Using cached mlflow_skinny-2.9.2-py3-none-any.whl (4.7 MB)
Collecting requests>=2
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting protobuf<5,>=3.12.0
  Using cached protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
Collecting pyyaml>=5.4.1
  Using cached PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (705 kB)
Collecting pydantic>=2.4.2
  Using cached pydantic-2.5.3-py3-none-any.whl (381 kB)
Collecting typing-extensions>=4.7.1
  Using cached typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting databricks-sdk>=0.12.0
  Using cached databricks_sdk-0.18.0-py3-none-any.wh

## Setup
This will set up a temporary catalog/schema/table for this example. If you do not have permissions to create catalogs, you can specify a catalog manually. 

In [0]:
import re
from datetime import datetime
import uuid

# Assuming DB_USER is fetched from the Databricks utility
DB_USER = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')

# Sanitize the DB_USER by replacing non-alphanumeric characters with underscores
DB_USER_SANITIZED = re.sub(r'\W', '_', DB_USER)

# Append a timestamp and a UUID to ensure uniqueness
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
unique_id = str(uuid.uuid4()).split('-')[0]  # Get the first part of the UUID

# Create a definitely unique DB_USER identifier
DB_USER_UNIQUE = f"{DB_USER_SANITIZED}_{timestamp}_{unique_id}"

print(DB_USER_UNIQUE)

daniel_liden_databricks_com_20240125_232233_f4bec934


In [0]:
CATALOG = DB_USER_UNIQUE
DB = "FM_API_EXAMPLES"
VS_ENDPOINT_NAME = "test_endpoint"
VS_INDEX_NAME = "FM_API_EXAMPLES_VS_INDEX"
SOURCE_TABLE_NAME = "FM_API_EXAMPLES_DATA"

VS_INDEX_FULLNAME = f"{CATALOG}.{DB}.{VS_INDEX_NAME}"
SOURCE_TABLE_FULLNAME = f"{CATALOG}.{DB}.{SOURCE_TABLE_NAME}"
DATABRICKS_TOKEN = dbutils.secrets.get(scope="daniel.liden", key="rag_demo")

In [0]:
# Set up schema/volume/table
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType

spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{DB}")
spark.sql(
    f"""CREATE TABLE IF NOT EXISTS {SOURCE_TABLE_FULLNAME} (
        id STRING,
        text STRING,
        date DATE,
        title STRING
    )
    USING delta 
    TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true')
"""
)

DataFrame[]

In [0]:
spark.sql(f"DROP TABLE {SOURCE_TABLE_FULLNAME}")

DataFrame[]

In [0]:
from databricks.vector_search.client import VectorSearchClient
vsc = VectorSearchClient()

[NOTICE] Using a notebook authentication token. Recommended for development only. For improved performance, please use Service Principal based authentication. To disable this message, pass disable_notice=True to VectorSearchClient().


In [0]:
# set up an index with managed embeddings
i=vsc.create_delta_sync_index(
    endpoint_name=VS_ENDPOINT_NAME,
    index_name=VS_INDEX_FULLNAME,
    source_table_name=SOURCE_TABLE_FULLNAME,
    pipeline_type="TRIGGERED",
    primary_key="id",
    embedding_source_column="text",
    embedding_model_endpoint_name="databricks-bge-large-en"
)

In [0]:
# Some fake texts
from datetime import datetime


smarter_overview = {"text":"""
S.M.A.R.T.E.R. Initiative: Strategic Management for Achieving Results through Efficiency and Resources
Introduction
The S.M.A.R.T.E.R. Initiative, standing for "Strategic Management for Achieving Results through Efficiency and Resources," is a groundbreaking project aimed at revolutionizing the way our organization operates. In today's rapidly changing business landscape, achieving success demands a strategic approach that leverages resources effectively while optimizing efficiency. The S.M.A.R.T.E.R. Initiative is designed to do just that.

Background
As markets evolve and competition intensifies, organizations must adapt to stay relevant and profitable. Traditional methods of operation often become inefficient and costly. The S.M.A.R.T.E.R. Initiative was conceived as a response to this challenge, with the primary goal of enhancing strategic management practices to achieve better results.

Objectives
1. Resource Optimization
One of the key objectives of the S.M.A.R.T.E.R. Initiative is to optimize resource allocation. This involves identifying underutilized resources, streamlining processes, and reallocating resources to areas that contribute most to our strategic goals.

2. Efficiency Improvement
Efficiency is at the core of the S.M.A.R.T.E.R. Initiative. By identifying bottlenecks and improving processes, we aim to reduce operational costs, shorten project timelines, and enhance overall productivity.

3. Strategic Alignment
For any organization to succeed, its activities must be aligned with its strategic objectives. The S.M.A.R.T.E.R. Initiative will ensure that every action and resource allocation is in sync with our long-term strategic goals.

4. Results-driven Approach
The ultimate measure of success is results. The S.M.A.R.T.E.R. Initiative will foster a results-driven culture within our organization, where decisions and actions are guided by their impact on our bottom line and strategic objectives.

Key Components
The S.M.A.R.T.E.R. Initiative comprises several key components:

1. Data Analytics and Insights
Data is the foundation of informed decision-making. We will invest in advanced data analytics tools to gain insights into our operations, customer behavior, and market trends. These insights will guide our resource allocation and strategy.

2. Process Automation
Automation will play a vital role in enhancing efficiency. Routine and repetitive tasks will be automated, freeing up our workforce to focus on more strategic activities.

3. Performance Metrics and KPIs
To ensure that our efforts are aligned with our objectives, we will establish a comprehensive set of Key Performance Indicators (KPIs). Regular monitoring and reporting will provide visibility into our progress.

4. Training and Development
Enhancing our workforce's skills is essential. We will invest in training and development programs to equip our employees with the knowledge and tools needed to excel in their roles.

Implementation Timeline
The S.M.A.R.T.E.R. Initiative will be implemented in phases over the next three years. This phased approach allows for a smooth transition and ensures that each component is integrated effectively into our operations.

Conclusion
The S.M.A.R.T.E.R. Initiative represents a significant step forward for our organization. By strategically managing our resources and optimizing efficiency, we are positioning ourselves for sustained success in a competitive marketplace. This initiative is a testament to our commitment to excellence and our dedication to achieving exceptional results.

As we embark on this journey, we look forward to the transformative impact that the S.M.A.R.T.E.R. Initiative will have on our organization and the benefits it will bring to our employees, customers, and stakeholders.
""", "title": "Project Kickoff", "date": datetime.strptime("2024-01-16", "%Y-%m-%d")}

smarter_kpis = {"text": """S.M.A.R.T.E.R. Initiative: Key Performance Indicators (KPIs)
Introduction
The S.M.A.R.T.E.R. Initiative (Strategic Management for Achieving Results through Efficiency and Resources) is designed to drive excellence within our organization. To measure the success and effectiveness of this initiative, we have established three concrete and measurable Key Performance Indicators (KPIs). This document outlines these KPIs and their associated targets.

Key Performance Indicators (KPIs)
1. Resource Utilization Efficiency (RUE)
Objective: To optimize resource utilization for cost-efficiency.

KPI Definition: RUE will be calculated as (Actual Resource Utilization / Planned Resource Utilization) * 100%.

Target: Achieve a 15% increase in RUE within the first year.

2. Time-to-Decision Reduction (TDR)
Objective: To streamline operational processes and reduce decision-making time.

KPI Definition: TDR will be calculated as (Pre-Initiative Decision Time - Post-Initiative Decision Time) / Pre-Initiative Decision Time.

Target: Achieve a 20% reduction in TDR for critical business decisions.

3. Strategic Goals Achievement (SGA)
Objective: To ensure that organizational activities align with strategic goals.

KPI Definition: SGA will measure the percentage of predefined strategic objectives achieved.

Target: Achieve an 80% Strategic Goals Achievement rate within two years.

Conclusion
These three KPIs, Resource Utilization Efficiency (RUE), Time-to-Decision Reduction (TDR), and Strategic Goals Achievement (SGA), will serve as crucial metrics for evaluating the success of the S.M.A.R.T.E.R. Initiative. By tracking these KPIs and working towards their targets, we aim to drive efficiency, optimize resource utilization, and align our actions with our strategic objectives. This focus on measurable outcomes will guide our efforts towards achieving excellence within our organization.""",
"title": "Project KPIs", "date": datetime.strptime("2024-01-16", "%Y-%m-%d")}

In [0]:
import re

def chunk_text(text, chunk_size, overlap):
    words = text.split()
    chunks = []
    index = 0

    while index < len(words):
        end = index + chunk_size
        while end < len(words) and not re.match(r'.*[.!?]\s*$', words[end]):
            end += 1
        chunk = ' '.join(words[index:end+1])
        chunks.append(chunk)
        index += chunk_size - overlap

    return chunks

# Use the function
chunks = []
documents = [smarter_overview, smarter_kpis]  # Replace with your actual documents

for document in documents:
    for i, c in enumerate(chunk_text(document["text"], 150, 25)):
        chunk = {}
        chunk["text"] = c
        chunk["title"] = document["title"]
        chunk["date"] = document["date"]
        chunk["id"] = document["title"] + "_" + str(i)

        chunks.append(chunk)


In [0]:
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, DateType


schema = StructType(
    [
        StructField("id", StringType(), True),
        StructField("text", StringType(), True),
        StructField("title", StringType(), True),
        StructField("date", DateType(), True),
    ]
)

if chunks:
    result_df = spark.createDataFrame(chunks, schema=schema)
    result_df.write.format("delta").mode("append").saveAsTable(
        SOURCE_TABLE_FULLNAME
    )

In [0]:
# Sync
index = vsc.get_index(endpoint_name=VS_ENDPOINT_NAME,
                      index_name=VS_INDEX_FULLNAME)
index.sync()

{}

In [0]:
# query
index.similarity_search(columns=["text"],
                        query_text="What is the TDR Target for the SMARTER initiative?",
                        num_results = 3)


{'manifest': {'column_count': 2,
  'columns': [{'name': 'text'}, {'name': 'score'}]},
 'result': {'row_count': 3,
  'data_array': [['S.M.A.R.T.E.R. Initiative: Key Performance Indicators (KPIs) Introduction The S.M.A.R.T.E.R. Initiative (Strategic Management for Achieving Results through Efficiency and Resources) is designed to drive excellence within our organization. To measure the success and effectiveness of this initiative, we have established three concrete and measurable Key Performance Indicators (KPIs). This document outlines these KPIs and their associated targets. Key Performance Indicators (KPIs) 1. Resource Utilization Efficiency (RUE) Objective: To optimize resource utilization for cost-efficiency. KPI Definition: RUE will be calculated as (Actual Resource Utilization / Planned Resource Utilization) * 100%. Target: Achieve a 15% increase in RUE within the first year. 2. Time-to-Decision Reduction (TDR) Objective: To streamline operational processes and reduce decision-mak

# Cleanup
The code snippet below will delete the VS index and source table, along with the catalog and schema if they are empty.

In [0]:
# # delete index
vsc.delete_index(endpoint_name=VS_ENDPOINT_NAME,
                  index_name=VS_INDEX_FULLNAME)

# delete schema and catalog
spark.sql(f"DROP CATALOG IF EXISTS {CATALOG} CASCADE")

DataFrame[]