<a href="https://colab.research.google.com/github/dimitarpg13/agentic_architectures_and_design_patterns/blob/main/notebooks/live_web_search/duckduckgo_langgraph_databricks_rag_ex1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Databricks Vector Search, DuckDuckGo, and LangGraph

This notebook demonstrates a production-ready RAG system that combines:
- **Databricks Vector Search**: Enterprise-grade vector database
- **DuckDuckGo**: Live web search for current information
- **LangGraph**: Intelligent workflow orchestration
- **Unity Catalog**: Data governance and management

## Architecture Overview

```
User Query ‚Üí Router Agent
                ‚Üì
    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚Üì                       ‚Üì
Databricks Vector      DuckDuckGo
   Search                Web Search
    ‚Üì                       ‚Üì
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ‚Üì
        Context Synthesis
                ‚Üì
          LLM Generation
                ‚Üì
          Final Answer
```

## Features
- üè¢ Enterprise-grade vector search with Databricks
- üîç Hybrid retrieval (vector DB + web search)
- üîÑ Intelligent routing with LangGraph
- üìä Unity Catalog integration
- ‚ö° High-performance Delta Lake storage
- üîê Built-in security and governance

## 1. Install Required Packages

In [None]:
# Install required packages
!pip install -q databricks-vectorsearch databricks-sdk langchain langchain-openai \
    langgraph langchain-community duckduckgo-search mlflow pandas

## 2. Import Dependencies

In [None]:
import os
from typing import TypedDict, Annotated, List, Dict, Any, Optional
import operator
import json

# Databricks imports
from databricks.vector_search.client import VectorSearchClient
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import *

# LangChain imports
from langchain_community.tools import DuckDuckGoSearchResults
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.documents import Document

# LangGraph imports
from langgraph.graph import StateGraph, END

# Other imports
import pandas as pd
from datetime import datetime

print("‚úÖ All imports successful!")

## 3. Configure Databricks Connection

### Authentication Options:
1. **Databricks Notebook**: Automatically authenticated
2. **Local/External**: Use personal access token
3. **Service Principal**: For production deployments

In [None]:
# Databricks Configuration
# Option 1: Running in Databricks notebook (auto-authenticated)
# No configuration needed - credentials are automatically available

# Option 2: Running locally or outside Databricks
DATABRICKS_HOST = "https://your-workspace.cloud.databricks.com"  # Your workspace URL
DATABRICKS_TOKEN = "your-personal-access-token"  # Your PAT

# Set environment variables for local development
os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

# OpenAI Configuration
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

print("‚úÖ Configuration set")

## 4. Set Up Unity Catalog Resources

Define the catalog, schema, and table names for organizing your vector data.

In [None]:
# Unity Catalog Configuration
CATALOG_NAME = "main"  # Or your custom catalog
SCHEMA_NAME = "rag_demo"  # Schema for RAG application
TABLE_NAME = "knowledge_base"  # Delta table for documents
VECTOR_SEARCH_ENDPOINT = "rag_endpoint"  # Vector Search endpoint name
INDEX_NAME = "knowledge_base_index"  # Vector search index name

# Full qualified names
FULL_TABLE_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}"
FULL_INDEX_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.{INDEX_NAME}"

print(f"üìä Catalog: {CATALOG_NAME}")
print(f"üìÅ Schema: {SCHEMA_NAME}")
print(f"üìÑ Table: {FULL_TABLE_NAME}")
print(f"üîç Index: {FULL_INDEX_NAME}")

## 5. Initialize Databricks Clients

In [None]:
# Initialize Databricks clients
try:
    # Workspace client for catalog operations
    workspace_client = WorkspaceClient()
    
    # Vector Search client
    vector_search_client = VectorSearchClient(
        workspace_url=DATABRICKS_HOST,
        personal_access_token=DATABRICKS_TOKEN,
        disable_notice=True
    )
    
    print("‚úÖ Databricks clients initialized successfully!")
    
except Exception as e:
    print(f"‚ùå Error initializing clients: {e}")
    print("\nTroubleshooting:")
    print("1. Verify DATABRICKS_HOST and DATABRICKS_TOKEN are set correctly")
    print("2. Ensure your PAT has the necessary permissions")
    print("3. Check network connectivity to Databricks workspace")

## 6. Create Unity Catalog Schema (if not exists)

In [None]:
# Note: This would typically run on Databricks using Spark SQL
# Here's the SQL command to run in Databricks:

create_schema_sql = f"""
-- Create catalog if not exists
CREATE CATALOG IF NOT EXISTS {CATALOG_NAME};

-- Create schema if not exists
CREATE SCHEMA IF NOT EXISTS {CATALOG_NAME}.{SCHEMA_NAME}
COMMENT 'Schema for RAG application with vector search';
"""

print("SQL to run in Databricks SQL Editor or notebook:")
print(create_schema_sql)
print("\n‚ö†Ô∏è  Run the above SQL in your Databricks workspace before proceeding")

## 7. Create Sample Documents Dataset

In [None]:
# Sample documents for the knowledge base
sample_documents = [
    {
        "id": "doc_001",
        "content": """Databricks is a unified data analytics platform built on Apache Spark. 
        It provides collaborative notebooks, automated cluster management, and production-grade 
        data pipelines. Databricks simplifies big data processing and machine learning workflows.""",
        "title": "Introduction to Databricks",
        "category": "platform",
        "source": "databricks_docs",
        "date": "2024-01-15"
    },
    {
        "id": "doc_002",
        "content": """Unity Catalog is Databricks' unified governance solution for data and AI. 
        It provides centralized access control, auditing, lineage, and data discovery across 
        Databricks workspaces. Unity Catalog works with Delta Lake to provide fine-grained 
        governance for tables, views, and models.""",
        "title": "Unity Catalog Overview",
        "category": "governance",
        "source": "unity_catalog_guide",
        "date": "2024-01-20"
    },
    {
        "id": "doc_003",
        "content": """Databricks Vector Search is a serverless vector database that makes it easy to 
        build retrieval-augmented generation (RAG) applications. It automatically indexes and 
        syncs embeddings from Delta tables, provides high-performance similarity search, and 
        integrates seamlessly with Unity Catalog for governance.""",
        "title": "Databricks Vector Search",
        "category": "vector_search",
        "source": "vector_search_docs",
        "date": "2024-02-01"
    },
    {
        "id": "doc_004",
        "content": """Delta Lake is an open-source storage layer that brings ACID transactions to 
        Apache Spark and big data workloads. It provides time travel, schema enforcement, and 
        unified batch and streaming processing. Delta Lake is the foundation for the lakehouse 
        architecture.""",
        "title": "Delta Lake Fundamentals",
        "category": "storage",
        "source": "delta_lake_guide",
        "date": "2024-01-10"
    },
    {
        "id": "doc_005",
        "content": """MLflow is an open-source platform for managing the machine learning lifecycle. 
        It includes experiment tracking, model registry, and model deployment capabilities. 
        MLflow integrates with Databricks to provide a complete MLOps solution.""",
        "title": "MLflow Platform",
        "category": "mlops",
        "source": "mlflow_docs",
        "date": "2024-01-25"
    },
    {
        "id": "doc_006",
        "content": """LangChain is a framework for developing applications powered by language models. 
        It provides abstractions for prompts, chains, agents, and memory. LangChain works 
        seamlessly with Databricks for building production RAG applications.""",
        "title": "LangChain Framework",
        "category": "frameworks",
        "source": "langchain_guide",
        "date": "2024-02-05"
    },
    {
        "id": "doc_007",
        "content": """Our company policy on data governance: All production data must be stored in 
        Unity Catalog with appropriate access controls. Data classification should follow 
        the company's data classification standard. PII data requires encryption at rest 
        and column-level access controls.""",
        "title": "Data Governance Policy",
        "category": "policy",
        "source": "company_handbook",
        "date": "2024-01-01"
    }
]

# Convert to DataFrame
docs_df = pd.DataFrame(sample_documents)

print(f"‚úÖ Created {len(sample_documents)} sample documents")
print("\nSample data:")
print(docs_df[['id', 'title', 'category']].to_string())

## 8. Initialize Embeddings Model

In [None]:
# Initialize OpenAI embeddings
embeddings_model = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536  # Standard dimension for text-embedding-3-small
)

# Test embeddings
test_text = "This is a test document."
test_embedding = embeddings_model.embed_query(test_text)

print(f"‚úÖ Embeddings model initialized")
print(f"üìä Embedding dimension: {len(test_embedding)}")
print(f"üî¢ Sample embedding (first 5 values): {test_embedding[:5]}")

## 9. Generate Embeddings for Documents

In [None]:
# Generate embeddings for all documents
print("üîÑ Generating embeddings for documents...")

embeddings_list = []
for doc in sample_documents:
    # Combine title and content for better embeddings
    text_to_embed = f"{doc['title']}: {doc['content']}"
    embedding = embeddings_model.embed_query(text_to_embed)
    embeddings_list.append(embedding)
    print(f"  ‚úì Embedded: {doc['id']}")

# Add embeddings to dataframe
docs_df['embedding'] = embeddings_list

print(f"\n‚úÖ Generated {len(embeddings_list)} embeddings")

## 10. Create Delta Table with Documents

**Note**: This section shows the SQL commands to run in Databricks.
In a real Databricks notebook, you would use Spark DataFrames.

In [None]:
# SQL to create the Delta table in Databricks
create_table_sql = f"""
CREATE TABLE IF NOT EXISTS {FULL_TABLE_NAME} (
  id STRING,
  content STRING,
  title STRING,
  category STRING,
  source STRING,
  date STRING,
  embedding ARRAY<FLOAT>,
  created_at TIMESTAMP
)
USING DELTA
COMMENT 'Knowledge base for RAG application';
"""

print("SQL to create Delta table:")
print(create_table_sql)
print("\n" + "="*80)

# In a Databricks notebook with Spark, you would do:
spark_save_code = f"""
# Convert pandas DataFrame to Spark DataFrame
from pyspark.sql.functions import current_timestamp

spark_df = spark.createDataFrame(docs_df)
spark_df = spark_df.withColumn("created_at", current_timestamp())

# Write to Delta table
spark_df.write.format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable("{FULL_TABLE_NAME}")

print(f"‚úÖ Saved {{len(docs_df)}} documents to {{FULL_TABLE_NAME}}")
"""

print("\nPython code to run in Databricks Spark notebook:")
print(spark_save_code)
print("\n‚ö†Ô∏è  Run the above code in your Databricks notebook to create and populate the table")

## 11. Create Vector Search Endpoint

In [None]:
def create_vector_search_endpoint(endpoint_name: str):
    """
    Create a Vector Search endpoint if it doesn't exist.
    """
    try:
        # Check if endpoint already exists
        try:
            endpoint = vector_search_client.get_endpoint(endpoint_name)
            print(f"‚úÖ Endpoint '{endpoint_name}' already exists")
            return endpoint
        except Exception:
            # Endpoint doesn't exist, create it
            print(f"üîÑ Creating endpoint '{endpoint_name}'...")
            
            endpoint = vector_search_client.create_endpoint(
                name=endpoint_name,
                endpoint_type="STANDARD"  # or "HIGH_CONCURRENCY" for production
            )
            
            print(f"‚úÖ Created endpoint '{endpoint_name}'")
            print(f"‚è≥ Endpoint is initializing (this may take a few minutes)...")
            
            return endpoint
            
    except Exception as e:
        print(f"‚ùå Error creating endpoint: {e}")
        return None

# Create endpoint
endpoint = create_vector_search_endpoint(VECTOR_SEARCH_ENDPOINT)

if endpoint:
    print(f"\nüìç Endpoint Name: {VECTOR_SEARCH_ENDPOINT}")
    print(f"üìä Endpoint Status: Check in Databricks UI under Compute > Vector Search")

## 12. Create Vector Search Index

In [None]:
def create_vector_search_index(
    endpoint_name: str,
    index_name: str,
    table_name: str,
    embedding_dimension: int = 1536
):
    """
    Create a Vector Search index on the Delta table.
    """
    try:
        # Check if index already exists
        try:
            index = vector_search_client.get_index(endpoint_name, index_name)
            print(f"‚úÖ Index '{index_name}' already exists")
            return index
        except Exception:
            # Index doesn't exist, create it
            print(f"üîÑ Creating index '{index_name}'...")
            
            index = vector_search_client.create_delta_sync_index(
                endpoint_name=endpoint_name,
                index_name=index_name,
                source_table_name=table_name,
                pipeline_type="TRIGGERED",  # or "CONTINUOUS" for real-time sync
                primary_key="id",
                embedding_dimension=embedding_dimension,
                embedding_vector_column="embedding"
            )
            
            print(f"‚úÖ Created index '{index_name}'")
            print(f"‚è≥ Index is being built (this may take several minutes)...")
            print(f"üìä Monitor progress in Databricks UI")
            
            return index
            
    except Exception as e:
        print(f"‚ùå Error creating index: {e}")
        print("\nCommon issues:")
        print("1. Endpoint not ready - wait for endpoint to be ONLINE")
        print("2. Table doesn't exist - create the Delta table first")
        print("3. Permission issues - ensure you have CREATE privilege")
        return None

# Create index
index = create_vector_search_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT,
    index_name=FULL_INDEX_NAME,
    table_name=FULL_TABLE_NAME,
    embedding_dimension=1536
)

if index:
    print(f"\nüîç Index Name: {FULL_INDEX_NAME}")
    print(f"üìä Source Table: {FULL_TABLE_NAME}")

## 13. Test Vector Search

In [None]:
def test_vector_search(query: str, k: int = 3):
    """
    Test vector search with a query.
    """
    print(f"\nüîç Searching for: '{query}'")
    print("="*60)
    
    try:
        # Generate query embedding
        query_embedding = embeddings_model.embed_query(query)
        
        # Search the index
        results = vector_search_client.get_index(
            endpoint_name=VECTOR_SEARCH_ENDPOINT,
            index_name=FULL_INDEX_NAME
        ).similarity_search(
            query_vector=query_embedding,
            columns=["id", "title", "content", "category", "source"],
            num_results=k
        )
        
        print(f"‚úÖ Found {len(results.get('result', {}).get('data_array', []))} results\n")
        
        # Display results
        for i, result in enumerate(results.get('result', {}).get('data_array', []), 1):
            print(f"Result {i}:")
            print(f"  ID: {result[0]}")
            print(f"  Title: {result[1]}")
            print(f"  Content: {result[2][:100]}...")
            print(f"  Category: {result[3]}")
            print(f"  Score: {result[-1] if len(result) > 5 else 'N/A'}")
            print()
        
        return results
        
    except Exception as e:
        print(f"‚ùå Error searching: {e}")
        print("\nPossible issues:")
        print("1. Index not ready - wait for indexing to complete")
        print("2. No data in table - ensure documents were loaded")
        print("3. Endpoint offline - check endpoint status")
        return None

# Test search
test_results = test_vector_search("What is Unity Catalog?")

## 14. Initialize DuckDuckGo Web Search

In [None]:
# Initialize DuckDuckGo search
web_search = DuckDuckGoSearchResults(
    num_results=3,
    output_format="list"
)

# Test web search
test_search = web_search.run("latest databricks features 2024")
print("‚úÖ DuckDuckGo search initialized")
print(f"Test search returned results: {len(test_search) if isinstance(test_search, list) else 'Yes'}")

## 15. Define RAG State for LangGraph

In [None]:
class DatabricksRAGState(TypedDict):
    """
    State for the Databricks RAG workflow.
    """
    # Input
    query: str
    
    # Routing
    use_vector_search: bool
    use_web_search: bool
    
    # Retrieval results
    vector_results: List[Dict[str, Any]]
    web_results: List[Dict[str, Any]]
    
    # Processing
    combined_context: str
    sources: List[str]
    
    # Output
    answer: str
    confidence: float
    metadata: Dict[str, Any]

print("‚úÖ RAG state defined")

## 16. Create Router Node

In [None]:
# Initialize LLM
llm = ChatOpenAI(model="gpt-4", temperature=0)

def route_query(state: DatabricksRAGState) -> DatabricksRAGState:
    """
    Route query to appropriate retrieval sources.
    """
    print("\n" + "="*60)
    print("üîÄ ROUTER NODE")
    print("="*60)
    
    query = state["query"]
    
    router_prompt = f"""Analyze this query and determine the best retrieval strategy:

Query: "{query}"

Available sources:
1. Databricks Vector Search: Internal knowledge base about Databricks, Unity Catalog, 
   Delta Lake, MLflow, data governance, and company policies
2. Web Search: Current news, latest features, recent announcements, and real-time information

Decide:
- use_vector_search: True if query is about internal documentation or company knowledge
- use_web_search: True if query requires current/recent external information

Both can be True for queries needing both internal and external context.

Examples:
- "What is Unity Catalog?" ‚Üí vector: True, web: False
- "Latest Databricks announcements 2024" ‚Üí vector: False, web: True
- "Compare Databricks Vector Search to Pinecone" ‚Üí vector: True, web: True

Respond with ONLY JSON: {{"use_vector_search": bool, "use_web_search": bool}}"""
    
    response = llm.invoke([HumanMessage(content=router_prompt)])
    
    try:
        decision = json.loads(response.content)
        use_vector = decision.get("use_vector_search", True)
        use_web = decision.get("use_web_search", False)
    except:
        use_vector = True
        use_web = False
    
    print(f"üìä Routing Decision:")
    print(f"   Databricks Vector Search: {use_vector}")
    print(f"   DuckDuckGo Web Search: {use_web}")
    
    return {
        "use_vector_search": use_vector,
        "use_web_search": use_web
    }

print("‚úÖ Router node created")

## 17. Create Databricks Vector Search Node

In [None]:
def search_databricks_vectors(state: DatabricksRAGState) -> DatabricksRAGState:
    """
    Search Databricks Vector Search index.
    """
    print("\n" + "="*60)
    print("üîç DATABRICKS VECTOR SEARCH NODE")
    print("="*60)
    
    if not state.get("use_vector_search", False):
        print("‚è≠Ô∏è  Skipping vector search")
        return {"vector_results": []}
    
    query = state["query"]
    
    try:
        # Generate query embedding
        print(f"üîÑ Generating query embedding...")
        query_embedding = embeddings_model.embed_query(query)
        
        # Search the index
        print(f"üîé Searching Databricks Vector Search index...")
        results = vector_search_client.get_index(
            endpoint_name=VECTOR_SEARCH_ENDPOINT,
            index_name=FULL_INDEX_NAME
        ).similarity_search(
            query_vector=query_embedding,
            columns=["id", "title", "content", "category", "source"],
            num_results=3
        )
        
        # Parse results
        vector_results = []
        data_array = results.get('result', {}).get('data_array', [])
        
        for result in data_array:
            vector_results.append({
                "id": result[0],
                "title": result[1],
                "content": result[2],
                "category": result[3],
                "source": result[4],
                "score": result[-1] if len(result) > 5 else None
            })
        
        print(f"‚úÖ Found {len(vector_results)} results from Databricks Vector Search")
        for i, result in enumerate(vector_results, 1):
            print(f"   {i}. {result['title']} (category: {result['category']})")
        
        return {"vector_results": vector_results}
        
    except Exception as e:
        print(f"‚ùå Error searching vectors: {e}")
        return {"vector_results": []}

print("‚úÖ Databricks vector search node created")

## 18. Create Web Search Node

In [None]:
def search_web(state: DatabricksRAGState) -> DatabricksRAGState:
    """
    Search the web using DuckDuckGo.
    """
    print("\n" + "="*60)
    print("üåê WEB SEARCH NODE")
    print("="*60)
    
    if not state.get("use_web_search", False):
        print("‚è≠Ô∏è  Skipping web search")
        return {"web_results": []}
    
    query = state["query"]
    
    try:
        print(f"üîé Searching DuckDuckGo for: {query}")
        raw_results = web_search.run(query)
        
        # Parse results
        web_results = []
        if isinstance(raw_results, str):
            snippets = raw_results.split('snippet: ')
            for snippet in snippets[1:]:
                parts = snippet.split('title: ')
                if len(parts) > 1:
                    title = parts[1].split('link: ')[0].strip()
                    link = parts[1].split('link: ')[1].strip() if 'link: ' in parts[1] else ""
                    web_results.append({
                        "title": title,
                        "snippet": parts[0].strip(),
                        "url": link
                    })
        else:
            web_results = raw_results if isinstance(raw_results, list) else []
        
        print(f"‚úÖ Found {len(web_results)} web results")
        for i, result in enumerate(web_results[:3], 1):
            print(f"   {i}. {result.get('title', 'N/A')[:60]}...")
        
        return {"web_results": web_results}
        
    except Exception as e:
        print(f"‚ùå Error searching web: {e}")
        return {"web_results": []}

print("‚úÖ Web search node created")

## 19. Create Context Builder Node

In [None]:
def build_context(state: DatabricksRAGState) -> DatabricksRAGState:
    """
    Combine results from Databricks and web search.
    """
    print("\n" + "="*60)
    print("üî® CONTEXT BUILDER NODE")
    print("="*60)
    
    context_parts = []
    sources = []
    
    # Add Databricks Vector Search results
    vector_results = state.get("vector_results", [])
    if vector_results:
        context_parts.append("=== DATABRICKS KNOWLEDGE BASE ===")
        for i, result in enumerate(vector_results, 1):
            context_parts.append(f"\n[Document {i}]")
            context_parts.append(f"Title: {result['title']}")
            context_parts.append(f"Content: {result['content']}")
            context_parts.append(f"Category: {result['category']}")
            sources.append(f"Databricks: {result['source']} ({result['id']})")
        print(f"üìä Added {len(vector_results)} documents from Databricks")
    
    # Add web search results
    web_results = state.get("web_results", [])
    if web_results:
        context_parts.append("\n\n=== WEB SEARCH RESULTS ===")
        for i, result in enumerate(web_results, 1):
            context_parts.append(f"\n[Web Result {i}]")
            context_parts.append(f"Title: {result.get('title', 'N/A')}")
            context_parts.append(f"Content: {result.get('snippet', 'N/A')}")
            url = result.get('url', '')
            if url:
                sources.append(f"Web: {url}")
        print(f"üåê Added {len(web_results)} web search results")
    
    combined_context = "\n".join(context_parts)
    
    if not combined_context.strip():
        combined_context = "No relevant context found."
    
    print(f"\n‚úÖ Built context: {len(combined_context)} characters")
    print(f"üìö Total sources: {len(sources)}")
    
    return {
        "combined_context": combined_context,
        "sources": sources
    }

print("‚úÖ Context builder node created")

## 20. Create Answer Generator Node

In [None]:
def generate_answer(state: DatabricksRAGState) -> DatabricksRAGState:
    """
    Generate final answer using LLM.
    """
    print("\n" + "="*60)
    print("‚ú® ANSWER GENERATOR NODE")
    print("="*60)
    
    query = state["query"]
    context = state.get("combined_context", "")
    sources = state.get("sources", [])
    
    generation_prompt = f"""You are a helpful AI assistant with access to both internal 
Databricks documentation and live web search results.

Context:
{context}

Question: {query}

Instructions:
1. Answer the question based on the provided context
2. Distinguish between Databricks knowledge base and web search results
3. If information comes from Databricks docs, emphasize it's from internal knowledge
4. If information comes from web search, note it's from external sources
5. Be specific and cite which sources you're using
6. If context is insufficient, acknowledge the limitation
7. Provide a confidence score (0-1) based on source quality and relevance

Format your response as:
Answer: [your detailed answer]
Confidence: [0.0-1.0]

Answer:"""
    
    try:
        response = llm.invoke([HumanMessage(content=generation_prompt)])
        answer_text = response.content
        
        # Extract confidence if present
        confidence = 0.8  # Default
        if "Confidence:" in answer_text:
            try:
                conf_str = answer_text.split("Confidence:")[1].split()[0]
                confidence = float(conf_str)
            except:
                pass
        
        print(f"‚úÖ Generated answer ({len(answer_text)} characters)")
        print(f"üìä Confidence: {confidence:.2f}")
        
        metadata = {
            "vector_results_count": len(state.get("vector_results", [])),
            "web_results_count": len(state.get("web_results", [])),
            "total_sources": len(sources),
            "timestamp": datetime.now().isoformat()
        }
        
        return {
            "answer": answer_text,
            "confidence": confidence,
            "metadata": metadata
        }
        
    except Exception as e:
        print(f"‚ùå Error generating answer: {e}")
        return {
            "answer": "I encountered an error generating the answer.",
            "confidence": 0.0,
            "metadata": {}
        }

print("‚úÖ Answer generator node created")

## 21. Build the LangGraph Workflow

In [None]:
# Create the workflow
workflow = StateGraph(DatabricksRAGState)

# Add nodes
workflow.add_node("route", route_query)
workflow.add_node("search_databricks", search_databricks_vectors)
workflow.add_node("search_web", search_web)
workflow.add_node("build_context", build_context)
workflow.add_node("generate", generate_answer)

# Define flow
workflow.set_entry_point("route")

# Parallel retrieval
workflow.add_edge("route", "search_databricks")
workflow.add_edge("route", "search_web")

# Both feed into context builder
workflow.add_edge("search_databricks", "build_context")
workflow.add_edge("search_web", "build_context")

# Context builder to generator
workflow.add_edge("build_context", "generate")

# Generator is the end
workflow.add_edge("generate", END)

# Compile
databricks_rag_app = workflow.compile()

print("\n‚úÖ Databricks RAG Application Ready!")
print("\nüìä Workflow: Route ‚Üí [Databricks Vector Search + Web Search] ‚Üí Context ‚Üí Generate")

## 22. Helper Function to Query RAG System

In [None]:
def ask_databricks_rag(query: str, verbose: bool = True) -> Dict[str, Any]:
    """
    Query the Databricks RAG system.
    """
    if verbose:
        print("\n" + "="*80)
        print("ü§ñ DATABRICKS RAG SYSTEM")
        print("="*80)
        print(f"\n‚ùì Query: {query}\n")
    
    # Initialize state
    initial_state = {
        "query": query,
        "use_vector_search": False,
        "use_web_search": False,
        "vector_results": [],
        "web_results": [],
        "combined_context": "",
        "sources": [],
        "answer": "",
        "confidence": 0.0,
        "metadata": {}
    }
    
    try:
        # Run workflow
        final_state = databricks_rag_app.invoke(initial_state)
        
        if verbose:
            print("\n" + "="*80)
            print("üìù FINAL ANSWER")
            print("="*80)
            print(final_state["answer"])
            
            print("\n" + "="*80)
            print("üìä METADATA")
            print("="*80)
            metadata = final_state.get("metadata", {})
            print(f"Databricks Results: {metadata.get('vector_results_count', 0)}")
            print(f"Web Results: {metadata.get('web_results_count', 0)}")
            print(f"Confidence: {final_state.get('confidence', 0):.2%}")
            
            if final_state.get("sources"):
                print("\n" + "="*80)
                print("üìö SOURCES")
                print("="*80)
                for i, source in enumerate(final_state["sources"], 1):
                    print(f"{i}. {source}")
        
        return final_state
        
    except Exception as e:
        print(f"\n‚ùå Error: {e}")
        import traceback
        traceback.print_exc()
        return {"answer": f"Error: {e}", "sources": []}

print("‚úÖ Query helper function ready!")

## 23. Example 1: Query Using Databricks Vector Search Only

In [None]:
# Example 1: Internal knowledge query
result1 = ask_databricks_rag("What is Unity Catalog and how does it work with Delta Lake?")

## 24. Example 2: Query Using Web Search Only

In [None]:
# Example 2: Current events query
result2 = ask_databricks_rag("What are the latest Databricks product announcements in 2024?")

## 25. Example 3: Hybrid Query (Both Sources)

In [None]:
# Example 3: Hybrid query
result3 = ask_databricks_rag(
    "How does Databricks Vector Search compare to other vector databases like Pinecone?"
)

## 26. Example 4: Company Policy Query

In [None]:
# Example 4: Company policy
result4 = ask_databricks_rag("What is our company's data governance policy?")

## 27. Add New Documents to Vector Search

In [None]:
def add_documents_to_databricks(
    documents: List[Dict[str, Any]],
    table_name: str = FULL_TABLE_NAME
):
    """
    Add new documents to Databricks Vector Search.
    
    In production, this would:
    1. Generate embeddings
    2. Insert into Delta table
    3. Vector Search automatically syncs
    """
    print(f"üìù Adding {len(documents)} documents to Databricks...")
    
    # Generate embeddings
    for doc in documents:
        text_to_embed = f"{doc['title']}: {doc['content']}"
        doc['embedding'] = embeddings_model.embed_query(text_to_embed)
    
    # Convert to DataFrame
    new_docs_df = pd.DataFrame(documents)
    
    # In Databricks notebook with Spark:
    spark_code = f"""
from pyspark.sql.functions import current_timestamp

# Convert to Spark DataFrame
spark_df = spark.createDataFrame(new_docs_df)
spark_df = spark_df.withColumn("created_at", current_timestamp())

# Append to existing table
spark_df.write.format("delta") \
    .mode("append") \
    .saveAsTable("{table_name}")

print(f"‚úÖ Added {{len(documents)}} documents to {{table_name}}")

# Vector Search will automatically sync the new data
print("‚è≥ Vector Search will sync automatically (may take a few minutes)")
"""
    
    print("\nRun this code in your Databricks notebook:")
    print(spark_code)
    
    return new_docs_df

# Example: Add a new document
new_documents = [
    {
        "id": "doc_008",
        "content": """Databricks Model Serving provides a unified interface for deploying 
        and serving ML models at scale. It supports both real-time and batch inference, 
        automatic scaling, and A/B testing capabilities.""",
        "title": "Databricks Model Serving",
        "category": "ml_serving",
        "source": "model_serving_docs",
        "date": "2024-02-10"
    }
]

new_docs_df = add_documents_to_databricks(new_documents)

## 28. Monitor Vector Search Performance

In [None]:
def get_index_status():
    """
    Check the status of the vector search index.
    """
    try:
        index = vector_search_client.get_index(
            endpoint_name=VECTOR_SEARCH_ENDPOINT,
            index_name=FULL_INDEX_NAME
        )
        
        print("\n" + "="*60)
        print("üìä VECTOR SEARCH INDEX STATUS")
        print("="*60)
        print(f"Index Name: {FULL_INDEX_NAME}")
        print(f"Endpoint: {VECTOR_SEARCH_ENDPOINT}")
        print(f"Source Table: {FULL_TABLE_NAME}")
        print(f"\nStatus: Check Databricks UI for detailed metrics")
        print(f"Path: Compute > Vector Search > {VECTOR_SEARCH_ENDPOINT}")
        
        return index
        
    except Exception as e:
        print(f"‚ùå Error getting index status: {e}")
        return None

# Check status
index_status = get_index_status()

## 29. Production Deployment Checklist

### 1. Infrastructure Setup
- ‚úÖ Create dedicated Databricks workspace for production
- ‚úÖ Set up Unity Catalog with proper governance
- ‚úÖ Configure Vector Search endpoint (HIGH_CONCURRENCY for production)
- ‚úÖ Set up Delta tables with appropriate partitioning

### 2. Security & Governance
- ‚úÖ Implement row-level and column-level security
- ‚úÖ Set up audit logging
- ‚úÖ Configure access controls (AAD/OAuth)
- ‚úÖ Enable data lineage tracking
- ‚úÖ Encrypt sensitive data

### 3. Performance Optimization
- ‚úÖ Optimize embedding generation (batch processing)
- ‚úÖ Implement caching for frequent queries
- ‚úÖ Set up continuous indexing for real-time updates
- ‚úÖ Monitor query latency and throughput
- ‚úÖ Use Z-ordering for Delta tables

### 4. Monitoring & Observability
- ‚úÖ Set up MLflow tracking for model versions
- ‚úÖ Implement logging for all RAG operations
- ‚úÖ Create dashboards for system metrics
- ‚úÖ Set up alerts for anomalies
- ‚úÖ Track user feedback and quality metrics

### 5. Cost Optimization
- ‚úÖ Use appropriate cluster sizes
- ‚úÖ Implement auto-scaling
- ‚úÖ Monitor compute and storage costs
- ‚úÖ Optimize query patterns
- ‚úÖ Consider spot instances for non-critical workloads

## 30. Best Practices & Troubleshooting

### Vector Search Best Practices

**1. Index Management:**
```python
# Use CONTINUOUS sync for real-time updates
pipeline_type="CONTINUOUS"

# Use TRIGGERED for batch updates
pipeline_type="TRIGGERED"
```

**2. Query Optimization:**
- Use appropriate `num_results` (3-5 for most cases)
- Add filters for better relevance
- Leverage metadata for filtering

**3. Embedding Strategy:**
- Combine title and content for richer embeddings
- Normalize text before embedding
- Use consistent embedding models

### Common Issues

**Issue 1: Index Not Found**
- Verify index name is fully qualified (catalog.schema.index)
- Check index creation completed successfully
- Ensure endpoint is ONLINE

**Issue 2: Slow Queries**
- Check index sync status
- Optimize table partitioning
- Use column pruning in queries
- Consider increasing endpoint capacity

**Issue 3: Poor Result Quality**
- Improve document chunking strategy
- Enhance metadata richness
- Try different embedding models
- Implement reranking

### Resources

- [Databricks Vector Search Docs](https://docs.databricks.com/en/generative-ai/vector-search.html)
- [Unity Catalog Guide](https://docs.databricks.com/en/data-governance/unity-catalog/index.html)
- [Delta Lake Documentation](https://docs.delta.io/)
- [LangChain Databricks Integration](https://python.langchain.com/docs/integrations/vectorstores/databricks_vector_search)