# Hands-On Workshop: Hybrid Retrieval Process with LLM-Generated SPARQL Queries

Welcome to the hands-on workshop for the Hybrid Retrieval Process! This notebook demonstrates how to use vector search and SPARQL queries to retrieve and analyze supplier information from both unstructured and structured data. The SPARQL queries will be dynamically generated using a Language Model (LLM).

## Objectives
By the end of this workshop, you will:
- Understand how to initialize and use the hybrid retrieval process.
- Learn how to generate SPARQL queries dynamically using LLMs.
- Combine results from vector search and SPARQL to generate meaningful answers.

## Prerequisites
Before running this notebook, ensure the following:
- Python 3.11 or higher is installed.
- All dependencies listed in `requirements.txt` are installed.
- The configuration files (`env_cloud.json` and `env_config.json`) are properly set up in the `config/` directory.
- The SAP HANA database is accessible and contains the required data.

## Step 1: Import Required Modules
We will start by importing the necessary modules and initializing the environment.

In [None]:
import os
import json
from hdbcli import dbapi
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings, ChatOpenAI
from gen_ai_hub.proxy.gen_ai_hub_proxy import GenAIHubProxyClient
from ai_core_sdk.ai_core_v2_client import AICoreV2Client
from langchain_community.vectorstores.hanavector import HanaDB
from langchain.prompts import PromptTemplate

print("Modules imported successfully!")

## Step 2: Load Configuration Files
We will load the configuration files (`env_cloud.json` and `env_config.json`) to set up the SAP HANA and AI Core environments.

In [None]:
# Load configuration files
with open('config/env_cloud.json') as f:
    hana_env_c = json.load(f)

with open('config/env_config.json') as f:
    aicore_config = json.load(f)

print("Configuration files loaded successfully!")

## Step 3: Initialize SAP HANA and AI Core Clients
We will initialize the SAP HANA database connection and the AI Core client for embedding and language model operations.

In [None]:
# Initialize AI Core client
ai_core_client = AICoreV2Client(
    base_url=aicore_config['AICORE_BASE_URL'],
    auth_url=aicore_config['AICORE_AUTH_URL'],
    client_id=aicore_config['AICORE_CLIENT_ID'],
    client_secret=aicore_config['AICORE_CLIENT_SECRET'],
    resource_group=aicore_config['AICORE_RESOURCE_GROUP']
)

# Initialize GenAI Hub Proxy Client
proxy_client = GenAIHubProxyClient(ai_core_client=ai_core_client)

# Initialize OpenAI Embedding Model
embedding_model = OpenAIEmbeddings(proxy_model_name='text-embedding-ada-002', proxy_client=proxy_client)

# Initialize LLM for SPARQL query generation
llm = ChatOpenAI(proxy_model_name='gpt-4o', proxy_client=proxy_client)

# Connect to SAP HANA
HANA_SCHEMA = "RAG"
HANA_EMBED_TABLE = "SUPPLIERS_EMBED_ADA"

conn_db_api = dbapi.connect(
    address=hana_env_c['url'],
    port=hana_env_c['port'],
    user=hana_env_c['user'],
    password=hana_env_c['pwd'],
    currentSchema=HANA_SCHEMA
)

# Initialize HanaDB Vector Store
db_ada_table = HanaDB(
    embedding=embedding_model,
    connection=conn_db_api,
    table_name=HANA_EMBED_TABLE,
    content_column="CONTENT",
    metadata_column="METADATA",
    vector_column="VECTOR"
)

print("SAP HANA and AI Core clients initialized successfully!")

## Step 4: Vector Search
In this step, we will retrieve relevant unstructured data using vector embeddings.

In [None]:
# Define a sample question
question = "Which suppliers are located in high-risk countries?"

# Retrieve vector context
def retrieve_vector_context(question, top_k=25):
    retriever = db_ada_table.as_retriever(search_kwargs={'k': top_k})
    return retriever.invoke(question)

vector_context = retrieve_vector_context(question)
print("Vector Context Retrieved:")
print(vector_context)

## Step 5: Generate SPARQL Query with LLM
We will now generate a SPARQL query dynamically using the LLM based on the RDF context and the user question.

In [None]:
# Define the context about your RDF schema
rdf_context = """
    Your RDF graph uses the following structure:

    Namespaces:
    - rag: <http://sap.com/rag/>

    Supplier entities (identified by SUPPLIER_NAME) have the following properties:
    - rag:locatedIn → the supplier's country (IRI reference).
    - rag:hasSupplierType → type of supplier (Literal: "Manufacturer" or "Reseller").
    - rag:hasSupplierId → supplier ID (Literal string).
    - rag:hasAddress → supplier address (Literal string).
    - rag:locatedInCity → supplier city (Literal string).
    - rag:hasEmail → supplier email (Literal string).
    - rag:hasPhone → supplier phone (Literal string).
    - rag:hasWebsite → supplier website (Literal string).

    Country entities (identified by SUPPLIER_COUNTRY) have the following properties:
    - rag:hasGeopoliticalRisk → geopolitical risk level (Literal string: "High", "Medium", "Low").

    Example triples:
    - rag:StandSolutions rag:locatedIn rag:Russia .
    - rag:StandSolutions rag:hasSupplierType "Manufacturer" .
    - rag:StandSolutions rag:hasAddress "Tverskaya St 7, Moscow, Russia" .
    - rag:Russia rag:hasGeopoliticalRisk "High" .

    """

# Define SPARQL generation prompt
sparql_prompt_template = PromptTemplate(
    template="""
        Generate a SPARQL query for the user question.

        Given the following RDF context: {rdf_context}
        
        And the user question:
        '{question}'

        Instructions:
        - Query the GRAPH <rag_suppliers>.
        - Always use the 'rag:' prefix for entities and predicates.
        - Available query variables include: ?supplierName, ?supplierType, ?supplierId, ?address, ?city, ?email, ?phone, ?website, ?country, ?risk.
        - You must only use available variables. Do NOT use any incorrect or undefined variables.
        - Every variable you SELECT must be BOUND (linked to an entity or literal) inside WHERE.
        - If you need to access values like geopolitical risk ("High", "Low"), you must bind it first:
            Example: ?country rag:hasGeopoliticalRisk ?risk .
        - Apply FILTERS *after* binding the variable.
        - DO NOT directly compare inside triple patterns like ?country rag:hasGeopoliticalRisk "High" if you SELECT ?risk.
        - When using IRIs for suppliers or countries, replace spaces with underscores (e.g., 'North Korea' → North_Korea). Do not remove spaces entirely.
        - Return ONLY the SPARQL query body (SELECT ... FROM ... WHERE ...).
        - Do NOT wrap it with CALL SPARQL_EXECUTE.
        - Format it as SPARQL query (NO code blocks like ```sparql or ```).

        Add the prefix declaration at the start of the query, like:
        PREFIX rag: <http://sap.com/rag/>

        Example:
        PREFIX rag: <http://sap.com/rag/>
        SELECT ?supplier ?supplierType ?country ?risk
        FROM <rag_suppliers>
        WHERE {{
            ?supplier rag:locatedIn ?country .
            ?supplier rag:hasSupplierType ?supplierType .
            ?country rag:hasGeopoliticalRisk ?risk .
            FILTER(?risk = "High")
        }}
        
        """,
    input_variables=["rdf_context", "question"]
)

# Generate SPARQL query using LLM
sparql_llm_chain = sparql_prompt_template | llm
sparql_query = sparql_llm_chain.invoke({
    "rdf_context": rdf_context,
    "question": question
}).content.strip()

print("Generated SPARQL Query:")
print(sparql_query)

## Step 6: Execute SPARQL Query
We will execute the SPARQL query and retrieve results from the knowledge graph.

In [None]:
# Execute SPARQL stored procedure
cursor = conn_db_api.cursor()
try:
    resp = cursor.callproc('SPARQL_EXECUTE', (
        sparql_query,
        'Accept: application/sparql-results+csv',
        '?',
        None
    ))
    metadata = resp[3]
    results = resp[2]
    
    # Print results
    print("Query Response:", results)
    print("Response Metadata:", metadata)

    kg_context = results, metadata
except Exception as e:
    raise RuntimeError(f"SPARQL_EXECUTE failed: {e}")

kg_context = results, metadata
print("\nKnowledge Graph Context Retrieved:")
print(kg_context)

## Step 7: Generate Final Answer
Finally, we will combine the results from vector search and SPARQL to generate a meaningful answer.

In [None]:
final_prompt_template = PromptTemplate(
    template="""
        Context: You are tasked with helping retrieve and summarize supplier information.

        Available information:
        - Unstructured document chunks ({vector_context}).
        - Structured knowledge graph results ({kg_context}).

        User Question:
        {question}

        Instructions:
        - Only suppliers that appear in the knowledge graph may be included in the final answer.
        - Use unstructured document data *only* to enrich or clarify facts about suppliers already found in the knowledge graph.
        - Do not include any supplier names, risk classifications, or attributes that are mentioned *only* in unstructured data.
        - If a supplier appears in documents but is not present in the knowledge graph, exclude it from the answer and state: "Information not available."
        - Do not infer or assume facts — all conclusions must be backed by knowledge graph validation.
        - Optionally, include supporting information from unstructured data *if* it aligns with the knowledge graph result.
        - Prefer clarity and structured presentation (use lists or bullets). Optionally, return structured JSON if many suppliers are involved.

        Return:
        - A clean, human-readable summary limited to validated suppliers only.
        - Ensure all risk indicators are grounded in validated graph entries.

        """,
    input_variables=["vector_context", "kg_context", "question"]
)
# Generate final answer using LLM
final_answer_llm_chain = final_prompt_template | llm
final_answer = final_answer_llm_chain.invoke({
    "vector_context": vector_context,
    "kg_context": kg_context,
    "question": question
}).content.strip()

print("\n=== Final Answer ===\n")
print(final_answer)

## Step 8: Close Database Connection
Ensure the database connection is closed when the notebook is no longer in use.

In [None]:
# Cleanup
if conn_db_api:
    conn_db_api.close()
    print("Database connection closed.")