## Querying the RAG System with LangChain ##
An LLM is a very capable tool, but its knowledge is limited to the public data it was trained on. It doesn't know about Parasol Company's internal IT procedures or the solutions to our past incidents. How can we make it answer questions using our specific, private data?

There are a few ways to solve this problem:

Full Retraining: This involves re-training the entire model from scratch with our data included. This is incredibly expensive and time-consuming, feasible for only a handful of organizations.

Fine-Tuning: We can "tune" an existing model on our data. This is much faster and cheaper. It's excellent for teaching the model a specific style, tone, or new skill, but less effective for injecting large amounts of factual knowledge. The model must also be re-tuned whenever the data is updated.

Retrieval-Augmented Generation (RAG): This is the technique we will use. We put our private knowledge into an external database (in our case, a `Milvus` vector database) and "retrieve" the most relevant pieces of information when a user asks a question. We then feed this retrieved context, along with the original question, to the LLM. The LLM uses this specific context to generate a highly relevant and accurate answer. This is powerful because we can continuously update our knowledge base without ever having to retrain the model.

In the previous step, your data science pipeline successfully fetched closed incident tickets from our mock `ServiceNow API`, generated vector embeddings from them, and loaded them into a Milvus database.

In this notebook, we will use RAG to ask questions about IT support issues and see how the LLM can provide precise answers based on the historical incident data we just ingested.

### Requirements and Imports

Import the needed libraries

In [None]:
# Uncomment the following line only if you have not selected the right workbench image, or are using this notebook outside of the workshop environment.
# !pip install --no-cache-dir --no-dependencies --disable-pip-version-check -r requirements.txt

import json
import transformers
from langchain.chains import RetrievalQA
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain_community.llms import VLLMOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus # Use the standard Milvus vector store

# Turn off warnings when downloading the embedding model
transformers.logging.set_verbosity_error()

### Langchain elements

Again, we are going to use Langchain to define our task pipeline.

First, the **LLM** where we will send our queries.

In [None]:
# LLM Inference Server URL
inference_server_url = "http://granite-3-1-8b-instruct-predictor.shared-llm.svc.cluster.local:8080"

# LLM definition
llm = VLLMOpenAI(           # We are using the vLLM OpenAI-compatible API client. But the Model is running on OpenShift AI, not OpenAI.
    openai_api_key="EMPTY",   # And that is why we don't need an OpenAI key for this.
    openai_api_base= f"{inference_server_url}/v1",
    model_name="granite-3-1-8b-instruct",
    top_p=0.92,
    temperature=0.01,
    max_tokens=512,
    presence_penalty=1.03,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Then the connection to the **vector database** where we have prepared the ServiceNow data we pulled from mock API and which was stored in the vector database via the pipeline we deployed and ran in the earlier step.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

# 1. Define the embedding model (must match the ingestion pipeline)
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},
    show_progress=False,
)
print("Embedding model loaded.")

# 2. Define connection arguments
connection_args = {
    "host": "vectordb-milvus", # The Kubernetes service name for Milvus
    "port": "19530"
}
print(f"Connecting to Milvus at: {connection_args['host']}:{connection_args['port']}")

# 3. Connect to the Milvus vector store, specifying the correct text and vector fields
vector_db = Milvus(
    embedding_function=embeddings,
    connection_args=connection_args,
    collection_name="servicenow_incidents",
    vector_field="embedding",      # Specify the name of your vector field
    text_field="resolution_notes"  # <-- THIS IS THE FIX: Tell LangChain to use this field for page_content
)
print("Successfully connected to Milvus collection 'servicenow_incidents'.")


# 4. Create a retriever to search for relevant documents
retriever = vector_db.as_retriever(search_kwargs={"k": 3})
print("Retriever created successfully.")

We will now define the **template** to use to make our query. Note that this template now contains a **servicenow_incidents** section. That's were the documents returned from the vector database will be injected.

### Step 1: Retrieval - Finding Relevant Documents

Let's first understand what happens during the **retrieval** step. We'll manually search the vector database to see what documents are considered most relevant to our query, and then show how these documents are used as context for the LLM.



In [None]:
# Define our query
query = "Give me information on INC001004"
print(f"🔍 Query: '{query}'\n")

# Step 1: RETRIEVAL - Let's see what documents the vector database finds
print("=" * 60)
print("STEP 1: RETRIEVAL - Finding relevant documents")
print("=" * 60)

# Perform the search manually to show the retrieval step
retrieved_docs = retriever.get_relevant_documents(query)

print(f"📊 Found {len(retrieved_docs)} relevant documents (top-{len(retrieved_docs)}):\n")

for i, doc in enumerate(retrieved_docs, 1):
    print(f"📄 Document {i}:")
    print(f"   Incident ID: {doc.metadata.get('incident_pk', 'N/A')}")
    print(f"   Description: {doc.metadata.get('short_description', 'N/A')}")
    print(f"   Resolution: {doc.page_content[:200]}{'...' if len(doc.page_content) > 200 else ''}")
    print("-" * 50)

print("\n💡 These documents will now be used as context for the LLM to generate an answer.\n")

In [None]:
### Step 2: Generation - Using Retrieved Context to Generate an Answer
# Now let's see how the LLM uses the retrieved documents as context to generate a relevant answer.

In [None]:
# Step 2: GENERATION - Using the retrieved documents to generate an answer
print("=" * 60)
print("STEP 2: GENERATION - Creating answer using retrieved context")
print("=" * 60)

# 1. Define a prompt template that shows the LLM how to use the retrieved context
prompt_template_str = """
<|system|>
You are a helpful, respectful and honest assistant named "Parasol Assistant".
You will be given context from past incident tickets and a question.
Your answer should be based only on the provided context.
If the context does not contain the answer, say that you don't have enough information.

Context:
{context}

Question:
{question}
<|assistant|>
"""

prompt = PromptTemplate(
    template=prompt_template_str, input_variables=["context", "question"]
)

# 2. Create the RAG chain with the prompt
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

# 3. Execute the full RAG chain (retrieval + generation)
print(f"🤖 Sending query to LLM with retrieved context...")
resp = rag_chain.invoke({"query": query})

# 4. Display the final answer
print(f"\n🎯 Final Answer from LLM:\n")
print("-" * 30)
print(resp["result"])
print("-" * 30)

print(f"\n✅ This answer was generated using the {len(resp['source_documents'])} documents retrieved above as context.")


## Understanding the RAG Process

Congratulations! You've now seen the complete RAG process broken down into its two main components:

1. **🔍 Retrieval**: The system searched the Milvus vector database and found the most semantically similar documents to your query using vector embeddings.

2. **🤖 Generation**: The LLM used those retrieved documents as context to generate a specific, relevant answer based on your company's actual incident data.

This two-step process is what makes RAG so powerful - it combines the reasoning capabilities of large language models with the specific, up-to-date knowledge stored in your private databases.

## Key Benefits You've Just Demonstrated

- **Factual Accuracy**: The LLM's answer is grounded in real incident data, not just its training knowledge
- **Transparency**: You can see exactly which documents influenced the answer
- **Updatable Knowledge**: Adding new incidents to Milvus immediately makes them available for future queries
- **Cost Effective**: No need to retrain expensive models when your data changes

For Parasol Company, this means support engineers can now ask complex questions and receive answers based on the collective knowledge from thousands of past incident tickets. This is the foundation of a powerful system that can reduce resolution times, improve support consistency, and accelerate new hire training.