## Extending the capabilities of our model

An LLM is a very capable tool, but only to the extent of the knowledge or information it has been trained on. After all, you only know what you know, right? But what if you need to ask a question that is not in the training data? Or what if you need to ask a question that is not in the training data, but is related to it?

There are different ways to solve this problem, depending on the resources you have and the time or money you can spend on it. Here are a few options:

- Fully retrain the model to include the information you need. For an LLM, it's only possible for a handful of companies in the world that can afford literally thousands of GPUs running for weeks.
- Fine-tune the model with this new information. This requires way less resources, and can usually be done in a few hours or minutes (depending on the size of the model). However as it does not fully retrain the model, the new information may not be completely integrated in the answers. Fine-tuning excels at giving a better understanding of a specific context or vocabulary, a little bit less on injecting new knowledge. Plus you have to retrain and redeploy the model anyway any time you want to add more information.
- Put this new information in a database and have the parts relevant to the query retrieved and added to this query as a context before sending it to the LLM. This technique is called **Retrieval Augmented Generation, or RAG**. It is interesting as you don't have to retrain or fine-tune the model to benefit of this new knowledge, that you can easily update at any time.

We have already prepared a Vector Database using [Milvus](https://milvus.io/), where we have stored (in the form of [Embeddings](https://www.ibm.com/topics/embedding)) the content of the [California Driver's Handbook](https://www.dmv.ca.gov/portal/handbook/california-driver-handbook/).

In this Notebook, we are going to use RAG to **make some queries about a Claim** and see how this new knowledge can help without having to modify our LLM.

### Requirements and Imports

If you have selected the right workbench image to launch as per the Lab's instructions, you should already have all the needed libraries. If not uncomment the first line in the next cell to install all the right packages.

In [13]:
# Uncomment the following line only if you have not selected the right workbench image, or are using this notebook outside of the workshop environment.
# !pip install --no-cache-dir --no-dependencies --disable-pip-version-check -r requirements.txt

import json
import transformers
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.llms import VLLMOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus # Use the standard Milvus vector store

# Turn off warnings when downloading the embedding model
transformers.logging.set_verbosity_error()

### Langchain elements

Again, we are going to use Langchain to define our task pipeline.

First, the **LLM** where we will send our queries.

In [14]:
# LLM Inference Server URL
inference_server_url = "http://granite-3-1-8b-instruct-predictor.shared-llm.svc.cluster.local:8080"

# LLM definition
llm = VLLMOpenAI(           # We are using the vLLM OpenAI-compatible API client. But the Model is running on OpenShift AI, not OpenAI.
    openai_api_key="EMPTY",   # And that is why we don't need an OpenAI key for this.
    openai_api_base= f"{inference_server_url}/v1",
    model_name="granite-3-1-8b-instruct",
    top_p=0.92,
    temperature=0.01,
    max_tokens=512,
    presence_penalty=1.03,
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

Then the connection to the **vector database** where we have prepared and stored the California Driver Handbook.

In [15]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Milvus

# 1. Define the embedding model (must match the ingestion pipeline)
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},
    show_progress=False,
)
print("Embedding model loaded.")

# 2. Define connection arguments
connection_args = {
    "host": "vectordb-milvus", # The Kubernetes service name for Milvus
    "port": "19530"
}
print(f"Connecting to Milvus at: {connection_args['host']}:{connection_args['port']}")

# 3. Connect to the Milvus vector store, specifying the correct text and vector fields
vector_db = Milvus(
    embedding_function=embeddings,
    connection_args=connection_args,
    collection_name="servicenow_incidents",
    vector_field="embedding",      # Specify the name of your vector field
    text_field="resolution_notes"  # <-- THIS IS THE FIX: Tell LangChain to use this field for page_content
)
print("Successfully connected to Milvus collection 'servicenow_incidents'.")


# 4. Create a retriever to search for relevant documents
retriever = vector_db.as_retriever(search_kwargs={"k": 3})
print("Retriever created successfully.")

Loading embedding model...




Embedding model loaded.
Connecting to Milvus at: vectordb-milvus:19530
Successfully connected to Milvus collection 'servicenow_incidents'.
Retriever created successfully.


We will now define the **template** to use to make our query. Note that this template now contains a **References** section. That's were the documents returned from the vector database will be injected.

We are now ready to query the model!

In the `claims` folder we have JSON files with examples of claims that could be received. We are going to read the first claim and ask a question related to it.

### First test, no additional knowledge

Let's start with a first query about the claim, but without help from our vector database.

We can see that the answer is valid. Here the model is using its general understanding of traffic regulation.

### Second test, with added knowledge

We will use the same prompt and query, but this time the model will have access to some references from the California's Driver Handbook.

In [16]:
# The 'llm' and 'retriever' objects should already be defined from the previous cells.

# 1. Define a prompt template suitable for a general Q&A over the retrieved documents.
#    This template tells the LLM how to use the context from Milvus to answer the question.
prompt_template_str = """
<|system|>
You are a helpful, respectful and honest assistant named "Parasol Assistant".
You will be given context from past incident tickets and a question.
Your answer should be based only on the provided context.
If the context does not contain the answer, say that you don't have enough information.

Context:
{context}

Question:
{question}
<|assistant|>
"""

prompt = PromptTemplate(
    template=prompt_template_str, input_variables=["context", "question"]
)

# 2. Create the RAG chain with the new prompt.
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

# 3. Define the query and invoke the chain.
query = "Give me information on INC001004"
print(f"Executing RAG chain with query: '{query}'")

resp = rag_chain.invoke({"query": query})

# 4. Print the results.
print("\n--- Answer from LLM ---\n")
print(resp["result"])
print("\n--- Sources Retrieved from Milvus ---\n")
for doc in resp["source_documents"]:
    print(f"  - Incident: {doc.metadata.get('incident_pk', 'N/A')}")
    print(f"    Description: {doc.metadata.get('short_description', 'N/A')}")
    print("-" * 20)

Executing RAG chain with query: 'Give me information on INC001004'
INC001004 refers to an incident where the primary network interface card (NIC) of the main email server failed. The issue was resolved by rerouting traffic to the secondary NIC and ordering a replacement primary NIC. Once installed and configured, full redundancy will be restored. Users confirmed that email services were restored after the failover. No issues were found with the VPN concentrator or client software.
--- Answer from LLM ---

INC001004 refers to an incident where the primary network interface card (NIC) of the main email server failed. The issue was resolved by rerouting traffic to the secondary NIC and ordering a replacement primary NIC. Once installed and configured, full redundancy will be restored. Users confirmed that email services were restored after the failover. No issues were found with the VPN concentrator or client software.

--- Sources Retrieved from Milvus ---

  - Incident: INC001005
    De

That is pretty neat! Now the model refers more precisely to the rules that must be observed.

But where did we get this information from? We can look into the sources associated with the answers from the vector database.

In [None]:
def format_sources(input_list):
    sources = ""
    if len(input_list) != 0:
        sources += input_list[0].metadata["metadata"]["source"] + ', page: ' + str(input_list[0].metadata["metadata"]["page"])
        page_list = [input_list[0].metadata["metadata"]["page"]]
        for item in input_list:
            if item.metadata["metadata"]["page"] not in page_list: # Avoid duplicates
                page_list.append(item.metadata["metadata"]["page"])
                sources += ', ' + str(item.metadata["metadata"]["page"])
    return sources


results = format_sources(resp['source_documents'])

print(results)

That's it! We now know how to complement our LLM with some external knowledge!