# RAG example with Langchain, Milvus, and vLLM

This Jupyter notebook is setting up a Retrieval-Augmented Generation (RAG) chatbot using LangChain, Milvus (a vector database), and an inference server hosting Mistral-7B-Instruct-v0.2. 

### Requirements:

A Milvus instance, either standalone or cluster.
Connection credentials to Milvus must be available as environment variables: MILVUS_USERNAME and MILVUS_PASSWORD.
A vLLM inference endpoint. In this example we use the OpenAI Compatible API.
Needed packages and imports

## Step 01 Install required libraries
Installs necessary libraries:
 - langchain: For managing LLM workflows.
 - pymilvus: For connecting to Milvus (a vector database).
 - sentence-transformers: For text embeddings.
 - openai: For interacting with OpenAI-compatible inference servers.

In [16]:
!pip install -q einops==0.7.0 langchain==0.1.9 pymilvus==2.3.6 sentence-transformers==2.4.0 openai==1.13.3;


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 02 Test connection to the LLM

In [17]:
!curl  https://vllm-vllm.apps.lisbon-sgioia-01.lis.ciscodemo.int/v1/models | python -m json.tool

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   497  100   497    0     0  33133      0 --:--:-- --:--:-- --:--:-- 33133
{
    "object": "list",
    "data": [
        {
            "id": "mistralai/Mistral-7B-Instruct-v0.2",
            "object": "model",
            "created": 1752067307,
            "owned_by": "vllm",
            "root": "mistralai/Mistral-7B-Instruct-v0.2",
            "parent": null,
            "permission": [
                {
                    "id": "modelperm-3ef50ce2bf294e4698f3b95037d2e256",
                    "object": "model_permission",
                    "created": 1752067307,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tun

## Step 3 Import Libraries & Set Up Configuration

Imports necessary components from langchain, including:
 - RetrievalQA: Enables retrieval-augmented question-answering.
 - Milvus: Connects to Milvus for vector search.
 - VLLMOpenAI: Uses an inference server compatible with OpenAI’s API.

In [18]:
import os
from langchain.callbacks.base import BaseCallbackHandler
from langchain.chains import RetrievalQA
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain_community.llms import VLLMOpenAI
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Milvus

## Step 4 Bases parameters, Inference server and Milvus info

Define Model & Milvus Vector Database Configuration.
 - LLM Configuration:
   - Uses an OpenAI-compatible inference server (VLLM).
   - Loads the Mistral-7B-Instruct-v0.2 model.
   - Sets hyperparameters like temperature (controls randomness) and top_p (nucleus sampling).
 - Milvus Vector Database Configuration:
   - Connects to a Milvus instance (vectordb-milvus).
   - Uses a collection named "splunk_appdynamics" for storing and retrieving embeddings.

In [19]:
INFERENCE_SERVER_URL = "https://vllm-vllm.apps.lisbon-sgioia-01.lis.ciscodemo.int/v1"
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
MAX_TOKENS=100
TOP_P=0.95
TEMPERATURE=0.5
PRESENCE_PENALTY=1.03
MILVUS_HOST = "vectordb-milvus.milvus.svc.cluster.local"
MILVUS_PORT = 19530
MILVUS_USERNAME = os.getenv('MILVUS_USERNAME')
MILVUS_PASSWORD = os.getenv('MILVUS_PASSWORD')
MILVUS_COLLECTION = "Football"

## Step 5 Load Embeddings Model

Uses nomic-ai/nomic-embed-text-v1 from Hugging Face for text embeddings. These embeddings are used to convert text into numerical vectors for similarity search in Milvus.
Then connect to Milvus as a vector database to:
 - Stores documents as embeddings.
 - Allows fast similarity search for retrieval-augmented generation (RAG).

In [20]:
model_kwargs = {'trust_remote_code': True}
embeddings = HuggingFaceEmbeddings(
    model_name="nomic-ai/nomic-embed-text-v1",
    model_kwargs=model_kwargs,
    show_progress=False
)

store = Milvus(
    embedding_function=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT, "user": MILVUS_USERNAME, "password": MILVUS_PASSWORD},
    collection_name=MILVUS_COLLECTION,
    metadata_field="metadata",
    text_field="page_content",
    drop_old=False
    )

You try to use a model that was created with version 2.4.0.dev0, however, your version is 2.4.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



<All keys matched successfully>


## Step 6 Define the Prompt Template & create a Retrieval-QA Chain 

Define the prompt template:
Then uses VLLMOpenAI to connect to an inference server running Mistral-7B.
 - Enables streaming responses.
 - Uses the API base URL (INFERENCE_SERVER_URL).

In this section, the notebook combines a retrieval system with a language model (LLM) to create a question-answering (QA) pipeline. This is done using LangChain’s RetrievalQA class, which integrates document retrieval with LLM-based generation. The goal is to improve the LLM’s responses by retrieving relevant documents from a vector database (Milvus) before generating an answer.
Instead of relying solely on the model’s pre-trained knowledge, RAG ensures factual correctness by referencing real-time data.
The RetrievalQA chain first retrieves the most similar documents from Milvus and then passes them as context to Mistral-7B for response generation.

 Create a Retrieval-QA Chain. Combines:
 - Milvus (retriever) for document search.
 - Mistral-7B (LLM) for answer generation.
 - Uses similarity search (k=4) to retrieve the most relevant documents.

In [21]:
template="""<s>[INST] <<SYS>>
You are CiscoBot, a helpful, respectful, and knowledgeable assistant.

Your task is to answer questions based on the context provided. When responding:

Focus on clarity and helpfulness: Provide answers that are informative, accurate, and as detailed as necessary to address the question effectively.
Be respectful and positive: Ensure your responses are polite, considerate, and promote a positive experience for the user.
Safety is key: Avoid sharing harmful, unethical, discriminatory, or illegal content in your answers. Always prioritize social responsibility and inclusivity in your responses.
Explain when unsure: If a question is unclear or not factually sound, offer a polite explanation of why and avoid providing incorrect or speculative information.
Always aim to provide the most reliable and accurate information, ensuring your responses are constructive and relevant to the user's inquiry.
<</SYS>>

Context: 
{context}

Question: {question} [/INST]
"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)
import httpx

llm =  VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base=INFERENCE_SERVER_URL,
    model_name=MODEL_NAME,
    max_tokens=MAX_TOKENS,
    top_p=TOP_P,
    temperature=TEMPERATURE,
    presence_penalty=PRESENCE_PENALTY,
    streaming=True,
    verbose=False,
    callbacks=[StreamingStdOutCallbackHandler()],
    async_client=httpx.AsyncClient(verify=False),
    http_client=httpx.Client(verify=False)
)

qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
            ),
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
        return_source_documents=True
        )

os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Step 7 Query Example

In [23]:
question = "Who was the winner of the PFA Premier League Team of the Year in 2024?"
result = qa_chain.invoke({"query": question})


The PFA (Professional Footballers' Association) Player of the Year Award in 2024 was won by Phil Foden. However, the context provided does not mention the PFA Premier League Team of the Year. To answer that question, I would need to access more information or clarify which league's team you are asking about.

## Step 8 Get Sources

Removes duplicate documents based on their metadata source.

In [24]:
def remove_duplicates(input_list):
    unique_list = []
    for item in input_list:
        if item.metadata['source'] not in unique_list:
            unique_list.append(item.metadata['source'])
    return unique_list

results = remove_duplicates(result['source_documents'])

for s in results:
    print(s)

https://www.transfermarkt.com/
https://www.fabrizioromano.org/


## Step 9 Install Gradio

Gradio is an open-source Python library for creating user-friendly web interfaces for machine learning models and applications. It allows developers to quickly build interactive demos with just a few lines of code. Users can input text, images, or other data types and receive real-time responses from models.

In [None]:
!pip install gradio;

## Step 10 Launch Gradio Chatbot
Creates a Gradio UI:
 - Allows users to enter questions.
 - Uses retrieval-augmented generation (RAG) to provide answers.
 - Launches the chatbot (local-only, not public).

In [None]:
import gradio as gr

def rag_query(user_input, _=None):  # Add `_` to ignore the extra argument
    response = qa_chain.invoke(user_input)
    result = response['result']
    return result

import json

demo = gr.Interface(
    fn=rag_query,
    inputs=["text", "slider"],
    outputs=["text"],
    title="🔍 RAG Chatbot",
    description="Ask a question, and the chatbot will retrieve relevant documents before answering."
)


demo.launch(share="true")

In [None]:
import gradio as gr
import requests


# INFERENCE_SERVER_URL = "https://vllm-vllm.apps.lisbon-sgioia-01.lis.ciscodemo.int/v1"
# MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

def rag_query(user_input, _=None):  # Add `_` to ignore the extra argument
    response = qa_chain.invoke(user_input)
    return response

# Function to query your LLM using vLLM API
def query_llm(prompt):
    api_url = "https://vllm-vllm.apps.lisbon-sgioia-01.lis.ciscodemo.int/v1/chat/completions"

    # Correcting the payload structure
    data = {
        "messages": [{"role": "user", "content": prompt}],
        "model": "mistralai/Mistral-7B-Instruct-v0.2"
    }

    response = requests.post(api_url, json=data)

    if response.status_code == 200:
        print(response)
        response_json = response.json()
        return response

# Create a Gradio interface
iface = gr.Interface(
    fn=query_llm,
    inputs="text",
    outputs="text",
    title="LLM Query Interface",
    description="Enter your prompt to query the language model."
)

# Launch the interface
iface.launch(share=True)