Docling RAG test based on https://github.com/docling-project/docling/blob/main/docs/examples/rag_milvus.ipynb

In [None]:
! pip install --upgrade pymilvus docling openai torch

In [2]:
import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise OSError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )

CUDA GPU is enabled: NVIDIA GeForce RTX 3060 Ti


In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
from openai import OpenAI

openai_client = OpenAI()

In [5]:
def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )



In [6]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]


Process Data Using Docling

In [7]:
from docling_core.transforms.chunker import HierarchicalChunker

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
chunker = HierarchicalChunker()

# Convert the input file to Docling Document
source = "https://milvus.io/docs/overview.md"
doc = converter.convert(source).document

# Perform hierarchical chunking
texts = [chunk.text for chunk in chunker.chunk(doc)]


Load Data into Milvus

In [8]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"

In [9]:
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

In [10]:
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.
)

In [11]:
from tqdm import tqdm

data = []

for i, chunk in enumerate(tqdm(texts, desc="Processing chunks")):
    embedding = emb_text(chunk)
    data.append({"id": i, "vector": embedding, "text": chunk})

milvus_client.insert(collection_name=collection_name, data=data)

Processing chunks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:21<00:00,  1.78it/s]


{'insert_count': 38, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], 'cost': 0}

Build RAG

In [12]:
question = (
    "What are the three deployment modes of Milvus, and what are their differences?"
)

In [13]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],
    limit=3,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"],
)

In [14]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "Milvus offers three deployment modes, covering a wide range of data scales\u2014from local prototyping in Jupyter Notebooks to massive Kubernetes clusters managing tens of billions of vectors:",
        0.6503317356109619
    ],
    [
        "- Milvus Lite is a Python library that can be easily integrated into your applications. As a lightweight version of Milvus, it\u2019s ideal for quick prototyping in Jupyter Notebooks or running on edge devices with limited resources. Learn more.\n- Milvus Standalone is a single-machine server deployment, with all components bundled into a single Docker image for convenient deployment. Learn more.\n- Milvus Distributed can be deployed on Kubernetes clusters, featuring a cloud-native architecture designed for billion-scale or even larger scenarios. This architecture ensures redundancy in critical components. Learn more.",
        0.634893536567688
    ],
    [
        "- What is Milvus?\n- Unstructured Data, Embeddings, and Milvus\

Use LLM to get a RAG response

In [15]:
context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

In [16]:
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

In [17]:
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)

Milvus offers three deployment modes:

1. **Milvus Lite**: This is a Python library that is ideal for quick prototyping in Jupyter Notebooks or for use on edge devices with limited resources. It is the lightweight version of Milvus and can be easily integrated into applications.

2. **Milvus Standalone**: This deployment mode is a single-machine server deployment, where all the components are bundled into a single Docker image for convenient deployment. It is suitable for environments where a single machine is sufficient.

3. **Milvus Distributed**: Designed for large-scale scenarios, it can be deployed on Kubernetes clusters. It features a cloud-native architecture suitable for managing billion-scale or even larger data. This mode ensures redundancy in critical components and is intended for massive deployments.

Each mode is tailored to different scales and environments, from local development and prototyping to massive, production-level deployments.
