<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# RAG with LlamaIndex

| Step | Tech | Execution |
| --- | --- | --- |
| Embedding | Hugging Face / Sentence Transformers | 💻 Local |
| Vector store | Milvus | 💻 Local |
| Gen AI | Hugging Face Inference API | 🌐 Remote |

## Overview

This example leverages the official [LlamaIndex Docling extension](../../integrations/llamaindex/).

Presented extensions `DoclingReader` and `DoclingNodeParser` enable you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.

## Setup

- 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.
- Notebook uses HuggingFace's Inference API; for increased LLM quota, token can be provided via env var `HF_TOKEN`.
- Requirements can be installed as shown below (`--no-warn-conflicts` meant for Colab's pre-populated Python env; feel free to remove for stricter usage):

In [1]:
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pylatexenc (setup.py) ... [?25l[?25hdone


In [2]:
import os
from pathlib import Path
from tempfile import mkdtemp
from warnings import filterwarnings

from dotenv import load_dotenv


def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


load_dotenv()

filterwarnings(action="ignore", category=UserWarning, module="pydantic")
filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

We can now define the main parameters:

In [12]:
SOURCE = "/content/강의계획서(학사).pdf"  # Docling Technical Report
QUERY = "학습 목표와 학습 내용은?"

In [13]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
GEN_MODEL = HuggingFaceInferenceAPI(
    token=_get_env_from_colab_or_os("HF_TOKEN"),
    model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
SOURCE = "/content/강의계획서(학사).pdf"  # Docling Technical Report
QUERY = "학습 목표와 학습 내용은?"

embed_dim = len(EMBED_MODEL.get_text_embedding("hi"))

## Using Markdown export

To create a simple RAG pipeline, we can:
- define a `DoclingReader`, which by default exports to Markdown, and
- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`

In [14]:
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.docling import DoclingReader
from llama_index.vector_stores.milvus import MilvusVectorStore

reader = DoclingReader()
node_parser = MarkdownNodeParser()

vector_store = MilvusVectorStore(
    uri=str(Path(mkdtemp()) / "docling.db"),  # or set as needed
    dim=embed_dim,
    overwrite=True,
)
index = VectorStoreIndex.from_documents(
    documents=reader.load_data(SOURCE),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

2025-03-25 02:51:28,686 [DEBUG][_create_connection]: Created new connection using: 0478f75597f74cf5b47d7394553c7899 (async_milvus_client.py:600)
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


Q: 학습 목표와 학습 내용은?
A: The learning goals and contents are as follows:

Week 1:
- Learning goal: Understand the learning objectives, get introduced to artificial intelligence search, and learn about the course plan and project execution method.
- Learning content: Introduction to the course, learning objectives, artificial intelligence search introduction, course plan, project execution method, and importance of artificial intelligence search.

Week 2:
- Learning goal: Learn about major data search methods.
- Learning content: Keywords, vector space, probability, database schema, Elasticsearch.

Week 3:
- Learning goal: Understand the principles of artificial intelligence-based data search.
- Learning content: Tokens, document loader, embedding, vector storage, text division.

Week 4:
- Learning goal: Learn about search enhancement and generation (RAG) search methods.
- Learning content: Vector store, context compression, ensemble, long context rewriting, and part-of-speech.

Week 5:
- L

[('## ◆주별수업계획\n\n| 주        | 구간                | 수업내용                                                                                                    | 교수학습 방법   | 교재/자료   | 비고   |\n|-----------|---------------------|-------------------------------------------------------------------------------------------------------------|-----------------|-------------|--------|\n| 1주       | 03/04~ 03/10        | 학습목표 강의소개및인공지능검색소개 학습내용 강의계획및프로젝트진행방법,인공지능 검색중요성                 |                 |             |        |\n| 2주       | 03/11~ 03/17        | 학습목표 주요자료검색방법 학습내용 키워드, 벡터공간, 확률, 데이터베이스 스키마 , 일래스틱서치               |                 |             |        |\n| 3주       | 03/18~ 03/24        | 학습목표 인공지능기반자료검색원리 학습내용 토큰, 문서 로더, 임베딩, 벡터저장소, 텍스트 분할                 |                 |             |        |\n| 4주       | 03/25~ 03/31        | 학습목표 검색증강생성(RAG) 검색 방법 학습내용 벡터스토어,문맥압축,앙상블,긴문맥재정 렬, 형태소              |                 |             |        |\n| 5주       | 04/01~ 04/07        | 학습목표

## Using Docling format

To leverage Docling's rich native format, we:
- create a `DoclingReader` with JSON export type, and
- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

In [15]:
from llama_index.node_parser.docling import DoclingNodeParser

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()

vector_store = MilvusVectorStore(
    uri=str(Path(mkdtemp()) / "docling.db"),  # or set as needed
    dim=embed_dim,
    overwrite=True,
)
index = VectorStoreIndex.from_documents(
    documents=reader.load_data(SOURCE),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

2025-03-25 02:57:49,479 [DEBUG][_create_connection]: Created new connection using: ca487063292842369aa6974ab0d52e25 (async_milvus_client.py:600)


Q: 학습 목표와 학습 내용은?
A: The learning goals and content can be found under the heading "강의내용" (course content) in the context. The learning goals are not explicitly stated, but they can be inferred from the course content provided. The course content includes topics such as the basics of programming, data structures, algorithms, and software development, among others.

Sources:


[('◆강의내용',
  {'schema_name': 'docling_core.transforms.chunker.DocMeta',
   'version': '1.0.0',
   'doc_items': [{'self_ref': '#/texts/1',
     'parent': {'$ref': '#/body'},
     'children': [],
     'content_layer': 'body',
     'label': 'text',
     'prov': [{'page_no': 1,
       'bbox': {'l': 28.35,
        't': 739.18,
        'r': 77.46,
        'b': 731.041,
        'coord_origin': 'BOTTOMLEFT'},
       'charspan': [0, 5]}]}],
   'headings': ['강의계획서(학사)'],
   'origin': {'mimetype': 'application/pdf',
    'binary_hash': 13855459962613527174,
    'filename': '강의계획서(학사).pdf'}}),
 ('시기, 1 = 내용. 시기, 2 = 시험방법/제출방법. 시기, 3 = 평가기준',
  {'schema_name': 'docling_core.transforms.chunker.DocMeta',
   'version': '1.0.0',
   'doc_items': [{'self_ref': '#/tables/2',
     'parent': {'$ref': '#/body'},
     'children': [],
     'content_layer': 'body',
     'label': 'table',
     'prov': [{'page_no': 2,
       'bbox': {'l': 27.108963012695312,
        't': 186.52886962890625,
        'r': 

## With Simple Directory Reader

To demonstrate this usage pattern, we first set up a test document directory.

In [16]:
from pathlib import Path
from tempfile import mkdtemp

import requests

tmp_dir_path = Path(mkdtemp())
r = requests.get(SOURCE)
with open(tmp_dir_path / f"{Path(SOURCE).name}.pdf", "wb") as out_file:
    out_file.write(r.content)

MissingSchema: Invalid URL '/content/강의계획서(학사).pdf': No scheme supplied. Perhaps you meant https:///content/강의계획서(학사).pdf?

Using the `reader` and `node_parser` definitions from any of the above variants, usage with `SimpleDirectoryReader` then looks as follows:

In [17]:
from llama_index.core import SimpleDirectoryReader

dir_reader = SimpleDirectoryReader(
    input_dir=tmp_dir_path,
    file_extractor={".pdf": reader},
)

vector_store = MilvusVectorStore(
    uri=str(Path(mkdtemp()) / "docling.db"),  # or set as needed
    dim=embed_dim,
    overwrite=True,
)
index = VectorStoreIndex.from_documents(
    documents=dir_reader.load_data(SOURCE),
    transformations=[node_parser],
    storage_context=StorageContext.from_defaults(vector_store=vector_store),
    embed_model=EMBED_MODEL,
)
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

ValueError: No files found in /tmp/tmpet7vpdr4.