<a href="https://colab.research.google.com/github/crystalloide/RAG/blob/main/rag_langchain_2026.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/rag_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with LangChain

| Step | Tech | Execution |
| --- | --- | --- |
| Embedding | Hugging Face / Sentence Transformers | üíª Local |
| Vector store | Milvus | üíª Local |
| Gen AI | Hugging Face Inference API | üåê Remote |

Cet exemple exploite l'int√©gration
[LangChain Docling](../../integrations/langchain/), ainsi qu'une base de donn√©es vectorielles Milvus,
et des sentence-transformers embeddings.

Le composant `DoclingLoader` permet¬†:

- d'utiliser facilement et rapidement diff√©rents types de documents dans les applications LLM, et
- d'exploiter le format riche de Docling pour un ancrage avanc√© et natif du document.

`DoclingLoader` prend en charge deux modes d'exportation¬†:

- `ExportType.MARKDOWN`¬†: si on souhaite collecter chaque document d'entr√©e comme un document LangChain distinct,

ou
- `ExportType.DOC_CHUNKS` (par d√©faut)¬†: si on souhaite d√©couper chaque document d'entr√©e en shards (segments) et

capturer ensuite chaque shard comme un document LangChain distinct.

L'exemple permet d'explorer les deux modes via le param√®tre `EXPORT_TYPE`¬†; selon la valeur d√©finie, le pipeline de l'exemple est configur√© en cons√©quence.

## Setup

üëâ Pour une vitesse de conversion optimale, utilisez l'acc√©l√©ration GPU quand c'est possible¬†; par exemple, si vous utilisez Colab, utilisez un environnement d'ex√©cution compatible GPU.

Ce notebook utilise l'API d'inf√©rence de Hugging Face¬†; pour augmenter le quota LLM, on peut fournir un jeton via la variable d'environnement HF_TOKEN.

Les d√©pendances peuvent √™tre install√©es comme indiqu√© ci-dessous (l'option `--no-warn-conflicts` est destin√©e √† l'environnement Python pr√©configur√© de Colab¬†; on peut la supprimer pour une utilisation plus stricte).

In [10]:
%pip install -q --progress-bar off --no-warn-conflicts  langchain-classic langchain-docling langchain-core langchain-huggingface langchain_milvus pymilvus[milvus_lite] langchain python-dotenv

In [3]:
import os
from pathlib import Path
from tempfile import mkdtemp

from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType


def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


load_dotenv()

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
FILE_PATH = ["https://arxiv.org/pdf/2408.09869"]  # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
GEN_MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")

## Chargement des documents

Nous pouvons maintenant instancier notre chargeur et charger les documents.

In [4]:
from langchain_docling import DoclingLoader

from docling.chunking import HybridChunker

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

docs = loader.load()

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

[32m[INFO] 2026-01-11 19:09:44,250 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-01-11 19:09:44,255 [RapidOCR] device_config.py:57: Using GPU device with ID: 0[0m
[32m[INFO] 2026-01-11 19:09:44,258 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.5.0/torch/PP-OCRv4/det/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-01-11 19:09:45,497 [RapidOCR] download_file.py:82: Download size: 13.83MB[0m
[32m[INFO] 2026-01-11 19:09:45,625 [RapidOCR] download_file.py:95: Successfully saved to: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-01-11 19:09:45,626 [RapidOCR] main.py:50: Using /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2026-01-11 19:09:46,073 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2026-01-11 19:09:46,074 [RapidOCR] device_config.py:57: Using GPU device with ID: 0[0

> Remarque¬†: le message indiquant  `"Token indices sequence length is longer than the specified
maximum sequence length..."`  c'est-√†-dire  `"La longueur de la s√©quence des indices de jetons est sup√©rieure √† la
longueur de s√©quence maximale sp√©cifi√©e‚Ä¶¬†"`  peut √™tre ignor√© dans ce cas¬†‚Äî d√©tails

[ici](https://github.com/docling-project/docling-core/issues/119#issuecomment-2577418826).

D√©termination des splits (divisions) :

In [5]:
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
    from langchain_text_splitters import MarkdownHeaderTextSplitter

    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "Header_1"),
            ("##", "Header_2"),
            ("###", "Header_3"),
        ],
    )
    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
    raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")

Regardons quelques splits exemple :

In [6]:
for d in splits[:3]:
    print(f"- {d.page_content=}")
print("...")

- d.page_content='Version 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¬® uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
- d.page_content='1 Introduction\nConverting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge var

## Ingestion

In [8]:
import json
from pathlib import Path
from tempfile import mkdtemp

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)


milvus_uri = str(Path(mkdtemp()) / "docling.db")  # or set as needed
vectorstore = Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name="docling_demo",
    connection_args={"uri": milvus_uri},
    index_params={"index_type": "FLAT"},
    drop_old=True,
)

## RAG

In [11]:
from langchain_classic.chains import create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint, ChatHuggingFace

retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})

# Cr√©er l'endpoint avec task conversational
endpoint = HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    huggingfacehub_api_token=HF_TOKEN,
    task="conversational",
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
)

# Wrapper pour compatibilit√© avec les cha√Ænes LangChain
llm = ChatHuggingFace(llm=endpoint)


def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text

In [12]:
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": QUESTION})

clipped_answer = clip_text(resp_dict["answer"], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, doc in enumerate(resp_dict["context"]):
    print()
    print(f"Source {i + 1}:")
    print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
    for key in doc.metadata:
        if key != "pk":
            val = doc.metadata.get(key)
            clipped_val = clip_text(val) if isinstance(val, str) else val
            print(f"  {key}: {clipped_val}")

Question:
Which are the main AI models in Docling?

Answer:
 The main AI models in Docling are a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, while TableFormer is a state-of-the-art table st...

Source 1:
  text: "3.2 AI models\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re..."
  source: https://arxiv.org/pdf/2408.09869
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'content_layer': 'body', 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 404.873, 'r': 504.003, 'b': 330.866, 'coord_orig