# RAG with LangChain

This example leverages the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings.

The `DoclingLoader` component enables you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.


## Setup

In [1]:
!pip install -q --progress-bar off --no-warn-conflicts langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Environment and Imports

In [None]:
import os
from pathlib import Path
from tempfile import mkdtemp

from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType

def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata
        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)

load_dotenv()

# Avoid tokenizer parallelism warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

# Configuration variables
HF_TOKEN = _get_env_from_colab_or_os('HF_TOKEN')
FILE_PATH = ['tests/AR_2020_WEB2.pdf']
EMBED_MODEL_ID = 'sentence-transformers/all-MiniLM-L6-v2'
GEN_MODEL_ID = 'google/flan-t5-base'  # Lightweight, reliable model for Q&A
EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = 'What are the financials for donated food?'
PROMPT = PromptTemplate.from_template(
    """Context information is below.
---------------------
{context}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {input}
Answer:
""",)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / 'docling.db')


  from .autonotebook import tqdm as notebook_tqdm


## Document Loading

In [3]:
from langchain_docling import DoclingLoader
from docling.chunking import HybridChunker

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

docs = loader.load()

Token indices sequence length is longer than the specified maximum sequence length for this model (681 > 512). Running this sequence through the model will result in indexing errors


## Determining the Splits

In [4]:
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
    from langchain_text_splitters import MarkdownHeaderTextSplitter

    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ('#', 'Header_1'),
            ('##', 'Header_2'),
            ('###', 'Header_3'),
        ],
    )
    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
    raise ValueError(f'Unexpected export type: {EXPORT_TYPE}')

### Inspecting Sample Splits

In [5]:
for d in splits[:3]:
    print(f"- {d.page_content=}")
print('...')

- d.page_content='bridging the gap between poverty and prosperity'
- d.page_content='A message from Co-Founder, President, and CEO, David Kieser\nNo one could have predicted the events of 2020. The global COVID-19 pandemic created a dynamic year. With the help of volunteers, donors, staff, and most importantly, the blessings of God, Midwest Food Bank responded nimbly to the changing landscape.\nAll  MFB  locations  remained  open  and  responsive  to  the  need  of  our nonprofit partners. We enacted safety protocols and reduced volunteer numbers  to  maintain  social  distancing  guidelines.  To  allow  partner agencies to receive food from MFB safely, we altered our distribution model.  Community,  business,  and  donor  support  funded  operations and helped with food purchases. More details on our response to the pandemic are on page 14.'
- d.page_content='Noteworthy in 2020:\n• MFB distributed a record amount of food, 37% more than in 2019.\n· In 2020, we sent a record number of f

## Ingestion

In [6]:
import json
from pathlib import Path
from tempfile import mkdtemp

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)

milvus_uri = str(Path(mkdtemp()) / 'docling.db')
vectorstore = Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name='docling_demo',
    connection_args={'uri': milvus_uri},
    index_params={'index_type': 'FLAT'},
    drop_old=True,
)

## RAG

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint

retriever = vectorstore.as_retriever(search_kwargs={'k': TOP_K})
llm = HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    temperature=0.5,
    max_new_tokens=512,
    huggingfacehub_api_token=HF_TOKEN,
)

def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text

question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({'input': QUESTION})

clipped_answer = clip_text(resp_dict['answer'], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, doc in enumerate(resp_dict['context']):
    print(f"\nSource {i + 1}:")
    print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
    for key, val in doc.metadata.items():
        if key != 'pk':
            clipped_val = clip_text(val) if isinstance(val, str) else val
            print(f"  {key}: {clipped_val}")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/cclyburn/.cache/huggingface/token
Login successful
Question:
What are the financials for donated food?

Answer:
YEARS ENDING DECEMBER 31, 2018, 2019 AND 2020
Donated food, vehicles, equipment and services, 2018 = $229,928,875. Donated food, vehicles, equipment and services, 2019 = $270,649,371. Donated food, ve...

Source 1:
  text: "YEARS ENDING DECEMBER 31, 2018, 2019 AND 2020\nDonated food, vehicles, equipment and services, 2018 = $229,928,875. Donated food, vehicles, equipment and services, 2019 = $270,649,371. Donated food, vehicles, equipment and services, 2020* = $397,204,944. General public contributions and grants, 2018 = $7,436,910. General public contributions and gra..."
  dl_meta