# Building an AI-Powered Document Retrieval System with Docling and Granite

*Using IBM Granite Models*

## Recipe Overview

Welcome to this Granite recipe, in this recipe, you'll learn to harness the power of advanced tools to build AI-powered document retrieval systems. It will guide you through:

- **Document Processing:** Learn to handle documents from various sources, parse and transform them into usable formats, and store them in vector databases using Docling.
- **Retrieval-Augmented Generation (RAG):** Understand how to connect large language models (LLMs) like Granite with external knowledge bases to enhance query responses and generate valuable insights.
- **LangChain for Workflow Integration:** Discover how to use LangChain to streamline and orchestrate document processing and retrieval workflows, enabling seamless interaction between different components of the system.

This recipe leverages three cutting-edge technologies:

1. **[Docling](https://docling-project.github.io/docling/):** An open-source toolkit for parsing and converting documents.
2. **[Granite](https://www.ibm.com/granite/docs/models/granite/):** A state-of-the-art LLM available via an [API](https://www.ibm.com/topics/api) through Replicate, providing robust natural language capabilities.
3. **[LangChain](https://github.com/langchain-ai/langchain):** A powerful framework for building applications powered by language models, designed to simplify complex workflows and integrate external tools seamlessly.

By the end of this recipe, you will:
- Gain proficiency in document processing and chunking.
- Integrate vector databases to enhance retrieval capabilities.
- Utilize RAG to perform efficient and accurate data retrieval for real-world applications.

This recipe is designed for AI developers, researchers, and enthusiasts looking to enhance their knowledge of document management and advanced NLP techniques.



## Prerequisites

- Familiarity with Python programming.
- Basic understanding of large language models and natural language processing concepts.


## Step 1: Setting up the environment

Install dependencies.

In [None]:
! echo "::group::Install Dependencies"
%pip install uv
! uv pip install git+https://github.com/ibm-granite-community/utils \
    transformers \
    langchain \
    'langchain_huggingface[full]' \
    langchain_milvus 'pymilvus[milvus_lite]' \
    docling \
    'langchain_replicate @ git+https://github.com/ibm-granite-community/langchain-replicate.git'
! echo "::endgroup::"

::group::Install Dependencies
[2mUsing Python 3.12.11 environment at: /usr[0m
[2K[2mResolved [1m137 packages[0m [2min 671ms[0m[0m
[2mAudited [1m137 packages[0m [2min 7ms[0m[0m
::endgroup::


## Step 2: Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text. Here we will be using one of the new [Granite Embeddings models](https://huggingface.co/collections/ibm-granite/granite-embedding-models-6750b30c802c1926a35550bb)

To use a model from another provider, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

### Use the Granite model

Select a Granite model from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb).

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [None]:
from langchain_replicate import ChatReplicate
from ibm_granite_community.notebook_utils import get_env_var

model_path = "ibm-granite/granite-4.0-h-small"
model = ChatReplicate(
    model=model_path,
    replicate_api_token=get_env_var("API"),
    model_kwargs={
        "max_tokens": 1000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 100, # Set the minimum number of tokens to generate as output.
    },
)

API not found in Google Colab secrets.
Please enter your API: ··········


Now that we have the model downloaded, let's try asking it a question

In [None]:
from langchain_core.prompts import ChatPromptTemplate

query = "Who won in the Pantoja vs Asakura fight at UFC 310?"

# Create a Granite prompt for question-answering
prompt_template = ChatPromptTemplate.from_template(template="{input}")

chain = prompt_template | model

output = chain.invoke({"input": query})

print(output.text())

As of my last update in April 2023, there has been no UFC event numbered UFC 310. The UFC event numbering typically progresses sequentially, and as of my last update, the most recent UFC event was UFC 302, which took place on June 1, 2024. It's possible that you might be referring to a different event or there might have been a misunderstanding regarding the event number or date.

If you're referring to a specific fight between Pantoja and Asakura, it's also possible that such a matchup hasn't occurred in the UFC or that the information might be outdated or incorrect. Fighters' schedules and matchups can change frequently due to various factors such as injuries, weight class changes, or other logistical considerations.

For the most accurate and up-to-date information on UFC events and fight outcomes, I recommend checking the official UFC website or their official social media channels.


Now, I know that UFC 310 happened in 2024, and this does not seem to be the right Pantoja. The model doesn't seem to know the answer but at least understands that this matchup did not occur. Let's see if it has some specific UFC rules info.

In [None]:
query1 = "How much weight allowance is allowed in non championship fights in the UFC?"

output = chain.invoke({"input": query1})

print(output.text())

In non-championship fights in the UFC, there is typically no weight allowance. Fighters are expected to make the agreed-upon weight limit for their respective weight class. If a fighter fails to make weight, they may face penalties such as a percentage of their purse being deducted and the fight proceeding as a catchweight bout if both fighters agree.

However, in some cases, particularly for title fights, a small weight allowance may be granted. For example, in championship bouts, a 1-pound allowance is sometimes permitted. This means that the champion can weigh up to 1 pound over the limit on fight night, while the challenger must make the exact weight limit. If the champion fails to make even the 1-pound allowance, the fight may be cancelled or turned into a non-title bout.

It's important to note that weight allowances and penalties can vary depending on the specific circumstances and the UFC's discretion. The primary goal is to ensure fighter safety and fairness in competition.


Based on the official UFC rules, this is also incorrect. Let's try getting some documents that contains this information for the model.

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus, replace this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [None]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

The vector database will be saved to /tmp/milvus_pen_ir43.db


## Step 3: Building the Vector Database

In this example, from a set of source documents, we use [Docling](https://docling-project.github.io/docling/) to convert the documents into text and then split the text into chunks, derive embedding vectors using the embedding model, and load it into the vector database. Creating this vector database will allow us to easily search across our documents, enabling us to use RAG.

### Use Docling to download the documents, convert to text, and split into chunks

Here we have found a website that gives us information on UFC 310, as well as a PDF of the official UFC rules. Below, we will see that Docling can both convert and chunk the two documents.

In [None]:
# Docling imports
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from docling_core.types.doc.labels import DocItemLabel
from langchain_core.documents import Document

# Here are our documents, feel free to add more documents in formats that Docling supports
sources = [
    "https://www.ufc.com/news/main-card-results-highlights-winner-interviews-ufc-310-pantoja-vs-asakura",
    "https://media.ufc.tv/discover-ufc/Unified_Rules_MMA.pdf",
]

converter = DocumentConverter()

# Convert and chunk out documents
doc_id = 0
texts: list[Document] = [
    Document(page_content=chunk.text, metadata={"doc_id": (doc_id:=doc_id+1), "source": source})
    for source in sources
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(converter.convert(source=source).document)
    if any(filter(lambda c: c.label in [DocItemLabel.TEXT, DocItemLabel.PARAGRAPH], iter(chunk.meta.doc_items)))
]

print(f"{len(texts)} document chunks created")

Token indices sequence length is longer than the specified maximum sequence length for this model (517 > 512). Running this sequence through the model will result in indexing errors


24 document chunks created


In [None]:
# Print all created documents
for document in texts:
    print(f"Document ID: {document.metadata['doc_id']}")
    print(f"Source: {document.metadata['source']}")
    print(f"Content:\n{document.page_content}")
    print("=" * 80)  # Separator for clarity

Document ID: 1
Source: https://www.ufc.com/news/main-card-results-highlights-winner-interviews-ufc-310-pantoja-vs-asakura
Content:
- [UFC Video Archive](https://imgvideoarchive.com/client/ufc?utm_source=ufc&utm_medium=website&utm_campaign=partner_marketing)
- [PODCASTS](https://www.ufc.com/podcasts)
- [SHOP](https://www.ufcstore.com/en/?_s=bm-UFCStore-UFC.com-Shop-UFC_Navigation-2025)
- [VENUM](https://www.ufcstore.com/en/venum/br-4523273600+z-959633-3205242604?_s=bm-UFCStore_Venum-UFC.com-Shop-UFC_Navigation-2025)
- [Apparel](https://www.ufcstore.com/en/apparel/c-3450654379+z-983054-2354459266?_s=bm-UFCStore_Apparel-UFC.com-Shop-UFC_Navigation-2025)
- [UFC COLLECTIBLES](https://ufccollectibles.com/?utm_source=referral&utm_medium=ufc%20website%20navigation%20link&utm_campaign=partner-referral)
- [UFC STRIKE](https://ufcstrike.com/)
- [WHAT'S NEW](/consumer-products)
- [Thorne Performance Solutions](https://www.thorne.com/partners/ufc)
[Previous](/news/prelim-results-highlights-winner-i

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

24 documents added to the vector database


## Step 4: RAG with Granite

Now that we have succesfully converted our documents and vectorized them, we can set up out RAG pipeline.

### Retrieve relevant chunks



Here we will test the as_retriever method to search through our newly created vector database for chunks that are relevant to our original query



In [None]:
retriever = vector_db.as_retriever()

docs = retriever.invoke(query)
print(docs)

[Document(metadata={'pk': 461324973141131265, 'doc_id': 2, 'source': 'https://www.ufc.com/news/main-card-results-highlights-winner-interviews-ufc-310-pantoja-vs-asakura'}, page_content='See The Fight Results, Watch Post-Fight Interviews With The Main Card Winners And More From UFC 310: Pantoja vs Asakura, Live From T-Mobile Arena In Las Vegas, Nevada\nBy E. Spencer Kyte, On X @spencerkyte\n• Dec. 8, 2024\nThe UFC 310 preliminary card slate was outstanding, featuring six finishes and trio of entertaining three-round battles, setting the stage for a captivating pay-per-view main card at T-Mobile Arena in Las Vegas.\nAnd the action in the Octagon delivered in a massive way.\nDooho Choi kicked off the festivities with a standout performance against Nate Landwehr, finishing from a mounted crucifix in the third round before Bryce Mitchell followed suit one fight later, putting Kron Gracie to sleep with a pair of thudding elbows from inside his guard. After heavyweight contenders Ciryl Gane a

Looks like it pulled some chunks that would have the information we are looking for. Let's go ahead and contruct our RAG pipeline.

### Create the prompt for Granite

Next, we construct the prompt pipeline. This creates the prompt which holds the retrieved chunks from out previous search and feeds this to the model as context for answering our question.

In [None]:
from ibm_granite_community.langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

### Generate a retrieval-augmented response to a question

The pipeline uses the query to locate documents from the vector database and use them as context for the query.

In [None]:
output = rag_chain.invoke({"input": query})

print(output['answer'])

Alexandre Pantoja won the fight against Kai Asakura at UFC 310. He successfully defended his UFC flyweight title by submitting Asakura in the second round. This was Pantoja's third consecutive successful title defense, and he continues to be a dominant force in the 125-pound weight class. The fight took place on December 7, 2024, at T-Mobile Arena in Las Vegas, Nevada. Pantoja's performance was impressive, and he remains on a seven-fight winning streak.


Awesome! It looks like the model figured out our first question. Let's see if it figure out the rule we were looking for.

In [None]:
output = rag_chain.invoke({"input": query1})

print(output['answer'])

In non-championship fights in the UFC, there is allowed a 1 pound weigh allowance. This means that a fighter can weigh up to 1 pound over the specified weight limit for their weight class and still be eligible to compete in the fight. This allowance is in place to account for slight variations in weight leading up to the fight and to ensure that the fight can still proceed as scheduled. It is important for fighters to make weight in order to compete safely and fairly within their designated weight class.


Awesome! We can now see that we have created a pipeline that can successfully leverage knowledge from multiple document types for generation.

## Next Steps

- Explore advanced RAG workflows for other industries
- Experiment with other document types and larger datasets.
- Optimize prompt engineering for better Granite responses.

Thank you for using this recipe!