UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

🤗 Hugging Face Releases

Dataset: DeepGlint-AI/UniDoc-RL
Model (7B): DeepGlint-AI/UniDoc-RL-7B
Model (3B): DeepGlint-AI/UniDoc-RL-3B

🚀 Overview

UniDoc-RL is a unified reinforcement learning framework for visual document RAG, where an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning within a single decision process.
It formulates visual evidence acquisition as a hierarchical sequential decision-making problem, progressively refining information from coarse document retrieval to fine-grained image selection and region cropping.
To support this training paradigm, we build and release a high-quality dataset of multi-turn reasoning trajectories with fine-grained action annotations, and validate the framework through extensive experiments on three benchmarks.

In UniDoc-RL, the model interacts with an external environment through structured actions such as <search>, <select>, <bbox>, and <answer>. This design enables the agent to progressively gather evidence from coarse page-level retrieval to fine-grained region inspection, which is especially useful for charts, tables, dense text regions, and multi-page evidence aggregation.

We emphasize two key properties of UniDoc-RL: progressive evidence acquisition and region-grounded interaction. The agent first searches over page images, then selects the most informative visual evidence, and finally performs targeted crop-and-zoom operations when the answer depends on local fine-grained content.

⚙️ Train Model with UniDoc-RL

Training Dependencies

conda create -n unidoc_rl python=3.10
conda activate unidoc_rl

git clone https://github.com/deepglint/UniDoc-RL.git
cd UniDoc-RL

Step1. Prepare Data.

Benchmark & Training Data

First, download the training datasets used in the paper from their official release pages or project links. After downloading, reorganize all samples into a unified JSON format so they can be further converted for synthetic data generation.

Every sample should contain:

a unique sample id,
the user question,
the reference answer,
the document or page identifier,
optional metadata such as page index, evidence type, and question type.

{
  "uid": "sample_000001",
  "query": "What is the reported top-1 accuracy in the ablation study?",
  "reference_answer": "84.7%",
  "meta_info": {
    "file_name": "example_document",
    "reference_page": [12],
    "source_type": "Text/Table",
    "query_type": "Single-Hop"
  }
}

Step2. Build Training Corpus & Run Multi-Modal Search Engine.

UniDoc-RL trains against a retrieval environment built on document page images. You should first convert each document into page-level images and organize them into a corpus directory.

Install the dependencies for the retrieval toolkit:

cd tools
pip install -r requirements.txt

Build the retrieval index:

cd tools/search_engine
python ingestion.py --dataset_dir /path/to/your/corpus

Then launch the search engine API:

cd tools/search_engine
SEARCH_DATASET_DIR=/path/to/your/corpus \
SEARCH_PORT=9001 \
python search_engine_api.py

By default, the retrieval module uses a ColQwen-style vision-language embedding backend and returns top-$k$ relevant page images for each search action.

Step3. Construct High-quality CoT & Learn Patterns via SFT.

Before RL, UniDoc-RL supports constructing high-quality multi-turn trajectories with the synthetic data pipeline in synthetic_data/. This stage can generate search, rerank, crop, and answer traces that are useful for bootstrapping the model with supervised fine-tuning.

Install the dependencies for synthetic trajectory generation:

cd synthetic_data
pip install -r requirements.txt

Example command:

cd synthetic_data
python generate_cot_data.py \
  --data-path /path/to/data.json \
  --output-path /path/to/output.jsonl \
  --search-engine-url http://127.0.0.1:9001/search \
  --layout-parser-url http://127.0.0.1:30000 \
  --llm-engine-url http://127.0.0.1:9009/v1 \
  --vllm-engine-url http://127.0.0.1:9000/v1 \
  --enable-rerank \
  --enable-crop

To convert the generated JSONL trajectories into the open-source ShareGPT-style multimodal format expected by LLaMA-Factory:

cd synthetic_data
python convert_to_llamafactory.py \
  --input-path /path/to/output.jsonl \
  --output-path ../LLaMA-Factory/data/unidoc_sft.json

The generated trajectories can then be converted into chat-style training data for SFT. In our framework, this stage teaches the model the basic interaction pattern of:

reasoning before acting,
issuing retrieval queries when evidence is missing,
selecting the most relevant page image,
requesting region magnification only when needed,
answering after sufficient evidence has been collected.

SFT with LLaMA-Factory

This repository already includes a local copy of LLaMA-Factory, which can be used to run multimodal SFT on the trajectories generated above.

First, install the LLaMA-Factory dependencies:

cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

The corresponding dataset entry has already been registered in LLaMA-Factory/data/dataset_info.json under the name unidoc, so in most cases you only need to prepare unidoc_sft.json with the messages and images fields.

For SFT with Qwen2.5-VL:

cd LLaMA-Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
llamafactory-cli train examples/train_full/qwen2_5vl_full_sft.yaml

Step4. Run RL Training with SFT Model.

Start Training

Install the dependencies for RL training:

cd RLTrain
pip install -r requirements.txt

If you use the model-based evaluator during training, start the evaluation service first:

cd tools/model_eval
pip install -r ../requirements.txt
EVAL_MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
EVAL_PORT=8003 \
python evaluator_api.py

Set the required environment variables:

export MODEL_PATH=/path/to/your/sft-checkpoint
export TRAIN_FILE=/path/to/your/train.parquet
export VAL_FILE=/path/to/your/val.parquet
export SEARCH_URL=http://127.0.0.1:9001/search
export RM_URL=http://127.0.0.1:8003/eval

Then start training:

cd RLTrain
bash train.sh

The default launcher configures GRPO-style RL optimization with multi-turn rollouts, multimodal prompts, external retrieval, and model-based reward evaluation.

🙏 Acknowledgements

This work is implemented based on and inspired by several open-source projects, including VRAG-RL, verl, and LLaMA-Factory. We sincerely thank the authors and contributors of these projects for making their code and ideas publicly available.

📝 Citation

If you find this repository useful, please cite the corresponding paper. The bibliographic information can be updated once the public paper entry is finalized.

@misc{unidocrl2026,
      title={UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards},
      author={Author List To Be Updated},
      year={2026},
      note={Project page and paper link will be updated.}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LLaMA-Factory		LLaMA-Factory
RLTrain		RLTrain
img		img
synthetic_data		synthetic_data
tools		tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

🤗 Hugging Face Releases

🚀 Overview

⚙️ Train Model with UniDoc-RL

Training Dependencies

Step1. Prepare Data.

Benchmark & Training Data

Step2. Build Training Corpus & Run Multi-Modal Search Engine.

Step3. Construct High-quality CoT & Learn Patterns via SFT.

SFT with LLaMA-Factory

Step4. Run RL Training with SFT Model.

Start Training

🙏 Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

🤗 Hugging Face Releases

🚀 Overview

⚙️ Train Model with UniDoc-RL

Training Dependencies

Step1. Prepare Data.

Benchmark & Training Data

Step2. Build Training Corpus & Run Multi-Modal Search Engine.

Step3. Construct High-quality CoT & Learn Patterns via SFT.

SFT with LLaMA-Factory

Step4. Run RL Training with SFT Model.

Start Training

🙏 Acknowledgements

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages