ALDEN is a multi-modal reinforcement learning framework designed for Agentic Visually-Rich Document Understanding (A-VRDU). Built upon Qwen2.5-VL, it introduces a novel fetch action, cross-level reward and a visual semantic anchoring mechanism to enable efficient navigation and reasoning over long, high-resolution documents.
This repository contains the official implementation of our paper: ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents..
conda create -n alden python=3.10
conda activate alden
git clone https://github.com/gipplab/ALDEN.git
cd ./ALDEN
pip install -e .conda create -n alden-sv python=3.10
cd ./ALDEN
pip install -r single-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .conda create -n alden-mv python=3.10
cd ./ALDEN
pip install -r multi-vec_retriever_requirements.txt
cd ./flashrag
pip install -e .We provide the processed training corpus on Hugging Face: SkyFishQ/ALDEN.
If you wish to build the corpus from scratch using your own data:
- Modify the
raw_data_pathandtarget_pathinrag_serving/build_corpus.py. - Run the build script:
python rag_serving/build_corpus.pyWe use flashrag to build the dense retrieval index for document images.
cd ./flashrag/flashrag/retriever
python index_builder.py \
--retrieval_method vdr-2b-v1 \
--model_path llamaindex/vdr-2b-v1 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 512 \
--batch_size 128 \
--faiss_type Flat \
--index_modal image \
--sentence_transformer \
--save_embeddingpython index_builder.py \
--retrieval_method gte-Qwen2-1.5B-instruct \
--model_path Alibaba-NLP/gte-Qwen2-1.5B-instruct \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal text \
--sentence_transformer \
--save_embeddingpython index_builder.py
--retrieval_method jina-colbert-v2 \
--model_path jinaai/jina-colbert-v2 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal text \
--save_embeddingpython index_builder.py
--retrieval_method colqwen2-v1.0 \
--model_path vidore/colqwen2-v1.0 \
--corpus_path /path/to/your/images_corpus/images.parquet \
--save_dir /path/to/save/images_index \
--max_length 4096 \
--batch_size 128 \
--faiss_type Flat \
--index_modal image \
--save_embeddingNote: Please replace /path/to/your/... with your actual file paths.
ALDEN uses a decoupled architecture where the environment (RAG tools) and the agent (RL training) run separately.
First, launch the RAG environment server which handles the <search> and <fetch> actions.
-
Get the Server IP:
hostname --ip-address
Take note of this IP address, you will need to configure it in the training script.
-
Start the Service:
python rag_serving/serving.py \ --config rag_serving/serving_config_single-vec.yaml \ --num_retriever 8 \ --port 42354or
python rag_serving/serving.py \ --config rag_serving/serving_config_multi-vec.yaml \ --num_retriever 8 \ --port 42354
Once the tool server is running, start the training. Ensure the server URL in the training script points to the IP obtained in Step 1.
bash examples/baselines/qwen2_5_vl_7b_doc_agent_ppo.shTo run inference on test sets:
bash examples/baselines/qwen2_5_vl_7b_doc_agent_generation.shpython3 scripts/model_merger.py \
--local_dir checkpoints/easy_r1/exp_name/global_step_1/actorIf you find this project useful, please cite our paper:
@article{yang2025alden,
title={ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents},
author={Yang, Tianyu and Ruas, Terry and Tian, Yijun and Wahle, Jan Philip and Kurzawe, Daniel and Gipp, Bela},
journal={arXiv preprint arXiv:2510.25668},
year={2025}
}
This work is built upon the following excellent open-source projects:
- EasyR1: For the RL infrastructure.
- VAGEN: For visual agent baselines.
- verl: For efficient RL training.
- ReCall: For RAG integration concepts.
We greatly appreciate their valuable contributions to the community.