This repository provides a framework for evaluating and comparing different retrieval strategies for RAG systems, using the LongDocURL benchmark
The primary goal is to analyze how different document parsing and retrieval methods impact the quality of LLM responses on documents.
- Questioning without any retrieved data.
- Questioning using cut-off paradigm from LongDocURL.
- Questioning using PyMuPDF based RAG algorithm.
- Questioning using PageR based RAG algorithm.
- Questioning using MinerU based RAG algorithm.
- PageR: Used commit fc509ea8fdd1e639e30daddd19f689491d694881 (main branch).
- MinerU: v3.0.4.
- Tesseract OCR: v5.5.0.20241111 (System dependency).
- Python: 3.13.7.
- Full Python environment details can be found in requirements.txt.
- Embeddings Model: distiluse-base-multilingual-cased-v1.