Official repository for DualGraphRAG, as described in the paper: "When to Query, When to Retrieve: A Dataset and Method for Semi-Structured Question Answering".
This repository contains both the implementation of the framework and the SpecsQA dataset used for evaluation.
DualGraphRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for reliable and efficient question answering over semi-structured content (e.g., documents containing tables and technical specifications).
Traditional RAG systems often struggle with global reasoning and exactness required for tasks like exhaustive listing or aggregation. DualGraphRAG addresses this by representing the corpus via two aligned graph views:
- Textual Knowledge Graph (TKG): Captures natural language context and is robust to noise and paraphrase.
- Symbolic Knowledge Graph (SKG): Encodes structured evidence (such as product specification tables) as logical triples, enabling exact symbolic operations and complex reasoning via SPARQL.
The system also features dynamic per-question routing, allowing it to decide when to rely on semantic retrieval over text and when to execute exact symbolic queries.
The repository is organized as follows:
DualGraph code/: Implementation of the DualGraphRAG system.configs/: Configuration templates for the project.prompts/: LLM prompts used for graph extraction, SPARQL generation, and response synthesis.dualgraphrag/: Core indexing, retrieval, and querying logic.evaluation/: Tools and metrics for system evaluation.
SpecsQA/: The evaluation benchmark and dataset processing tools.scraped_data/: Raw HTML product pages (packaged as.tar.xz).questions.json: 117 manually annotated questions for benchmark evaluation.scraping/: Tools used to harvest the raw HTML dataset.databuilder/: Preprocessing tools to prepare the raw data for indexing.
SpecsQA is a challenging benchmark for technical question answering on product specifications.
- Source: Scraped snapshot of the Samsung UK website (November 2025).
- Content: 2,162 product pages across 26 categories of consumer electronics.
- Questions: 117 manually annotated questions categorized into:
- Inverse Queries: Questions requiring exhaustive lists of products.
- Multi-condition Queries: Filtering based on multiple technical attributes.
- Reasoning & Comparison: Questions requiring aggregation or comparison across products.
Detailed installation and execution instructions are provided in the subdirectories DualGraph code/ and SpecsQA/.
If you find our work useful please cite us:
@article{dualgraphrag2026,
title={Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering},
author={Mateusz Czy\.{z}nikiewicz, Ryszard Tuora, Adam Kozakiewicz,Tomasz Zi\k{e}tkiewicz, Mateusz Gali\'{n}ski, Micha\l{} T. Godziszewski,
Micha\l{} Karpowicz, Timothy Hospedales, Cristina Cornelio},
journal={arXiv preprint},
year={2026},
note={Under submission}
}This repository was forked from the following three repositories in the SamsungLabs GitHub organization:
- DualGraph: Core project containing also data preprocessing and evaluation code.
- DualGraph_scraping: Code used to scrape the product pages.
- DualGraph_dataset: Raw scraped data for the SpecsQA dataset.
© 2026 Samsung Labs. All rights reserved.