GraphSkill is a Python research framework for evaluating LLMs on graph reasoning tasks through both code generation and text reasoning pipelines.
It currently supports two benchmarks:
complexgraphgtools
The repository includes multiple runner scripts for zero-shot, few-shot, chain-of-thought, retrieval-assisted, and agentic baselines, with standardized result outputs and evaluation metrics.
runners/- experiment entry points (run_*.py)utils/- shared utilities (dataset loading, LLM wrappers, evaluation, code execution)prompts/- benchmark test cases and prompting assetsdata/- retrieval/documentation data and supporting JSON resourcesLLM_generation_results/- generated outputs and evaluation artifacts.env.example- environment variable template for API keys
- Python 3
- API keys for whichever model providers you plan to use (for example OpenAI, DeepSeek, Together, Hugging Face)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtFor development tooling (tests, linting, formatting):
pip install -r requirements-dev.txtCreate a local .env file and populate it with your API keys:
cp .env.example .envThen load variables into your shell:
source load_env.shRun zero-shot code generation baseline:
python runners/run_zs_coding.py --benchmark complexgraph --model deepseek-chat --dataset small --max_instances 10Run CodeGraph baseline (code generation with retrieved examples):
python runners/run_codegraph.py --benchmark gtools --model deepseek-chat --dataset small --max_instances 10Evaluate an existing results file without re-running generation:
python runners/run_zs_coding.py --benchmark complexgraph --evaluate_only --results_file ./LLM_generation_results/complexgraph/code_generation/zero_shot/small/<model>/all_results.json--benchmark:complexgraphorgtools--dataset:complexgraph:small,large,compositegtools:small,large
Use each runner's built-in help for full arguments:
python runners/run_zs_coding.py --help
python runners/run_codegraph.py --helpRunners save outputs under LLM_generation_results/..., typically including:
- per-task
*_results.json all_results.jsonall_results_with_eval.jsonevaluation_metrics.json
The repo contains several baselines under runners/, including:
run_zs_coding.pyrun_codegraph.pyrun_fs_coding.pyrun_pie_coding.pyrun_graphteam_coding.pyrun_zs_textReason.pyrun_fs_textReason.pyrun_cot_textReason.pyrun_bag_textReason.pyrun_zs_retrieval_textReason.pyrun_zs_sentbert_codingagent.pyrun_zs_tfidf_codingagent.pyrun_zs_retagent_codingagent.py
- Keep
.envlocal and never commit secrets. - Large datasets can require longer runtime due to code execution and evaluation.
@misc{wang2026graphskill,
title={GraphSkill: Documentation-Guided Hierarchical Retrieval-Augmented Coding for Complex Graph Reasoning},
author={Fali Wang and Chenglin Weng and Xianren Zhang and Siyuan Hong and Hui Liu and Suhang Wang},
year={2026},
eprint={2603.06620},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.06620},
}
This project is distributed under the terms in LICENSE.