diff --git a/README.md b/README.md index e027f88..6c8f7e5 100644 --- a/README.md +++ b/README.md @@ -15,15 +15,23 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per ## News -🔥🔥🔥 [2025/09/26] Featured papers: +🔥🔥🔥 [2025/10/03] Featured papers: -- 🔥🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group. +- 🔥🔥 [BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software](https://arxiv.org/abs/2509.25248) from Arizona State University. -- 🔥🔥 [SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) from Scale AI. +- 🔥🔥 [Devstral: Fine-tuning Language Models for Coding Agent Applications](https://arxiv.org/abs/2509.25193) from Mistral AI. -- 🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University. +- 🔥🔥 [LLaDA-MoE: A Sparse MoE Diffusion Language Model](https://arxiv.org/abs/2509.24389) from Ant Group. -🔥🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"! +- 🔥🔥 [Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents](https://arxiv.org/abs/2509.23045) from Moonshot AI. + +- 🔥🔥 [ML2B: Multi-Lingual ML Benchmark For AutoML](https://arxiv.org/abs/2509.22768) from HSE University. + +- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group. + +- 🔥 [SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) from Scale AI. + +🔥🔥     [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"! 🔥🔥     [2025/08/15] 80 papers from ACL 2025 have been added. Search for the keyword "ACL 2025"! @@ -393,6 +401,8 @@ These LLMs are not specifically trained for code, but have demonstrated varying 90. **LongCat-Flash**: "LongCat-Flash Technical Report" [2025-09] [[paper](https://arxiv.org/abs/2509.01322)] +91. **LLaDA-MoE**: "LLaDA-MoE: A Sparse MoE Diffusion Language Model" [2025-09] [[paper](https://arxiv.org/abs/2509.24389)] + ### 2.2 Existing LLM Adapted to Code These models are general-purpose LLMs further pretrained on code-related data. @@ -531,7 +541,7 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained 11. **AST-T5** (Span Corruption): "AST-T5: Structure-Aware Pretraining for Code Generation and Understanding" [2024-01] [ICML 2024] [[paper](https://arxiv.org/abs/2401.03003)] -12. **DivoT5**: "Directional Diffusion-Style Code Editing Pre-traini" [2025-01] [[paper](https://arxiv.org/abs/2501.12079)] +12. **DivoT5**: "Directional Diffusion-Style Code Editing Pre-training" [2025-01] [[paper](https://arxiv.org/abs/2501.12079)] #### UniLM @@ -757,6 +767,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 34. "DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?" [2025-09] [[paper](https://arxiv.org/abs/2509.21016)] +35. **Critique-Coder**: "Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning" [2025-09] [[paper](https://arxiv.org/abs/2509.22824)] + ## 3. When Coding Meets Reasoning ### 3.1 Coding for Reasoning @@ -889,6 +901,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 64. **PBB**: "Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training" [2025-06] [[paper](https://arxiv.org/abs/2506.18777)] +65. "On Code-Induced Reasoning in LLMs" [2025-09] [[paper](https://arxiv.org/abs/2509.21499)] + ### 3.2 Code Simulation - "Code Simulation Challenges for Large Language Models" [2024-01] [[paper](https://arxiv.org/abs/2401.09074)] @@ -1075,6 +1089,12 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 77. **MapCoder-Lite**: "MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17489)] +78. **Devstral**: "Devstral: Fine-tuning Language Models for Coding Agent Applications" [2025-09] [[paper](https://arxiv.org/abs/2509.25193)] + +79. **Lita**: "Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs" [2025-09] [[paper](https://arxiv.org/abs/2509.25873)] + +80. **Kimi-Dev**: "Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents" [2025-09] [[paper](https://arxiv.org/abs/2509.23045)] + ### 3.4 Interactive Coding - "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)] @@ -1487,6 +1507,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities - [**LaTeX**] "Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.17589)] +- "CodeChemist: Functional Knowledge Transfer for Low-Resource Code Generation via Test-Time Scaling" [2025-10] [[paper](https://arxiv.org/abs/2510.00501)] + ## 5. Methods/Models for Downstream Tasks For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5). @@ -1643,6 +1665,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla" [2025-09] [[paper](https://arxiv.org/abs/2509.09101)] +- "SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code" [2025-09] [[paper](https://arxiv.org/abs/2509.24507)] + +- "LongCodeZip: Compress Long Context for Code Language Models" [2025-10] [[paper](https://arxiv.org/abs/2510.00446)] + ### Code RAG - "CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation" [2024-05] [[paper](https://arxiv.org/abs/2405.02355)] @@ -2389,6 +2415,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench" [2025-07] [[paper](https://arxiv.org/abs/2507.02554)] +- "ML2B: Multi-Lingual ML Benchmark For AutoML" [2025-09] [[paper](https://arxiv.org/abs/2509.22768)] + ### Text-To-SQL - "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models" [2021-09] [EMNLP 2021] [[paper](https://arxiv.org/abs/2109.05093)] @@ -2667,6 +2695,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction" [2025-09] [[paper](https://arxiv.org/abs/2509.14507)] +- "A State-of-the-Art SQL Reasoning Model using RLVR" [2025-09] [[paper](https://arxiv.org/abs/2509.21459)] + +- "Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling" [2025-09] [[paper](https://arxiv.org/abs/2509.24403)] + ### Program Proof - "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)] @@ -2717,6 +2749,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "PyVeritas: On Verifying Python via LLM-Based Transpilation and Bounded Model Checking for C" [2025-08] [[paper](https://arxiv.org/abs/2508.08171)] +- "InvBench: Can LLMs Accelerate Program Verification with Invariant Synthesis?" [2025-09] [[paper](https://arxiv.org/abs/2509.21629)] + +- "Towards Repository-Level Program Verification with Large Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.25197)] + ### Test Generation - "Unit Test Case Generation with Transformers and Focal Context" [2020-09] [AST@ICSE 2022] [[paper](https://arxiv.org/abs/2009.05617)] @@ -2871,6 +2907,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning" [2025-08] [[paper](https://arxiv.org/abs/2508.05710)] +- "Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.23812)] + ### Oracle Generation - "Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers" [2020-09] [[paper](https://arxiv.org/abs/2009.05634)] @@ -3193,6 +3231,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code" [2025-09] [[paper](https://arxiv.org/abs/2509.17337)] +- "Improving Code Localization with Repository Memory" [2025-10] [[paper](https://arxiv.org/abs/2510.01003)] + ### Malicious Code Detection - "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)] @@ -3511,6 +3551,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Prompting for Performance: Exploring LLMs for Configuring Software" [2025-07] [[paper](https://arxiv.org/abs/2507.09790)] +- "BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software" [2025-09] [[paper](https://arxiv.org/abs/2509.25248)] + ### Code QA & Reasoning - "DialogAgent: An Auto-engagement Agent for Code Question Answering Data Production" [2024-12] [[paper](https://arxiv.org/abs/2412.08069)] @@ -3535,6 +3577,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Complex Logical Instruction Generation" [2025-08] [[paper](https://arxiv.org/abs/2508.09125)] +- "Regression Language Models for Code" [2025-09] [[paper](https://arxiv.org/abs/2509.26476)] + ### Software Modeling - "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)] @@ -3723,6 +3767,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Localizing Malicious Outputs from CodeLLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17070)] +- "SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios" [2025-09] [[paper](https://arxiv.org/abs/2509.22097)] + +- "Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks" [2025-10] [[paper](https://arxiv.org/abs/2510.01359)] + ### Correctness - "An Empirical Evaluation of GitHub Copilot's Code Suggestions" [2022-05] [MSR 2022] [[paper](https://ieeexplore.ieee.org/document/9796235)] @@ -3905,6 +3953,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "How Programming Concepts and Neurons Are Shared in Code Language Models" [2025-06] [ACL 2025 Findings] [[paper](https://arxiv.org/abs/2506.01074)] +- "Analyzing Latent Concepts in Code Language Models" [2025-10] [[paper](https://arxiv.org/abs/2510.00476)] + ### API Usage - "How and Why LLMs Use Deprecated APIs in Code Completion? An Empirical Study" [2024-06] [[paper](https://arxiv.org/abs/2406.09834)] @@ -4467,34 +4517,36 @@ $^\diamond$ Machine/human prompts - "A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions", 2022-08, arXiv, [[paper](https://arxiv.org/abs/2208.13629)] - "A survey on deep learning approaches for text-to-SQL", 2023-01, VLDB J., [[paper](https://link.springer.com/article/10.1007/s00778-022-00776-8)] -| Date | Venue | Benchmark | Size | Language | Source | -| ------- | ------------------- | ---------------- | ---------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| 2017-08 | arXiv | WikiSQL | 80654 | | "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning" [[paper](https://arxiv.org/abs/1709.00103)] [[data](https://github.com/salesforce/WikiSQL)] | -| 2018-06 | CL 2018 | Advising | 4570 | | "Improving Text-to-SQL Evaluation Methodology" [[paper](https://arxiv.org/abs/1806.09029)] [[data](https://github.com/jkkummerfeld/text2sql-data/)] | -| 2018-09 | EMNLP 2018 | Spider | 10181 | | "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task" [[paper](https://arxiv.org/abs/1809.08887)] [[data](https://yale-lily.github.io/spider)] | -| 2019-06 | ACL 2019 | SParC | 12726 | | "SParC: Cross-Domain Semantic Parsing in Context" [[paper](https://arxiv.org/abs/1906.02285)] [[data](https://yale-lily.github.io/sparc)] | -| 2019-07 | WWW 2020 | MIMICSQL | 10000 | | "Text-to-SQL Generation for Question Answering on Electronic Medical Records" [[paper](https://arxiv.org/abs/1908.01839)] [[data](https://github.com/wangpinggl/TREQS)] | -| 2019-09 | EMNLP-IJCNLP 2019 | CoSQL | 15598 | | "CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases" [[paper](https://arxiv.org/abs/1909.05378)] [[data](https://yale-lily.github.io/cosql)] | -| 2020-05 | LREC 2020 | Criteria-to-SQL | 2003 | | "Dataset and Enhanced Model for Eligibility Criteria-to-SQL Semantic Parsing" [[paper](https://aclanthology.org/2020.lrec-1.714/)] [[data](https://github.com/xiaojingyu92/Criteria2SQL)] | -| 2020-10 | EMNLP 2020 Findings | Squall | 11276 | | "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries" [[paper](https://arxiv.org/abs/2010.11246)] [[data](https://github.com/tzshi/squall)] | -| 2020-10 | NAACL-HLT 2021 | Spider-Realistic | 508 | | "Structure-Grounded Pretraining for Text-to-SQL" [[paper](https://arxiv.org/abs/2010.12773)] [[data](https://zenodo.org/records/5205322)] | -| 2021-06 | ACL/IJCNLP 2021 | Spider-Syn | 8034 | | "Towards Robustness of Text-to-SQL Models against Synonym Substitution" [[paper](https://arxiv.org/abs/2106.01065)] [[data](https://arxiv.org/abs/2106.01065)] | -| 2021-06 | NLP4Prog 2021 | SEDE | 12023 | | "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" [[paper](https://arxiv.org/abs/2106.05006)] [[data](https://github.com/hirupert/sede)] | -| 2021-06 | ACL/IJCNLP 2021 | KaggleDBQA | 400 | | "KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers" [[paper](https://arxiv.org/abs/2106.11455)] [[data](https://github.com/chiahsuan156/KaggleDBQA)] | -| 2021-09 | EMNLP | Spider-DK | 535 | | "Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization" [[paper](https://arxiv.org/abs/2109.05157)] [[data](https://github.com/ygan/Spider-DK)] | -| 2022-05 | NAACL 2022 Findings | Spider-SS/CG | 8034/45599 | | "Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment" [[paper](https://arxiv.org/abs/2205.02054)] [[data](https://github.com/ygan/SpiderSS-SpiderCG)] | -| 2023-05 | arXiv | BIRD | 12751 | | "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" [[paper](https://arxiv.org/abs/2305.03111)] [[data](https://bird-bench.github.io/)] | -| 2023-06 | ACL 2023 | XSemPLR | 24.4K | | "XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations" [[paper](https://arxiv.org/abs/2306.04085)] [[data](https://github.com/psunlpgroup/XSemPLR)] | -| 2024-05 | ACL 2024 Findings | EHR-SeqSQL | 31669 | | "EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records" [[paper](https://arxiv.org/abs/2406.00019)] | -| 2024-06 | NAACL 2024 | BookSQL | 100K | | "BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain" [[paper](https://arxiv.org/abs/2406.07860)] [[data](https://github.com/Exploration-Lab/BookSQL)] | -| 2024-08 | ACL 2024 Findings | MultiSQL | 9257 | | "MultiSQL: A Schema-Integrated Context-Dependent Text2SQL Dataset with Diverse SQL Operations" [[paper](https://aclanthology.org/2024.findings-acl.823/)] [[data](https://github.com/grandchicken/MultiSQL)] | -| 2024-09 | arXiv | BEAVER | 93 | | "BEAVER: An Enterprise Benchmark for Text-to-SQL" [[paper](https://arxiv.org/abs/2409.02038)] | -| 2024-10 | arXiv | PRACTIQ | 2812 | | "PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries" [[paper](https://arxiv.org/abs/2410.11076)] | -| 2024-10 | arXiv | BIS | 239 | | "BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios" [[paper](https://arxiv.org/abs/2410.22925)] [[data](https://github.com/boracaglayan/bis-nl2sql)] | -| 2024-11 | ICLR 2025 | Spider 2.0 | 632 | | "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows" [[paper](https://arxiv.org/abs/2411.07763)] [[data](https://github.com/xlang-ai/Spider2)] | -| 2025-01 | arXiv | Dialect2SQL | 9428 | | "Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija" [[paper](https://arxiv.org/abs/2501.11498)] | -| 2025-05 | arXiv | LogicCat | 4038 | | "LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges" [[paper](https://arxiv.org/abs/2505.18744)] | -| 2025-05 | arXiv | BiomedSQL | 68,000 | | "BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases" [[paper](https://arxiv.org/abs/2505.20321)] [[data](https://github.com/NIH-CARD/biomedsql)] | +| Date | Venue | Benchmark | Size | Language | Source | +| ------- | ------------------- | ---------------- | ---------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 2017-08 | arXiv | WikiSQL | 80654 | | "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning" [[paper](https://arxiv.org/abs/1709.00103)] [[data](https://github.com/salesforce/WikiSQL)] | +| 2018-06 | CL 2018 | Advising | 4570 | | "Improving Text-to-SQL Evaluation Methodology" [[paper](https://arxiv.org/abs/1806.09029)] [[data](https://github.com/jkkummerfeld/text2sql-data/)] | +| 2018-09 | EMNLP 2018 | Spider | 10181 | | "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task" [[paper](https://arxiv.org/abs/1809.08887)] [[data](https://yale-lily.github.io/spider)] | +| 2019-06 | ACL 2019 | SParC | 12726 | | "SParC: Cross-Domain Semantic Parsing in Context" [[paper](https://arxiv.org/abs/1906.02285)] [[data](https://yale-lily.github.io/sparc)] | +| 2019-07 | WWW 2020 | MIMICSQL | 10000 | | "Text-to-SQL Generation for Question Answering on Electronic Medical Records" [[paper](https://arxiv.org/abs/1908.01839)] [[data](https://github.com/wangpinggl/TREQS)] | +| 2019-09 | EMNLP-IJCNLP 2019 | CoSQL | 15598 | | "CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases" [[paper](https://arxiv.org/abs/1909.05378)] [[data](https://yale-lily.github.io/cosql)] | +| 2020-05 | LREC 2020 | Criteria-to-SQL | 2003 | | "Dataset and Enhanced Model for Eligibility Criteria-to-SQL Semantic Parsing" [[paper](https://aclanthology.org/2020.lrec-1.714/)] [[data](https://github.com/xiaojingyu92/Criteria2SQL)] | +| 2020-10 | EMNLP 2020 Findings | Squall | 11276 | | "On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries" [[paper](https://arxiv.org/abs/2010.11246)] [[data](https://github.com/tzshi/squall)] | +| 2020-10 | NAACL-HLT 2021 | Spider-Realistic | 508 | | "Structure-Grounded Pretraining for Text-to-SQL" [[paper](https://arxiv.org/abs/2010.12773)] [[data](https://zenodo.org/records/5205322)] | +| 2021-06 | ACL/IJCNLP 2021 | Spider-Syn | 8034 | | "Towards Robustness of Text-to-SQL Models against Synonym Substitution" [[paper](https://arxiv.org/abs/2106.01065)] [[data](https://arxiv.org/abs/2106.01065)] | +| 2021-06 | NLP4Prog 2021 | SEDE | 12023 | | "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data" [[paper](https://arxiv.org/abs/2106.05006)] [[data](https://github.com/hirupert/sede)] | +| 2021-06 | ACL/IJCNLP 2021 | KaggleDBQA | 400 | | "KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers" [[paper](https://arxiv.org/abs/2106.11455)] [[data](https://github.com/chiahsuan156/KaggleDBQA)] | +| 2021-09 | EMNLP | Spider-DK | 535 | | "Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization" [[paper](https://arxiv.org/abs/2109.05157)] [[data](https://github.com/ygan/Spider-DK)] | +| 2022-05 | NAACL 2022 Findings | Spider-SS/CG | 8034/45599 | | "Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment" [[paper](https://arxiv.org/abs/2205.02054)] [[data](https://github.com/ygan/SpiderSS-SpiderCG)] | +| 2023-05 | arXiv | BIRD | 12751 | | "Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs" [[paper](https://arxiv.org/abs/2305.03111)] [[data](https://bird-bench.github.io/)] | +| 2023-06 | ACL 2023 | XSemPLR | 24.4K | | "XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations" [[paper](https://arxiv.org/abs/2306.04085)] [[data](https://github.com/psunlpgroup/XSemPLR)] | +| 2024-05 | ACL 2024 Findings | EHR-SeqSQL | 31669 | | "EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records" [[paper](https://arxiv.org/abs/2406.00019)] | +| 2024-06 | NAACL 2024 | BookSQL | 100K | | "BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain" [[paper](https://arxiv.org/abs/2406.07860)] [[data](https://github.com/Exploration-Lab/BookSQL)] | +| 2024-08 | ACL 2024 Findings | MultiSQL | 9257 | | "MultiSQL: A Schema-Integrated Context-Dependent Text2SQL Dataset with Diverse SQL Operations" [[paper](https://aclanthology.org/2024.findings-acl.823/)] [[data](https://github.com/grandchicken/MultiSQL)] | +| 2024-09 | arXiv | BEAVER | 93 | | "BEAVER: An Enterprise Benchmark for Text-to-SQL" [[paper](https://arxiv.org/abs/2409.02038)] | +| 2024-10 | arXiv | PRACTIQ | 2812 | | "PRACTIQ: A Practical Conversational Text-to-SQL dataset with Ambiguous and Unanswerable Queries" [[paper](https://arxiv.org/abs/2410.11076)] | +| 2024-10 | arXiv | BIS | 239 | | "BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios" [[paper](https://arxiv.org/abs/2410.22925)] [[data](https://github.com/boracaglayan/bis-nl2sql)] | +| 2024-11 | ICLR 2025 | Spider 2.0 | 632 | | "Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows" [[paper](https://arxiv.org/abs/2411.07763)] [[data](https://github.com/xlang-ai/Spider2)] | +| 2025-01 | arXiv | Dialect2SQL | 9428 | | "Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija" [[paper](https://arxiv.org/abs/2501.11498)] | +| 2025-05 | arXiv | LogicCat | 4038 | | "LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges" [[paper](https://arxiv.org/abs/2505.18744)] | +| 2025-05 | arXiv | BiomedSQL | 68,000 | | "BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases" [[paper](https://arxiv.org/abs/2505.20321)] [[data](https://github.com/NIH-CARD/biomedsql)] | +| 2025-09 | arXiv | PARROT | 598 | | "PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation" [[paper](https://arxiv.org/abs/2509.23338)] [[data](https://github.com/weAIDB/PARROT)] | +| 2025-09 | arXiv | MultiSpider 2.0 | 5056 | | "Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents" [[paper](https://arxiv.org/abs/2509.24405)] [[data](https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL)] | #### Code Translation @@ -4552,6 +4604,8 @@ $^\diamond$ Machine/human prompts | 2024-01 | arXiv | DebugBench | 4253 | C++, Java, Python | "DebugBench: Evaluating Debugging Capability of Large Language Models" [[paper](https://arxiv.org/abs/2401.04621)] [[data](https://github.com/thunlp/DebugBench)] | | 2024-11 | arXiv | MdEval | 3513 | 18 | "MdEval: Massively Multilingual Code Debugging" [[paper](https://arxiv.org/abs/2411.02310)] | | 2025-01 | arXiv | unnamed | 48,398 | Python | "Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models" [[paper](https://arxiv.org/abs/2501.09745)] [[data](https://zenodo.org/records/14281690)] | +| 2025-09 | arXiv | AgentPack | 1.3M | 20+ | "AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans" [[paper](https://arxiv.org/abs/2509.21891)] [[data](https://huggingface.co/datasets/nuprl/AgentPack)] | +| 2025-09 | arXiv | OCEDataFT | 20K | Python | "Generating High-Quality Datasets for Code Editing via Open-Source Language Models" [[paper](https://arxiv.org/abs/2509.25203)] [[data](https://github.com/zkzhang88/OpenCodeEdit)] | \* These are code-change datasest, and only a subset therein concerns bug fixing. @@ -4602,6 +4656,7 @@ $^\diamond$ Machine/human prompts | 2025-03 | arXiv | DSDBench | 1117 | Python | "Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors" [[paper](https://arxiv.org/abs/2503.22388)] [[data](https://github.com/KevinCL16/DSDBench)] | | 2025-05 | arXiv | SecVulEval | 25440 | C/C++ | "SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection" [[paper](https://arxiv.org/abs/2505.19828)] [[data](https://github.com/basimbd/SecVulEval)] | | 2025-05 | arXiv | SV-TrustEval-C | 3,337 | C | "SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis" [[paper](https://arxiv.org/abs/2505.20630)] [[data](https://github.com/Jackline97/SV-TrustEval-C)] | +| 2025-09 | arXiv | MULocBench | 1,100 | Python | "A Benchmark for Localizing Code and Non-Code Issues in Software Projects" [[paper](https://arxiv.org/abs/2509.25242)] [[data](https://huggingface.co/datasets/somethingone/MULocBench)] | #### Code Retrieval