From ac5fdefaed07174b712fcdf4d961814e0af9b28e Mon Sep 17 00:00:00 2001 From: Geralt <94539084+Geralt-Targaryen@users.noreply.github.com> Date: Mon, 22 Sep 2025 18:05:27 +0800 Subject: [PATCH] latest papers 09-22 --- README.md | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 5da08da..c20dad2 100644 --- a/README.md +++ b/README.md @@ -15,17 +15,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per ## News -🔥🔥🔥 [2025/09/12] Featured papers: +🔥🔥🔥 [2025/09/22] Featured papers: -- 🔥🔥 [LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering](https://arxiv.org/abs/2509.09614) from Salesforce AI Research. +- 🔥🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group. -- 🔥🔥 [Astra: A Multi-Agent System for GPU Kernel Performance Optimization](https://arxiv.org/abs/2509.07506) from Stanford University. +- 🔥🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University. -- 🔥🔥 [GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion](https://arxiv.org/abs/2509.05980) from Zhejiang University. +- 🔥 [LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering](https://arxiv.org/abs/2509.09614) from Salesforce AI Research. -- 🔥 [LongCat-Flash Technical Report](https://arxiv.org/abs/2509.01322) from Meituan. +- 🔥 [Astra: A Multi-Agent System for GPU Kernel Performance Optimization](https://arxiv.org/abs/2509.07506) from Stanford University. -- 🔥 [Towards Better Correctness and Efficiency in Code Generation](https://arxiv.org/abs/2508.20124) from Qwen Team. +- 🔥 [GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion](https://arxiv.org/abs/2509.05980) from Zhejiang University. 🔥🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"! @@ -33,11 +33,11 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per 🔥         [2024/09/06] **Our survey has been accepted for publication by [Transactions on Machine Learning Research (TMLR)](https://jmlr.org/tmlr/).** -🔥🔥🔥 [2025/06/25] News from Codefuse +🔥🔥🔥 [2025/09/22] News from Codefuse -- [GALLa: Graph Aligned Large Language Models](https://arxiv.org/abs/2409.04183) is accepted by ACL 2025 main conference. [[repo](https://github.com/codefuse-ai/GALLa)] +- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-source models on [SWE-Bench leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)] -- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is released, **currently ranking 1st among open-source models on [SWE-Bench leaderboard](https://www.swebench.com/)**. [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)] +- [GALLa: Graph Aligned Large Language Models](https://arxiv.org/abs/2409.04183) is accepted by ACL 2025 main conference. [[repo](https://github.com/codefuse-ai/GALLa)]

@@ -553,6 +553,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained 2. **Dream-Coder**: "Dream-Coder 7B: An Open Diffusion Language Model for Code" [2025-09] [[paper](https://arxiv.org/abs/2509.01142)] +3. "Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.11252)] + ### 2.4 (Instruction) Fine-Tuning on Code These models apply Instruction Fine-Tuning techniques to enhance the capacities of Code LLMs. @@ -687,6 +689,10 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 65. **SCoder**: "SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs" [2025-09] [[paper](https://arxiv.org/abs/2509.07858)] +66. "Do Code Semantics Help? A Comprehensive Study on Execution Trace-Based Information for Code Large Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.11686)] + +67. "SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems" [2025-09] [[paper](https://arxiv.org/abs/2509.14281)] + ### 2.5 Reinforcement Learning on Code 1. **CompCoder**: "Compilable Neural Code Generation with Compiler Feedback" [2022-03] [ACL 2022] [[paper](https://arxiv.org/abs/2203.05132)] @@ -753,6 +759,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 32. "Towards Better Correctness and Efficiency in Code Generation" [2025-08] [[paper](https://arxiv.org/abs/2508.20124)] +33. "Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.12434)] + ## 3. When Coding Meets Reasoning ### 3.1 Coding for Reasoning @@ -1993,7 +2001,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "NL-Debugging: Exploiting Natural Language as an Intermediate Representation for Code Debugging" [2025-05] [[paper](https://arxiv.org/abs/2505.15356)] -- "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models" [2025-05] [EASE, June 2025] [[paper](https://arxiv.org/abs/2505.02931)] +- "The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models" [2025-05] [EASE, June 2025] [[paper](https://arxiv.org/abs/2505.02931)] - "Adversarial Reasoning for Repair Based on Inferred Program Intent" [2025-05] [ISSTA 2025] [[paper](https://arxiv.org/abs/2505.13008)] @@ -2251,6 +2259,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning" [2025-08] [[paper](https://arxiv.org/abs/2508.03501)] +- "An Empirical Study on Failures in Automated Issue Solving" [2025-09] [[paper](https://arxiv.org/abs/2509.13941)] + ### Frontend Development - "Seeking the user interface", 2014-09, ASE 2014, [[paper](https://dl.acm.org/doi/10.1145/2642937.2642976)] @@ -2647,6 +2657,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Evaluating NL2SQL via SQL2NL" [2025-09] [[paper](https://arxiv.org/abs/2509.04657)] +- "DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction" [2025-09] [[paper](https://arxiv.org/abs/2509.14507)] + ### Program Proof - "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)] @@ -3403,6 +3415,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects" [2025-07] [[paper](https://arxiv.org/abs/2507.19271)] +- "CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects" [2025-09] [[paper](https://arxiv.org/abs/2509.14856)] + ### Log Analysis - "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08] [[paper](https://arxiv.org/abs/2208.10282)] @@ -3851,6 +3865,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions" [2025-07] [[paper](https://arxiv.org/abs/2507.20439)] +- "Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations" [2025-09] [[paper](https://arxiv.org/abs/2509.13680)] + ### Interpretability - "A Critical Study of What Code-LLMs (Do Not) Learn" [2024-06] [ACL 2024 Findings] [[paper](https://arxiv.org/abs/2406.11930)] @@ -4424,6 +4440,7 @@ $^\diamond$ Machine/human prompts | 2025-03 | ACL 2025 | LONGCODEU | 3983 | Python | "LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding" [[paper](https://arxiv.org/abs/2503.04359)] | | 2025-05 | arXiv | CodeSense | 4495 | Python, C, Java | "CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning" [[paper](https://arxiv.org/abs/2506.00750)] [[data](https://codesense-bench.github.io/)] | | 2025-07 | arXiv | CORE | 12,533 | C/C++, Java, Python | "CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks" [[paper](https://arxiv.org/abs/2507.05269)] [[data](https://corebench.github.io/)] | +| 2025-09 | arXiv | SWE-QA | 576 | Python | "SWE-QA: Can Language Models Answer Repository-level Code Questions?" [[paper](https://arxiv.org/abs/2509.14635)] [[data](https://github.com/peng-weihan/SWE-QA-Bench)] | #### Text-to-SQL