diff --git a/README.md b/README.md index c20dad2..e027f88 100644 --- a/README.md +++ b/README.md @@ -15,17 +15,13 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per ## News -🔥🔥🔥 [2025/09/22] Featured papers: +🔥🔥🔥 [2025/09/26] Featured papers: - 🔥🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group. -- 🔥🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University. +- 🔥🔥 [SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) from Scale AI. -- 🔥 [LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering](https://arxiv.org/abs/2509.09614) from Salesforce AI Research. - -- 🔥 [Astra: A Multi-Agent System for GPU Kernel Performance Optimization](https://arxiv.org/abs/2509.07506) from Stanford University. - -- 🔥 [GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion](https://arxiv.org/abs/2509.05980) from Zhejiang University. +- 🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University. 🔥🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"! @@ -35,14 +31,10 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per 🔥🔥🔥 [2025/09/22] News from Codefuse -- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-source models on [SWE-Bench leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)] +- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-weight models on [SWE-Bench-Lite leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)] - [GALLa: Graph Aligned Large Language Models](https://arxiv.org/abs/2409.04183) is accepted by ACL 2025 main conference. [[repo](https://github.com/codefuse-ai/GALLa)] -

- -

- #### How to Contribute If you find a paper to be missing from this repository, misplaced in a category, or lacking a reference to its journal/conference information, please do not hesitate to create an issue. @@ -693,6 +685,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 67. "SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems" [2025-09] [[paper](https://arxiv.org/abs/2509.14281)] +68. "Verification Limits Code LLM Training" [2025-09] [[paper](https://arxiv.org/abs/2509.20837)] + ### 2.5 Reinforcement Learning on Code 1. **CompCoder**: "Compilable Neural Code Generation with Compiler Feedback" [2022-03] [ACL 2022] [[paper](https://arxiv.org/abs/2203.05132)] @@ -761,6 +755,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 33. "Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.12434)] +34. "DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?" [2025-09] [[paper](https://arxiv.org/abs/2509.21016)] + ## 3. When Coding Meets Reasoning ### 3.1 Coding for Reasoning @@ -1077,6 +1073,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities 76. "GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging" [2025-08] [[paper](https://arxiv.org/abs/2508.18993)] +77. **MapCoder-Lite**: "MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17489)] + ### 3.4 Interactive Coding - "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)] @@ -1185,6 +1183,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities - "CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance" [2025-07] [[paper](https://arxiv.org/abs/2507.10646)] +- "SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement" [2025-09] [[paper](https://arxiv.org/abs/2509.18808)] + ### 3.5 Frontend Navigation - "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding" [2021-10] [ACL 2022] [[paper](https://arxiv.org/abs/2110.08518)] @@ -1295,6 +1295,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities - "UI-Venus Technical Report: Building High-performance UI Agents with RFT" [2025-08] [[paper](https://arxiv.org/abs/2508.10833)] +- "Mano Report" [2025-09] [[paper](https://arxiv.org/abs/2509.17336)] + ## 4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages - [**Ruby**] "On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages" [2022-04] [ICPC 2022] [[paper](https://arxiv.org/abs/2204.09653)] @@ -1483,6 +1485,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities - [**CUDA**] "Astra: A Multi-Agent System for GPU Kernel Performance Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.07506)] +- [**LaTeX**] "Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.17589)] + ## 5. Methods/Models for Downstream Tasks For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5). @@ -2225,6 +2229,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion" [2025-09] [[paper](https://arxiv.org/abs/2509.05980)] +- "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [2025-09] [[paper](https://arxiv.org/abs/2509.16112)] + +- "RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.16198)] + ### Issue Resolution - "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [2023-10] [ICLR 2024] [[paper](https://arxiv.org/abs/2310.06770)] @@ -3183,6 +3191,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Code-SPA: Style Preference Alignment to Large Language Models for Effective and Robust Code Debugging" [2025-07] [ACL 2025 Findings] [[paper](https://aclanthology.org/2025.findings-acl.912/)] +- "LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code" [2025-09] [[paper](https://arxiv.org/abs/2509.17337)] + ### Malicious Code Detection - "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)] @@ -3337,6 +3347,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "Evaluating Generated Commit Messages with Large Language Models" [2025-07] [[paper](https://arxiv.org/abs/2507.10906)] +- "CoRaCMG: Contextual Retrieval-Augmented Framework for Commit Message Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.18337)] + ### Code Review - "Using Pre-Trained Models to Boost Code Review Automation" [2022-01] [ICSE 2022] [[paper](https://arxiv.org/abs/2201.06850)] @@ -3417,6 +3429,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects" [2025-09] [[paper](https://arxiv.org/abs/2509.14856)] +- "Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach" [2025-09] [[paper](https://arxiv.org/abs/2509.21170)] + ### Log Analysis - "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08] [[paper](https://arxiv.org/abs/2208.10282)] @@ -3707,6 +3721,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code" [2025-08] [[paper](https://arxiv.org/abs/2508.18106)] +- "Localizing Malicious Outputs from CodeLLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17070)] + ### Correctness - "An Empirical Evaluation of GitHub Copilot's Code Suggestions" [2022-05] [MSR 2022] [[paper](https://ieeexplore.ieee.org/document/9796235)] @@ -4201,6 +4217,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - "ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming" [2025-05] [ACL 2025] [[paper](https://arxiv.org/abs/2505.16667)] +- "Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" [2025-09] [[paper](https://arxiv.org/abs/2509.19708)] + ## 8. Datasets ### 8.1 Pretraining @@ -4692,6 +4710,7 @@ $^\diamond$ Machine/human prompts | 2025-07 | arXiv | LiveRepoReflection | 1888 | C++, Go, Java, JS, Python, Rust | "Turning the Tide: Repository-based Code Reflection" [[paper](https://arxiv.org/abs/2507.09866)] | | 2025-07 | arXiv | SWE-Perf | 140 | Python | "SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?" [2025-07] [[paper](https://arxiv.org/abs/2507.12415)] [[data](https://github.com/swe-perf/swe-perf)] | | 2025-09 | arXiv | RepoDebug | 30696 | 8 | "RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models" [[paper](https://arxiv.org/abs/2509.04078)] | +| 2025-09 | arXiv | SWE-Bench Pro | 1865 | Python, Go, JS, TS | "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [[paper](https://arxiv.org/abs/2509.16941)] [[data](https://github.com/scaleapi/SWE-bench_Pro-os)] | \*Line Completion/API Invocation Completion/Function Completion