Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 31 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,13 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per

## News

🔥🔥🔥 [2025/09/22] Featured papers:
🔥🔥🔥 [2025/09/26] Featured papers:

- 🔥🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.

- 🔥🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University.
- 🔥🔥 [SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) from Scale AI.

- 🔥 [LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering](https://arxiv.org/abs/2509.09614) from Salesforce AI Research.

- 🔥 [Astra: A Multi-Agent System for GPU Kernel Performance Optimization](https://arxiv.org/abs/2509.07506) from Stanford University.

- 🔥 [GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion](https://arxiv.org/abs/2509.05980) from Zhejiang University.
- 🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University.

🔥🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!

Expand All @@ -35,14 +31,10 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per

🔥🔥🔥 [2025/09/22] News from Codefuse

- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-source models on [SWE-Bench leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)]
- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-weight models on [SWE-Bench-Lite leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)]

- [GALLa: Graph Aligned Large Language Models](https://arxiv.org/abs/2409.04183) is accepted by ACL 2025 main conference. [[repo](https://github.com/codefuse-ai/GALLa)]

<p align='center'>
<img src='imgs/swe-leaderboard.png' style='width: 90%; '>
</p>

#### How to Contribute

If you find a paper to be missing from this repository, misplaced in a category, or lacking a reference to its journal/conference information, please do not hesitate to create an issue.
Expand Down Expand Up @@ -693,6 +685,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

67. "SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems" [2025-09] [[paper](https://arxiv.org/abs/2509.14281)]

68. "Verification Limits Code LLM Training" [2025-09] [[paper](https://arxiv.org/abs/2509.20837)]

### 2.5 Reinforcement Learning on Code

1. **CompCoder**: "Compilable Neural Code Generation with Compiler Feedback" [2022-03] [ACL 2022] [[paper](https://arxiv.org/abs/2203.05132)]
Expand Down Expand Up @@ -761,6 +755,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

33. "Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.12434)]

34. "DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?" [2025-09] [[paper](https://arxiv.org/abs/2509.21016)]

## 3. When Coding Meets Reasoning

### 3.1 Coding for Reasoning
Expand Down Expand Up @@ -1077,6 +1073,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

76. "GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging" [2025-08] [[paper](https://arxiv.org/abs/2508.18993)]

77. **MapCoder-Lite**: "MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17489)]

### 3.4 Interactive Coding

- "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)]
Expand Down Expand Up @@ -1185,6 +1183,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

- "CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance" [2025-07] [[paper](https://arxiv.org/abs/2507.10646)]

- "SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement" [2025-09] [[paper](https://arxiv.org/abs/2509.18808)]

### 3.5 Frontend Navigation

- "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding" [2021-10] [ACL 2022] [[paper](https://arxiv.org/abs/2110.08518)]
Expand Down Expand Up @@ -1295,6 +1295,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

- "UI-Venus Technical Report: Building High-performance UI Agents with RFT" [2025-08] [[paper](https://arxiv.org/abs/2508.10833)]

- "Mano Report" [2025-09] [[paper](https://arxiv.org/abs/2509.17336)]

## 4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages

- [**Ruby**] "On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages" [2022-04] [ICPC 2022] [[paper](https://arxiv.org/abs/2204.09653)]
Expand Down Expand Up @@ -1483,6 +1485,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities

- [**CUDA**] "Astra: A Multi-Agent System for GPU Kernel Performance Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.07506)]

- [**LaTeX**] "Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.17589)]

## 5. Methods/Models for Downstream Tasks

For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
Expand Down Expand Up @@ -2225,6 +2229,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion" [2025-09] [[paper](https://arxiv.org/abs/2509.05980)]

- "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [2025-09] [[paper](https://arxiv.org/abs/2509.16112)]

- "RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.16198)]

### Issue Resolution

- "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [2023-10] [ICLR 2024] [[paper](https://arxiv.org/abs/2310.06770)]
Expand Down Expand Up @@ -3183,6 +3191,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "Code-SPA: Style Preference Alignment to Large Language Models for Effective and Robust Code Debugging" [2025-07] [ACL 2025 Findings] [[paper](https://aclanthology.org/2025.findings-acl.912/)]

- "LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code" [2025-09] [[paper](https://arxiv.org/abs/2509.17337)]

### Malicious Code Detection

- "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)]
Expand Down Expand Up @@ -3337,6 +3347,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "Evaluating Generated Commit Messages with Large Language Models" [2025-07] [[paper](https://arxiv.org/abs/2507.10906)]

- "CoRaCMG: Contextual Retrieval-Augmented Framework for Commit Message Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.18337)]

### Code Review

- "Using Pre-Trained Models to Boost Code Review Automation" [2022-01] [ICSE 2022] [[paper](https://arxiv.org/abs/2201.06850)]
Expand Down Expand Up @@ -3417,6 +3429,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects" [2025-09] [[paper](https://arxiv.org/abs/2509.14856)]

- "Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach" [2025-09] [[paper](https://arxiv.org/abs/2509.21170)]

### Log Analysis

- "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08] [[paper](https://arxiv.org/abs/2208.10282)]
Expand Down Expand Up @@ -3707,6 +3721,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code" [2025-08] [[paper](https://arxiv.org/abs/2508.18106)]

- "Localizing Malicious Outputs from CodeLLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17070)]

### Correctness

- "An Empirical Evaluation of GitHub Copilot's Code Suggestions" [2022-05] [MSR 2022] [[paper](https://ieeexplore.ieee.org/document/9796235)]
Expand Down Expand Up @@ -4201,6 +4217,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming" [2025-05] [ACL 2025] [[paper](https://arxiv.org/abs/2505.16667)]

- "Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" [2025-09] [[paper](https://arxiv.org/abs/2509.19708)]

## 8. Datasets

### 8.1 Pretraining
Expand Down Expand Up @@ -4692,6 +4710,7 @@ $^\diamond$ Machine/human prompts
| 2025-07 | arXiv | LiveRepoReflection | 1888 | C++, Go, Java, JS, Python, Rust | "Turning the Tide: Repository-based Code Reflection" [[paper](https://arxiv.org/abs/2507.09866)] |
| 2025-07 | arXiv | SWE-Perf | 140 | Python | "SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?" [2025-07] [[paper](https://arxiv.org/abs/2507.12415)] [[data](https://github.com/swe-perf/swe-perf)] |
| 2025-09 | arXiv | RepoDebug | 30696 | 8 | "RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models" [[paper](https://arxiv.org/abs/2509.04078)] |
| 2025-09 | arXiv | SWE-Bench Pro | 1865 | Python, Go, JS, TS | "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [[paper](https://arxiv.org/abs/2509.16941)] [[data](https://github.com/scaleapi/SWE-bench_Pro-os)] |

\*Line Completion/API Invocation Completion/Function Completion

Expand Down