Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per

## News

🔥🔥🔥 [2025/10/11] Featured papers:
🔥🔥🔥 [2025/10/13] Featured papers:

- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.

- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
- 🔥🔥 [LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?](https://arxiv.org/abs/2510.09595) from University of Michigan.

- 🔥 [BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software](https://arxiv.org/abs/2509.25248) from Arizona State University.
- 🔥🔥 [Scaling Laws for Code: A More Data-Hungry Regime](https://arxiv.org/abs/2510.08702) from Harbin Institute of Technology.

- 🔥 [Devstral: Fine-tuning Language Models for Coding Agent Applications](https://arxiv.org/abs/2509.25193) from Mistral AI.
- 🔥🔥 [BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution](https://arxiv.org/abs/2510.08697) from Monash University.

- 🔥 [LLaDA-MoE: A Sparse MoE Diffusion Language Model](https://arxiv.org/abs/2509.24389) from Ant Group.

- 🔥 [Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents](https://arxiv.org/abs/2509.23045) from Moonshot AI.
- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.

- 🔥 [ML2B: Multi-Lingual ML Benchmark For AutoML](https://arxiv.org/abs/2509.22768) from HSE University.
- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.

🔥🔥     [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!

Expand Down Expand Up @@ -521,6 +517,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained

27. **Mellum**: "Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding" [2025-10] [[paper](https://arxiv.org/abs/2510.05788)]

28. "Scaling Laws for Code: A More Data-Hungry Regime" [2025-10] [[paper](https://arxiv.org/abs/2510.08702)]

#### Encoder-Decoder

1. **PyMT5** (Span Corruption): "PyMT5: multi-mode translation of natural language and Python code with transformers" [2020-10] [EMNLP 2020] [[paper](https://arxiv.org/abs/2010.03150)]
Expand Down Expand Up @@ -3603,6 +3601,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "When Names Disappear: Revealing What LLMs Actually Understand About Code" [2025-10] [[paper](https://arxiv.org/abs/2510.03178)]

- "MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction" [2025-10] [[paper](https://arxiv.org/abs/2510.09049)]

### Software Modeling

- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)]
Expand Down Expand Up @@ -4391,6 +4391,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF

- "CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks" [2025-07] [[paper](https://arxiv.org/abs/2507.10535)]

- "BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution" [2025-10] [[paper](https://arxiv.org/abs/2510.08697)]

- "How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective" [2025-10] [[paper](https://arxiv.org/abs/2510.08720)]

#### Program Synthesis

| Date | Venue | Benchmark | Size | Language | Source |
Expand Down Expand Up @@ -4485,6 +4489,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
| 2025-08 | arXiv | FPBench | 1800 | Python | "Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework" [[paper](https://arxiv.org/abs/2508.03622)] [[data](https://github.com/JialinLi13/FaultyPremise)] |
| 2025-08 | arXiv | AutoCodeBench | 3,920 | 20 | "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators" [[paper](https://arxiv.org/abs/2508.09101)] [[data](https://autocodebench.github.io/)] |
| 2025-08 | arXiv | AetherCode | 456 | C++ | "AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions" [[paper](https://arxiv.org/abs/2508.16402)] [[data](https://huggingface.co/datasets/m-a-p/AetherCode)] |
| 2025-10 | arXiv | LiveOIBench | 403 | | "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)] [[data](https://liveoibench.github.io/)] |

\* Automatically mined/human-annotated

Expand Down