This is a new programming benchmark for LLMs with two goals:
-
Support as many niche ("low-resource") programming languages as possible, such as Julia, Fortran, R, and others.
Although LLMs are remarkably good at Python and other high-resource languages, they are much worse at low-resource programming languages. A high-quality benchmark is necessary to both to measure LLM capabilities and to make LLMs better.
-
Make it easy to support new programming languages a wide variety of programming tasks.
Writing a good benchmark is hard, and we don't to duplicate effort for each language. Prior work, such as MultiPL-E, reduce the effort needed, but only support a small sliver of programming tasks. Our goal is to reduce effort even further and support a much broader range of programming tasks than MultiPL-E.
We start with BigCodeBench, which is an LLM programming benchmark with complex Python programming tasks. Each task is accompanied with a reference solution and a comprehensive test suite, both in Python of course. We proceed in three steps:
-
We prompt a reasoning model to reformulate the task, including the reference solution and test suite, to use standard I/O and re-use as much of the existing code as possible. The result is a new benchmark -- still for Python -- that uses standard I/O. We can partially-validate this translation step by running the updated code. With just one attempt, using o4-mini, ~75% of the ~1,000 updated problems pass their own test suite.
-
We prompt a model to update the task description to remove references to Python and Python terminology. This requires human validation, but our preliminary work indicates that gpt-4.1-mini does this task well. Notice that we do not need to update the test suite to be language specific. Since the task uses standard I/O, the tests can be in Python even if the program is in another language.
-
We build containers for each niche language. We have a few already in the containers directory. The Julia container is well documented and should be a template for building other containers.
To follow these directions, you will need:
- Docker or Podman to run containers.
- A Python environment with
tqdm,datasets, andlitellminstalled. - The
jqtool, which will be available from your Linux package manager.
-
Generate and Execute Completions: Our benchmark execution script will generate completions with the LLM and execute the generated code in a container. For example, the following command benchmarks JavaScript:
python3 -m bigcodebench_multipl.run_benchmark generate \ --model-name openai/gpt-4.1-nano \ --temperature 0.2 \ --num-concurrent-requests 50 \ --max-tokens 2000 \ --lang "JavaScript using Node 24" \ --container-name "ghcr.io/arjunguha/bcb_multipl-js" \ --output-path js.jsonlThis assumes that you can run 50 containers concurrently. If you have fewer cores, you can reduce the number of concurrent requests.
Notice that we use JavaScript using Node 24 as the language. This is the version of Node that is installed in the container. It is important to convey this information to the model, since JavaScript on the web is quite different from server-side JavaScript.
The output JSONL file has a field called
programwith model-generated JavaScript code. -
Compute Pass@1: This is just the fraction of programs that pass all tests.
./bin/pass1.sh js.jsonl
To support a new programming language, you will need to write a container that can run its code. However, you also need to decide what libraries should be available in the container. The best way to do this is to first generate completions for your language with some model. It may make the most sense to use what you think is the best model for your language. Use your text processing skills to extract the list of libraries that the model is trying to use. With that list, you can build a container that has those libraries installed.
We recommend modifying an existing container. Look at the Julia container in
containers/jl for an example that is well-documented.
See the README in the bigcodebench_multipl directory. for instructions on
how to construct the benchmark.
-
A preliminary version of the benchmark is in this Hugging Face Dataset.
-
We are putting generations from models in this Hugging Face Dataset.
-
See this Google Sheet for model evaluations.
This work is supported by grants from the U.S. National Science Foundation (SES-2326173) and the U.S. Department of Energy, Office of Science (DE-SC0025613).