Polars Bench Submission

Text-to-Polars code generation using Qwen2.5-Coder (MLX backend, 4-bit quantization).

Approach

Model: mlx-community/Qwen2.5-Coder-7B-Instruct-4bit (or 3B variant for speed)
Prompting: System instruction with Polars-specific syntax rules + 5 carefully chosen few-shot examples targeting common LLM failure modes (date handling, sort direction, membership tests, scalar extraction)
Self-repair loop: If generated code throws an exception, the error is fed back to the model for one retry
Output parsing: Strips markdown fences, special tokens (<|im_end|>), and extracts executable expression

Results on local eval set

16/16 correct in ~39s (N/T ≈ 0.41)

Setup

Apple Silicon (MLX backend):

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-apple.txt
python data/make_data.py

Linux / CUDA (transformers + bitsandbytes backend):

python3 -m venv .venv
source .venv/bin/activate
# Install torch with the CUDA version matching your driver (example: cu121)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
python data/make_data.py

Run

Benchmark server (used by the platform runner):

bash start.sh
# or: uvicorn server:app --host 0.0.0.0 --port 9000

The server exposes:

POST /chat — receives {question_id, message, schema, data_path?, data_b64?}, returns {question_id, response}
GET /health — readiness probe

Local eval loop (development only):

python run.py

The model backend is selected automatically: MLX on Apple Silicon, transformers (4-bit via bitsandbytes, float16 fallback) on Linux/CUDA.

Files

server.py — FastAPI inference server (benchmark entrypoint)
start.sh — Starts the server on port 8000
src/model.py — Code generator (MLX on Apple Silicon, transformers on Linux/CUDA)
src/prompt.py — System instruction + few-shot examples
src/executor.py — Safe code execution with timeout and output cleanup
src/evaluator.py — Eval loop with self-repair retry
run.py — Local evaluation script (development only)
data/eval_set.json — Ground-truth test cases
data/make_data.py — Generates synthetic sales parquet

Key optimizations

Targeted few-shots — each example fixes a specific Polars footgun (.dt.month() parens, .is_in() vs .isin(), descending=True not ascending=True)
Explicit syntax rules in system prompt — cheaper than adding more few-shots
Self-repair — catches transient generation errors with one retry
max_tokens=200 — covers complex group-by chains without over-generating

Notes

Developed on Apple Silicon (M-series). src/model.py auto-selects MLX on Apple Silicon and transformers (4-bit via bitsandbytes, float16 fallback) on Linux/CUDA.
Linux target model: Qwen/Qwen2.5-Coder-7B-Instruct (same model family, downloaded from HuggingFace Hub on first run).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements-apple.txt		requirements-apple.txt
requirements.txt		requirements.txt
run.py		run.py
server.py		server.py
smoke_test.py		smoke_test.py
start.sh		start.sh
test_server.py		test_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polars Bench Submission

Approach

Results on local eval set

Setup

Run

Files

Key optimizations

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Polars Bench Submission

Approach

Results on local eval set

Setup

Run

Files

Key optimizations

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages