TAROT

We provide scripts that are used to generate 4-tiered test suite of dataset, fine-tune models and evlaluate the models from the paper. Also, all the fine-tuned models and dataset can be found in Hugging Face.

Generating 4-tiered test suite of dataset

The base dataset comes from Hugging Face's open-r1/verifiable-coding-problems-python-v2. We used OpenAI's o1-pro, o3, and o4-mini to generate 4-tiered test suite for each data point. Final dataset contains 15.5K data points which has 4-tiered test suite per each (~ 62K test cases are included).

Fine-tuning models

We experiment to verify the effectiveness of TAROT on 1.5B, 3B, and 7B variant of Qwen2.5-Instruct and Qwen2.5-Coder-Instruct, Qwen3-4B-Instruct-2507, and 2B and 9B variant of Gemma2 models using GRPO algorithm.

To streamline the Reinfocement Fine-Tuning (RFT) and to make sure the consistency of the fine-tuning process across different model families, we used dstack.ai for provisioning cloud VMs, the Hugging Face's Open-R1 to define high-level fine-tuning recipe, trl to to define custom rewarding system, vllm for faster generation for GRPO, and E2B for safe code execution in the cloud sandbox.

Exact fine-tuning process is specified in the dstack YAML file, and you can it here. To run such YAML, you need to run the following commands after setting up dstack which you can find the how-to guide here.

$ export HF_TOKEN=[YOUR-HUGGING-FACE-TOKEN]
$ export HUGGINGFACE_TOKEN=[YOUR-HUGGING-FACE-TOKEN]
$ export WANDB_API_KEY=[YOUR-WANDB-API-KEY]
$ export E2B_API_KEY=[YOUR-E2B-API-KEY]

$ TARGET_BASE_MODEL=.... \ # base model (i.e., Qwen/Qwen2.5-1.5B-Instruct)
  TARGET_YAML=.... \ # RFT recipe (i.e., qwen2.5-1.5b/basic_only.yaml)
  TARGET_REWARD=.... \ # reward function (i.e., custom_rewards/basic_only.py)
  MAX_MODEL_LEN=.... \ # max length to gen (i.e., 4096)
  dstack apply -f 8x80GB.task.dstack.yml -d

The list of the TARGET_YAML can be found under gemma2-2b-it, gemma2-9b-it, qwen2.5-1.5b, qwen2.5-3b, qwen2.5-7b, qwen2.5-coder-1.5b, qwen2.5-coder-3b, qwen2.5-coder-7b, and qwen3-4b-it folders. And the list of custom reward functions can be found under custom_rewards folder as well.

Evaluating models

All the fine-tuned models were evaluated on HumanEval, HumanEval+, MBPP, MBPP+, LiveCodeBench_v5, CodeForces, and CruxEval using EvalChemy framework to ensure the evaluation consistency.

Evaluation steps can be found in detailed at the EvalChemy official REAMDE, but here are the briefs for any convenient use.

# Create and activate conda environment
$ conda create --name evalchemy python=3.10
$ conda activate evalchemy

# Clone the repo
$ git clone https://github.com/mlfoundations/evalchemy.git
$ cd evalchemy

# Install dependencies
$ pip install -e .
$ huggingface-cli login

$ python -m eval.eval \
    --model vllm \
    --tasks HumanEval,HumanEvalPlus,MBPP,MBPPPlus,LiveCodeBenchv5_official,CodeForces,CruxEval \
    --model_args "pretrained=[TARGET-MODEL],tensor_parallele_size=[NUM-GPUs]" \
    --batch_size 16 \
    --output_path logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAROT

Generating 4-tiered test suite of dataset

Fine-tuning models

Evaluating models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
custom_rewards		custom_rewards
data		data
gemma2-2b-it		gemma2-2b-it
gemma2-9b-it		gemma2-9b-it
open-r1		open-r1
qwen2.5-1.5b		qwen2.5-1.5b
qwen2.5-3b		qwen2.5-3b
qwen2.5-7b		qwen2.5-7b
qwen2.5-coder-1.5b		qwen2.5-coder-1.5b
qwen2.5-coder-3b		qwen2.5-coder-3b
qwen2.5-coder-7b		qwen2.5-coder-7b
qwen3-4b-it		qwen3-4b-it
qwen3-coder-30b-a3b		qwen3-coder-30b-a3b
trl		trl
.gitattributes		.gitattributes
.gitignore		.gitignore
8x80GB.task.dstack.yml		8x80GB.task.dstack.yml
LICENSE		LICENSE
README.md		README.md
restore.log		restore.log

Folders and files

Latest commit

History

Repository files navigation

TAROT

Generating 4-tiered test suite of dataset

Fine-tuning models

Evaluating models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages