<a href="https://colab.research.google.com/github/OpenPipe/ART/blob/auto-art/clean_auto_art.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train a model for your custom task, click _Runtime_ and press _Run all_. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**Custom Task Training with ART**

This notebook shows how to train a Qwen 2.5 7B model to perform any single-turn task you describe - no labeled data needed! Simply describe what you want the model to learn, and this notebook will:

1. Generate diverse input examples for your task
2. Create an appropriate system prompt
3. Train the model using RULER's automatic evaluation
4. Test the trained model on new inputs

RULER learns what makes a good output purely from your task description - no expected outputs required!

You will learn how to use RULER for unsupervised learning, define custom [rollouts](https://art.openpipe.ai/resources/glossary#rollout), and run a [training loop](https://art.openpipe.ai/fundamentals/training-loop) that automatically improves your model.

In [2]:
#@title Installation

%%capture
!uv pip install openpipe-art==0.3.11.post2 langchain-core tenacity --prerelease allow --no-cache-dir

<a name="Configuration"></a>

### 🎯 Configuration - Edit These Settings

Add an OpenRouter key and customize your training by modifying the values below:

In [3]:
# Required - Used for generating training inputs and RULER evaluation
OPENROUTER_API_KEY = ""

# Optional - Enables metric logging
WANDB_API_KEY = ""

# Describe your custom task (be specific!)
TASK_DESCRIPTION = """
Convert informal bug reports into structured JIRA-style tickets with these exact sections:
- SUMMARY: (one line title)
- PRIORITY: (Critical/High/Medium/Low based on impact)
- STEPS TO REPRODUCE: (numbered list)
- EXPECTED RESULT: (what should happen)
- ACTUAL RESULT: (what actually happens)
- ENVIRONMENT: (extracted system/version info)
"""

# Choose the base model to train
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"  # Options: "Qwen/Qwen2.5-1.5B-Instruct", "Qwen/Qwen2.5-3B-Instruct", etc.

In [4]:
#@title Advanced Settings

# Model configuration
MODEL_NAME = "custom-task-model-001"  # Name for your trained model
PROJECT_NAME = "custom-task-training"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 25,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 1,  # Number of times through all data
    "rollouts_per_group": 4,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": None,  # Maximum training steps (set to None for no limit)
}

# Evaluation configuration
RULER_MODEL = "openrouter/moonshotai/kimi-k2"  # Model for RULER evaluation
SYSTEM_PROMPT_GENERATION_MODEL="openrouter/moonshotai/kimi-k2"
INPUT_GENERATION_MODEL="openrouter/moonshotai/kimi-k2"
NUM_TEST_INPUTS = 5  # Number of test inputs to generate

# GPU configuration (for T4 — keep these as-is unless you have a reason to change them)
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.8  # GPU memory usage (0.0-1.0)

In [5]:
import os

# Required
if OPENROUTER_API_KEY:
    os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
else:
    raise ValueError(
        "OPENROUTER_API_KEY is required for data generation and RULER evaluation."
    )

# Optional
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


#@title Run this cell to train your model!
import json
import asyncio
from typing import List, Dict, Tuple
from pydantic import BaseModel, Field
from litellm import acompletion
from tqdm import tqdm

class TrainingInput(BaseModel):
    input: str = Field(description="The input text for the task")

class TrainingDataset(BaseModel):
    inputs: List[TrainingInput] = Field(description="List of training inputs")

async def generate_training_inputs(task_description: str, num_examples: int = 50) -> List[str]:
    """Generate diverse training inputs for the given task"""

    system_prompt = f"""You are a helpful assistant that generates diverse, high-quality training inputs.

Task: {task_description}

Generate {num_examples} diverse INPUT examples that someone might provide for this task.
Make sure the inputs:
1. Cover a wide range of cases and edge cases
2. Are realistic and practical
3. Vary in length and complexity
4. Represent real-world scenarios

Only generate the INPUTS, not the outputs. RULER will evaluate the model's attempts automatically.
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Generate {num_examples} input examples for the task described above. Return them in the form of a list."}
    ]

    print(f"Generating {num_examples} training inputs...")
    response = await acompletion(
        model=INPUT_GENERATION_MODEL,
        messages=messages,
        response_format=TrainingDataset,
        temperature=0.4,
    )

    dataset = TrainingDataset.model_validate_json(response.choices[0].message.content)

    return [ex.input for ex in dataset.inputs]

# Generate training inputs

training_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=TRAINING_CONFIG["num_training_inputs"])
print(f"\nGenerated {len(training_inputs)} training inputs!")
print("\nFirst 5 examples:")
for i, input_text in enumerate(training_inputs[:5]):
    print(f"\nExample {i+1}: {input_text}")

#@title Model Creation Code
import art
from art.local import LocalBackend
import random

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=MAX_SEQ_LENGTH,
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    ),
)

# Initialize the server
backend = LocalBackend(
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend
await model.register(backend)

print("Model created!")
print("Base model:", BASE_MODEL)
print("Model name:", MODEL_NAME)
print("Project name:", PROJECT_NAME)

#@title Rollout Function Code

import art
import weave
from litellm import acompletion
from art.utils.litellm import convert_litellm_choice_to_openai

if os.getenv("WANDB_API_KEY", ""):
    weave.init(PROJECT_NAME, settings={"print_call_link": False})

# Generate a system prompt for the task
async def generate_system_prompt(task_description: str) -> str:
    """Generate an appropriate system prompt for the task"""

    messages = [
        {
            "role": "system",
            "content": "Generate a clear, concise system prompt for a model that will perform the following task. The prompt should be direct and instructional."
        },
        {
            "role": "user",
            "content": f"Task: {task_description}\n\nGenerate a system prompt for this task."
        }
    ]

    response = await acompletion(
        model=SYSTEM_PROMPT_GENERATION_MODEL,
        messages=messages,
        temperature=0.3,
    )

    return response.choices[0].message.content.strip()

SYSTEM_PROMPT = await generate_system_prompt(TASK_DESCRIPTION)
print(f"Generated system prompt:\n\n{SYSTEM_PROMPT}")

class TaskInput(BaseModel):
    step: int
    input_text: str

@weave.op
async def rollout(model: art.Model, task_input: TaskInput) -> art.Trajectory:
    """Execute a single rollout for the custom task"""

    traj = art.Trajectory(
        reward=0.0,
        messages_and_choices=[],
        metadata={
            "step": task_input.step,
            "input": task_input.input_text,
        },
    )

    # Build the conversation
    traj.messages_and_choices = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task_input.input_text},
    ]

    # Get model response
    if model.trainable:
        litellm_model_name = f"hosted_vllm/{model.name}"
    else:
        litellm_model_name = model.name

    response = await acompletion(
        model=litellm_model_name,
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
        temperature=0.7,
        messages=traj.messages(),
        caching=False,
    )

    # Add the model's response to the trajectory
    traj.messages_and_choices.append(
        convert_litellm_choice_to_openai(response.choices[0])
    )

    return traj

print("\nRollout function defined!")


import art
from art.rewards import ruler_score_group

# Test RULER with example outputs for a text formalization task
test_input = "hey can u send me the report asap? thx"

base_messages = [
    {"role": "system", "content": "Convert informal text to formal business language."},
    {"role": "user", "content": test_input},
]

good_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Could you please send me the report at your earliest convenience? Thank you."},
    ],
    reward=0,
)

mediocre_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Can you send me the report soon? Thanks."},
    ],
    reward=0,
)

bad_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "hey send report quick thx"},
    ],
    reward=0,
)

sample_group = art.TrajectoryGroup(
    trajectories=[good_trajectory, mediocre_trajectory, bad_trajectory]
)

# RULER will score these based on how well they accomplish the task
judged_group = await ruler_score_group(sample_group, RULER_MODEL, debug=True)
assert judged_group is not None

# Display rankings
sorted_trajectories = sorted(
    judged_group.trajectories, key=lambda t: t.reward, reverse=True
)
for rank, traj in enumerate(sorted_trajectories, 1):
    messages = traj.messages()
    print(f"\nRank {rank}: Score {traj.reward:.3f}")
    print(f"  Response: {messages[-1]['content']}")


#@title Training Loop Code
# Training configuration
from art.utils import iterate_dataset

# Convert training inputs to TaskInput objects
training_task_inputs = [
    TaskInput(step=0, input_text=inp)
    for inp in training_inputs
]

# Create training iterator
training_iterator = iterate_dataset(
    training_task_inputs,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),
)

print(f"Starting training with {len(training_task_inputs)} inputs...")
print(f"Training for {TRAINING_CONFIG['num_epochs']} epoch(s)")
print(f"Generating {TRAINING_CONFIG['rollouts_per_group']} responses per input for RULER to compare")
print(f"\nWhy multiple responses? RULER needs to compare different attempts to learn what's good!")

for batch, epoch, global_step, epoch_step in training_iterator:
    print(f"\nTraining step {global_step}, epoch {epoch}, epoch step {epoch_step}")
    print(f"Batch contains {len(batch)} inputs")

    # Create trajectory groups for this batch
    groups = []
    for task_input in batch:
        # Update step number
        task_input.step = global_step

        # Generate multiple responses for each input (RULER will compare these)
        groups.append(
            art.TrajectoryGroup(
                (
                    rollout(model, task_input)
                    for _ in range(TRAINING_CONFIG["rollouts_per_group"])
                )
            )
        )

    # Gather all trajectory groups
    finished_groups = await art.gather_trajectory_groups(
        groups,
        pbar_desc="Generating responses",
        max_exceptions=TRAINING_CONFIG["rollouts_per_group"] * len(batch),
    )

    # Use RULER to score each group
    judged_groups = []
    for group in finished_groups:
        judged_group = await ruler_score_group(
            group,
            RULER_MODEL,
            debug=False
        )
        judged_groups.append(judged_group)

    # Train on the scored trajectories
    await model.delete_checkpoints()
    await model.train(
        judged_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
        _config={"logprob_calculation_chunk_size": 8},
    )

    print(f"Completed training step {global_step}")

    # Stop after configured steps (if limit is set)
    if TRAINING_CONFIG["max_training_steps"] and global_step >= TRAINING_CONFIG["max_training_steps"]:
        print(f"Reached maximum training steps ({TRAINING_CONFIG['max_training_steps']})")
        break

print("\n✅ Training completed!")

WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.
TRAINING_CONFIG["num_training_inputs"] 25
Generating 25 training inputs...
messages[0] {'role': 'system', 'content': "You are a helpful assistant that generates diverse, high-quality training inputs.\n\nTask: \nConvert informal bug reports into structured JIRA-style tickets with these exact sections:\n- SUMMARY: (one line title)\n- PRIORITY: (Critical/High/Medium/Low based on impact)\n- STEPS TO REPRODUCE: (numbered list)\n- EXPECTED RESULT: (what should happen)\n- ACTUAL RESULT: (what actually happens)\n- ENVIRONMENT: (extracted system/version info)\n\n\nGenerate 25 diverse INPUT examples that someone might provide for this task.\nMake sure the inputs:\n1. Cover a wide range of cases and edge cases\n2. Are realistic and practical\n3. Vary in length and complexity\n4. Represent real-world scenarios\n\nOnly generate the INPUTS, not the outputs. RULER will evaluate the model's attempts automatically.\n"}

Generated


Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit with actual GPU utilization = 78.25%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 5.67 GB. Also swap space = 2 GB.
INFO 07-29 00:15:30 [config.py:717] This model s

model-00002-of-00002.safetensors:   0%|          | 0.00/2.16G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
#@title Test Your Model!

# Generate test inputs
print("Generating test inputs...")
test_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=NUM_TEST_INPUTS)

print(f"\n🧪 Testing the trained model on {len(test_inputs)} new inputs:\n")
print("=" * 80)

for i, test_input in enumerate(test_inputs):
    print(f"\nTest {i+1}:")
    print(f"Input: {test_input}")

    # Run the model
    test_task_input = TaskInput(
        step=999,
        input_text=test_input
    )
    result_trajectory = await rollout(model, test_task_input)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]['content'] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\n🎉 Testing completed!")
print(f"\nYour model '{MODEL_NAME}' has been trained to: {TASK_DESCRIPTION}")
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print("3. Or continue training with more examples by adjusting the configuration at the top")

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A task description
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **Task refinement**: Make your task description more specific and detailed
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your task description alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).