<a href="https://colab.research.google.com/github/arcticfly/2048-tutorial/blob/main/examples/auto-art.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To train a model for your custom task, click _Runtime_ and press _Run all_. Make sure you've enabled a free Tesla T4 GPU!

<div class="align-center">
<a href="https://github.com/openpipe/art"><img src="https://github.com/openpipe/art/raw/main/assets/ART_pill.png" height="50"></a>
<a href="https://discord.gg/zbBHRUpwf4"><img src="https://github.com/openpipe/art/raw/main/assets/Discord_pill.png" height="50"></a>
<a href="https://art.openpipe.ai"><img src="https://github.com/openpipe/art/raw/main/assets/Documentation_pill.png" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [Github](https://github.com/openpipe/art).

</div>

<a href="https://art.openpipe.ai/"><img src="https://github.com/openpipe/art/raw/main/assets/Header_separator.png" height="5"></a>

**Custom Task Training with ART**

This notebook shows how to train a Qwen 2.5 7B model to perform any single-turn task you describe - no labeled data needed! Simply describe what you want the model to learn, and this notebook will:

1. Generate diverse input examples for your task
2. Create an appropriate system prompt
3. Train the model using RULER's automatic evaluation
4. Test the trained model on new inputs

RULER learns what makes a good output purely from your task description - no expected outputs required!

You will learn how to use RULER for unsupervised learning, define custom [rollouts](#Rollout), and run a [training loop](#Loop) that automatically improves your model.

### Installation

In [None]:
%%capture
!uv pip install openpipe-art==0.3.11.post2 langchain-core tenacity --prerelease allow --no-cache-dir

<a name="Environment-Variables"></a>
### Environment Variables

**OpenAI (required for RULER and input generation)**

OpenAI provides access to GPT models which we'll use for:
1. Generating diverse training inputs for your task
2. RULER evaluation during training (comparing model outputs)

**Weights & Biases (optional)**

The notebook can log metrics to Weights & Biases. If you want to track your training progress, provide your API key below.

In [None]:
import os

# Required
OPENAI_API_KEY = ""
if OPENAI_API_KEY:
    os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
else:
    raise ValueError(
        "OPENAI_API_KEY is required for data generation and RULER evaluation."
    )

# Optional
WANDB_API_KEY = ""
if WANDB_API_KEY:
    os.environ["WANDB_API_KEY"] = WANDB_API_KEY
else:
    print("WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.")


WANDB_API_KEY is not set. We'll skip logging metrics to Weights & Biases.


<a name="Configuration"></a>

### 🎯 Configuration - Edit These Settings

Customize your training by modifying the values below:

In [None]:
# ============= MAIN CONFIGURATION =============

# Describe your custom task (be specific!)
TASK_DESCRIPTION = """
Convert informal bug reports into structured JIRA-style tickets with these exact sections:
- SUMMARY: (one line title)
- PRIORITY: (Critical/High/Medium/Low based on impact)
- STEPS TO REPRODUCE: (numbered list)
- EXPECTED RESULT: (what should happen)
- ACTUAL RESULT: (what actually happens)
- ENVIRONMENT: (extracted system/version info)
"""

# More example task descriptions:
# - "Summarize product reviews focusing on pros and cons in bullet points"
# - "Extract key facts, dates, and entities from news articles"
# - "Rewrite modern text in Shakespearean style with appropriate vocabulary"
# - "Convert technical documentation into simple explanations for beginners"
# - "Transform long paragraphs into concise bullet point summaries"
# - "Identify and extract action items from meeting transcripts"

# Model configuration
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"  # Options: "Qwen/Qwen2.5-7B-Instruct", "Qwen/Qwen2.5-3B-Instruct", etc.
MODEL_NAME = "custom-task-model-001"  # Name for your trained model
PROJECT_NAME = "custom-task-training"  # Project name for tracking

# Training configuration
TRAINING_CONFIG = {
    "num_training_inputs": 25,  # Number of training inputs to generate
    "groups_per_step": 2,  # Inputs to process per training step
    "num_epochs": 1,  # Number of times through all data
    "rollouts_per_group": 4,  # Different responses per input (for RULER comparison)
    "learning_rate": 1e-5,  # Learning rate
    "max_training_steps": 5,  # Maximum training steps (set to None for no limit)
}

# Evaluation configuration
RULER_MODEL = "openai/gpt-4.1-mini"  # Model for RULER evaluation
NUM_TEST_INPUTS = 5  # Number of test inputs to generate

# GPU configuration (for T4)
MAX_SEQ_LENGTH = 4096  # Maximum sequence length
GPU_MEMORY_UTILIZATION = 0.8  # GPU memory usage (0.0-1.0)

# ============= END CONFIGURATION =============

print(f"Task: {TASK_DESCRIPTION}")
print(f"Model: {BASE_MODEL}")
print(f"Training inputs: {TRAINING_CONFIG['num_training_inputs']}")

Task: 
Convert informal bug reports into structured JIRA-style tickets with these exact sections:
- SUMMARY: (one line title)
- PRIORITY: (Critical/High/Medium/Low based on impact)
- STEPS TO REPRODUCE: (numbered list)
- EXPECTED RESULT: (what should happen)
- ACTUAL RESULT: (what actually happens)
- ENVIRONMENT: (extracted system/version info)

Model: Qwen/Qwen2.5-1.5B-Instruct
Training inputs: 25


<a name="Task-Definition"></a>

### Task Overview

The model will learn to perform the task described above purely from the description - no labeled examples needed!

💡 **Tip**: To change the task or any settings, go back to the [Configuration](#Configuration) section above.

### Generate Training Data

Now we'll use GPT-4.1 to generate training inputs for your task.

**Important**: We only need inputs, NOT expected outputs! RULER will automatically learn what makes a good output by:
1. Having the model generate multiple attempts for each input
2. Comparing these attempts based on your task description
3. Learning to prefer better attempts

This is the power of RULER - unsupervised learning from just a task description!

In [None]:
import json
import asyncio
from typing import List, Dict, Tuple
from pydantic import BaseModel, Field
from litellm import acompletion
from tqdm import tqdm

class TrainingInput(BaseModel):
    input: str = Field(description="The input text for the task")

class TrainingDataset(BaseModel):
    inputs: List[TrainingInput] = Field(description="List of training inputs")

async def generate_training_inputs(task_description: str, num_examples: int = 50) -> List[str]:
    """Generate diverse training inputs for the given task"""

    system_prompt = f"""You are a helpful assistant that generates diverse, high-quality training inputs.

Task: {task_description}

Generate {num_examples} diverse INPUT examples that someone might provide for this task.
Make sure the inputs:
1. Cover a wide range of cases and edge cases
2. Are realistic and practical
3. Vary in length and complexity
4. Represent real-world scenarios

Only generate the INPUTS, not the outputs. RULER will evaluate the model's attempts automatically.
"""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Generate {num_examples} input examples for the task described above."}
    ]

    print(f"Generating {num_examples} training inputs...")
    response = await acompletion(
        model="openai/gpt-4.1",
        messages=messages,
        response_format=TrainingDataset,
        temperature=0.8,
    )

    dataset = TrainingDataset.model_validate_json(response.choices[0].message.content)
    return [ex.input for ex in dataset.inputs]

# Generate training inputs
training_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=TRAINING_CONFIG["num_training_inputs"])
print(f"\nGenerated {len(training_inputs)} training inputs!")
print("\nFirst 5 examples:")
for i, input_text in enumerate(training_inputs[:5]):
    print(f"\nExample {i+1}: {input_text}")

Generating 25 training inputs...

Generated 25 training inputs!

First 5 examples:

Example 1: When I try to upload a profile picture, it keeps spinning forever and never completes. Using Chrome on Windows 10.

Example 2: App crashes every time I click on the "Create New Project" button. Noticed on v2.3.1, using MacBook Pro, macOS 12.1.

Example 3: Logged out automatically after a few minutes even though 'Remember Me' is checked. Using Firefox 113 on Ubuntu 20.04.

Example 4: Notifications aren't showing up for new messages. I have all notification settings enabled. Using iOS app version 4.7.2.

Example 5: I get a 404 error page when accessing the settings page from the dashboard. Only happens for admin users.


### Creating a Model

Now we'll create a model that will learn your task.

In [None]:
import art
from art.local import LocalBackend
import random

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name=MODEL_NAME,
    project=PROJECT_NAME,
    base_model=BASE_MODEL,
)

# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=MAX_SEQ_LENGTH,
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    ),
)

# Initialize the server
backend = LocalBackend(
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend
await model.register(backend)

INFO 07-15 20:59:55 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-15 20:59:56 [__init__.py:239] Automatically detected platform cuda.



Please restructure your imports with 'import unsloth' at the top of your file.
  import unsloth  # type: ignore


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit with actual GPU utilization = 78.25%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 4096. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 10.32 GB. Also swap space = 2 GB.
INFO 07-15 21:00:34 [config.py:717] This mode

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

INFO 07-15 21:00:41 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-15 21:00:41 [cuda.py:289] Using XFormers backend.
INFO 07-15 21:00:42 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 07-15 21:00:42 [model_runner.py:1108] Starting to load model unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit...
INFO 07-15 21:00:42 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 07-15 21:00:43 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.53G [00:00<?, ?B/s]

INFO 07-15 21:01:46 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-1.5b-instruct-unsloth-bnb-4bit: 62.462090 seconds
INFO 07-15 21:01:46 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 07-15 21:01:48 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 07-15 21:01:48 [model_runner.py:1140] Model loading took 1.4698 GiB and 65.994480 seconds
INFO 07-15 21:01:58 [worker.py:287] Memory profiling takes 9.42 seconds
INFO 07-15 21:01:58 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.80) = 11.79GiB
INFO 07-15 21:01:58 [worker.py:287] model weights take 1.47GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 9.07GiB.
INFO 07-15 21:01:59 [executor_base.py:112] # cuda blocks: 21229, # CPU blocks: 4681
INFO 07-15 21:01:59 [executor_base.py:117] Maximum concurrency for 4096 tokens per request: 82.93x
INFO 07-15 21:02:02 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 13.34 seconds
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm', 'q_norm', 'k_norm']
Unsloth

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.5.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Rollout"></a>

### Defining a Rollout

A rollout is a single episode where the model attempts to complete your task. For single-turn tasks, this is straightforward:
1. Present the input to the model
2. Get the model's response
3. The response is automatically evaluated using RULER

The rollout function below is automatically generated based on your task description.

In [None]:
#@title Rollout Function

import art
import weave
from litellm import acompletion
from art.utils.litellm import convert_litellm_choice_to_openai

if os.getenv("WANDB_API_KEY", ""):
    weave.init(PROJECT_NAME, settings={"print_call_link": False})

# Generate a system prompt for the task
async def generate_system_prompt(task_description: str) -> str:
    """Generate an appropriate system prompt for the task"""

    messages = [
        {
            "role": "system",
            "content": "Generate a clear, concise system prompt for a model that will perform the following task. The prompt should be direct and instructional."
        },
        {
            "role": "user",
            "content": f"Task: {task_description}\n\nGenerate a system prompt for this task."
        }
    ]

    response = await acompletion(
        model="openai/gpt-4.1",
        messages=messages,
        temperature=0.3,
    )

    return response.choices[0].message.content.strip()

SYSTEM_PROMPT = await generate_system_prompt(TASK_DESCRIPTION)
print(f"Generated system prompt:\n{SYSTEM_PROMPT}")

class TaskInput(BaseModel):
    step: int
    input_text: str

@weave.op
async def rollout(model: art.Model, task_input: TaskInput) -> art.Trajectory:
    """Execute a single rollout for the custom task"""

    traj = art.Trajectory(
        reward=0.0,
        messages_and_choices=[],
        metadata={
            "step": task_input.step,
            "input": task_input.input_text,
        },
    )

    # Build the conversation
    traj.messages_and_choices = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": task_input.input_text},
    ]

    # Get model response
    if model.trainable:
        litellm_model_name = f"hosted_vllm/{model.name}"
    else:
        litellm_model_name = model.name

    response = await acompletion(
        model=litellm_model_name,
        base_url=model.inference_base_url,
        api_key=model.inference_api_key,
        temperature=0.7,
        messages=traj.messages(),
        caching=False,
    )

    # Add the model's response to the trajectory
    traj.messages_and_choices.append(
        convert_litellm_choice_to_openai(response.choices[0])
    )

    return traj

print("Rollout function defined!")

  if event.key is 'enter':



Generated system prompt:
Convert informal bug reports into structured JIRA-style tickets with the following sections:  
- SUMMARY: Provide a concise, one-line title.  
- PRIORITY: Assign Critical, High, Medium, or Low based on the described impact.  
- STEPS TO REPRODUCE: List the steps as a numbered list.  
- EXPECTED RESULT: Describe what should happen.  
- ACTUAL RESULT: Describe what actually happens.  
- ENVIRONMENT: Extract and summarize any relevant system or version information.  
Use only the information provided in the report. If any section is missing information, write "Not specified."
Rollout function defined!


### How RULER works

**RULER** (Reinforcement learning via Universal Reward) evaluates model outputs WITHOUT needing expected answers!

How it works:
1. Give the same input to the model multiple times
2. Get different responses (due to temperature/randomness)
3. Have an LLM judge compare these responses based on your task description
4. Assign relative scores (0-1) based on quality
5. Train the model to prefer higher-scored responses

This is powerful because:
- No labeled data needed - just inputs!
- The judge understands your task from the description alone
- It naturally handles subjective or creative tasks
- The model learns what "good" means for your specific task

Let's see RULER in action:

In [None]:
import art
from art.rewards import ruler_score_group

# Test RULER with example outputs for a text formalization task
test_input = "hey can u send me the report asap? thx"

base_messages = [
    {"role": "system", "content": "Convert informal text to formal business language."},
    {"role": "user", "content": test_input},
]

good_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Could you please send me the report at your earliest convenience? Thank you."},
    ],
    reward=0,
)

mediocre_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "Can you send me the report soon? Thanks."},
    ],
    reward=0,
)

bad_trajectory = art.Trajectory(
    messages_and_choices=[
        *base_messages,
        {"role": "assistant", "content": "hey send report quick thx"},
    ],
    reward=0,
)

sample_group = art.TrajectoryGroup(
    trajectories=[good_trajectory, mediocre_trajectory, bad_trajectory]
)

# RULER will score these based on how well they accomplish the task
judged_group = await ruler_score_group(sample_group, RULER_MODEL, debug=True)
assert judged_group is not None

# Display rankings
sorted_trajectories = sorted(
    judged_group.trajectories, key=lambda t: t.reward, reverse=True
)
for rank, traj in enumerate(sorted_trajectories, 1):
    messages = traj.messages()
    print(f"\nRank {rank}: Score {traj.reward:.3f}")
    print(f"  Response: {messages[-1]['content']}")


Rank 1: Score 1.000
  Response: Could you please send me the report at your earliest convenience? Thank you.

Rank 2: Score 0.600
  Response: Can you send me the report soon? Thanks.

Rank 3: Score 0.000
  Response: hey send report quick thx


<a name="Loop"></a>

### Training Loop

Now we'll train the model on your task. The training process:

1. For each input, generate multiple different responses
2. RULER compares these responses and scores them based on your task description
3. Train the model to prefer higher-scored responses
4. Repeat for multiple epochs

No labeled outputs needed - RULER figures out what's good based on your task description alone!

In [None]:
# Training configuration
from art.utils import iterate_dataset

# Convert training inputs to TaskInput objects
training_task_inputs = [
    TaskInput(step=0, input_text=inp)
    for inp in training_inputs
]

# Create training iterator
training_iterator = iterate_dataset(
    training_task_inputs,
    groups_per_step=TRAINING_CONFIG["groups_per_step"],
    num_epochs=TRAINING_CONFIG["num_epochs"],
    initial_step=await model.get_step(),
)

print(f"Starting training with {len(training_task_inputs)} inputs...")
print(f"Training for {TRAINING_CONFIG['num_epochs']} epoch(s)")
print(f"Generating {TRAINING_CONFIG['rollouts_per_group']} responses per input for RULER to compare")
print(f"\nWhy multiple responses? RULER needs to compare different attempts to learn what's good!")

for batch, epoch, global_step, epoch_step in training_iterator:
    print(f"\nTraining step {global_step}, epoch {epoch}, epoch step {epoch_step}")
    print(f"Batch contains {len(batch)} inputs")

    # Create trajectory groups for this batch
    groups = []
    for task_input in batch:
        # Update step number
        task_input.step = global_step

        # Generate multiple responses for each input (RULER will compare these)
        groups.append(
            art.TrajectoryGroup(
                (
                    rollout(model, task_input)
                    for _ in range(TRAINING_CONFIG["rollouts_per_group"])
                )
            )
        )

    # Gather all trajectory groups
    finished_groups = await art.gather_trajectory_groups(
        groups,
        pbar_desc="Generating responses",
        max_exceptions=TRAINING_CONFIG["rollouts_per_group"] * len(batch),
    )

    # Use RULER to score each group
    judged_groups = []
    for group in finished_groups:
        judged_group = await ruler_score_group(
            group,
            RULER_MODEL,
            debug=False
        )
        judged_groups.append(judged_group)

    # Train on the scored trajectories
    await model.delete_checkpoints()
    await model.train(
        judged_groups,
        config=art.TrainConfig(learning_rate=TRAINING_CONFIG["learning_rate"]),
        _config={"logprob_calculation_chunk_size": 8},
    )

    print(f"Completed training step {global_step}")

    # Stop after configured steps (if limit is set)
    if TRAINING_CONFIG["max_training_steps"] and global_step >= TRAINING_CONFIG["max_training_steps"]:
        print(f"Reached maximum training steps ({TRAINING_CONFIG['max_training_steps']})")
        break

print("\n✅ Training completed!")

Starting training with 25 inputs...
Training for 1 epoch(s)
Generating 4 responses per input for RULER to compare

Why multiple responses? RULER needs to compare different attempts to learn what's good!


Iterating dataset:   0%|          | 0/13 [00:00<?, ?batch/s]


Training step 0, epoch 0, epoch step 0
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

 (subsequent messages of this type will be suppressed)


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Packed 7 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 30,000,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 9,232,384/5,000,000,000 (0.18% trained)


Unsloth: Will smartly offload gradients to save VRAM!
Completed training step 0

Training step 1, epoch 0, epoch step 1
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0000
Packed 8 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 1

Training step 2, epoch 0, epoch step 2
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0001
Packed 8 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 2

Training step 3, epoch 0, epoch step 3
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0002
Packed 8 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 3

Training step 4, epoch 0, epoch step 4
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0003
Packed 8 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 4

Training step 5, epoch 0, epoch step 5
Batch contains 2 inputs


Generating responses:   0%|          | 0/8 [00:00<?, ?it/s]

Deleted checkpoint ./.art/custom-task-training/models/custom-task-model-001/0004
Packed 8 trajectories into 1 sequences of length 2048


train:   0%|          | 0/1 [00:00<?, ?it/s]

Completed training step 5
Reached maximum training steps (5)

✅ Training completed!


### Testing Your Trained Model

Let's test your trained model on some new inputs to see how well it learned the task!

In [None]:
# Generate test inputs
print("Generating test inputs...")
test_inputs = await generate_training_inputs(TASK_DESCRIPTION, num_examples=NUM_TEST_INPUTS)

print(f"\n🧪 Testing the trained model on {len(test_inputs)} new inputs:\n")
print("=" * 80)

for i, test_input in enumerate(test_inputs):
    print(f"\nTest {i+1}:")
    print(f"Input: {test_input}")

    # Run the model
    test_task_input = TaskInput(
        step=999,
        input_text=test_input
    )
    result_trajectory = await rollout(model, test_task_input)

    # Extract the model's response
    messages = result_trajectory.messages()
    model_response = messages[-1]['content'] if messages else "No response"

    print(f"Model output: {model_response}")
    print("-" * 80)

print("\n🎉 Testing completed!")
print(f"\nYour model '{MODEL_NAME}' has been trained to: {TASK_DESCRIPTION}")
print("\nTo use this model in production:")
print("1. The model checkpoint is saved in ./.art/")
print("2. You can load it using the vLLM library")
print("3. Or continue training with more examples by adjusting the configuration at the top")

Generating test inputs...
Generating 5 training inputs...

🧪 Testing the trained model on 5 new inputs:


Test 1:
Input: Hey, I was trying to upload a .pdf file to the project documents section, but it kept failing with some generic error message. I tried it in both Chrome and Firefox, same result. The file is about 5MB, so shouldn't be too big. Everything else seems to work fine.
Model output: SUMMARY: PDF upload issue

PRIORITY: High

STEPS TO REPRODUCE:
1. Attempt to upload a 5MB .pdf file to the project documents section.
2. Observe the generic error message that appears.
3. Confirm the file size is 5MB.
4. Use both Chrome and Firefox browsers.

EXPECTED RESULT:
The PDF file should be successfully uploaded without any issues.

ACTUAL RESULT:
The generic error message is displayed, indicating an upload failure.

ENVIRONMENT:
- System: Not specified.
- Version: Not specified.
--------------------------------------------------------------------------------

Test 2:
Input: When I use t

### Next Steps

Congratulations! You've successfully trained a custom model for your task using only:
- A task description
- Example inputs (no outputs needed!)
- RULER's automatic evaluation

Here are some ways to improve results:

1. **More diverse inputs**: Generate more varied input examples
2. **Longer training**: Increase the number of training steps
3. **More comparisons**: Increase `rollouts_per_group` for better RULER comparisons
4. **Task refinement**: Make your task description more specific and detailed
5. **Hyperparameter tuning**: Adjust learning rate, batch size, etc.

Remember: RULER learns what "good" means from your task description alone - no labeled data required!

For more advanced use cases, check out the [ART documentation](https://art.openpipe.ai).