# ACT Training - One Step Walkthrough

This notebook demonstrates the training process for the Action Chunking Transformer (ACT) policy, stripped down to the absolute essentials. We will run a single training step on a CPU to understand the data flow, policy architecture, and loss calculation.

We will covers:
1.  **Configuration**: Setting up the minimal training and policy configs.
2.  **Dataset**: Loading a real robotics dataset (`lerobot/pusht`).
3.  **Policy**: Instantiating the ACT policy and its underlying model.
4.  **Preprocessing**: Creating data processors for normalization.
5.  **Training Step**: Running the forward pass, calculating loss, and updating weights.

This is a simplified version of `lerobot/scripts/lerobot_train.py`.

In [11]:
%load_ext autoreload
%autoreload 2

import torch
from torch.utils.data import DataLoader
from pathlib import Path
import logging

# LeRobot imports
from lerobot.configs.train import TrainPipelineConfig, DatasetConfig
from lerobot.policies.act.configuration_act import ACTConfig
from lerobot.datasets.factory import make_dataset
from lerobot.policies.factory import make_policy, make_pre_post_processors
from lerobot.optim.factory import make_optimizer_and_scheduler
from lerobot.utils.utils import init_logging
from lerobot.optim.optimizers import AdamWConfig

# Configure logging to see what's happening
init_logging()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Configuration

We need two main configurations:
- `ACTConfig`: Defines the policy architecture (ResNet backbone, Transformer layers, VAE settings).
- `TrainPipelineConfig`: Defines training parameters (batch size, dataset, optimizer).

We'll use the `lerobot/pusht` dataset, a common benchmark for 2D manipulation.

**Note**: We set `device='cpu'` for simplicity and compatibility in this notebook, but you should use 'cuda' for real training.

In [12]:
# Define the Policy Configuration
from lerobot.optim.optimizers import AdamWConfig


policy_cfg = ACTConfig(
    # Input/Output structure
    n_obs_steps=1,
    chunk_size=100,
    n_action_steps=100,
    # Architecture details
    dim_model=512,
    n_heads=8,
    n_encoder_layers=4,
    n_decoder_layers=1,
    dropout=0.1,
    # VAE settings
    use_vae=True,
    latent_dim=32,
    # Vision Backbone
    vision_backbone="resnet18",
    pretrained_backbone_weights="ResNet18_Weights.IMAGENET1K_V1",
    # Training settings
    optimizer_lr=1e-5,
    optimizer_weight_decay=1e-4,
    kl_weight=10.0,
)

# Define the Dataset Configuration
dataset_cfg = DatasetConfig(
    repo_id="lerobot/pusht",
    root="data",  # Local cache directory
    revision="main",
)

# Combine into the main Training Configuration
cfg = TrainPipelineConfig(
    dataset=dataset_cfg,
    policy=policy_cfg,
    batch_size=8,
    num_workers=0,  # 0 for main process only (easier for debugging)
    save_checkpoint=False,
    steps=10,  # We only run 1 step manually
    output_dir=Path("outputs/train/notebook_test"),
    optimizer=AdamWConfig(
        lr=1e-5,
        weight_decay=1e-4,
        grad_clip_norm=10.0,
    ),
)

print("Configuration ready.")

INFO 2025-11-26 12:05:18 ils/utils.py:43 Cuda backend detected, using cuda.


Configuration ready.


## 2. Dataset Loading

We use `make_dataset` to download (if needed) and prepare the dataset. `LeRobotDataset` handles synchronization between different modalities (images, robot states, actions) and calculates necessary statistics for normalization.

In [6]:
# Create the dataset
dataset = make_dataset(cfg)

print(f"Dataset loaded: {dataset.repo_id}")
print(f"Number of episodes: {dataset.num_episodes}")
print(f"Number of frames: {dataset.num_frames}")
print(f"Features: {dataset.features.keys()}")

# We can check the statistics (mean/std) computed from the dataset
# These will be used by the policy to normalize inputs
stats = dataset.meta.stats
print("Stats keys:", stats.keys())

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

info.json: 0.00B [00:00, ?B/s]

stats.json: 0.00B [00:00, ?B/s]

meta/episodes/chunk-000/file-000.parquet:   0%|          | 0.00/107k [00:00<?, ?B/s]

meta/tasks.parquet:   0%|          | 0.00/2.27k [00:00<?, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

README.md: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

videos/observation.image/chunk-000/file-(â€¦):   0%|          | 0.00/6.89M [00:00<?, ?B/s]

data/chunk-000/file-000.parquet:   0%|          | 0.00/674k [00:00<?, ?B/s]

Dataset loaded: lerobot/pusht
Number of episodes: 206
Number of frames: 25650
Features: dict_keys(['observation.image', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'])
Stats keys: dict_keys(['index', 'next.success', 'observation.state', 'next.done', 'observation.image', 'timestamp', 'episode_index', 'frame_index', 'action', 'task_index', 'next.reward'])


## 3. Policy Initialization

`make_policy` initializes the ACT policy. It infers input/output shapes from the dataset metadata. The policy contains:
- **Backbone**: ResNet18 for processing images.
- **VAE Encoder**: Encodes action sequences into a latent space (during training).
- **Transformer Encoder/Decoder**: Predicts action chunks from observations and latent z.


In [13]:
policy = make_policy(
    cfg=cfg.policy,
    ds_meta=dataset.meta,
)

print("Policy initialized:", policy.name)
# Move to device (CPU for this demo, use 'cuda' if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy.to(device)
print(f"Policy moved to {device}")

Policy initialized: act
Policy moved to cuda


## 4. Preprocessing and Optimization

We need:
1.  **Processors**: To normalize raw data (images [0-255] -> [0-1], vectors -> mean 0 std 1).
2.  **Optimizer**: AdamW is standard for Transformers.
3.  **DataLoader**: Pytorch's standard data loader.

In [14]:
# 1. Create Pre- and Post-processors
# These handle normalization using the dataset statistics we saw earlier
preprocessor, postprocessor = make_pre_post_processors(
    policy_cfg=cfg.policy, dataset_stats=dataset.meta.stats
)

# 2. Create Optimizer
optimizer, lr_scheduler = make_optimizer_and_scheduler(cfg, policy)

# 3. Create DataLoader
dataloader = DataLoader(
    dataset,
    num_workers=cfg.num_workers,
    batch_size=cfg.batch_size,
    shuffle=True,
    drop_last=False,
)

print("Pipeline components ready.")

Pipeline components ready.


## 5. The Training Step

Now we perform a single training iteration manually. This corresponds to the `update_policy` function in `lerobot_train.py`.

**The Flow:**
1.  **Fetch Batch**: Get a dictionary of tensors (images, states, actions) from the dataloader.
2.  **Preprocess**: Normalize the batch (on CPU).
3.  **Move to Device**: Send data to GPU/CPU.
4.  **Forward Pass**: `policy(batch)` computes the loss.
    -   ACT uses a VAE objective: `loss = reconstruction_loss + kl_weight * kld_loss`.
    -   It tries to reconstruct the `action` chunk given the `observation` and the `latent`.
5.  **Backward Pass**: Compute gradients.
6.  **Optimizer Step**: Update model weights.

In [15]:
# 1. Fetch a single batch
dl_iter = iter(dataloader)
batch = next(dl_iter)
print("Raw batch keys:", batch.keys())

# 2. Preprocess (Normalization)
# Note: The preprocessor expects data on CPU usually, but handles device internally if configured.
batch = preprocessor(batch)

# 3. Move batch to device
for key in batch:
    if isinstance(batch[key], torch.Tensor):
        batch[key] = batch[key].to(device)

# Set policy to training mode
policy.train()

# 4. Forward Pass
# The policy.forward() method handles the complex logic of the ACT model:
# - Encoding images with ResNet
# - Encoding action sequence with VAE encoder (to get z)
# - decoding with Transformer
# - Computing L1 loss (action reconstruction) and KL loss (VAE regularization)
loss, output_dict = policy.forward(batch)

print("\n--- Training Step Results ---")
print(f"Total Loss: {loss.item():.4f}")
print("Component Losses:", output_dict)

# 5. Backward Pass
optimizer.zero_grad()
loss.backward()

# Clip gradients (standard practice for Transformers)
grad_norm = torch.nn.utils.clip_grad_norm_(
    policy.parameters(), cfg.optimizer.grad_clip_norm
)
print(f"Gradient Norm: {grad_norm.item():.4f}")

# 6. Optimizer Step
optimizer.step()

print("Weights updated successfully!")

Raw batch keys: dict_keys(['observation.image', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index', 'action_is_pad', 'task'])

--- Training Step Results ---
Total Loss: 66.0094
Component Losses: {'l1_loss': 0.7924197912216187, 'kld_loss': 6.521697044372559}
Gradient Norm: 1103.7797
Weights updated successfully!


## Conclusion

You've successfully run one step of ACT training! 

**Key Takeaways:**
- The `make_dataset` and `make_policy` helpers simplify initialization significantly.
- `preprocessor` is critical; the model expects normalized inputs.
- The policy's `forward` method encapsulates the specific algorithm logic (ACT's VAE + Transformer), returning a ready-to-optimize loss.

In the full training script, this loop repeats for thousands of steps, with added logic for:
- Checkpointing (saving the model).
- Evaluation (running the policy in a simulation/real robot).
- Logging (WandB).
- Distributed training (multi-GPU).