# TinyVLA: Fast-Iteration VLA Training on Colab

Train a minimal Vision-Language-Action model in 1-3 minutes on free Colab GPU!

**What you'll learn:**
- VLA architecture basics (ViT + Transformer + Action head)
- Fast iteration techniques
- Synthetic dataset generation

**Runtime:** Make sure to enable GPU: Runtime → Change runtime type → GPU (T4)

## Setup

In [None]:
# Install dependencies
!pip install -q torch torchvision transformers matplotlib tqdm tensorboard pillow

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Upload Code Files

Upload these files to Colab:
- `tiny_vla_dataset.py`
- `tiny_vla_model.py`
- `train_tiny_vla.py`
- `inference_tiny_vla.py`

Or run the cells below if you have them in your current directory.

In [None]:
# Alternative: Clone from GitHub if you've uploaded there
# !git clone https://github.com/your-username/tiny-vla.git
# %cd tiny-vla

## Quick Setup Test

In [None]:
# Test imports and setup
from tiny_vla_dataset import BlockFindDataset
from tiny_vla_model import create_tiny_vla

# Create a small test dataset
print("Creating test dataset...")
test_dataset = BlockFindDataset(num_samples=10)

# Visualize a sample
test_dataset.visualize_sample(0)

# Display the image
from IPython.display import Image, display
display(Image('sample_visualization.png'))

In [None]:
# Create and inspect model
print("Creating TinyVLA model...")
model = create_tiny_vla()

# Test forward pass
batch_size = 4
images = torch.randn(batch_size, 3, 64, 64)
instructions = [
    "Push the red block up",
    "Move blue block left",
    "Push green block down",
    "Move yellow block right"
]

images, input_ids, attention_mask = model.prepare_inputs(images, instructions)

with torch.no_grad():
    actions = model(images, input_ids, attention_mask)

print(f"\nForward pass successful!")
print(f"Action predictions shape: {actions.shape}")
print(f"Sample prediction: {actions[0]}")

## Training

Train the model for 20 epochs (1-3 minutes on T4 GPU)

In [None]:
# Option 1: Run training script directly
!python train_tiny_vla.py

In [None]:
# Option 2: Train inline with custom config (faster for prototyping)
from train_tiny_vla import TinyVLATrainer
from tiny_vla_dataset import create_dataloaders
from tiny_vla_model import create_tiny_vla

# Create model
model = create_tiny_vla()

# Create dataloaders (smaller dataset for faster iteration)
train_loader, val_loader, test_loader = create_dataloaders(
    train_size=4000,  # Reduce for faster training
    val_size=500,
    test_size=500,
    batch_size=64,
    num_workers=2  # Colab has limited CPU
)

# Create trainer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
trainer = TinyVLATrainer(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    device=device,
    lr=3e-4
)

# Train
trainer.train(num_epochs=10)  # Reduce epochs for faster testing

## Visualize Training Progress

In [None]:
# Load tensorboard
%load_ext tensorboard
%tensorboard --logdir logs

## Inference & Evaluation

In [None]:
# Load trained model and visualize predictions
from inference_tiny_vla import TinyVLAInference
from tiny_vla_dataset import BlockFindDataset

# Create inference object
device = 'cuda' if torch.cuda.is_available() else 'cpu'
inference = TinyVLAInference('checkpoints/best_model.pt', device=device)

# Create test dataset
test_dataset = BlockFindDataset(num_samples=1000, seed=44)

# Visualize predictions
inference.visualize_predictions(test_dataset, num_samples=8)

# Display
from IPython.display import Image, display
display(Image('predictions.png'))

In [None]:
# Evaluate accuracy
metrics = inference.evaluate_accuracy(test_dataset, num_samples=1000)

# Print results
print("\n" + "="*50)
print("Final Test Results")
print("="*50)
for key, value in metrics.items():
    print(f"{key:30s}: {value:.4f}")
print("="*50)

## Interactive Demo

Test the model on individual samples

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Pick a random sample
idx = np.random.randint(len(test_dataset))
sample = test_dataset[idx]

# Get prediction
image = sample['image']
instruction = sample['instruction']
action_gt = sample['action'].numpy()
action_pred = inference.predict(image, instruction)

# Visualize
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
image_display = image.permute(1, 2, 0).numpy()
ax.imshow(image_display)
ax.set_title(
    f"Instruction: {instruction}\n"
    f"Ground Truth: [{action_gt[0]:.2f}, {action_gt[1]:.2f}]\n"
    f"Prediction: [{action_pred[0]:.2f}, {action_pred[1]:.2f}]\n"
    f"Error: {np.linalg.norm(action_pred - action_gt):.3f}",
    fontsize=12
)

# Draw arrows
center_x, center_y = 32, 32
arrow_scale = 15

# Ground truth (green)
ax.arrow(center_x, center_y, action_gt[0]*arrow_scale, action_gt[1]*arrow_scale,
         head_width=3, head_length=3, fc='green', ec='green', linewidth=3, label='GT')

# Prediction (red)
ax.arrow(center_x, center_y, action_pred[0]*arrow_scale, action_pred[1]*arrow_scale,
         head_width=3, head_length=3, fc='red', ec='red', linewidth=3, linestyle='--', label='Pred')

ax.legend()
ax.axis('off')
plt.tight_layout()
plt.show()

print(f"\nInstruction: {instruction}")
print(f"Ground Truth Action: {action_gt}")
print(f"Predicted Action: {action_pred}")
print(f"L2 Error: {np.linalg.norm(action_pred - action_gt):.4f}")

## Download Trained Model

Download your trained model to use locally

In [None]:
from google.colab import files

# Download checkpoint
files.download('checkpoints/best_model.pt')

# Download training logs
!zip -r logs.zip logs/
files.download('logs.zip')

## Experiments to Try

### 1. Architecture Ablations

In [None]:
# Try smaller model (faster training)
config_small = {
    'image_size': 64,
    'vision_embed_dim': 128,  # Reduced from 192
    'vision_layers': 2,       # Reduced from 4
    'lang_embed_dim': 128,    # Reduced from 256
    'lang_layers': 2,         # Reduced from 4
    'action_dim': 2,
}

model_small = create_tiny_vla(config_small)
# ... train and compare results

### 2. Data Efficiency

In [None]:
# Train on different dataset sizes
for train_size in [500, 1000, 2000, 4000, 8000]:
    print(f"\nTraining on {train_size} samples...")
    # ... create dataloaders and train
    # Compare final validation error

## Next Steps

1. **Scale up**: Increase model size gradually
2. **Real data**: Replace synthetic data with real robot demonstrations
3. **Pretrained models**: Use SigLip + Phi-2 backbones
4. **Advanced techniques**: Try LoRA, cross-attention, diffusion policies

Check the README for more details!