# Introduction and Deployment of VLA: SMolVLA

## VLA: Vision-Language Action Model
Compared to VLMs, a VLA is specifically made for robotics where you want to be able to use language to help understand the correct action a robot should take.

SmolVLA is an example of a VLA foundation model where input the model are 

For a more comprehensive overview on all LeRobot has to offer, [LeRobot Repo](https://learnopencv.com/vision-language-action-models-lerobot-policy/). 

Walk through each cell of the notebook to better understand the VLA model.

In [None]:
!pip install ipykernel matplotlib
!pip install lerobot
!conda install ffmpeg -c conda-forge -y

### Import Libraries

In [None]:
import torch
import matplotlib.pyplot as plt
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.policies.lerobot_policy import LeRobotPolicy

def show_image(img, instruction):
    """Display an image with its instruction."""
    plt.imshow(img.permute(1, 2, 0))  # Convert from (C, H, W) to (H, W, C)
    plt.title(f"Instruction: \"{instruction}\"")
    plt.axis('off')
    plt.show()

### Load the Pre-Trained VLA (SmolVLA)

In [None]:
device = torch.device("cpu")
print(f"Using device: {device}")

# Load the pre-trained SmolVLA policy
# This downloads the model weights from the Hugging Face Hub
policy = LeRobotPolicy.from_pretrained("lerobot/smolvla_base").to(device)

# Set the model to evaluation mode (disables things like dropout)
policy.eval()

print("SmolVLA model loaded successfully!")

### Load Sample Data

In [None]:
dataset_repo_id = "lerobot/svla_so100_pickplace"
dataset = LeRobotDataset(dataset_repo_id)

print(f"Loaded dataset with {len(dataset)} total steps.")

# Let's get a single sample from the dataset (e.g., sample #1000)
sample_idx = 1000
sample = dataset[sample_idx]

# Let's see what's in our sample
print("\nSample keys:")
print(sample.keys())

# The 'observation' key contains the inputs for the model
print("\nObservation keys:")
print(sample['observation'].keys())

### Prepare Inputs and Run Inference

In [None]:
# Get the instruction and observation from our sample
instruction = sample['instruction']

# Prepare the observation dictionary
# We add a batch dimension (B) to each tensor, e.g., (C, H, W) -> (B, C, H, W)
observation = {
    'images.top': sample['observation']['images.top'].to(device).unsqueeze(0),
    'state': sample['observation']['state'].to(device).unsqueeze(0)
}

# Run inference!
# We use torch.no_grad() to tell PyTorch we're not training, which saves memory.
with torch.no_grad():
    action = policy.select_action(observation, instruction)

print(f"Instruction: \"{instruction}\"")
print(f"\nPredicted Action: {action.cpu().numpy()}")

# Let's also see what the "ground truth" action was (what the human did)
ground_truth_action = sample['action'].numpy()
print(f"Ground Truth Action: {ground_truth_action}")

### Visualize the Result

In [None]:
top_image = observation['images.top'][0].cpu()

# Use our helper function to show the image and instruction
show_image(top_image, instruction)