End-to-end Physical AI pipeline combining voice commands, scene reasoning, and learned manipulation
Built with NVIDIA Cosmos Reason 2 + GR00T N1.6 + Parakeet ASR
NOVA demonstrates the complete NVIDIA Physical AI stack working together on Pollen Robotics' Reachy 2 humanoid robot:
"Pick up the red cube"
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ NOVA PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ VOICE INPUT SCENE UNDERSTANDING │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ Parakeet │──────────────▶│ Cosmos Reason 2 │ │
│ │ ASR │ text │ "I see a red │ │
│ │ (600M) │ command │ cube at..." │ │
│ └─────────────┘ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ GR00T N1.6 │ │
│ │ Action Policy │ │
│ │ (Fine-tuned) │ │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Reachy 2 │ │
│ │ (14-DOF Arms) │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
| Component | Technology | What It Does |
|---|---|---|
| Voice Input | Parakeet CTC 0.6B | 6% WER, 50x real-time transcription |
| Scene Reasoning | Cosmos Reason 2 | Object detection, spatial understanding, task planning |
| Action Policy | GR00T N1.6 | Vision-language-action model for manipulation |
| Simulation | MuJoCo + reachy2_mujoco | High-fidelity physics with domain randomization |
| Dataset | LeRobot v2.1 | 100 expert episodes, HuggingFace compatible |
- 100 episodes of pick-and-place demonstrations
- 32 task variations (4 objects × 8 colors)
- Domain randomization: position, lighting, camera jitter
- Format: LeRobot v2.1 with parquet + H264 video
| Metric | Value |
|---|---|
| GPU | NVIDIA A100-SXM4-80GB |
| Training Steps | 30,000 |
| Batch Size | 64 |
| Final Loss | ~0.008-0.01 |
| Checkpoints | 15K, 21K, 24K, 30K |
|
Objects
|
Colors
|
# Create conda environments
conda create -n reachy_groot python=3.10 -y
conda create -n reachy_cosmos python=3.10 -y
# Clone repository with submodules
git clone --recurse-submodules https://github.com/ganatrask/NOVA.git
cd NOVA
# If already cloned without submodules:
git submodule update --init --recursive# Activate main environment
conda activate reachy_groot
# Install reachy2_mujoco (simulation)
pip install -e libs/reachy2_mujoco
# Install Isaac-GR00T (action policy)
cd Isaac-GR00T
pip install uv && uv sync --python 3.10 && uv pip install -e .
cd ..
# Install remaining dependencies
pip install -r requirements.txtTerminal 1: MuJoCo Server
reachy2-mujoco --headlessTerminal 2: Cosmos Reasoning Server
conda activate reachy_cosmos
python scripts/cosmos_server.py --port 8100Terminal 3: Gradio GUI
conda activate reachy_groot
python scripts/pipeline_gui.py --model-path checkpoints/groot-reachy2-pickplace/checkpoint-30000Open http://localhost:7860 in your browser.
MuJoCo Simulation
│
├── Domain Randomization
│ ├── Object: cube, rect_box, cylinder, capsule
│ ├── Color: 8 variations
│ ├── Position: workspace-aware random
│ ├── Lighting: 0.5-1.0 intensity
│ └── Camera: ±2° jitter
│
├── External Cameras
│ ├── front_cam (640×480, 108° FOV)
│ └── workspace_cam (640×480, 70° FOV)
│
└── LeRobot v2.1 Dataset
├── Parquet files (states + actions)
├── MP4 videos (H264)
└── stats.json (normalization)
# Right arm (8 values)
action = [
shoulder_pitch, # -180° to 90°
shoulder_roll, # -180° to 10°
elbow_yaw, # -90° to 90°
elbow_pitch, # -125° to 0°
wrist_roll, # -100° to 100°
wrist_pitch, # -45° to 45°
wrist_yaw, # -30° to 30°
gripper, # 0 (closed) to 1 (open)
]# Scene analysis output
{
"objects": [
{"object": "cube", "color": "red", "position": "center-right"},
{"object": "box", "color": "white", "position": "center"}
],
"gr00t_instruction": "Pick up the red cube and place it in the white box"
}NOVA/
├── scripts/
│ ├── pipeline_gui.py # Gradio web interface
│ ├── data_collector.py # Dataset collection
│ ├── eval_closed_loop.py # Policy evaluation in simulation
│ ├── parakeet_asr.py # Voice transcription
│ ├── cosmos_reason.py # Scene understanding
│ ├── cosmos_server.py # HTTP server for Cosmos
│ ├── cosmos_client.py # HTTP client for Cosmos
│ └── robot_pipeline.py # Full Voice→Reason→Act
├── configs/
│ ├── reachy2_modality_config.py # GR00T modality config
│ └── modality_reachy2.json # Action space definition
├── Isaac-GR00T/ # NVIDIA Isaac-GR00T (submodule)
├── libs/
│ └── reachy2_mujoco/ # Pollen Robotics simulator (submodule)
├── requirements.txt # Python dependencies
└── README.md
Note: Checkpoints and datasets are downloaded separately from HuggingFace.
This section explains how to fine-tune GR00T N1.6 on your own dataset and add support for new robot embodiments.
GR00T uses "embodiment tags" to identify different robots. To add your own:
Apply the patch to add your embodiment tag:
cd Isaac-GR00T
# Apply the REACHY2 patch (or create your own)
patch -p1 < ../patches/add_reachy2_embodiment.patchThe patch modifies two files:
gr00t/data/embodiment_tags.py - Add enum entry:
class EmbodimentTag(Enum):
# ... existing tags ...
REACHY2 = "reachy2"
"""
The Pollen Robotics Reachy 2 humanoid robot.
"""gr00t/model/gr00t_n1d6/processing_gr00t_n1d6.py - Add projector index:
EMBODIMENT_TAG_TO_PROJECTOR_INDEX = {
# ... existing mappings ...
"new_embodiment": 10,
"reachy2": 11, # Use index 10-15 for custom robots
}Create a modality configuration that defines your robot's action space:
configs/reachy2_modality_config.py:
from gr00t.configs.data.embodiment_configs import register_modality_config
from gr00t.data.embodiment_tags import EmbodimentTag
from gr00t.data.types import (
ActionConfig, ActionFormat, ActionRepresentation,
ActionType, ModalityConfig
)
register_modality_config(
config={
"video": ModalityConfig(
delta_indices=[0],
modality_keys=["front_cam"] # Your camera key
),
"state": ModalityConfig(
delta_indices=[0],
modality_keys=["arm_joints"] # Your state key
),
"action": ModalityConfig(
delta_indices=list(range(0, 16)), # Action horizon
modality_keys=["arm_joints", "gripper"],
action_configs=[
ActionConfig(
rep=ActionRepresentation.RELATIVE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT,
state_key="arm_joints"
),
ActionConfig(
rep=ActionRepresentation.ABSOLUTE,
type=ActionType.NON_EEF,
format=ActionFormat.DEFAULT
),
],
),
"language": ModalityConfig(
delta_indices=[0],
modality_keys=["annotation.human.task_description"]
),
},
embodiment_tag=EmbodimentTag.REACHY2, # Your tag
)Use the data collector to gather demonstrations:
# Collect 100 episodes with domain randomization
python scripts/data_collector.py \
--episodes 100 \
--output reachy2_dataset \
--arm both \
--randomize-object \
--randomize-color \
--cameras front_cam workspace_camDataset format follows LeRobot v2.1:
reachy2_dataset/
├── meta/
│ ├── info.json # Dataset metadata
│ ├── stats.json # Normalization statistics
│ ├── tasks.jsonl # Task descriptions
│ └── episodes.jsonl # Episode info
├── data/chunk-000/
│ └── episode_*.parquet # State/action data
└── videos/chunk-000/
└── observation.images.*/
└── episode_*.mp4 # Camera videos
# Clone and setup Isaac-GR00T
git clone --recurse-submodules https://github.com/NVIDIA/Isaac-GR00T
cd Isaac-GR00T
# Apply embodiment patch
patch -p1 < ../patches/add_reachy2_embodiment.patch
# Install dependencies
conda create -n groot python=3.10 -y
conda activate groot
pip install torch torchvision
pip install flash-attn --no-build-isolation
pip install -e .
# Login to HuggingFace
huggingface-cli login# Full training (2x A100, 30K steps)
python -m gr00t.train \
--dataset_repo_id ganatrask/reachy2_100 \
--embodiment_tag reachy2 \
--video_backend decord \
--num_gpus 2 \
--batch_size 64 \
--max_steps 30000 \
--save_steps 3000 \
--output_dir ./checkpoints/groot-reachy2| Resource | Quick Test | Full Training |
|---|---|---|
| CPU Cores | 8 | 32 |
| Memory | 32 GiB | 128 GiB |
| GPU | 1x A100 | 2x A100 |
| Wall Time | 1 hour | 6 hours |
from gr00t.data.embodiment_tags import EmbodimentTag
from gr00t.policy.gr00t_policy import Gr00tPolicy
# Load modality config first (must happen before policy init)
import importlib.util
spec = importlib.util.spec_from_file_location(
"modality_config",
"configs/reachy2_modality_config.py"
)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# Load policy
policy = Gr00tPolicy(
embodiment_tag=EmbodimentTag.REACHY2,
model_path="checkpoints/groot-reachy2/checkpoint-30000",
device="cuda",
strict=True,
)
# Run inference
obs = {
"video": {"front_cam": image[None, None, :, :, :]},
"state": {"arm_joints": joints[None, None, :]},
"language": {"annotation.human.task_description": [["Pick up the cube"]]},
}
action, _ = policy.get_action(obs)| Error | Solution |
|---|---|
REACHY2 not found |
Apply patch: patch -p1 < patches/add_reachy2_embodiment.patch |
Already registered |
Modality config loaded twice; add guard clause to prevent re-registration |
OOM on A100 |
Reduce batch size: --batch_size 32 |
torchcodec not available |
Use: --video_backend decord |
Duplicate enum key |
Re-clone Isaac-GR00T and apply patch once |
'reachy2' not in projector index |
Add mapping to processing_gr00t_n1d6.py (see patch) |
| Metric | Value |
|---|---|
| Dataset collection rate | ~2 episodes/min |
| GR00T inference | ~40ms/step (A100) |
| Cosmos reasoning | ~500ms/query |
| Parakeet transcription | 50x real-time |
| Resource | URL |
|---|---|
| GitHub | ganatrask/NOVA |
| Model | ganatrask/NOVA |
| Dataset | ganatrask/NOVA |
| GR00T N1.6 | nvidia/GR00T-N1.6-3B |
| Cosmos Reason 2 | nvidia-cosmos/cosmos-reason2 |
| Parakeet ASR | nvidia/parakeet-ctc-0.6b |
| reachy2_mujoco | pollen-robotics/reachy2_mujoco |
| LeRobot | huggingface/lerobot |
| Component | Specification |
|---|---|
| Training GPU | NVIDIA A100-SXM4-80GB |
| VRAM Usage | 44GB / 80GB |
| Training Time | ~6 hours (30K steps) |
| Inference | Works on RTX 5090 |
- Pollen Robotics - Reachy 2 humanoid robot & MuJoCo simulation
- NVIDIA - GR00T N1.6, Cosmos Reason 2, Parakeet ASR
- HuggingFace - LeRobot framework & model hosting
- DeepMind - MuJoCo physics engine
- VESSL AI - GPU compute credits for model training
This project uses NVIDIA models under their respective licenses:
- NVIDIA GR00T N1.6: NVIDIA Open Model License
- NVIDIA Cosmos Reason 2: License
- NVIDIA Parakeet ASR: CC-BY-4.0