Inference code for Unified Video Action (UVA) model.
git clone https://github.com/cfeng16/uva_baseline.git
cd uva_baseline
pip install -e .Policy output shape: (pred_horizon, 10) - Denormalized absolute actions
| Dims | Description |
|---|---|
| 0-2 | Position (x, y, z) |
| 3-8 | Rotation (6D representation) |
| 9 | Gripper |
from inference_droid import UVAPolicy
# Load policy
policy = UVAPolicy("path/to/checkpoint.ckpt", device="cuda:0")
# Prepare observations.
# `images` is a 16-frame history clip, not a single image.
obs_dict = {
"agentview_rgb": images, # (16, 3, 224, 224) float32 in [0, 1]
}
# Predict action with language instruction
action = policy.predict(obs_dict, language_instruction="pick up the red cup")
print(f"Action shape: {action.shape}") # typically (16, 10) for UniFlow stage2
# Execute only the first 8 actions, then replan from the next 16-frame history clip.
exec_action = action[:8]Note: For the UniFlow stage2 checkpoint, the policy takes 16 observed frames as input and the current wrapper returns the full predicted action horizon, typically 16 future 10D actions. In rollout, only the first 8 actions should be executed before collecting a fresh 16-frame history and calling the policy again.
At the beginning of an episode, if you have fewer than 16 real frames, pad the history by repeating the earliest available frame. This matches UVA's multistep observation padding behavior.
The checkpoint predicts UVA-format 10D actions:
pos(3) + rot6d(6) + gripper(1)
If a downstream UniFlow controller expects the older 7D action format:
pos(3) + euler_xyz(3) + gripper(1)
use the helper below:
from inference_droid import uva_action_10d_to_uniflow_7d_euler_xyz
action_10d = policy.predict(obs_dict, language_instruction="pick up the cup")
exec_action_10d = action_10d[:8]
exec_action_7d = uva_action_10d_to_uniflow_7d_euler_xyz(exec_action_10d)If you use this Euler-conversion helper, make sure your environment has pytorch3d installed. The main inference path does not conceptually depend on Euler conversion, but this helper uses the repo's rotation utility.
Start server:
python inference_droid.py -i path/to/checkpoint.ckpt --port 8766 --device cuda:0Client:
import zmq
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://server_ip:8766")
obs_dict = {
"agentview_rgb": images,
"language_instruction": "pick up the cup"
}
socket.send_pyobj(obs_dict)
action = socket.recv_pyobj() # typically (16, 10); execute action[:8]