UVA Baseline

Inference code for Unified Video Action (UVA) model.

Installation

git clone https://github.com/cfeng16/uva_baseline.git
cd uva_baseline
pip install -e .

Action Format

Policy output shape: (pred_horizon, 10) - Denormalized absolute actions

Dims	Description
0-2	Position (x, y, z)
3-8	Rotation (6D representation)
9	Gripper

Usage

from inference_droid import UVAPolicy

# Load policy
policy = UVAPolicy("path/to/checkpoint.ckpt", device="cuda:0")

# Prepare observations.
# `images` is a 16-frame history clip, not a single image.
obs_dict = {
    "agentview_rgb": images,  # (16, 3, 224, 224) float32 in [0, 1]
}

# Predict action with language instruction
action = policy.predict(obs_dict, language_instruction="pick up the red cup")

print(f"Action shape: {action.shape}")  # typically (16, 10) for UniFlow stage2

# Execute only the first 8 actions, then replan from the next 16-frame history clip.
exec_action = action[:8]

Note: For the UniFlow stage2 checkpoint, the policy takes 16 observed frames as input and the current wrapper returns the full predicted action horizon, typically 16 future 10D actions. In rollout, only the first 8 actions should be executed before collecting a fresh 16-frame history and calling the policy again.

At the beginning of an episode, if you have fewer than 16 real frames, pad the history by repeating the earliest available frame. This matches UVA's multistep observation padding behavior.

UniFlow 7D Euler Output

The checkpoint predicts UVA-format 10D actions:

pos(3) + rot6d(6) + gripper(1)

If a downstream UniFlow controller expects the older 7D action format:

pos(3) + euler_xyz(3) + gripper(1)

use the helper below:

from inference_droid import uva_action_10d_to_uniflow_7d_euler_xyz

action_10d = policy.predict(obs_dict, language_instruction="pick up the cup")
exec_action_10d = action_10d[:8]
exec_action_7d = uva_action_10d_to_uniflow_7d_euler_xyz(exec_action_10d)

If you use this Euler-conversion helper, make sure your environment has pytorch3d installed. The main inference path does not conceptually depend on Euler conversion, but this helper uses the repo's rotation utility.

ZMQ Server Mode

Start server:

python inference_droid.py -i path/to/checkpoint.ckpt --port 8766 --device cuda:0

Client:

import zmq

context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://server_ip:8766")

obs_dict = {
    "agentview_rgb": images,
    "language_instruction": "pick up the cup"
}
socket.send_pyobj(obs_dict)
action = socket.recv_pyobj()  # typically (16, 10); execute action[:8]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
umi		umi
unified_video_action		unified_video_action
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UVA Baseline

Installation

Action Format

Usage

UniFlow 7D Euler Output

ZMQ Server Mode

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UVA Baseline

Installation

Action Format

Usage

UniFlow 7D Euler Output

ZMQ Server Mode

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages