This repository contains the implementation of World-Value-Action Model(WAV), a world-model-based framework. The central idea is to unify instruction-conditioned video prediction, trajectory value estimation, action decoding, and latent trajectory planning within a single multi-view diffusion transformer for long-horizon robotic manipulation.
Instead of planning directly in action space, WAV performs iterative inference in a compact latent trajectory space. This design biases sampling toward feasible futures, allows trajectory-level evaluation before action execution, and improves long-horizon decision making in both simulated and real-world settings.
- A unified multi-view transformer backbone with video, value, and action experts.
- A three-stage training recipe: task-specific video adaptation, trajectory value learning, and action post-training.
- Latent trajectory planning at inference time through iterative elite reweighting in latent space.
- Support for LIBERO closed-loop evaluation, open-loop validation, and real-world deployment.
- Release inference & training code
- Release model weights
Direct action prediction is often insufficient for long-horizon embodied tasks because it provides limited trajectory-level reasoning. WAV addresses this by first imagining future visual trajectories, then evaluating their long-horizon quality, and finally decoding executable robot actions from optimized trajectory features.
The current paper motivates this design from a model-based planning perspective:
- direct action-space planning suffers from vanishing feasible mass as the horizon grows;
- latent planning reweights probability mass toward feasible trajectories;
- iterative latent inference is necessary to concentrate samples on high-value futures.
WAV decomposes planning and control into three tightly coupled modules:
-
Instruction-conditioned video generation. A multi-view diffusion transformer predicts future visual trajectories conditioned on history frames and language instructions.
-
Trajectory value estimation. A value expert evaluates candidate futures and provides the trajectory-level signal used for latent planning.
-
Action decoding. An action expert predicts executable action chunks from optimized video and value features, optionally conditioned on robot state.
git clone https://github.com/Win-commit/WAV.git
cd WAV
conda create -n wav python=3.10.4
conda activate wav
pip install -r requirements.txtPlease download the following pretrained weights before training or inference:
-
LTX_video_part -
GE_base
After downloading the checkpoints, please update the corresponding paths in your config file:
pretrained_model_name_or_path: PATH/TO/LTX_video_part
diffusion_model:
model_path: PATH/TO/GE_base_fast.safetensors
This codebase uses a LeRobot-like layout. A typical dataset is organized as:
ROOT_PATH/
└── DATASETNAME/
├── data/
│ └── chunk-000/
│ ├── episode_000000.parquet
│ └── ...
├── meta/
│ ├── episodes.jsonl
│ ├── tasks.jsonl
│ ├── info.json
│
└── videos/
└── chunk-000/
├── CAMERA_A/
│ ├── episode_000000.mp4
│ └── ...
└── CAMERA_B/
├── episode_000000.mp4
└── ...
We provide scripts/get_statistics.py to compute normalization statistics:
python scripts/get_statistics.py \
--data_root PATH/TO/YOUR/DATASET/data/ \
--data_name DATASETNAME \
--data_type eef \
--action_key actions \
--state_key state \
--value_key state_value \
--save_path PATH/TO/YOUR/DATASET/meta/stats.jsonlAfter running the script, you can get a jsonl file of statistics. You should specific the path of json file in configs
data:
train:
...
stat_file: PATH/OF/FILE.jsonl
val:
...
stat_file: PATH/OF/FILE.jsonl
For the unseen robots or customized new tasks, we recommend performing this step of video adaptation to achieve better performance.
i. Modify the config in configs/ltx_model/*/video_model.yaml. More details of dataset can be found in data/*_dataset.py:
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
...
ii. Disable value-model and action-model as bellow in configs/ltx_model/*/video_model.yaml:
return_video: True
return_value:False
return_action: False
train_mode: 'video_only'
diffusion_model:
config:
value_expert: False
action_expert: False
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/video_model.yaml
i. Modify the config in configs/ltx_model/*/value_model.yaml
diffusion_model:
model_path: PATH_TO_VIDEO_POST_TRAINING_CHECKPOINT_SAFETENSOR
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
# rewrite to the keys used in your dataset
value_key: "state_value"
value_dense: True
...
More details of dataset can be found in data/*_dataset.py
ii. Enable value-model as bellow in configs/ltx_model/*/value_model.yaml:
return_video: False
return_value:True
return_action: False
train_mode: 'value_only'
diffusion_model:
config:
value_expert: True
action_expert: False
noisy_video: True
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/value_model.yaml
i. Modify the config in configs/ltx_model/*/policy_model.yaml
diffusion_model:
model_path: PATH_TO_VALUE_POST_TRAINING_CHECKPOINT_SAFETENSOR
data:
train / val:
data_roots: [ROOT_PATH_TO_YOUR_DATASETS, ]
domains: [DATASETNAME, ]
# rewrite to the camera names used in your dataset
valid_cam: ["observation.images.top_head", "observation.images.hand_left", "observation.images.hand_right"]
# rewrite to the keys used in your dataset
action_key: "action"
state_key: "observation.state"
action_type: "absolute" # "absolute", "delta" or "relative"
action_space: "joint"
...
More details of dataset can be found in data/*_dataset.py
ii. Enable action-model as bellow in configs/ltx_model/*/policy_model.yaml:
return_video: False
return_value: True
return_action: True
train_mode: 'action_full'
diffusion_model:
config:
value_expert: True
action_expert: True
noisy_video: True
iii. Run
bash scripts/train.sh main.py configs/ltx_model/*/policy_model.yaml
bash scripts/infer.sh \
main.py \
PATH/TO/CONFIG \
PATH/TO/CHECKPOINT \
PATH/TO/OUTPUTS \
DomainNameThis path is useful for quick qualitative inspection and open-loop video/value/action prediction.
We provide both WebSocket-based serving and HTTP-based robot deployment.
python3 web_infer_scripts/main_server.py \
-c PATH/TO/CONFIG \
-w PATH/TO/CHECKPOINT \
--host 0.0.0.0 \
--port $PORT \
--domain_name $DOMAIN_NAME \
--action_dim $ACTION_DIM \
--norm_type $NORM_TYPE \
--device 0A minimal test client is available in web_infer_scripts/simple_client.py.
python3 web_infer_utils/Real_deploy.py \
-c PATH/TO/CONFIG \
-w PATH/TO/CHECKPOINT \
--host 0.0.0.0 \
--port $PORT \
--domain_name $DOMAIN_NAME \
--action_dim $ACTION_DIM \
--norm_type $NORM_TYPE \
--device 0This codebase builds on the current Genie-Envisioner implementation.
If you find this project useful, please consider citing the paper once the public version is released.
@article{li2026world,
title={World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems},
author={Li, Runze and Zhang, Hongyin and Jin, Junxi and Zeng, Qixin and Zhuang, Zifeng and Tang, Yiqi and Lyu, Shangke and Wang, Donglin},
journal={arXiv preprint arXiv:2604.14732},
year={2026}
}


