PoCo: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
Binyuan Huang1, Yuning Lu2,†, Weinan Jia2, Hualiang Wang3, Mu Liu4, Daiqing Yang1
1Wuhan University | 2University of Science and Technology of China | 3Hong Kong University of Science and Technology | 4Tsinghua University
†Corresponding author
PoCo (Position Embedding as a Context Controller) targets multi-reference, multi-shot video generation, where visually or semantically overlapping references often lead to reference confusion and incorrect shot-to-reference grounding.
PoCo addresses this with:
- SideInfo-RoPE: extends 3D-RoPE with a side-information axis (e.g.,
@character_i) to enable reference-aware attention routing beyond text-only semantic cues. - Hierarchical Cross-Attention (HCA): applies a hierarchical text-visibility mask, where reference tokens attend globally while shot tokens attend only to their corresponding shot captions.
- [2026-02-21] PoCo was accepted by CVPR 2026.
- [2026-04-04] Released the project page and arXiv version.
- [2026-04-04] Released the initial GitHub repository.
- arXiv paper
- Project page
- Initial GitHub repository
- Core model code
GIF previews are shown below. Click each preview to open the full .mp4.
| Case | @Character1 | @Character2 | Video |
|---|---|---|---|
| Case 1 | ![]() |
![]() |
![]() |
| Case 2 | ![]() |
![]() |
![]() |
| Case 3 | ![]() |
![]() |
![]() |
These cases highlight scenarios where references share highly overlapping semantic attributes, making shot-level identity control difficult for text-only semantic routing.
A: Shot1 -> @Character1, Shot2 -> @Character2, Shot3 -> @Character1
B: Shot1 -> @Character2, Shot2 -> @Character1, Shot3 -> @Character2
| Case | @Character1 | @Character2 | Video A | Video B |
|---|---|---|---|---|
| Hard Case 1 | ![]() |
![]() |
![]() |
![]() |
| Hard Case 2 | ![]() |
![]() |
![]() |
![]() |
PoCo integrates reference images, per-shot captions, and latent video features in a multi-shot diffusion transformer pipeline. Its core modules are:
- SideInfo-RoPE: introduces a side-information axis in rotary embeddings to disambiguate reference identity during attention.
- HCA: applies hierarchical caption masking to preserve both global reference conditioning and per-shot text alignment.
- Binyuan Huang: by_huang@whu.edu.cn
- Yuning Lu (Corresponding Author): luyuningx@gmail.com

















