GitHub - byhuang123/PoCo: [CVPR2026] Official implementation of our paper “Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation”

PoCo: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Binyuan Huang¹, Yuning Lu^2,†, Weinan Jia², Hualiang Wang³, Mu Liu⁴, Daiqing Yang¹

¹Wuhan University | ²University of Science and Technology of China | ³Hong Kong University of Science and Technology | ⁴Tsinghua University

^†Corresponding author

PoCo (Position Embedding as a Context Controller) targets multi-reference, multi-shot video generation, where visually or semantically overlapping references often lead to reference confusion and incorrect shot-to-reference grounding.

PoCo addresses this with:

SideInfo-RoPE: extends 3D-RoPE with a side-information axis (e.g., @character_i) to enable reference-aware attention routing beyond text-only semantic cues.
Hierarchical Cross-Attention (HCA): applies a hierarchical text-visibility mask, where reference tokens attend globally while shot tokens attend only to their corresponding shot captions.

🔥 News

[2026-02-21] PoCo was accepted by CVPR 2026.
[2026-04-04] Released the project page and arXiv version.
[2026-04-04] Released the initial GitHub repository.

🚀 Release Plan

arXiv paper
Project page
Initial GitHub repository
Core model code

🎬 Demo Videos

GIF previews are shown below. Click each preview to open the full .mp4.

Case	@Character1	@Character2	Video
Case 1
Case 2
Case 3

Hard Cases: Identity Exchange Control Under Semantic Overlap

These cases highlight scenarios where references share highly overlapping semantic attributes, making shot-level identity control difficult for text-only semantic routing.

A: Shot1 -> @Character1, Shot2 -> @Character2, Shot3 -> @Character1
B: Shot1 -> @Character2, Shot2 -> @Character1, Shot3 -> @Character2

Case	@Character1	@Character2	Video A	Video B
Hard Case 1
Hard Case 2

🧩 Pipeline

PoCo integrates reference images, per-shot captions, and latent video features in a multi-shot diffusion transformer pipeline. Its core modules are:

SideInfo-RoPE: introduces a side-information axis in rotary embeddings to disambiguate reference identity during attention.
HCA: applies hierarchical caption masking to preserve both global reference conditioning and per-shot text alignment.

📬 Contact

Binyuan Huang: by_huang@whu.edu.cn
Yuning Lu (Corresponding Author): luyuningx@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PoCo: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

🔥 News

🚀 Release Plan

🎬 Demo Videos

Hard Cases: Identity Exchange Control Under Semantic Overlap

🧩 Pipeline

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PoCo: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

🔥 News

🚀 Release Plan

🎬 Demo Videos

Hard Cases: Identity Exchange Control Under Semantic Overlap

🧩 Pipeline

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages