Skip to content

byhuang123/PoCo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PoCo: Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Project Page arXiv Paper CVPR 2026

Binyuan Huang1, Yuning Lu2,†, Weinan Jia2, Hualiang Wang3, Mu Liu4, Daiqing Yang1

1Wuhan University   |   2University of Science and Technology of China   |   3Hong Kong University of Science and Technology   |   4Tsinghua University

Corresponding author

PoCo (Position Embedding as a Context Controller) targets multi-reference, multi-shot video generation, where visually or semantically overlapping references often lead to reference confusion and incorrect shot-to-reference grounding.

PoCo addresses this with:

  • SideInfo-RoPE: extends 3D-RoPE with a side-information axis (e.g., @character_i) to enable reference-aware attention routing beyond text-only semantic cues.
  • Hierarchical Cross-Attention (HCA): applies a hierarchical text-visibility mask, where reference tokens attend globally while shot tokens attend only to their corresponding shot captions.

🔥 News

  • [2026-02-21] PoCo was accepted by CVPR 2026.
  • [2026-04-04] Released the project page and arXiv version.
  • [2026-04-04] Released the initial GitHub repository.

🚀 Release Plan

  • arXiv paper
  • Project page
  • Initial GitHub repository
  • Core model code

🎬 Demo Videos

GIF previews are shown below. Click each preview to open the full .mp4.

Case @Character1 @Character2 Video
Case 1 Case 1 reference 1 Case 1 reference 2 Case 1 GIF preview
Case 2 Case 2 reference 1 Case 2 reference 2 Case 2 GIF preview
Case 3 Case 3 reference 1 Case 3 reference 2 Case 3 GIF preview

Hard Cases: Identity Exchange Control Under Semantic Overlap

These cases highlight scenarios where references share highly overlapping semantic attributes, making shot-level identity control difficult for text-only semantic routing.

A: Shot1 -> @Character1, Shot2 -> @Character2, Shot3 -> @Character1
B: Shot1 -> @Character2, Shot2 -> @Character1, Shot3 -> @Character2

Case @Character1 @Character2 Video A Video B
Hard Case 1 Hard Case 1 reference 1 Hard Case 1 reference 2 Hard Case 1 Video A GIF preview Hard Case 1 Video B GIF preview
Hard Case 2 Hard Case 2 reference 1 Hard Case 2 reference 2 Hard Case 2 Video A GIF preview Hard Case 2 Video B GIF preview

🧩 Pipeline

PoCo Pipeline

PoCo integrates reference images, per-shot captions, and latent video features in a multi-shot diffusion transformer pipeline. Its core modules are:

  • SideInfo-RoPE: introduces a side-information axis in rotary embeddings to disambiguate reference identity during attention.
  • HCA: applies hierarchical caption masking to preserve both global reference conditioning and per-shot text alignment.

📬 Contact

About

[CVPR2026] Official implementation of our paper “Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors