Wan2.2 TI2V training

I noticed that when DreamZero is trained with the Wan2.2 TI2V backbone, the CLIP embedding of the first video frame is injected via cross-attention. As far as I understand, this differs from the standard conditioning setup in vanilla Wan2.2 TI2V. Could you clarify the motivation behind this design choice?