You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for your great work!
I have some questions about your released i2v models. Based on unet_i2vgen.py and #49 , I understand that i2vgen-xl is a single-stage model that takes both the image and text as conditions during video generation. However, as in the technical report, i2vgen is a two-stage diffusion model that takes the image as condition in the base stage and text in the refinement stage. Therefore I am curious about the role played by the input text in this single-stage generation process. What's the difference between the reported two-stage model and the released one-stage model? Can the one-stage model be considered as an image animator guided by the input text, similar to PIAhttps://arxiv.org/pdf/2312.13964.pdf? Additionally, I would like to know which dataset was used to train this open-source model.
Thank you!
The text was updated successfully, but these errors were encountered:
Hello, thank you for your interest in our work. We have open-sourced the single-stage I2VGen-XL model here, which is capable of fully retaining the content of the input images. The training data used for this model is the same as that of the two-stage model. Our primary intention for open-sourcing this model is to provide some assistance to the community for research purposes. Currently, there are no plans to open-source the two-stage version of the I2VGen-XL model. However, our HiGen method is about to be open-sourced soon, which includes the two-stage process and can serve as an alternative. Thank you for your attention.
Hi, thanks for your great work!
I have some questions about your released i2v models. Based on unet_i2vgen.py and #49 , I understand that i2vgen-xl is a single-stage model that takes both the image and text as conditions during video generation. However, as in the technical report, i2vgen is a two-stage diffusion model that takes the image as condition in the base stage and text in the refinement stage. Therefore I am curious about the role played by the input text in this single-stage generation process. What's the difference between the reported two-stage model and the released one-stage model? Can the one-stage model be considered as an image animator guided by the input text, similar to PIAhttps://arxiv.org/pdf/2312.13964.pdf? Additionally, I would like to know which dataset was used to train this open-source model.
Thank you!
The text was updated successfully, but these errors were encountered: