Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect FID-VID and FVD #25

Open
Fanghaipeng opened this issue Jul 28, 2023 · 15 comments
Open

Incorrect FID-VID and FVD #25

Fanghaipeng opened this issue Jul 28, 2023 · 15 comments

Comments

@Fanghaipeng
Copy link

Thanks for great work. @Wangt-CN

I tried to reproduce the results using "gen_eval.sh," but I noticed that the FID-VID and FVD do not match the results reported in the paper. Can you help me with this issue? Is it possible that I am using the incorrect checkpoints?

截屏2023-07-28 14 11 42

download checkpoints:
pth : TikTok Training Data (FID-FVD: 18.8)

FID-VID:resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth"

FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit"

@zhangtao22
Copy link

which checkpoint did you use to evaluate?

@zhangtao22
Copy link

@Fanghaipeng 谢谢

@zhangtao22
Copy link

exp_folder=$1
pred_folder="${2:-${exp_folder}/pred_gs3.0_scale-cond1.0-ref1.0}"
gt_folder=${3:-${exp_folder}/gt}
what did assign these parameters when you executed gen_eval.sh?How do them correspond to that checkpoint?

@fwbx529
Copy link

fwbx529 commented Aug 8, 2023

Hi, I have a similar problem.
I run the evaluation using this script (with More TikTok-Style Training Data (FID-FVD: 15.7)):

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py --eval_visu --root_dir run_test --local_train_batch_size 32 --local_eval_batch_size 32 --log_dir exp/tiktok_ft --epochs 20 --deepspeed --eval_step 500 --save_step 500 --gradient_accumulate_steps 1 --learning_rate 2e-4 --fix_dist_seed --loss_target "noise" --train_yaml /root/autodl-tmp/DisCo/data/composite_offset/train_TiktokDance-poses-masks.yaml --val_yaml /root/autodl-tmp/DisCo/data/composite_offset/new10val_TiktokDance-poses-masks.yaml --unet_unfreeze_type "all" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "poses" "masks" --eval_save_filename outputs/ --guidance_scale 1.5 --pretrained_model /root/autodl-tmp/DisCo/checkpoints/moretiktok_cfg/mp_rank_00_model_states.pt
and perform the fvd calculation using sh gen_eval.sh run_test/exp/tiktok_ft/outputs
exp_folder=$1 pred_folder="${2:-${exp_folder}/pred_gs1.5_scale-cond1.0-ref1.0}" gt_folder=${3:-${exp_folder}/gt}
the outputs are:
{"FID": 28.28052071476202} {"FVD-3DRN50": 20.343344017827633, "FVD-3DInception": 496.9720583513131} {"L1": 0.00036944843636224373, "SSIM": 0.6731481792654774, "LPIPS": 0.2868661347549863, "PSNR": 29.18716913133768, "clean-fid": 28.280520714758126}

FID/L1/SSIM/LPIPS/PSNR are similar, but FVD-3DRN50 (FID-VID) and FVD-3DInception (FVD) are different. (DISCO † (w/. HAP, CFG))

Also, let me attach some gifs generated for FVD, to check whether the generated results are correct.
TiktokDance_00337_0001png
TiktokDance_202_006_1x1_00000jpg
TiktokDance_201_002_1x1_00000jpg
The code run evaluation on 337/338/201/202/203

@fwbx529
Copy link

fwbx529 commented Aug 8, 2023

BTW, I found that the gif generation using imageio at 3fps (gen_eval.sh) will cause tbr as 24.25, so when ffmpeg change it to video, the origin 16 frame gif will turn into 128 frame video.

I try to change the 3fps to 25fps in gen_eval.sh, the results become more wierd: {"FVD-3DRN50": 96.4811157462753, "FVD-3DInception": 382.5581035752218}
I kind of not understand sample_duration=16 when dealing with 16-frame gif, does it mean...nothing? Just use the total 16 frame-video for evaluation? Or the correct version should be the original 128-frame video and split it into 8?

ps: the above results are generated using pytorch 2.0 (with some code change for loading ckpt) and other newer packages. When I reproduce the exact pip package versions in ReadMe, the '25fps' FVD-3DRN50 is 96.15072454699538, and other results are similar, as stated in #36

@linziqu
Copy link

linziqu commented Aug 21, 2023

I meet the similar issue!
I try this script (with More TikTok-Style Training Data (FID-FVD: 15.7)). But I cannot get the satisfactory results in the paper!
How to generate the similiar videos provided in the project?

@Delicious-Bitter-Melon
Copy link

Delicious-Bitter-Melon commented Aug 29, 2023

I can not aslo reproduce the results using "gen_eval.sh" by FID-VID: 18.86 model provided by the official implementation. When using the default guidance scale 3.0, my result is {'FVD-3DRN50': 21.664065154647858, 'FVD-3DInception': 567.6111442260626}. And, using the optimal guidance scale 1.5 as reported in the paper, my result is {'FVD-3DRN50': 23.933738779128873, 'FVD-3DInception': 564.9114347158875} compared to the paper result FID-VID 18.86 and FVD 393.34.

@Wangt-CN
Copy link
Owner

Wangt-CN commented Sep 12, 2023

Dear all, so sorry for the delay since I cannot achieve the computing resources for this project after the end of my internship in July. A few days ago, I successfully got the temporal access and I try to revisit this codebase. I used a totally new env to make sure that this codebase can be reproduced under most situations.

  1. For Image evaluation metric, it seemed that there is no confusion (btw, for FID, we use the pytorch-FID package to report the results).

  2. For Video metric, first of all, I can successfully reproduce our paper results (I use DisCo but not DisCo+ model and I will further verify the DisCo+ model) [@Fanghaipeng @linziqu @Yongssss]:

Metrics FID-FVD FVD
Paper 18.86 393.34
Reproduce 19.28 385.93
image

And we use this resnet-kinetics and i3d checkpoint model (under eval_fvd). I think the different results may due to the different checkpoint model which I forgot to sync from the corporation storage.

  1. Moreover, after checking @fwbx529's comments (thanks @fwbx529 !), we found current gif generation process is indeed sub-optimal and different video format may cause totally different results. (But note that we used this computing script for all the models so it is still fair). To solve this possible issue, we revise the code and follow mvcd to directly use frames (but not generate additional video file) for the FVD computing evaluation. Here are the brief new results:
Metrics FID-FVD FVD
Dreampose 80.51 551.56
DisCo 59.90 292.80
DisCo+ 55.17 267.75

We can see that for both baseline and our model, we got better FVD but higher FID-FVD. (Ps: we use the same generated frames for calculating the previous metric in paper and this new metric). We plan to use this new metric calculation to avoid confusion and have updated the evaluation code in the latest commit (Note: if you want to check the reproduction of the previous results, do not pull the latest commit and just download the fvd pretrained model). We will update the paper ASAP.

If you meet any further problems about the reproduction, please comment here.

@asdasdad738
Copy link

I can not reproduce the results with the checkpoint(TikTok Training Data), using the updated evaluation code and the provided vision model for achieving fvd metric.
I use the following command to inference and compute the metric:

NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--eval_visu --root_dir DisCo-main/ --local_train_batch_size 128 --local_eval_batch_size 128 \
--log_dir exp/tiktok-cfg \
--epochs 20 --deepspeed --eval_step 500 --save_step 500 --gradient_accumulate_steps 1 \
--learning_rate 2e-4 --fix_dist_seed --loss_target "noise"  \
--train_yaml TSV_dataset/composite_offset/train_TiktokDance-poses-masks.yaml \
--val_yaml TSV_dataset/composite_offset/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "null" --guidance_scale 1.5 --drop_ref 0.05 --refer_sdvae --ref_null_caption False \
--combine_clip_local --combine_use_mask --conds "poses" "masks" \
--pretrained_model pretrained_model/tiktok-cfg.pt \
--eval_save_filename tiktok-cfg-check

sh gen_eval.sh exp/tiktok-cfg/tiktok-cfg-check
And the result is:
{"FVD-3DRN50": 95.92242708773176, "FVD-3DInception": 407.55894385101624}

@Wangt-CN
Copy link
Owner

Wangt-CN commented Nov 3, 2023

Hi, we can try to find the issue. There are actually 2 steps to get the results (a. generate the images; b. get the metric).

  1. Here is the generated images (pred folder) that I try to reproduce. Could you please download it and try to run gen_eval.sh on this prediction and see if can get 60 FVD results?
  2. Then we will know if the issue is with the generation process or the evaluation process. Btw, do you follow the repo installation to install the packages?

@asdasdad738
Copy link

asdasdad738 commented Nov 9, 2023

Sorry for the late reply. I have tried the image you gave me and got the correct result:
{"FVD-3DRN50": 60.145482643211295, "FVD-3DInception": 294.8605752686153}.
It seems that some factors in the generation process affect the result. I took some time to find the reason. Here are some results that I obtained:
image

When I use 4 GPUs, I can get a result that's relatively close to 60. I don't have a way to test the result with 8 GPUs, or maybe you've tried running inference with fewer than 8 GPUs? The number of GPUs seems to affect the test result, but I haven't found the reason why the number of GPUs affects the test result. Maybe it's because of mpirun? I'm not very sure.

@fwbx529
Copy link

fwbx529 commented Nov 15, 2023

Hi, sorry for the late reply, as I am currently working on another project here, not focusing on video gen.
When I tested FVD before, I used an online-downloaded ckpt (as I mentioned in another issue) and I believe the number difference is due to this.
For the fps problem, I think @Wangt-CN gives very detailed corrections and results mentioned above. Thanks for the reply!

@LucasLOOT
Copy link

@Wangt-CN
In the context of the paper, was the FVD metric computed for video reconstruction tasks or image animation tasks? Based on the surrounding context, it seems that the FVD metric was evaluated in the context of image animation tasks. My question pertains to the absence of ground truth in image animation tasks. How is the comparison of feature distribution distance between generated and real videos handled in the absence of ground truth? Specifically, if the generated videos are created by driving source frames with a pose sequence, where are the corresponding real videos obtained? Are they directly sourced from the videos corresponding to the source frames?

@Fanghaipeng
Copy link
Author

Hi, we can try to find the issue. There are actually 2 steps to get the results (a. generate the images; b. get the metric).

  1. Here is the generated images (pred folder) that I try to reproduce. Could you please download it and try to run gen_eval.sh on this prediction and see if can get 60 FVD results?
  2. Then we will know if the issue is with the generation process or the evaluation process. Btw, do you follow the repo installation to install the packages?

When I used a 4 NVIDIA A100 batch_size=2 and nframe=16 , and ran gen_eval_tm.sh, I obtained results similar to @asdasdad738 : "FVD-3DRN50'': 95.32, "FVD-3DInception'': 409.01. Additionally, when I used the "pred folder'' provided by @Wangt-CN, I got the correct results: "FVD-3DRN50'': 60.16, "FVD-3DInception'': 294.88. Therefore, I suspect that this issue might be caused during the generation process. Can you help me solve this problem? @Wangt-CN .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants