Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in reproducing fid score #22

Closed
howtowhy opened this issue Apr 6, 2023 · 9 comments
Closed

Problem in reproducing fid score #22

howtowhy opened this issue Apr 6, 2023 · 9 comments
Labels
help wanted Extra attention is needed

Comments

@howtowhy
Copy link

howtowhy commented Apr 6, 2023

Thank you for amazing work again.
I have some trouble to reproducing fid score..

  1. I fail to download tfrecord and used 7GB public 256 dataset as a zip file.
    (https://www.kaggle.com/datasets/denislukovnikov/ffhq256-images-only)
  2. I guess the ffhq256x256.zip file format is same as preprocessed one.
  3. I trained with this data and checked the training score.
  4. the fid goes up when I train.. from 20 to 25
    Could you give me some advice about this situation?
    (I will try again with 1025 90G version with your advice..
    Also,, if you give me some personal help please email me.. I will so appreciate your help. howtowhy@gmail.com
    I really want to refer your research but.. dataset problem is not easy for me... )

Thank you so much.

@Xiaoming-Zhao
Copy link
Collaborator

Xiaoming-Zhao commented Apr 7, 2023

I guess the ffhq256x256.zip file format is same as preprocessed one.

This one I am unsure as I have never tried the link you referred to before.

I trained with this data and checked the training score.

This is weird. Do you mind sharing the hardware setup? E.g., the number of GPUs, etc. For GAN's training, the batch size matters a lot. If possible, please use the same batch size as specified in the paper.

I really want to refer your research but.. dataset problem is not easy for me...

I am sorry and I definitely understand that this is a headache. BTW, I recently encountered a tool, i.e., rclone, and it can interact with GDrive well (on the remote server, etc). Please refer to the documentation for how to set it up. The only thing that needs some effort is to set up a Google Cloud API, which is not hard by following its instruction.

Hope these help.

@Xiaoming-Zhao Xiaoming-Zhao added the help wanted Extra attention is needed label Apr 8, 2023
@parkjh688
Copy link

parkjh688 commented Apr 11, 2023

Hi, @Xiaoming-Zhao!

I'm facing a similar issue. I trained my model with the following configuration: 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002}, using the ffhq256x256 dataset from Kaggle. However, my FID score keeps increasing after the first 1000 steps, as shown in the graph below:

fid

In the original paper, the model was trained with a batch size of 64, so I also tried that, but the FID score still increased. Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps. However, this time, even though I started with a lower FID score of 18, it kept increasing after around 1000 steps.

I'm wondering if using ffhq256x256 data from Kaggle instead of real 256-sized data could be causing overfitting. Because I think the size of the Kaggle set is a little smaller. Are there any other possible reasons for this behavior?

Thanks!

@Xiaoming-Zhao
Copy link
Collaborator

Xiaoming-Zhao commented Apr 11, 2023

Hi @parkjh688, I need several more information if possible.

In the original paper, the model was trained with a batch size of 64, so I also tried that

How many GPUs did you use to train GMPI? The reason I am asking is that the batch_size specified in the curriculums.py is for batch size per GPU. The batch size of 64 stated in the paper comes from 8 (per GPU) x 8 (#GPUs) = 64.

Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps.

What is the dataset you used for this training? And how many GPUs were you using?

I'm wondering if using ffhq256x256 data from Kaggle instead of real 256-sized data could be causing overfitting.

One caveat I could see is that the Kaggle dataset uses a different resizing method from the one the pre-trained StyleGAN2 uses. Specifically:
a. The Kaggle one uses bicubic as specified on the webpage.
b. The StyleGAN2 is trained on images downscaled with lanczos (see here).

So maybe you want to process the dataset following the official instruction to have a double check.

I trained my model with the following configuration: 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002}

I noticed that you have batch_split of 16. This means that you will split the 64 images into 16 mini-batches and accumulate gradients 16 times. Theoretically, this should be fine. However, I would recommend using only one single forward pass if possible to avoid any hidden issues.

Hope these help.

@parkjh688
Copy link

The reason I am asking is that the batch_size specified in the curriculums.py is for batch size per GPU. The batch size of 64 stated in the paper comes from 8 (per GPU) x 8 (#GPUs) = 64.

Oh, I didn't know that. I used 6 GPUs and my configuration in curriculums.py was 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002} this. Then if I want to train the model with 64 batch size I have to train it with 4 GPUs and the configuration will be 256: {'batch_size': 16, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002} like this. 16 (per GPU) x 4 (#GPUs) = 64

What is the dataset you used for this training? And how many GPUs were you using?

I use this kaggle dataset..

Thanks!

@Xiaoming-Zhao
Copy link
Collaborator

Got you. So the Kaggle dataset is indeed able to reproduce FID based on your previous statement

Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps.

Then I would recommend reducing the batch_split to see whether this is the culprit for the weird FID curve you showed before. As I mentioned before, theoretically, having batch_split = 16 should be fine but I am not sure whether there will be some hidden issues there.

Hope this helps.

@howtowhy
Copy link
Author

howtowhy commented Apr 19, 2023

Hello, thank you for your detailed help.
I downloaded the 1024 image and preprocessing to 256 with script.
And I used following option with 8 GPU

"res_dict": {
256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002},
512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

"res_dict_learnable_param": {
    256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002},
    512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

But the fid goes up and the result was like this.
Could you advice for this situation?

image

@Xiaoming-Zhao
Copy link
Collaborator

Xiaoming-Zhao commented Apr 19, 2023

Do you mind trying the default configuration:

"res_dict": {
256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

See the discussion above in this issue. Essentially:

  1. We use batch_size of 8 as it is batch size per GPU.
    a. One caveat is that a larger batch size does not always mean better results. Though a larger batch size could contribute to the generator's learning, it could also provide the discriminator with more power to break the balance between the generator and the discriminator.
    b. I have not tried with a batch size of 64 x 8 = 512 therefore I am not sure whether this could work.
  2. Maybe reduce the batch_split from 16 to 1 if your GPU memory allows it. Or maybe 2. Theoretically, having batch_split = 16 should be fine but I am not sure whether there will be some hidden issues there.

Hope these help.

@howtowhy
Copy link
Author

Hello! Thank you for your kind help.
I run the script with 8 GPU and batchsize you suggested.

"res_dict": {
        256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    },

    "res_dict_learnable_param": {
        256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
        1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    },

The fid score with 256 was 18.92-21.52. (with one peak pertubation)
But the paper suggest its fid to 11.4
Is there any problem I missed?
Thank you for your quick help so much.

image

@Xiaoming-Zhao
Copy link
Collaborator

Xiaoming-Zhao commented Apr 21, 2023

This curve looks reasonable to me. I am not sure about the peak but I guess it may due to some randomness.

Regarding the FID: FID score largely depends on the number of images used to compute the score. The more images you use, the large probability you will obtain a lower score.

However, FID with plenty of images is costly to compute. Therefore, during training, we use a small number of images to get a sense of the FID trend:

During the full evaluation, we use 50k fake and real images as stated in the paper and this follows StyleGAN's papers:

N_IMGS=50000

Hope this resolves your confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants