New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projected GANs for image-to-image translation? #71
Comments
Hi :) Can you post some loss curves / logits etc? I assume the discriminator quickly overpowers the generator? |
You're welcome :) This does not seem like collapse though, the losses are not directly comparable (PG uses hinge, SG uses non-sat. loss). So training seems stable. If I am not mistaken you cut off the x-axis for the LPIPS plot, so it appears that PG is initially better than your baseline? It is also still improving, just more slowly. Are you using a dataset with many faces? This is currently a weakness of PG that I mentioned in other issues. |
The first plot shows the raw logits of the discriminator (w/o sigmoid for stylegan-D), not the loss. Thus, it should at least be comparable to the stylegan logits in terms of discriminating gap between real/fake examples (but perhaps the scale of the logits will be different?). You're right that it's not collapsing, but from my previous experience with ns-loss/wasserstein loss I would assume that the gap would be slightly smaller. But perhaps the poor sample quality might be the feature network, not the training dynamics.
Yeah, correct!
Yeah, I'm currently using a dataset of only human bodies (example illustration). I'll do some further ablations with CLIP/mocov2 resnet!
Thanks for the reference, it's seems very interesting and quite relevant to my dataset :) |
In my experience, the training dynamics are quite different from standard GAN training, so previous experiences might be misleading :)
Awesome, keep me updated on the results! You could also take this further and pretrain a model on your specific dataset with eg. MoCoV2. This should then definitely give you useful features for human body generation. |
Thanks for the answers, will post an update later! |
A quick question regarding normalization of input images: do you use the standard imagenet normalization for the feature network in D, or do you keep the stylegan normalization to the range [-1, 1]? |
|
Thanks! Do you keep the [-1, 1] normalization for the generator? |
yes :) |
gonna close this now, keep me updated on your results :) |
Here is an update after a couple of days of experiments with projected GANs for image inpainting of human figures. From my experience, projected GANs converge quickly, however, it is prone to mode collapse early in training and some feature networks are more unstable to train than others. All models that I’ve trained have suffered from mode collapse, where the model generates deterministic completions for the same conditional input, or the model diversity is limited to simple semantic changes (e.g. only changing the color of the clothes, not the general appearance). To diminish this issue, I’ve experimented with blurring discriminator images, and turning on/off seperable/patch discriminator. Generally, blurring the first N iterations (tested with 200k-1M images) seems to improve diversity somewhat, but it is still far from the diversity of the baseline. This figure shows various experiments with a ViT discriminator, with separable/patch/blurring turned on/off. I've used the ViT model from "Masked Autoencoders are Scalable Vision Learners". The figure includes the logits of the discriminator, FID-Clip (from the paper you linked), LPIPS Diversity, and LPIPS. You can observe that blurring improves diversity, however, too much (1M images) seemed to collapse training. The image quality of the model is quite similar to the baseline, however, the diversity is significantly worse. Also, I noticed some surprising results in these runs, e.g. ViT with only blur trains fine, while adding patch/separable options can collapse training. This might be randomness though. Note that some plots do not have the full graph of FID-CLIP, as I implemented this in the middle of training runs. This figure shows different feature networks that I've tested. The models are:
From the experiments, I find that the rn50 clip/rn50 densepose and MAE ViT provide quite good features for human figure synthesis. In summary, I find the results promising and I will continue my experiments. The current issue is training instability and reduced diversity. I believe this is an issue of instability early in training, where simply blurring the first iterations diminishes the issue with no oberservable cost to image quality. I'm happy to hear if you have any suggestions or ideas to combat these issues :) |
You can probably try to generate image from smaller resolution first, and go higher resolution, this can help stable the training. Also you may try to add a noise to images, to manually make the "real data distribution" and "fake data distribution" has some overlap, this trick also found be useful in gan and diffusion model. Hope this helpful for you. I found what you do is interesting :). |
Hi,
Are you familiar with any work that has applied projected GANs for image-to-image translation? I spent a couple of days trying to get projected GANs to work for image inpainting of human bodies. However, I continuously struggled with the discriminator learning to discriminate between real/generated examples very early in training (often less than 100k images).
I experimented with several methods to prevent this behavior:
Note that the discriminator never observed the conditional information, I only inputted the generated/real RGB image.
Also, the discriminator follows the implementation in this repo.
Would appreciate if you have any tips or related work that might be relevant for this use case.
The text was updated successfully, but these errors were encountered: