Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning code #9

Open
alxiang opened this issue Aug 1, 2024 · 29 comments
Open

Fine-tuning code #9

alxiang opened this issue Aug 1, 2024 · 29 comments

Comments

@alxiang
Copy link

alxiang commented Aug 1, 2024

Congrats on the release! Will code be released to fine-tune the [schnell] model? (is it even fine-tuneable?)

@franz101
Copy link

franz101 commented Aug 1, 2024

This model is so incredible. Can't wait for LORA being supported too. Really fantastic work!!

@bghira
Copy link

bghira commented Aug 1, 2024

both of the released sets of weights, the Schnell and the Dev model, are distilled from the Pro model, and probably not directly tunable in the traditional sense

@billnye2
Copy link

billnye2 commented Aug 2, 2024

Why does being distilled prevent fine tuning? At the end of the day, it's still just layers of parameters, not sure why lora wouldn't work

@bghira
Copy link

bghira commented Aug 2, 2024

what schedule do you tune it with?

@billnye2
Copy link

billnye2 commented Aug 2, 2024

Hmm not sure, I'm not really familiar with the training details, does distilling use a scheduler dependent on the teacher model?

@bghira
Copy link

bghira commented Aug 2, 2024

yes and the adversarial discriminator they created from the teacher model

@billnye2
Copy link

billnye2 commented Aug 2, 2024

What would happen if we just fine tuned it using a normal scheduler (without adversarial), is that possible?

@bghira
Copy link

bghira commented Aug 2, 2024

it will likely go out of distribution and enter representation collapse. the public Flux release seems more about their commercial model personalisation services than actually providing a fine-tuneable model to the community

@bghira
Copy link

bghira commented Aug 2, 2024

reference to earlier request from sdxl turbo which never got answered: Stability-AI/generative-models#199

when you see "Turbo" models on CivitAI they did so by merging the SDXL Turbo checkpoint with a finetuned SDXL model. same for Lightning checkpoints, because Bytedance never released their training code or discriminator either

the turbo models are pretty useless for continued training or development, not sure why they call it a developer model

@billnye2
Copy link

billnye2 commented Aug 2, 2024

Interesting, that would be a shame since this model is so good, the community will move on likely. Appreciate your time thank you. You mention it would go out of distribution. Excuse my lack of knowledge (not well educated on details), but doesn't the student model learn to mimic the distribution of the teacher? So if we directly finetune the student with the mse loss on real data (rather than output of teacher) wouldn't the distribution be very close? For small loras, with few training images and few steps, would it still go out of distribution? Trying to see if there's any way to salvage some sort of lora training

@bghira
Copy link

bghira commented Aug 2, 2024

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

@bghira

This comment was marked as outdated.

@nom
Copy link

nom commented Aug 3, 2024

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

This makes sense for adversarial diffusion distillation used for the "Schnell" model, as it's been observed that ADD made SDXL turbo harder to tune. Probably due to ADD resulting in brittle and more compressed representations.

The "Dev" model OTOOH was post-trained with guidance distillation, which I'm not sure suffers from the same tunability challenges? IIUC guidance distillation just makes the conditional noise predictions really high quality, removing the need for unconditional noise predictions. If so, finetuning a guidance distilled model would likely result in adapting these high-quality predictions to the new distribution, rather than causing a dramatic collapse.

@nom
Copy link

nom commented Aug 3, 2024

small update, loading the model and a lora adapter in training mode requires FSDP, as it won't fit on a single 80G card. you probably need a multi-GPU system with a bunch of A6000s, L40, A100-40, A100-80 or H100 (or better)

Is that with AMP / deepspeed etc?

@aarongerber
Copy link

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

I am thinking of this model in the context of LLMs and efforts there.

Allow me a few dumb questions with dumb comments. I wonder if any of their solutions would be transferable here since it is a transformer model. Is this distilling in Flux anything like quantization or more like removing layers from the model? LLM enthusiasts found ways revert quantization for an unreleased Mistral model (perhaps they did this based on the parent model and that's what you're talking about), and for models with not enough layers/parameters, they added additional layers into the model to increase the training capability (not sure how that worked). They also figured out how to fine tun without full precision.

@evkogs
Copy link

evkogs commented Aug 3, 2024

the turbo models are pretty useless for continued training or development, not sure why they call it a developer model

Dude, you are being cited everywhere, let's not jump to conclusions this fast 😂 There's no paper or any other info what exactly they've used for it. Let's wait a bit and try different things, I'm 100% sure it's tunable, maybe not just with a MSE loss.

@FurkanGozukara
Copy link

FurkanGozukara commented Aug 3, 2024

Certainly fine tune able

Of course if only authors publishes training code or some people develop themselves

Probably a single A6000 would be sufficient

@Visual-Synthesizer
Copy link

Certainly fine tune able

Of course if only authors publishes training code or some people develop themselves

Probably a single A6000 would be sufficient

We shall see. SD3 can be fine tuned, but the results are sub par...

@jardelva96
Copy link

Fine-tuning the Schnell and Dev models, distilled from the Pro model, presents challenges due to the nature of model distillation. Distilled models often face representation collapse during short training sessions (~10,000 steps), limiting the number of images and epochs that can be used without degrading the model's performance. Adversarial Diffusion Distillation (ADD) makes models brittle and harder to tune. The Dev model, using guidance distillation, may avoid some tuning challenges as it focuses on high-quality conditional noise predictions.

Effective fine-tuning requires advanced hardware like multi-GPU setups (e.g., A100-80, H100) and techniques like Fully Sharded Data Parallel (FSDP). While small-scale fine-tuning with LoRA might work, extensive fine-tuning remains problematic without significant engineering efforts.

@bghira
Copy link

bghira commented Aug 4, 2024

the diffusers team decided not to release a finetuning script for now. i am assuming it's because of problems faced during the training. they are, however, working on it, and we will see what comes. but from comments by eg. neggles (a BFL employee) it sounds like they're (BFL) not directly interested in helping making it trainable, though also insisted that a friend of theirs did finetune it (without mentioning what hardware or script was used)

that said, Ostris has a script here (https://github.com/ostris/ai-toolkit/) and it looks like the same approach i used during my tests. he says he has a rank 256 lora running in 38gb of vram - haven't tested his version personally, but that's way lower than i was able to get the model going at. no idea what resolution he's using or any other details yet

@FurkanGozukara
Copy link

Certainly fine tune able
Of course if only authors publishes training code or some people develop themselves
Probably a single A6000 would be sufficient

We shall see. SD3 can be fine tuned, but the results are sub par...

well i haven't had chance to research it yet. finding optimal params take really effort. I did over 100 different full trainings to find for SDXL

@FurkanGozukara
Copy link

38gb of vram

Sounds alright to me. it is just starting so it may get way lower. happened before with SD 1.5 and SDXL. As I predicted a single concept should be pretty easy to fine tune into FLUX with at worst a single RTX A6000 GPU. But it depends on training script and the hyper parameters

@bghira
Copy link

bghira commented Aug 4, 2024

furkan, can you chill out unless you have something constructive to add?

@CCpt5
Copy link

CCpt5 commented Aug 4, 2024

the diffusers team decided not to release a finetuning script for now. i am assuming it's because of problems faced during the training. they are, however, working on it, and we will see what comes. but from comments by eg. neggles (a BFL employee) it sounds like they're (BFL) not directly interested in helping making it trainable, though also insisted that a friend of theirs did finetune it (without mentioning what hardware or script was used)

that said, Ostris has a script here (https://github.com/ostris/ai-toolkit/) and it looks like the same approach i used during my tests. he says he has a rank 256 lora running in 38gb of vram - haven't tested his version personally, but that's way lower than i was able to get the model going at. no idea what resolution he's using or any other details yet

He's posted some comments on his twitter: @ostrisai

Interesting bit here:
ddffsd

@bghira
Copy link

bghira commented Aug 4, 2024

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

@cchance27
Copy link

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

Wow didn’t expect movement on this so quickly hopefully we see some people testing it on it and more optimizations

question why does it say it won’t work on Apple mps? Is it a hard limitation or just something lacking the dev work to make it possible

@billnye2
Copy link

billnye2 commented Aug 4, 2024

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

What optimizations or things did you do to combat the distillation / high vram ? Going to try it out, thank you

@bghira
Copy link

bghira commented Aug 4, 2024

apple mps doesn't have the fp64 support on GPU. you can only train on CPU 😩

AMD ROCm has kernel size limits that are tripped by teh enormous Flux model and requires attention slicing, which isn't really worth it.

@AMohamedAakhil
Copy link

does it support loras yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests