-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning code #9
Comments
This model is so incredible. Can't wait for LORA being supported too. Really fantastic work!! |
both of the released sets of weights, the Schnell and the Dev model, are distilled from the Pro model, and probably not directly tunable in the traditional sense |
Why does being distilled prevent fine tuning? At the end of the day, it's still just layers of parameters, not sure why lora wouldn't work |
what schedule do you tune it with? |
Hmm not sure, I'm not really familiar with the training details, does distilling use a scheduler dependent on the teacher model? |
yes and the adversarial discriminator they created from the teacher model |
What would happen if we just fine tuned it using a normal scheduler (without adversarial), is that possible? |
it will likely go out of distribution and enter representation collapse. the public Flux release seems more about their commercial model personalisation services than actually providing a fine-tuneable model to the community |
reference to earlier request from sdxl turbo which never got answered: Stability-AI/generative-models#199 when you see "Turbo" models on CivitAI they did so by merging the SDXL Turbo checkpoint with a finetuned SDXL model. same for Lightning checkpoints, because Bytedance never released their training code or discriminator either the turbo models are pretty useless for continued training or development, not sure why they call it a developer model |
Interesting, that would be a shame since this model is so good, the community will move on likely. Appreciate your time thank you. You mention it would go out of distribution. Excuse my lack of knowledge (not well educated on details), but doesn't the student model learn to mimic the distribution of the teacher? So if we directly finetune the student with the mse loss on real data (rather than output of teacher) wouldn't the distribution be very close? For small loras, with few training images and few steps, would it still go out of distribution? Trying to see if there's any way to salvage some sort of lora training |
i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation |
This comment was marked as outdated.
This comment was marked as outdated.
This makes sense for adversarial diffusion distillation used for the "Schnell" model, as it's been observed that ADD made SDXL turbo harder to tune. Probably due to ADD resulting in brittle and more compressed representations. The "Dev" model OTOOH was post-trained with guidance distillation, which I'm not sure suffers from the same tunability challenges? IIUC guidance distillation just makes the conditional noise predictions really high quality, removing the need for unconditional noise predictions. If so, finetuning a guidance distilled model would likely result in adapting these high-quality predictions to the new distribution, rather than causing a dramatic collapse. |
Is that with AMP / deepspeed etc? |
I am thinking of this model in the context of LLMs and efforts there. Allow me a few dumb questions with dumb comments. I wonder if any of their solutions would be transferable here since it is a transformer model. Is this distilling in Flux anything like quantization or more like removing layers from the model? LLM enthusiasts found ways revert quantization for an unreleased Mistral model (perhaps they did this based on the parent model and that's what you're talking about), and for models with not enough layers/parameters, they added additional layers into the model to increase the training capability (not sure how that worked). They also figured out how to fine tun without full precision. |
Dude, you are being cited everywhere, let's not jump to conclusions this fast 😂 There's no paper or any other info what exactly they've used for it. Let's wait a bit and try different things, I'm 100% sure it's tunable, maybe not just with a MSE loss. |
Certainly fine tune able Of course if only authors publishes training code or some people develop themselves Probably a single A6000 would be sufficient |
We shall see. SD3 can be fine tuned, but the results are sub par... |
Fine-tuning the Schnell and Dev models, distilled from the Pro model, presents challenges due to the nature of model distillation. Distilled models often face representation collapse during short training sessions (~10,000 steps), limiting the number of images and epochs that can be used without degrading the model's performance. Adversarial Diffusion Distillation (ADD) makes models brittle and harder to tune. The Dev model, using guidance distillation, may avoid some tuning challenges as it focuses on high-quality conditional noise predictions. Effective fine-tuning requires advanced hardware like multi-GPU setups (e.g., A100-80, H100) and techniques like Fully Sharded Data Parallel (FSDP). While small-scale fine-tuning with LoRA might work, extensive fine-tuning remains problematic without significant engineering efforts. |
the diffusers team decided not to release a finetuning script for now. i am assuming it's because of problems faced during the training. they are, however, working on it, and we will see what comes. but from comments by eg. neggles (a BFL employee) it sounds like they're (BFL) not directly interested in helping making it trainable, though also insisted that a friend of theirs did finetune it (without mentioning what hardware or script was used) that said, Ostris has a script here (https://github.com/ostris/ai-toolkit/) and it looks like the same approach i used during my tests. he says he has a rank 256 lora running in 38gb of vram - haven't tested his version personally, but that's way lower than i was able to get the model going at. no idea what resolution he's using or any other details yet |
well i haven't had chance to research it yet. finding optimal params take really effort. I did over 100 different full trainings to find for SDXL |
Sounds alright to me. it is just starting so it may get way lower. happened before with SD 1.5 and SDXL. As I predicted a single concept should be pretty easy to fine tune into FLUX with at worst a single RTX A6000 GPU. But it depends on training script and the hyper parameters |
furkan, can you chill out unless you have something constructive to add? |
He's posted some comments on his twitter: @ostrisai |
updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md |
Wow didn’t expect movement on this so quickly hopefully we see some people testing it on it and more optimizations question why does it say it won’t work on Apple mps? Is it a hard limitation or just something lacking the dev work to make it possible |
What optimizations or things did you do to combat the distillation / high vram ? Going to try it out, thank you |
apple mps doesn't have the fp64 support on GPU. you can only train on CPU 😩 AMD ROCm has kernel size limits that are tripped by teh enormous Flux model and requires attention slicing, which isn't really worth it. |
does it support loras yet |
Congrats on the release! Will code be released to fine-tune the [schnell] model? (is it even fine-tuneable?)
The text was updated successfully, but these errors were encountered: