Fine-tuning code #9

alxiang · 2024-08-01T18:50:58Z

Congrats on the release! Will code be released to fine-tune the [schnell] model? (is it even fine-tuneable?)

franz101 · 2024-08-01T19:21:32Z

This model is so incredible. Can't wait for LORA being supported too. Really fantastic work!!

bghira · 2024-08-01T22:52:46Z

both of the released sets of weights, the Schnell and the Dev model, are distilled from the Pro model, and probably not directly tunable in the traditional sense

billnye2 · 2024-08-02T15:19:36Z

Why does being distilled prevent fine tuning? At the end of the day, it's still just layers of parameters, not sure why lora wouldn't work

bghira · 2024-08-02T15:23:38Z

what schedule do you tune it with?

billnye2 · 2024-08-02T15:27:08Z

Hmm not sure, I'm not really familiar with the training details, does distilling use a scheduler dependent on the teacher model?

bghira · 2024-08-02T16:07:40Z

yes and the adversarial discriminator they created from the teacher model

billnye2 · 2024-08-02T16:09:51Z

What would happen if we just fine tuned it using a normal scheduler (without adversarial), is that possible?

bghira · 2024-08-02T16:16:17Z

it will likely go out of distribution and enter representation collapse. the public Flux release seems more about their commercial model personalisation services than actually providing a fine-tuneable model to the community

bghira · 2024-08-02T16:23:37Z

reference to earlier request from sdxl turbo which never got answered: Stability-AI/generative-models#199

when you see "Turbo" models on CivitAI they did so by merging the SDXL Turbo checkpoint with a finetuned SDXL model. same for Lightning checkpoints, because Bytedance never released their training code or discriminator either

the turbo models are pretty useless for continued training or development, not sure why they call it a developer model

billnye2 · 2024-08-02T16:52:44Z

Interesting, that would be a shame since this model is so good, the community will move on likely. Appreciate your time thank you. You mention it would go out of distribution. Excuse my lack of knowledge (not well educated on details), but doesn't the student model learn to mimic the distribution of the teacher? So if we directly finetune the student with the mse loss on real data (rather than output of teacher) wouldn't the distribution be very close? For small loras, with few training images and few steps, would it still go out of distribution? Trying to see if there's any way to salvage some sort of lora training

bghira · 2024-08-02T17:51:56Z

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

nom · 2024-08-03T03:34:23Z

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

This makes sense for adversarial diffusion distillation used for the "Schnell" model, as it's been observed that ADD made SDXL turbo harder to tune. Probably due to ADD resulting in brittle and more compressed representations.

The "Dev" model OTOOH was post-trained with guidance distillation, which I'm not sure suffers from the same tunability challenges? IIUC guidance distillation just makes the conditional noise predictions really high quality, removing the need for unconditional noise predictions. If so, finetuning a guidance distilled model would likely result in adapting these high-quality predictions to the new distribution, rather than causing a dramatic collapse.

nom · 2024-08-03T04:42:56Z

small update, loading the model and a lora adapter in training mode requires FSDP, as it won't fit on a single 80G card. you probably need a multi-GPU system with a bunch of A6000s, L40, A100-40, A100-80 or H100 (or better)

Is that with AMP / deepspeed etc?

aarongerber · 2024-08-03T15:34:02Z

i should say that training can "work" but typically training a distilled model has major limitations, such as representation collapse in a short ~10k step training session. you don't have much time to show it stuff, because it is degrading the whole time it's being trained. you can't train it on too many images as a result, nor too many epochs on fewer images. it's not a great situation

I am thinking of this model in the context of LLMs and efforts there.

Allow me a few dumb questions with dumb comments. I wonder if any of their solutions would be transferable here since it is a transformer model. Is this distilling in Flux anything like quantization or more like removing layers from the model? LLM enthusiasts found ways revert quantization for an unreleased Mistral model (perhaps they did this based on the parent model and that's what you're talking about), and for models with not enough layers/parameters, they added additional layers into the model to increase the training capability (not sure how that worked). They also figured out how to fine tun without full precision.

evkogs · 2024-08-03T17:03:06Z

the turbo models are pretty useless for continued training or development, not sure why they call it a developer model

Dude, you are being cited everywhere, let's not jump to conclusions this fast 😂 There's no paper or any other info what exactly they've used for it. Let's wait a bit and try different things, I'm 100% sure it's tunable, maybe not just with a MSE loss.

FurkanGozukara · 2024-08-03T22:01:51Z

Certainly fine tune able

Of course if only authors publishes training code or some people develop themselves

Probably a single A6000 would be sufficient

Visual-Synthesizer · 2024-08-03T22:21:27Z

Certainly fine tune able

Of course if only authors publishes training code or some people develop themselves

Probably a single A6000 would be sufficient

We shall see. SD3 can be fine tuned, but the results are sub par...

jardelva96 · 2024-08-04T01:06:39Z

Fine-tuning the Schnell and Dev models, distilled from the Pro model, presents challenges due to the nature of model distillation. Distilled models often face representation collapse during short training sessions (~10,000 steps), limiting the number of images and epochs that can be used without degrading the model's performance. Adversarial Diffusion Distillation (ADD) makes models brittle and harder to tune. The Dev model, using guidance distillation, may avoid some tuning challenges as it focuses on high-quality conditional noise predictions.

Effective fine-tuning requires advanced hardware like multi-GPU setups (e.g., A100-80, H100) and techniques like Fully Sharded Data Parallel (FSDP). While small-scale fine-tuning with LoRA might work, extensive fine-tuning remains problematic without significant engineering efforts.

bghira · 2024-08-04T01:15:43Z

the diffusers team decided not to release a finetuning script for now. i am assuming it's because of problems faced during the training. they are, however, working on it, and we will see what comes. but from comments by eg. neggles (a BFL employee) it sounds like they're (BFL) not directly interested in helping making it trainable, though also insisted that a friend of theirs did finetune it (without mentioning what hardware or script was used)

that said, Ostris has a script here (https://github.com/ostris/ai-toolkit/) and it looks like the same approach i used during my tests. he says he has a rank 256 lora running in 38gb of vram - haven't tested his version personally, but that's way lower than i was able to get the model going at. no idea what resolution he's using or any other details yet

FurkanGozukara · 2024-08-04T01:26:58Z

Certainly fine tune able
Of course if only authors publishes training code or some people develop themselves
Probably a single A6000 would be sufficient

We shall see. SD3 can be fine tuned, but the results are sub par...

well i haven't had chance to research it yet. finding optimal params take really effort. I did over 100 different full trainings to find for SDXL

FurkanGozukara · 2024-08-04T01:29:16Z

38gb of vram

Sounds alright to me. it is just starting so it may get way lower. happened before with SD 1.5 and SDXL. As I predicted a single concept should be pretty easy to fine tune into FLUX with at worst a single RTX A6000 GPU. But it depends on training script and the hyper parameters

bghira · 2024-08-04T01:48:35Z

furkan, can you chill out unless you have something constructive to add?

CCpt5 · 2024-08-04T02:46:47Z

the diffusers team decided not to release a finetuning script for now. i am assuming it's because of problems faced during the training. they are, however, working on it, and we will see what comes. but from comments by eg. neggles (a BFL employee) it sounds like they're (BFL) not directly interested in helping making it trainable, though also insisted that a friend of theirs did finetune it (without mentioning what hardware or script was used)

that said, Ostris has a script here (https://github.com/ostris/ai-toolkit/) and it looks like the same approach i used during my tests. he says he has a rank 256 lora running in 38gb of vram - haven't tested his version personally, but that's way lower than i was able to get the model going at. no idea what resolution he's using or any other details yet

He's posted some comments on his twitter: @ostrisai

Interesting bit here:

bghira · 2024-08-04T03:59:58Z

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

cchance27 · 2024-08-04T04:49:20Z

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

Wow didn’t expect movement on this so quickly hopefully we see some people testing it on it and more optimizations

question why does it say it won’t work on Apple mps? Is it a hard limitation or just something lacking the dev work to make it possible

billnye2 · 2024-08-04T04:50:42Z

updated with a guide on how to train flux dev or schnell: https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md

What optimizations or things did you do to combat the distillation / high vram ? Going to try it out, thank you

bghira · 2024-08-04T04:54:29Z

apple mps doesn't have the fp64 support on GPU. you can only train on CPU 😩

AMD ROCm has kernel size limits that are tripped by teh enormous Flux model and requires attention slicing, which isn't really worth it.

AMohamedAakhil · 2024-08-05T15:26:31Z

does it support loras yet

This comment was marked as outdated.

Sign in to view

OliviaOliveiira mentioned this issue Aug 4, 2024

FLUX example config ostris/ai-toolkit#32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning code #9

Fine-tuning code #9

alxiang commented Aug 1, 2024

franz101 commented Aug 1, 2024

bghira commented Aug 1, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024 •

edited

Loading

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

This comment was marked as outdated.

nom commented Aug 3, 2024 •

edited

Loading

nom commented Aug 3, 2024

aarongerber commented Aug 3, 2024

evkogs commented Aug 3, 2024 •

edited

Loading

FurkanGozukara commented Aug 3, 2024 •

edited

Loading

Visual-Synthesizer commented Aug 3, 2024

jardelva96 commented Aug 4, 2024

bghira commented Aug 4, 2024

FurkanGozukara commented Aug 4, 2024

FurkanGozukara commented Aug 4, 2024

bghira commented Aug 4, 2024

CCpt5 commented Aug 4, 2024

bghira commented Aug 4, 2024

cchance27 commented Aug 4, 2024

billnye2 commented Aug 4, 2024

bghira commented Aug 4, 2024

AMohamedAakhil commented Aug 5, 2024

Fine-tuning code #9

Fine-tuning code #9

Comments

alxiang commented Aug 1, 2024

franz101 commented Aug 1, 2024

bghira commented Aug 1, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024 • edited Loading

bghira commented Aug 2, 2024

billnye2 commented Aug 2, 2024

bghira commented Aug 2, 2024

This comment was marked as outdated.

nom commented Aug 3, 2024 • edited Loading

nom commented Aug 3, 2024

aarongerber commented Aug 3, 2024

evkogs commented Aug 3, 2024 • edited Loading

FurkanGozukara commented Aug 3, 2024 • edited Loading

Visual-Synthesizer commented Aug 3, 2024

jardelva96 commented Aug 4, 2024

bghira commented Aug 4, 2024

FurkanGozukara commented Aug 4, 2024

FurkanGozukara commented Aug 4, 2024

bghira commented Aug 4, 2024

CCpt5 commented Aug 4, 2024

bghira commented Aug 4, 2024

cchance27 commented Aug 4, 2024

billnye2 commented Aug 4, 2024

bghira commented Aug 4, 2024

AMohamedAakhil commented Aug 5, 2024

bghira commented Aug 2, 2024 •

edited

Loading

nom commented Aug 3, 2024 •

edited

Loading

evkogs commented Aug 3, 2024 •

edited

Loading

FurkanGozukara commented Aug 3, 2024 •

edited

Loading