bug/feature: Fixing mixed precision training #290

isamu-isozaki · 2024-02-16T17:28:09Z

This makes it so we can do lower precision training. @piercus might be interested

src/refiners/training_utils/trainer.py

… into grad_scaler

src/refiners/training_utils/trainer.py

limiteinductive

@deltheil PTAL. @isamu-isozaki could you rebase?

limiteinductive · 2024-02-21T09:01:15Z

When trying with bfloat16 I've got this error

limiteinductive · 2024-02-21T09:05:22Z

even with float16 I have an error

limiteinductive

See runtime bugs found

isamu-isozaki · 2024-02-21T14:41:49Z

@limiteinductive Ok, a slight update. For bfloat16 it doesn't seem like grad scaling is necessary because the number of exponent bits is the exact same as float32. Though in fp16 it is necessary. The steps needed to make this work for float16 seem like, traditionally,

Keep the model in fp32 precision
Have a scaler
Autocast to fp16 during training
For autocasting usually it's done like

with autocast(device_type='cuda', dtype=torch.float16):

but in accelerate they do

if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
    model.forward = torch.cuda.amp.autocast(dtype=torch.float16)(model.forward)

and then do

model.forward = convert_outputs_to_fp32(model.forward)

I think I'll try to find a simpler way since these need the model to be "wrapped".

isamu-isozaki · 2024-02-21T15:14:41Z

Now that I think about it, even in bfloat16 the gradients should be fp32

isamu-isozaki · 2024-02-21T17:37:04Z

Ok fixed. For this training, the models getting trained have to be in fp32 while all other models can be self.dtype.
The reason I modified the DTypes in the pipelines is that they automatically convert the training model's fp32 to the self.dtype.
For the training step and evaluation, I think we have to do autocast and my guess is it's the simplest way.
In my dino pr, these changes made the gradients fp32 while training on bfloat16

… into grad_scaler

limiteinductive

@deltheil PTAL

deltheil

Some comments, PTAL cc @limiteinductive

deltheil · 2024-02-27T08:11:32Z

src/refiners/training_utils/trainer.py

@@ -105,6 +106,9 @@ def wrapper(self: Trainer[BaseConfig, Any], config: ModelConfigT) -> fl.Module:
            if config.requires_grad is not None:
                model.requires_grad_(requires_grad=config.requires_grad)
            learnable_parameters = [param for param in model.parameters() if param.requires_grad]
+            if self.config.training.automatic_mixed_precision:
+                for learnable_parameter in learnable_parameters:
+                    learnable_parameter.to(dtype=float32)


I guess this is because with AMP some layers/ops behave better (range) with float32, right? Doesn't this deserve a comment?

deltheil · 2024-02-27T08:15:21Z

src/refiners/training_utils/config.py

+    automatic_mixed_precision: bool = (
+        True  # Enables automatic mixed precision which allows float32 gradients while working with lower precision.
+    )
    dtype: str = "float32"


Since AMP is on by default (opt-out), and given that this dtype config is used jointly given the autocast(dtype=self.dtype, enabled=self.config.training.automatic_mixed_precision) call, any reason not to pick a sane default here? Namely: either float16 or bfloat16. Might also deserve a comment.

(accelerate even offers to "Choose from ‘no’,‘fp16’,‘bf16 or ‘fp8’.")

We could set it to False by default, and even though float32 + amp doesn't do much, I don't see this as a big issue; I would rather keep the API straightforward.

deltheil · 2024-02-27T08:17:18Z

src/refiners/training_utils/trainer.py

+        if self.scaler is None:
+            backward(tensors=scaled_loss)
+            return
+        self.scaler.scale(scaled_loss).backward()  # type: ignore


Note: as discussed offline, a follow up PR will be needed to properly unscale the gradients before gradient clipping (which happens in between the backward step and optimizer step, as per https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients)

But also, what about gradient accumulation? https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

isamu-isozaki · 2024-04-29T17:36:39Z

I found that if we do

# For all parameters we train in automatic mixed precision we want them to be in float32.
for learnable_parameter in learnable_parameters:
    learnable_parameter.to(dtype=float32)

while it's more clear it doesn't actually set the dtype of all the parameters to float32. So, the correct way to do it is to set the parameter dtype in the model before register_model(when requires_grad is set), and then if not amp, set grad to self.dtype I think.

Adding grad scaler

e8b084c

isamu-isozaki changed the title ~~Adding grad scaler~~ bug/feature: Adding grad scaler Feb 16, 2024

limiteinductive reviewed Feb 16, 2024

View reviewed changes

src/refiners/training_utils/trainer.py Outdated Show resolved Hide resolved

Fixed double backward

dc795ff

isamu-isozaki marked this pull request as draft February 16, 2024 19:14

isamu-isozaki added 2 commits February 16, 2024 14:16

zero grad

895c65f

Merge branch 'grad_scaler' of https://github.com/isamu-isozaki/refiners…

6049d40

… into grad_scaler

limiteinductive reviewed Feb 20, 2024

View reviewed changes

src/refiners/training_utils/trainer.py Outdated Show resolved Hide resolved

limiteinductive reviewed Feb 20, 2024

View reviewed changes

src/refiners/training_utils/trainer.py Outdated Show resolved Hide resolved

isamu-isozaki added 2 commits February 20, 2024 21:41

Fixed accordign to review

7ccb836

Fixed styles

5e539c1

isamu-isozaki marked this pull request as ready for review February 21, 2024 02:54

Cleaned up code

35401da

isamu-isozaki marked this pull request as draft February 21, 2024 02:58

Removed diffs

4ccca8d

isamu-isozaki marked this pull request as ready for review February 21, 2024 03:01

limiteinductive approved these changes Feb 21, 2024

View reviewed changes

limiteinductive requested review from deltheil and limiteinductive February 21, 2024 08:41

limiteinductive removed the request for review from deltheil February 21, 2024 09:05

limiteinductive requested changes Feb 21, 2024

View reviewed changes

Disable grad scaler for bf16

3c7f8f3

isamu-isozaki marked this pull request as draft February 21, 2024 15:21

isamu-isozaki added 2 commits February 21, 2024 12:30

Fixed mixed precision training

eebaad7

Added option for None as dtype for mixed precision training

d362686

isamu-isozaki added 16 commits February 26, 2024 10:31

Remove diffs

dce3cd4

Fixed always true condition

486de2a

Removed not implemented error

9e7b65a

Remove comment

5d3cd51

Remove overflow grad check

acba9cf

Removed grad scaler for non-amp training

78e3a52

Remove comments

03f4352

Cleaner amp

3ed2c29

Fix default

4df3741

More explicit name

123b4d4

Remove accelerate

72e8216

Linting

e206b7a

Merge branch 'grad_scaler' of https://github.com/isamu-isozaki/refiners…

07185be

… into grad_scaler

Merge branch 'main' into grad_scaler

7c7c3ff

Merge branch 'grad_scaler' of https://github.com/isamu-isozaki/refiners…

8a9e034

… into grad_scaler

Remove diffs

a5ff578

limiteinductive added run-ci Run CI and removed run-ci Run CI labels Feb 26, 2024

limiteinductive self-requested a review February 26, 2024 15:44

limiteinductive previously approved these changes Feb 26, 2024

View reviewed changes

limiteinductive requested a review from deltheil February 26, 2024 16:25

deltheil reviewed Feb 27, 2024

View reviewed changes

Did fixes

da188f7

isamu-isozaki dismissed limiteinductive’s stale review via da188f7 March 1, 2024 23:50

isamu-isozaki added 4 commits March 8, 2024 11:01

Fixed import

49ebf48

Resolve merge conflicts

58fcb15

Delete double optimizer step

1d648a2

Fixed dtype

4efdffb

Merge branch 'main' into grad_scaler

27cbdba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/feature: Fixing mixed precision training #290

bug/feature: Fixing mixed precision training #290

isamu-isozaki commented Feb 16, 2024

limiteinductive left a comment

limiteinductive commented Feb 21, 2024

limiteinductive commented Feb 21, 2024

limiteinductive left a comment

isamu-isozaki commented Feb 21, 2024 •

edited

isamu-isozaki commented Feb 21, 2024

isamu-isozaki commented Feb 21, 2024

limiteinductive left a comment

deltheil left a comment

deltheil Feb 27, 2024

deltheil Feb 27, 2024

limiteinductive Feb 27, 2024

deltheil Feb 27, 2024

isamu-isozaki commented Apr 29, 2024

bug/feature: Fixing mixed precision training #290

Are you sure you want to change the base?

bug/feature: Fixing mixed precision training #290

Conversation

isamu-isozaki commented Feb 16, 2024

limiteinductive left a comment

Choose a reason for hiding this comment

limiteinductive commented Feb 21, 2024

limiteinductive commented Feb 21, 2024

limiteinductive left a comment

Choose a reason for hiding this comment

isamu-isozaki commented Feb 21, 2024 • edited

isamu-isozaki commented Feb 21, 2024

isamu-isozaki commented Feb 21, 2024

limiteinductive left a comment

Choose a reason for hiding this comment

deltheil left a comment

Choose a reason for hiding this comment

deltheil Feb 27, 2024

Choose a reason for hiding this comment

deltheil Feb 27, 2024

Choose a reason for hiding this comment

limiteinductive Feb 27, 2024

Choose a reason for hiding this comment

deltheil Feb 27, 2024

Choose a reason for hiding this comment

isamu-isozaki commented Apr 29, 2024

isamu-isozaki commented Feb 21, 2024 •

edited