Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/feature: Fixing mixed precision training #290

Open
wants to merge 69 commits into
base: main
Choose a base branch
from

Conversation

isamu-isozaki
Copy link
Contributor

This makes it so we can do lower precision training. @piercus might be interested

@isamu-isozaki isamu-isozaki changed the title Adding grad scaler bug/feature: Adding grad scaler Feb 16, 2024
@isamu-isozaki isamu-isozaki marked this pull request as draft February 16, 2024 19:14
@isamu-isozaki isamu-isozaki marked this pull request as ready for review February 21, 2024 02:54
@isamu-isozaki isamu-isozaki marked this pull request as draft February 21, 2024 02:58
@isamu-isozaki isamu-isozaki marked this pull request as ready for review February 21, 2024 03:01
Copy link
Member

@limiteinductive limiteinductive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deltheil PTAL. @isamu-isozaki could you rebase?

@limiteinductive
Copy link
Member

When trying with bfloat16 I've got this error
image

@limiteinductive
Copy link
Member

image
even with float16 I have an error

@limiteinductive limiteinductive removed the request for review from deltheil February 21, 2024 09:05
Copy link
Member

@limiteinductive limiteinductive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See runtime bugs found

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Feb 21, 2024

@limiteinductive Ok, a slight update. For bfloat16 it doesn't seem like grad scaling is necessary because the number of exponent bits is the exact same as float32. Though in fp16 it is necessary. The steps needed to make this work for float16 seem like, traditionally,

  1. Keep the model in fp32 precision
  2. Have a scaler
  3. Autocast to fp16 during training
    For autocasting usually it's done like
with autocast(device_type='cuda', dtype=torch.float16):

but in accelerate they do

if self.mixed_precision == "fp16" and is_torch_version(">=", "1.10"):
    model.forward = torch.cuda.amp.autocast(dtype=torch.float16)(model.forward)

and then do

model.forward = convert_outputs_to_fp32(model.forward)

I think I'll try to find a simpler way since these need the model to be "wrapped".

@isamu-isozaki
Copy link
Contributor Author

Now that I think about it, even in bfloat16 the gradients should be fp32

@isamu-isozaki isamu-isozaki marked this pull request as draft February 21, 2024 15:21
@isamu-isozaki
Copy link
Contributor Author

Ok fixed. For this training, the models getting trained have to be in fp32 while all other models can be self.dtype.
The reason I modified the DTypes in the pipelines is that they automatically convert the training model's fp32 to the self.dtype.
For the training step and evaluation, I think we have to do autocast and my guess is it's the simplest way.
In my dino pr, these changes made the gradients fp32 while training on bfloat16

@limiteinductive limiteinductive added run-ci Run CI and removed run-ci Run CI labels Feb 26, 2024
Copy link
Member

@limiteinductive limiteinductive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deltheil PTAL

Copy link
Member

@deltheil deltheil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, PTAL cc @limiteinductive

@@ -105,6 +106,9 @@ def wrapper(self: Trainer[BaseConfig, Any], config: ModelConfigT) -> fl.Module:
if config.requires_grad is not None:
model.requires_grad_(requires_grad=config.requires_grad)
learnable_parameters = [param for param in model.parameters() if param.requires_grad]
if self.config.training.automatic_mixed_precision:
for learnable_parameter in learnable_parameters:
learnable_parameter.to(dtype=float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is because with AMP some layers/ops behave better (range) with float32, right? Doesn't this deserve a comment?

Comment on lines 25 to 28
automatic_mixed_precision: bool = (
True # Enables automatic mixed precision which allows float32 gradients while working with lower precision.
)
dtype: str = "float32"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AMP is on by default (opt-out), and given that this dtype config is used jointly given the autocast(dtype=self.dtype, enabled=self.config.training.automatic_mixed_precision) call, any reason not to pick a sane default here? Namely: either float16 or bfloat16. Might also deserve a comment.

(accelerate even offers to "Choose from ‘no’,‘fp16’,‘bf16 or ‘fp8’.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could set it to False by default, and even though float32 + amp doesn't do much, I don't see this as a big issue; I would rather keep the API straightforward.

if self.scaler is None:
backward(tensors=scaled_loss)
return
self.scaler.scale(scaled_loss).backward() # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: as discussed offline, a follow up PR will be needed to properly unscale the gradients before gradient clipping (which happens in between the backward step and optimizer step, as per https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients)

But also, what about gradient accumulation? https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

@isamu-isozaki
Copy link
Contributor Author

I found that if we do

# For all parameters we train in automatic mixed precision we want them to be in float32.
for learnable_parameter in learnable_parameters:
    learnable_parameter.to(dtype=float32)

while it's more clear it doesn't actually set the dtype of all the parameters to float32. So, the correct way to do it is to set the parameter dtype in the model before register_model(when requires_grad is set), and then if not amp, set grad to self.dtype I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-ci Run CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants