Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multimodal]Enable backbone freezing to boost finetuning speed and save GPU usage #3220

Merged
merged 13 commits into from
May 25, 2023

Conversation

FANGAreNotGnu
Copy link
Contributor

@FANGAreNotGnu FANGAreNotGnu commented May 17, 2023

The current two-stage/layerwise-decay learning rate settings supports use different lr in head and non-head layers. It also allows user to set lr in non-head layers to 0. However,

  • Most models are more complicated than backbone + head. And we may want to use two stage on head v.s. non-head, and layer freeze on backbone v.s. non-backbone. (e.g. backbone_lr = 1e-5, neck_lr=1e-5, head_lr=1e-3 and backbone_lr = 0, neck_lr=1e-4, head_lr=1e-4 may both converge fast and nice but backbone_lr = 0, neck_lr=0, head_lr=1e-4 may not be a good one)
  • Only setting lr = 0 in two-stage does not save GPU memory. Setting require_grad=False for backbone parameters can save lots of GPU memory and enable us to run a larger batch_size or a larger model.

So here we introduced a new hyperparameter model.mmdet.frozen_layers that disables gradient update for backbone. It can be used together with any lr_choice, i.e. "single_lr", "two_stage", or "layerwise_decay".

Future work:
Due to bandwidth limit, it's only added to lit_mmdet. Will need benchmarking on other problem types and add to corresponding lit modules. Also the lit modules may need a refactor (add an base module for a better OOP design).

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@FANGAreNotGnu FANGAreNotGnu changed the title [WIP][multimodal]Enable backbone freezing to boost finetuning speed and save GPU usage [multimodal]Enable backbone freezing to boost finetuning speed and save GPU usage May 17, 2023
@FANGAreNotGnu FANGAreNotGnu added the model list checked You have updated the model list after modifying multimodal unit tests/docs label May 17, 2023
@tonyhoo
Copy link
Collaborator

tonyhoo commented May 17, 2023

Thank @FANGAreNotGnu, can you run some quick tests to make sure the memory footprint has been reduced post this change?

@FANGAreNotGnu
Copy link
Contributor Author

Thank @FANGAreNotGnu, can you run some quick tests to make sure the memory footprint has been reduced post this change?

Yes I did, the YOLOX-L GPU usage with batch_size=8 is reduced from >16GB to <10GB. (trainable parameters from 54M to 27M)

@@ -390,6 +391,18 @@ def get_layer_ids(

return name_to_id

def get_backbone_layer_names(self):
backbone_layer_names = []
backbone_layers_patterns = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this cover all backbone layer names used by mmdet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works for common models in mmdet2. Just find out that it may not work for DINO (in mmdet3), will fix this in next PR that enables DINO support.

"encoder",
]
for n, _ in self.named_parameters():
for pattern in backbone_layers_patterns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the layer name has both "backbone" and "encoder" in it? The layer name will be added twice in current impl? Is that intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Made the edits to use any().

Ideally "model.backbone.xxx" parameter name is used by non transformer models and "model.encoder.xxx" is used by transformer based models. So there should be no conflicts. Also it won't cause a problem even if there is.

@@ -628,6 +628,73 @@ def get_trainable_params_efficient_finetune(
return trainable_param_names


def apply_freeze_backbone_lr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is freezing backbone one special case of two stages lr? The difference is whether setting the requires_grad of backbone hyperparameters as False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frozen layers in freeze backbone setting should be a subset of the low learning rate layers in two stage setting. See PR description for details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But in your example, the neck_lr and head_lr are still the same, so the corresponding layers including neck can be treated as "head" layers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most users don't need to set different learning rates for neck and head. So, from the view of learning rate, it's still two-stage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to improve apply_two_stages_lr or apply_layerwise_lr_decay to support requires_grad=False for the parameters whose learning rate is 0. For object detection models, the head_layer_names attribute can have both neck and head layers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current two-stage lr implementation have two issues:

  • Only support using head layer v.s. other layers
  • Use base lr on other layers, and lr * lr_mult on head layers

To support backbone frozen in current two stage implementation, we need to:

  • Add hyperparameter providing option to freeze layers with lr=0.
  • Add hyperparameter to support using head layer v.s. other layers or backbone layer v.s. other layers.
  • Use base lr on head/non-backbone layers, and lr / lr_mult on non-head/backbone layers

The third change involves an implicit change in API and may cause users' confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two stage lr on head layer v.s. other layers may still be useful since it converges faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. apply_two_stages_lr can't directly support lr=0 for backbone parameters due to lr * lr_mult. I think there are two choices:

  • Changing the design lr_mult in two-stage lr. Let lr be the head lr and the backbone lr uses lr * lr_mult but with 0<=lr_mult<1.
  • Using apply_layerwise_lr_decay instead of apply_two_stages_lr. apply_layerwise_lr_decay can simulate the case where head uses lr and backbone use 0 learning rate. In general, layerwise learning rate decay should perform better than two-stage if we can have a good lr_decay.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the definition of head layers can be model-specific. For object detection models, we can count the neck layers as head since we want to finetune them together with the real head layers.

@github-actions
Copy link

Job PR-3220-d859300 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/d859300/index.html

@zhiqiangdon
Copy link
Contributor

The current two-stage/layerwise-decay learning rate settings supports use different lr in head and non-head layers. It also allows user to set lr in non-head layers to 0. However,

  • Most models are more complicated than backbone + head. And we may want to use two stage on head v.s. non-head, and layer freeze on backbone v.s. non-backbone. (e.g. backbone_lr = 1e-5, neck_lr=1e-5, head_lr=1e-3 and backbone_lr = 0, neck_lr=1e-4, head_lr=1e-4 may both converge fast and nice but backbone_lr = 0, neck_lr=0, head_lr=1e-4 may not be a good one)
  • Only setting lr = 0 in two-stage does not save GPU memory. Setting require_grad=False for backbone parameters can save lots of GPU memory and enable us to run a larger batch_size or a larger model.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Looks like this PR only addresses the second point, i.e., setting the requires_grad=True for backbone parameters.

@github-actions
Copy link

Job PR-3220-48164f5 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/48164f5/index.html

@github-actions
Copy link

Job PR-3220-136b7dc is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/136b7dc/index.html

@github-actions
Copy link

Job PR-3220-24f3ae3 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/24f3ae3/index.html

@FANGAreNotGnu
Copy link
Contributor Author

Changed the design based on some offline discussions.

@github-actions
Copy link

Job PR-3220-9b0744f is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/9b0744f/index.html

@github-actions
Copy link

Job PR-3220-2b21df0 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/2b21df0/index.html

@github-actions
Copy link

Job PR-3220-833f1fd is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/833f1fd/index.html

multimodal/src/autogluon/multimodal/optimization/utils.py Outdated Show resolved Hide resolved
multimodal/src/autogluon/multimodal/models/utils.py Outdated Show resolved Hide resolved
@@ -149,6 +149,7 @@ model:
- "image"
max_img_num_per_col: 1
output_bbox_format: "xyxy" # now support xyxy or xywh, for bbox format details see https://keras.io/api/keras_cv/bounding_box/formats/
frozen_layers: null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if removing null? What's the difference between nothing and null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no difference. We are using both in the configs.

@github-actions
Copy link

Job PR-3220-d2af22a is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3220/d2af22a/index.html

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zhiqiangdon zhiqiangdon merged commit 476164f into autogluon:master May 25, 2023
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model list checked You have updated the model list after modifying multimodal unit tests/docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants