-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[multimodal]Enable backbone freezing to boost finetuning speed and save GPU usage #3220
[multimodal]Enable backbone freezing to boost finetuning speed and save GPU usage #3220
Conversation
Thank @FANGAreNotGnu, can you run some quick tests to make sure the memory footprint has been reduced post this change? |
Yes I did, the YOLOX-L GPU usage with batch_size=8 is reduced from >16GB to <10GB. (trainable parameters from 54M to 27M) |
@@ -390,6 +391,18 @@ def get_layer_ids( | |||
|
|||
return name_to_id | |||
|
|||
def get_backbone_layer_names(self): | |||
backbone_layer_names = [] | |||
backbone_layers_patterns = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this cover all backbone layer names used by mmdet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works for common models in mmdet2. Just find out that it may not work for DINO (in mmdet3), will fix this in next PR that enables DINO support.
"encoder", | ||
] | ||
for n, _ in self.named_parameters(): | ||
for pattern in backbone_layers_patterns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the layer name has both "backbone" and "encoder" in it? The layer name will be added twice in current impl? Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Made the edits to use any()
.
Ideally "model.backbone.xxx" parameter name is used by non transformer models and "model.encoder.xxx" is used by transformer based models. So there should be no conflicts. Also it won't cause a problem even if there is.
@@ -628,6 +628,73 @@ def get_trainable_params_efficient_finetune( | |||
return trainable_param_names | |||
|
|||
|
|||
def apply_freeze_backbone_lr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is freezing backbone one special case of two stages lr? The difference is whether setting the requires_grad of backbone hyperparameters as False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frozen layers in freeze backbone setting should be a subset of the low learning rate layers in two stage setting. See PR description for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But in your example, the neck_lr and head_lr are still the same, so the corresponding layers including neck can be treated as "head" layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess most users don't need to set different learning rates for neck and head. So, from the view of learning rate, it's still two-stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to improve apply_two_stages_lr
or apply_layerwise_lr_decay
to support requires_grad=False for the parameters whose learning rate is 0. For object detection models, the head_layer_names
attribute can have both neck and head layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current two-stage lr implementation have two issues:
- Only support using head layer v.s. other layers
- Use base
lr
on other layers, andlr * lr_mult
on head layers
To support backbone frozen in current two stage implementation, we need to:
- Add hyperparameter providing option to freeze layers with
lr=0.
- Add hyperparameter to support using head layer v.s. other layers or backbone layer v.s. other layers.
- Use base
lr
on head/non-backbone layers, andlr / lr_mult
on non-head/backbone layers
The third change involves an implicit change in API and may cause users' confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two stage lr on head layer v.s. other layers may still be useful since it converges faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. apply_two_stages_lr
can't directly support lr=0 for backbone parameters due to lr * lr_mult
. I think there are two choices:
- Changing the design
lr_mult
in two-stage lr. Letlr
be the head lr and the backbone lr useslr * lr_mult
but with 0<=lr_mult
<1. - Using
apply_layerwise_lr_decay
instead ofapply_two_stages_lr
.apply_layerwise_lr_decay
can simulate the case where head useslr
and backbone use 0 learning rate. In general, layerwise learning rate decay should perform better than two-stage if we can have a goodlr_decay
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the definition of head layers can be model-specific. For object detection models, we can count the neck layers as head since we want to finetune them together with the real head layers.
Job PR-3220-d859300 is done. |
Looks like this PR only addresses the second point, i.e., setting the requires_grad=True for backbone parameters. |
Job PR-3220-48164f5 is done. |
Job PR-3220-136b7dc is done. |
5373100
to
03900b8
Compare
Job PR-3220-24f3ae3 is done. |
multimodal/src/autogluon/multimodal/configs/optimization/adamw.yaml
Outdated
Show resolved
Hide resolved
Changed the design based on some offline discussions. |
2b21df0
to
833f1fd
Compare
Job PR-3220-9b0744f is done. |
Job PR-3220-2b21df0 is done. |
Job PR-3220-833f1fd is done. |
@@ -149,6 +149,7 @@ model: | |||
- "image" | |||
max_img_num_per_col: 1 | |||
output_bbox_format: "xyxy" # now support xyxy or xywh, for bbox format details see https://keras.io/api/keras_cv/bounding_box/formats/ | |||
frozen_layers: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if removing null
? What's the difference between nothing and null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's no difference. We are using both in the configs.
Job PR-3220-d2af22a is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The current two-stage/layerwise-decay learning rate settings supports use different lr in head and non-head layers. It also allows user to set lr in non-head layers to 0. However,
backbone_lr = 1e-5, neck_lr=1e-5, head_lr=1e-3
andbackbone_lr = 0, neck_lr=1e-4, head_lr=1e-4
may both converge fast and nice butbackbone_lr = 0, neck_lr=0, head_lr=1e-4
may not be a good one)require_grad=False
for backbone parameters can save lots of GPU memory and enable us to run a larger batch_size or a larger model.So here we introduced a new hyperparameter
model.mmdet.frozen_layers
that disables gradient update for backbone. It can be used together with any lr_choice, i.e. "single_lr", "two_stage", or "layerwise_decay".Future work:
Due to bandwidth limit, it's only added to lit_mmdet. Will need benchmarking on other problem types and add to corresponding lit modules. Also the lit modules may need a refactor (add an base module for a better OOP design).
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.