Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmentioned but critical LayerNorm #3

Open
gathierry opened this issue Mar 18, 2022 · 5 comments
Open

Unmentioned but critical LayerNorm #3

gathierry opened this issue Mar 18, 2022 · 5 comments

Comments

@gathierry
Copy link
Owner

To achieve comparable result as the original paper. LayerNorm is applied to the feature before NF. This is never mentioned in the paper and the usage is very tricky (but this is the only way works for me):

  • resnet18 and wide-resnet-50: use trainable LayerNorm
  • CaiT and DeiT: use the final norm from the pre-trained model and fix it's affine parameters
@cytotoxicity8
Copy link

I measured the performances of models without LayerNorm parts. In both renset18 and wide-resnet50, AUROC was quite similar, sometimes even better the original ones. Also DeiT showed comparable performances. (lower as 0.03~0.05) However in CaiT, the loss was crazily high and AUROC was 0.5! I can't understand why these models show different results depending on Layer Normalization.

@cytotoxicity8
Copy link

cytotoxicity8 commented May 28, 2022

image
The red one is w/o elementwise-affine.
I am experimenting to advance FastFlow, discussion is always open.

@AncientRemember
Copy link

use x = x.flatten(2).transpose(1, 2) to reshape the featuremap BCHW -->B,N,C,thus layerNorm don't depend the input size

@AncientRemember
Copy link

AncientRemember commented Sep 23, 2022

maybe use BN after conv2d will work

@gathierry
Copy link
Owner Author

gathierry commented Sep 23, 2022

Well, after learning more about transformers, I realize that adding LayerNorm to intermediate output feature maps is very commom, such as applying transformers as the backbone in semantic segmentation (https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/blob/87e6f90577435c94f3e92c7db1d36edc234d91f6/mmseg/models/backbones/swin_transformer.py#L620). So I guess that's why the paper never mentioned.

And for resnet, maybe LayerNorm is not necessary as pointed out by @cytotoxicity8 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants