Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding backbone network #26

Closed
henriquepm opened this issue Feb 15, 2023 · 5 comments
Closed

Questions regarding backbone network #26

henriquepm opened this issue Feb 15, 2023 · 5 comments

Comments

@henriquepm
Copy link

Hi! First of all thank you for the great quality of this work, both the paper and the code.
I have a couple of doubts regarding the backbone:

  1. As mentioned in issue Questions on architecture design choices #24 the image features in the repo come from the concatenation of the output of the second layer and the upsampled output of the third layer. In the paper, it is instead stated that the features come from the concatenation of the output of the third layer with the upsampled output of the last layer, leading to feature maps of dimension C x H/8 x W/8, while the approach in the code will produce FM of dimension C x H/4 x W/4. From which of the two approaches come the results reported in the paper? Does this difference have significant effect on performance (if both have been tested)?
  2. In the paper it is mentioned that the ResNet-101 backbone is initialized from COCO pretraining citing the DETR paper, while in the code the network is initialized from torchvision default weights (ImageNet pretraining). In the experiment sections of the paper, the effect of input resolution is discussed and it is hypothesised that the decreasing performance with higher resolution could be explained due to worse transfer from inconsistency with the pretraining scale. Do the results in this section of the code come from the approach described in the paper (COCO-pretraining) or the one in the code? In case you have run experiments with both approaches, does this make any significant difference?
    Thanks again.
@aharley
Copy link
Owner

aharley commented Feb 16, 2023

Thanks for these questions.

  1. The paper results come from this repo (or a slightly messier version of it). I will either update the dimension line in the paper, or add an experiment with H/8 x W/8. (Do you already know if H/8 x W/8 is much different?)
  2. That's a great point. I need to think and check back to see why the paper says COCO while clearly the code indicates Imagenet. It could be that we used coco inits very early on, then switched to imagenet while simplifying the codebase.

@henriquepm
Copy link
Author

Thanks for the quick answer, I do not know atm, I'm planning to run some experiments with the backbones and wanted to understand the departure point as well as possible.

@aharley
Copy link
Owner

aharley commented Feb 23, 2023

@henriquepm I'm coming back to this to check the /4 and /8 stuff. I added a bunch of shape prints to the forward of Encoder_res101, and right now I'm not sure why you said "the approach in the code will produce FM of dimension C x H/4 x W/4".

def forward(self, x):
        print('x in', x.shape)
        x1 = self.backbone(x)                                                                                                               
        print('x1', x1.shape)
        x2 = self.layer3(x1)                                                                                                                 
        print('x2', x2.shape)
        x = self.upsampling_layer(x2, x1)
        print('x up', x.shape)
        x = self.depth_layer(x)
        print('x d', x.shape)
        return x

The output is:

x in torch.Size([6, 3, 448, 800])
x1 torch.Size([6, 512, 56, 100])
x2 torch.Size([6, 1024, 28, 50])
x up torch.Size([6, 512, 56, 100])
x d torch.Size([6, 128, 56, 100])

which looks like H/8, W/8 like the paper said. I may easily have missed something because I haven't used the repo in a little bit, so please let me know if you see something wrong.

@aharley aharley reopened this Feb 23, 2023
@henriquepm
Copy link
Author

Hey, that looks totally right, sorry about that.
I took a look at the notebook where I was looking at the network and dissecting it. I was comparing the size against the output of the first conv layer of the resnet instead of the proper input so I was missing a 1/2 factor.

@aharley
Copy link
Owner

aharley commented Feb 23, 2023

Perfect, no problem. Thanks for confirming so quickly!

@aharley aharley closed this as completed Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants