You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading your paper, I have a confusion that how do you handle the multi-patch (256) inputs in the encoder? It seems that in the encoder, the network fuses the 256 patches and learns one feature map (with size: (H/16, W/16, D)) of the whole original image (instead of the patch-wise image), and then decode this feature map to generate the segmentatoin map. Wonder how to process and fue the 256 patches in the encoder?
The text was updated successfully, but these errors were encountered:
The size of each patch is 1616, if the size of the input image is HW, then the sequence length is (H/16)*(W/16)=HW/256. Not 256
The size of the output feature of the encoder is (HW/256, 1024), HW/256 is the sequence length and 1024 is the embedding dimension. Then we reshape it to feature map with size (H/16, W/16, 1024) and connect to the decoder. Please refer to Figure 1 in the main paper for more detail.
Thanks for your helps. I think in the encoder, all layers carry out all calculations (MSA & MLP) in the inter-patch way, which doesn't consider the intra-patch information. Could it affect to capture small or local features?
agree. the intra-patch information has been done in the linear projection layer: 16x16x3 (RGB 3 channel) --> 1x1x1024 and there is no chance to do intra-patch within the 1x1.
After reading your paper, I have a confusion that how do you handle the multi-patch (256) inputs in the encoder? It seems that in the encoder, the network fuses the 256 patches and learns one feature map (with size: (H/16, W/16, D)) of the whole original image (instead of the patch-wise image), and then decode this feature map to generate the segmentatoin map. Wonder how to process and fue the 256 patches in the encoder?
The text was updated successfully, but these errors were encountered: