Using ConViT for object detection [Discussion] #21

sfarkya04 · 2021-12-03T19:42:54Z

Thank you for your work and for providing the code.

Based on my limited knowledge about the transformers used in detectors and trackers, I see a common trend of using Resnet50 backbone for feature extraction before the transformer layers, and the only paper I could find without a Conv backbone is WB-DETR which introduces a module similar to T2T module which claims captures "rich local information" from the patch. In general, the argument I have read is that transformers are not able to capture the local information from the patch well and hence miss the small objects.

Do you think ConViT would be able to handle object detection if used as the backbone in the DETR framework instead of Resnet50+Transformer? Intuitively, the GPSA from your paper should be able to capture convolutional features from the image and well as the SA properties based on the loss function from DETR.

Would be happy to try it out if you think it's possible?
I have been trying to find an architecture that doesn't use Conv layers and work on patch-based representation and your work seemed very relevant.

Any comments would be really helpful.

Thank you.
Saurabh

sdascoli · 2021-12-04T20:35:39Z

Dear Saurabh,
Thanks for this good question. I would tend to think that if patch-based models do not have enough local information to perform detection, then the ConViT might suffer the same issue, as it is also patch-based : the ConViT introduces locality between patches, not within a given patch, and that might not be enough. However, I am not a specialist in detection, and this could totally be worth trying ; if the locality of the ConViT is not enough on its own, perhaps one could reduce the patch size, or introduce a small CNN in the patchifier as suggested in https://arxiv.org/pdf/2106.14881.pdf ?
Best wishes,
Stéphane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ConViT for object detection [Discussion] #21

Using ConViT for object detection [Discussion] #21

sfarkya04 commented Dec 3, 2021 •

edited

Loading

sdascoli commented Dec 4, 2021 •

edited

Loading

Using ConViT for object detection [Discussion] #21

Using ConViT for object detection [Discussion] #21

Comments

sfarkya04 commented Dec 3, 2021 • edited Loading

sdascoli commented Dec 4, 2021 • edited Loading

sfarkya04 commented Dec 3, 2021 •

edited

Loading

sdascoli commented Dec 4, 2021 •

edited

Loading