Skip to content
This repository has been archived by the owner on Dec 5, 2022. It is now read-only.

Using ConViT for object detection [Discussion] #21

Open
sfarkya04 opened this issue Dec 3, 2021 · 1 comment
Open

Using ConViT for object detection [Discussion] #21

sfarkya04 opened this issue Dec 3, 2021 · 1 comment

Comments

@sfarkya04
Copy link

sfarkya04 commented Dec 3, 2021

Thank you for your work and for providing the code.

Based on my limited knowledge about the transformers used in detectors and trackers, I see a common trend of using Resnet50 backbone for feature extraction before the transformer layers, and the only paper I could find without a Conv backbone is WB-DETR which introduces a module similar to T2T module which claims captures "rich local information" from the patch. In general, the argument I have read is that transformers are not able to capture the local information from the patch well and hence miss the small objects.

Do you think ConViT would be able to handle object detection if used as the backbone in the DETR framework instead of Resnet50+Transformer? Intuitively, the GPSA from your paper should be able to capture convolutional features from the image and well as the SA properties based on the loss function from DETR.

Would be happy to try it out if you think it's possible?
I have been trying to find an architecture that doesn't use Conv layers and work on patch-based representation and your work seemed very relevant.

Any comments would be really helpful.

Thank you.
Saurabh

@sdascoli
Copy link
Contributor

sdascoli commented Dec 4, 2021

Dear Saurabh,
Thanks for this good question. I would tend to think that if patch-based models do not have enough local information to perform detection, then the ConViT might suffer the same issue, as it is also patch-based : the ConViT introduces locality between patches, not within a given patch, and that might not be enough. However, I am not a specialist in detection, and this could totally be worth trying ; if the locality of the ConViT is not enough on its own, perhaps one could reduce the patch size, or introduce a small CNN in the patchifier as suggested in https://arxiv.org/pdf/2106.14881.pdf ?
Best wishes,
Stéphane

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants