New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about the implementation details #2
Comments
Hi, thanks for your issues. The C-FF module is used to fuse the features of both high- and low- resolution inputs. The output of the C-FF is the two-dimension score. Thus we obtain the new features by weighted (obtained score) adding those two features. After that, we import the new features into the detection head to get the final results. For the second question, fused features shared the same detection head with high- and low- resolution features. That means, we only have one detection head for the teacher. Hope this could answer your questions. |
Hi, so for question 1 you mean the inference performance is got from the fusion part only. |
Yes. The inference performance is got from the fusion part. |
Hi, another question is: does the similar performance improvement for low-resolution input (e.g. FCOS 35.9 -> 37.8 -> 39.7, vanilla multi-scale -> aligned multi-scale -> distilled student) exist on another one-stage detectors like retinanet? |
I think there is a similar performance improvement for the RetinaNet with the 1x learning schedule. I trained RetinaNet and MEInst in the 3x schedule and all of them share a similar law. The performance of RetinaNet with 3x is reported in our camera-ready paper. |
Hi, I'm planning to use the distillation method on my own customized dataset. |
Hi, I'm sorry I have another question about the teacher's training details on the positive/negative samples definition. |
In my setting, the three features shared the same spatial size. So you indeed only have one time for assigning positive/negative samples. The implementation is that you could make your high-resolution input size could be divisible by 256 (the stride of P7 is 128). In this way, your low-resolution input size could be divisible by 128. And we are noting that this operation could also be limited to the max/min valid input size like (800,1333) by padding operation. |
Hi, thanks for your kind suggestions. I wonder did you have experiment experience on using less FPN features? In my customized training setting and my own designed backbones, perhaps there only P3 and P4 features might be used. |
I conduct the Faster R-CNN with only using single-level features. The same law is shared with the framework using multi-level features. |
Thanks. |
Hi, I'm sorry I have another question about the implementation details on training a multi-scale fusion teacher. Thanks a lot for your reply. |
Hi, the three features generated by low-/high- resolution input or C-FF share the same regression target, which is our proposed "aligned" concept. There are two implementation ways. The first one is to set unique parameters for each type of feature. For example, FPN strides are from 8 to 128 (P3-P7) in high-resolution input and 4 to 64 (P2-P6) in low-resolution input. When you do like this, you could find that these features share the same regression target. I recommend the second way that you only replaced the fused/low-resolution input features with high-resolution input features. You use the same high-resolution input hyperparameter setting for all these features. |
Hi thanks for your reply. |
Yes. You could refer to the line 144-208 of our released code arch_t.py. You could find that I still use the ground truth of high-resolution input. I only change different features. |
So when the training of the multi-scale fusion teacher finished, during the test mode of low-resolution input, will the predicted bbox coordinates be manually divided by a factor 2 to format the final detection results? |
It depends on your implementation. If you implement it in the second way, you do not need to divide by a factor 2. My inference implementation of RetinaNet by detection2 is as follows. You could see that I only choose features I would used in the first three rows. results = self.inference(anchors, hr_pred_logits, hr_pred_anchor_deltas, hr_images.image_sizes) // high resolution processed_results = [] |
Ok, thanks for your kind replies. |
Hi, thanks for your nice work and open source project. I have the following questions about the implementation details on teacher part:
How did you get the inference performance (e.g. 42.5 mAP in Table 2(a)) of the trained C-FF(Crossing Feature-level Fusion) teacher? Is it an ensemble of two inference parts(high/low resolution inputs), or ensemble of three parts(low/high/fusion), or only from the fusion part?
Does the teacher's detection head for the fused features (returned by C-FF module) share exactly the same learnable weights with the head used for high/low resolution inputs?
Thanks a lot for your reply.
The text was updated successfully, but these errors were encountered: