Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the implementation details #2

Closed
v-qjqs opened this issue Apr 16, 2021 · 18 comments
Closed

Some questions about the implementation details #2

v-qjqs opened this issue Apr 16, 2021 · 18 comments

Comments

@v-qjqs
Copy link

v-qjqs commented Apr 16, 2021

Hi, thanks for your nice work and open source project. I have the following questions about the implementation details on teacher part:

  1. How did you get the inference performance (e.g. 42.5 mAP in Table 2(a)) of the trained C-FF(Crossing Feature-level Fusion) teacher? Is it an ensemble of two inference parts(high/low resolution inputs), or ensemble of three parts(low/high/fusion), or only from the fusion part?

  2. Does the teacher's detection head for the fused features (returned by C-FF module) share exactly the same learnable weights with the head used for high/low resolution inputs?

Thanks a lot for your reply.

@qqlu
Copy link
Collaborator

qqlu commented Apr 16, 2021

Hi, thanks for your issues.

The C-FF module is used to fuse the features of both high- and low- resolution inputs. The output of the C-FF is the two-dimension score. Thus we obtain the new features by weighted (obtained score) adding those two features. After that, we import the new features into the detection head to get the final results.

For the second question, fused features shared the same detection head with high- and low- resolution features. That means, we only have one detection head for the teacher.

Hope this could answer your questions.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 16, 2021

Hi, so for question 1 you mean the inference performance is got from the fusion part only.
Thanks very much.

@qqlu
Copy link
Collaborator

qqlu commented Apr 16, 2021

Yes. The inference performance is got from the fusion part.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 16, 2021

Hi, another question is: does the similar performance improvement for low-resolution input (e.g. FCOS 35.9 -> 37.8 -> 39.7, vanilla multi-scale -> aligned multi-scale -> distilled student) exist on another one-stage detectors like retinanet?
Since I'm trying to transfer your distillation method to my own customized dataset with retinanet detector, and I may mainly focus on the low-resolution input because high-resulution input is not expected to be used during my inference of student model.
Thanks very much.

@qqlu
Copy link
Collaborator

qqlu commented Apr 16, 2021

I think there is a similar performance improvement for the RetinaNet with the 1x learning schedule. I trained RetinaNet and MEInst in the 3x schedule and all of them share a similar law. The performance of RetinaNet with 3x is reported in our camera-ready paper.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 16, 2021

Hi, I'm planning to use the distillation method on my own customized dataset.
Thanks a lot.

@v-qjqs v-qjqs changed the title Some questions about the implementation details on teacher Some questions about the implementation details Apr 16, 2021
@v-qjqs v-qjqs closed this as completed Apr 16, 2021
@v-qjqs v-qjqs reopened this Apr 17, 2021
@v-qjqs
Copy link
Author

v-qjqs commented Apr 17, 2021

Hi, I'm sorry I have another question about the teacher's training details on the positive/negative samples definition.
If I use retinanet to train a multi-scale fusion teacher, I wonder whether the three features (low /high /fusion FPN feature, e.g. P(s) /P'(s-1) / P(s)^T in Fig.2) which have the same spatial size share exactly the same positive/negative samples or not? I mean is there only one time for the assigning/matching process to decide positive/negative samples?
Thanks a lot.

@qqlu
Copy link
Collaborator

qqlu commented Apr 17, 2021

In my setting, the three features shared the same spatial size. So you indeed only have one time for assigning positive/negative samples.

The implementation is that you could make your high-resolution input size could be divisible by 256 (the stride of P7 is 128). In this way, your low-resolution input size could be divisible by 128. And we are noting that this operation could also be limited to the max/min valid input size like (800,1333) by padding operation.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 17, 2021

Hi, thanks for your kind suggestions. I wonder did you have experiment experience on using less FPN features? In my customized training setting and my own designed backbones, perhaps there only P3 and P4 features might be used.

@qqlu
Copy link
Collaborator

qqlu commented Apr 17, 2021

I conduct the Faster R-CNN with only using single-level features. The same law is shared with the framework using multi-level features.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 17, 2021

Thanks.

@v-qjqs v-qjqs closed this as completed Apr 17, 2021
@v-qjqs
Copy link
Author

v-qjqs commented Apr 18, 2021

Hi, I'm sorry I have another question about the implementation details on training a multi-scale fusion teacher.
Which regression target is used for the supervision of positive samples under the fusion branch(feature returned from the C-FF module)? Is it the gt bbox annotations from the high(2x) resolution input or from the low resulotion?
Also I'm not sure whether the gt annotations form the low resulotion are used for the training of low-resolution branch or not?
Is it fine for all the three branches(low/high/fusion) to use exactly the same gt coordinate target which is only from the high-resolution input?

Thanks a lot for your reply.

@v-qjqs v-qjqs reopened this Apr 18, 2021
@qqlu
Copy link
Collaborator

qqlu commented Apr 18, 2021

Hi, the three features generated by low-/high- resolution input or C-FF share the same regression target, which is our proposed "aligned" concept. There are two implementation ways. The first one is to set unique parameters for each type of feature. For example, FPN strides are from 8 to 128 (P3-P7) in high-resolution input and 4 to 64 (P2-P6) in low-resolution input. When you do like this, you could find that these features share the same regression target.

I recommend the second way that you only replaced the fused/low-resolution input features with high-resolution input features. You use the same high-resolution input hyperparameter setting for all these features.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 18, 2021

Hi thanks for your reply.
In the second way, did you mean there is unnecessary to divide the gt bbox annotations (which are from high-resolution input) by factor 2 when doing regression supervision for the low-resolution inputs? If so, the gt annotations might exceed the image boundary for the low-resolution input.

@qqlu
Copy link
Collaborator

qqlu commented Apr 18, 2021

Yes. You could refer to the line 144-208 of our released code arch_t.py. You could find that I still use the ground truth of high-resolution input. I only change different features.

@v-qjqs
Copy link
Author

v-qjqs commented Apr 18, 2021

So when the training of the multi-scale fusion teacher finished, during the test mode of low-resolution input, will the predicted bbox coordinates be manually divided by a factor 2 to format the final detection results?

@qqlu
Copy link
Collaborator

qqlu commented Apr 18, 2021

It depends on your implementation. If you implement it in the second way, you do not need to divide by a factor 2.

My inference implementation of RetinaNet by detection2 is as follows. You could see that I only choose features I would used in the first three rows.

results = self.inference(anchors, hr_pred_logits, hr_pred_anchor_deltas, hr_images.image_sizes) // high resolution
results = self.inference(anchors, lr_pred_logits, lr_pred_anchor_deltas, hr_images.image_sizes) // low resolution
results = self.inference(anchors, fr_pred_logits, fr_pred_anchor_deltas, hr_images.image_sizes) // fused resolution

processed_results = []
for results_per_image, input_per_image, image_size in zip(
results, batched_inputs, t_images.image_sizes
):
height = input_per_image.get("height", image_size[0])
width = input_per_image.get("width", image_size[1])
r = detector_postprocess(results_per_image, height, width)
processed_results.append({"instances": r})
return processed_results

@v-qjqs
Copy link
Author

v-qjqs commented Apr 18, 2021

Ok, thanks for your kind replies.

@v-qjqs v-qjqs closed this as completed Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants