Some questions about the implementation details #2

v-qjqs · 2021-04-16T11:25:55Z

Hi, thanks for your nice work and open source project. I have the following questions about the implementation details on teacher part:

How did you get the inference performance (e.g. 42.5 mAP in Table 2(a)) of the trained C-FF(Crossing Feature-level Fusion) teacher? Is it an ensemble of two inference parts(high/low resolution inputs), or ensemble of three parts(low/high/fusion), or only from the fusion part?
Does the teacher's detection head for the fused features (returned by C-FF module) share exactly the same learnable weights with the head used for high/low resolution inputs?

Thanks a lot for your reply.

qqlu · 2021-04-16T11:36:29Z

Hi, thanks for your issues.

The C-FF module is used to fuse the features of both high- and low- resolution inputs. The output of the C-FF is the two-dimension score. Thus we obtain the new features by weighted (obtained score) adding those two features. After that, we import the new features into the detection head to get the final results.

For the second question, fused features shared the same detection head with high- and low- resolution features. That means, we only have one detection head for the teacher.

Hope this could answer your questions.

v-qjqs · 2021-04-16T11:44:50Z

Hi, so for question 1 you mean the inference performance is got from the fusion part only.
Thanks very much.

qqlu · 2021-04-16T11:46:17Z

Yes. The inference performance is got from the fusion part.

v-qjqs · 2021-04-16T11:56:48Z

Hi, another question is: does the similar performance improvement for low-resolution input (e.g. FCOS 35.9 -> 37.8 -> 39.7, vanilla multi-scale -> aligned multi-scale -> distilled student) exist on another one-stage detectors like retinanet?
Since I'm trying to transfer your distillation method to my own customized dataset with retinanet detector, and I may mainly focus on the low-resolution input because high-resulution input is not expected to be used during my inference of student model.
Thanks very much.

qqlu · 2021-04-16T12:08:50Z

I think there is a similar performance improvement for the RetinaNet with the 1x learning schedule. I trained RetinaNet and MEInst in the 3x schedule and all of them share a similar law. The performance of RetinaNet with 3x is reported in our camera-ready paper.

v-qjqs · 2021-04-16T13:16:04Z

Hi, I'm planning to use the distillation method on my own customized dataset.
Thanks a lot.

v-qjqs · 2021-04-17T02:37:38Z

Hi, I'm sorry I have another question about the teacher's training details on the positive/negative samples definition.
If I use retinanet to train a multi-scale fusion teacher, I wonder whether the three features (low /high /fusion FPN feature, e.g. P(s) /P'(s-1) / P(s)^T in Fig.2) which have the same spatial size share exactly the same positive/negative samples or not? I mean is there only one time for the assigning/matching process to decide positive/negative samples?
Thanks a lot.

qqlu · 2021-04-17T02:47:32Z

In my setting, the three features shared the same spatial size. So you indeed only have one time for assigning positive/negative samples.

The implementation is that you could make your high-resolution input size could be divisible by 256 (the stride of P7 is 128). In this way, your low-resolution input size could be divisible by 128. And we are noting that this operation could also be limited to the max/min valid input size like (800,1333) by padding operation.

v-qjqs · 2021-04-17T02:56:38Z

Hi, thanks for your kind suggestions. I wonder did you have experiment experience on using less FPN features? In my customized training setting and my own designed backbones, perhaps there only P3 and P4 features might be used.

qqlu · 2021-04-17T02:59:58Z

I conduct the Faster R-CNN with only using single-level features. The same law is shared with the framework using multi-level features.

v-qjqs · 2021-04-17T03:03:38Z

Thanks.

v-qjqs · 2021-04-18T06:58:01Z

Hi, I'm sorry I have another question about the implementation details on training a multi-scale fusion teacher.
Which regression target is used for the supervision of positive samples under the fusion branch(feature returned from the C-FF module)? Is it the gt bbox annotations from the high(2x) resolution input or from the low resulotion?
Also I'm not sure whether the gt annotations form the low resulotion are used for the training of low-resolution branch or not?
Is it fine for all the three branches(low/high/fusion) to use exactly the same gt coordinate target which is only from the high-resolution input?

Thanks a lot for your reply.

qqlu · 2021-04-18T07:46:39Z

Hi, the three features generated by low-/high- resolution input or C-FF share the same regression target, which is our proposed "aligned" concept. There are two implementation ways. The first one is to set unique parameters for each type of feature. For example, FPN strides are from 8 to 128 (P3-P7) in high-resolution input and 4 to 64 (P2-P6) in low-resolution input. When you do like this, you could find that these features share the same regression target.

I recommend the second way that you only replaced the fused/low-resolution input features with high-resolution input features. You use the same high-resolution input hyperparameter setting for all these features.

v-qjqs · 2021-04-18T07:56:27Z

Hi thanks for your reply.
In the second way, did you mean there is unnecessary to divide the gt bbox annotations (which are from high-resolution input) by factor 2 when doing regression supervision for the low-resolution inputs? If so, the gt annotations might exceed the image boundary for the low-resolution input.

qqlu · 2021-04-18T08:01:23Z

Yes. You could refer to the line 144-208 of our released code arch_t.py. You could find that I still use the ground truth of high-resolution input. I only change different features.

v-qjqs · 2021-04-18T08:09:23Z

So when the training of the multi-scale fusion teacher finished, during the test mode of low-resolution input, will the predicted bbox coordinates be manually divided by a factor 2 to format the final detection results?

qqlu · 2021-04-18T08:24:34Z

It depends on your implementation. If you implement it in the second way, you do not need to divide by a factor 2.

My inference implementation of RetinaNet by detection2 is as follows. You could see that I only choose features I would used in the first three rows.

results = self.inference(anchors, hr_pred_logits, hr_pred_anchor_deltas, hr_images.image_sizes) // high resolution
results = self.inference(anchors, lr_pred_logits, lr_pred_anchor_deltas, hr_images.image_sizes) // low resolution
results = self.inference(anchors, fr_pred_logits, fr_pred_anchor_deltas, hr_images.image_sizes) // fused resolution

processed_results = []
for results_per_image, input_per_image, image_size in zip(
results, batched_inputs, t_images.image_sizes
):
height = input_per_image.get("height", image_size[0])
width = input_per_image.get("width", image_size[1])
r = detector_postprocess(results_per_image, height, width)
processed_results.append({"instances": r})
return processed_results

v-qjqs · 2021-04-18T08:34:51Z

Ok, thanks for your kind replies.

v-qjqs changed the title ~~Some questions about the implementation details on teacher~~ Some questions about the implementation details Apr 16, 2021

v-qjqs closed this as completed Apr 16, 2021

v-qjqs reopened this Apr 17, 2021

v-qjqs closed this as completed Apr 17, 2021

v-qjqs reopened this Apr 18, 2021

v-qjqs closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the implementation details #2

Some questions about the implementation details #2

v-qjqs commented Apr 16, 2021 •

edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021 •

edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021 •

edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021

v-qjqs commented Apr 17, 2021 •

edited

qqlu commented Apr 17, 2021

v-qjqs commented Apr 17, 2021 •

edited

qqlu commented Apr 17, 2021

v-qjqs commented Apr 17, 2021

v-qjqs commented Apr 18, 2021 •

edited

qqlu commented Apr 18, 2021

v-qjqs commented Apr 18, 2021 •

edited

qqlu commented Apr 18, 2021

v-qjqs commented Apr 18, 2021

qqlu commented Apr 18, 2021 •

edited

v-qjqs commented Apr 18, 2021

Some questions about the implementation details #2

Some questions about the implementation details #2

Comments

v-qjqs commented Apr 16, 2021 • edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021 • edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021 • edited

qqlu commented Apr 16, 2021

v-qjqs commented Apr 16, 2021

v-qjqs commented Apr 17, 2021 • edited

qqlu commented Apr 17, 2021

v-qjqs commented Apr 17, 2021 • edited

qqlu commented Apr 17, 2021

v-qjqs commented Apr 17, 2021

v-qjqs commented Apr 18, 2021 • edited

qqlu commented Apr 18, 2021

v-qjqs commented Apr 18, 2021 • edited

qqlu commented Apr 18, 2021

v-qjqs commented Apr 18, 2021

qqlu commented Apr 18, 2021 • edited

v-qjqs commented Apr 18, 2021

v-qjqs commented Apr 16, 2021 •

edited

v-qjqs commented Apr 16, 2021 •

edited

v-qjqs commented Apr 16, 2021 •

edited

v-qjqs commented Apr 17, 2021 •

edited

v-qjqs commented Apr 17, 2021 •

edited

v-qjqs commented Apr 18, 2021 •

edited

v-qjqs commented Apr 18, 2021 •

edited

qqlu commented Apr 18, 2021 •

edited