Skip to content

Latest commit

 

History

History
32 lines (17 loc) · 3 KB

Learning Efficient Object Detection Models with Knowledge Distillation.md

File metadata and controls

32 lines (17 loc) · 3 KB

Contributions C:

Some techniques to apply knowledge distillation, weighted loss for imbalanced data, hint learning for feature adaptation on object detection tasks.

Takeaways


Quotes We transfer the teacher’s regression output as a form of upper bound, that is, if the student’s regression output is better than that of teacher, no additional loss is applied.

When is very similar to the hard label, with probability for one class very close to 1 and most others very close to 0, the temperature parameter is introduced to soften the output. Using higher temperature will force to produce softer labels so that the classes with near-zero probabilities will not be ignored by the cost function. This is especially pertinent to simpler tasks, such as classification on small datasets like MNIST. But for harder problems where the prediction error is already high, a larger value of introduces more noise which is detrimental to learning. Thus, lower values of are used in [20] for classification on larger datasets. For even harder problems such as object detection, we find using no temperature parameter at all (equivalent to ) in the distillation loss works the best in practice (see supplementary material for an empirical study).

Unlike distillation for discrete categories, the teacher’s regression outputs can provide very wrong guidance toward the student model, since the real valued regression outputs are unbounded

Instead of using the teacher’s regression output directly as a target, we exploit it as an upper bound for the student to achieve. The student’s regression vector should be as close to the ground truth label as possible in general, but once the quality of the student surpasses that of the teacher with a certain margin, we do not provide additional loss for the student.

When the hint and guided layers are convolutional layers, we use 1 × 1 convolutions to save memory.

The observation may support the hypothesis that CNN based object detectors are highly over-parameterized.

This suggests that it is worth having even higher capacity models for such large scale datasets.

To prevent the classes with small probability being ignored by the objective function, soft label with high temperature, also named weighted cross entropy loss, is proposed for the proposal classification task in Sec.3.2.

‘Car’ shares more common visual characteristics with ‘Truck’ than with ’Person’. Such structural information is not available in the ground truth annotations.


** Miscellaneous**

Compression ratio can be measured by the ranks of weight matrices from a neural network.