Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questionable distillation technique #9

Open
andyperio opened this issue Apr 28, 2024 · 1 comment
Open

Questionable distillation technique #9

andyperio opened this issue Apr 28, 2024 · 1 comment

Comments

@andyperio
Copy link

andyperio commented Apr 28, 2024

I noticed in Table 12 of your paper, the hyperparameter ($\beta$) is set to a very low value, 1e-8, which suggests that the proposed code-based distillation process plays an almost negligible role during training. This is quite puzzling.
Could you explain?

@tyfeld
Copy link
Collaborator

tyfeld commented Apr 29, 2024

I noticed in Table 12 of your paper, the hyperparameter (β) is set to a very low value, 1e-8, which suggests that the proposed code-based distillation process plays an almost negligible role during training. This is quite puzzling. Could you explain?

Regarding the disparity in hyperparameters between the code distillation and class distillation in Equation 7, it is critical to consider the calculation processes for both the class-based distillation loss and code-based distillation loss. The class-based distillation loss entails computing the KL divergence between the student predictions and soft labels. On the other hand, the code-based loss involves comparing the target node representations with all M codes of the codebook embeddings.
It is deserved to notice that the class number and code number are extremely different regarding the order of magnitude, which results in a considerable scale gap of approximately 10^7-fold in large-scale datasets. Consequently, despite the smaller hyperparameter, the corresponding code distillation loss is proportionally larger, effectively ensuring that the gradients propagated during backpropagation are maintained at a comparable level with class distillation loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants