Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knowledge distillation loss #3

Closed
segalinc opened this issue Mar 11, 2021 · 3 comments
Closed

knowledge distillation loss #3

segalinc opened this issue Mar 11, 2021 · 3 comments

Comments

@segalinc
Copy link

segalinc commented Mar 11, 2021

Hi,
do you apply the KL divergence loss to both valence arousal and expression?
Can you provide more details about it? For instance what do you pass to the loss? Do you create distributions from the prediction and draw a sample and pass them to the loss? digitize the predictions before passing them to the loss? pass the predictions as they are?
Did you also use some temperature parameter?
Thank you

@antoinetlc
Copy link
Contributor

Hello,

Thank you for your questions.
I used distillation only for the categorical emotions and not for the valence and arousal values. I replaced the cross entropy loss for categorical emotions by a KL divergence term between the teacher and student predictions (after having applied a softmax, in order to get probability distributions). The part of the loss that deals with valence and arousal stayed the same (see paper). I did not use a temperature parameter.

From my experience, this worked great with the AffectNet dataset. However, it might be data dependent and there might be other ways of doing the distillation such as :

  • keep the cross entropy between the student prediction and label coming from the dataset and add a KL divergence term between the student and teacher predictions as a regularization (in this case, add a coefficient in front of the KL divergence term, to control the amount of regularization you want)
  • You could also add distillation for valence and arousal. In this case, I would use a L2 loss between the valence and arousal predictions of the teacher and student instead of the KL divergence as this is a regression problem (a KL divergence with negative values for regression will break...). Again this distillation term could either replace the original loss with the label coming from the dataset or come on top of it as a regularization

It is hard to know what will work better as it is data dependent. I would recommend you to try on a validation set and see what gives you the best accuracy there to select your best model.

Hope it helps!

@segalinc
Copy link
Author

segalinc commented Mar 13, 2021 via email

@nlml
Copy link

nlml commented Sep 29, 2022

Hi there. Not sure if you'll see this but this aspect of your paper is still unclear to me.

You say you use knowledge distillation, but what are you distilling from? Did you train a teacher network first on AffectNet (i.e. no other datasets), and then use that network's predictions for the knowledge distillation loss component when training the student?

Thanks!
Liam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants