-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
knowledge distillation loss #3
Comments
Hello, Thank you for your questions. From my experience, this worked great with the AffectNet dataset. However, it might be data dependent and there might be other ways of doing the distillation such as :
It is hard to know what will work better as it is data dependent. I would recommend you to try on a validation set and see what gives you the best accuracy there to select your best model. Hope it helps! |
Hi Antonie,
Thank you for the detailed reply, now all is much clear.
In fact it wasn't clear to me if you applied it to VA because being
regression sounded weird.
Really appreciated
Cristina
…Sent from my OnePlus
On Sat, Mar 13, 2021, 14:39 Antoine Toisoul ***@***.***> wrote:
Hello,
Thank you for your questions.
I used distillation only for the categorical emotions and not for the
valence and arousal values. I replaced the cross entropy loss for
categorical emotions by a KL divergence term between the teacher and
student predictions (after having applied a softmax, in order to get
probability distributions). The part of the loss that deals with valence
and arousal stayed the same (see paper). I did not use a temperature
parameter.
From my experience, this worked great with the AffectNet dataset. However,
it might be data dependent and there might be other ways of doing the
distillation such as :
- keep the cross entropy between the student prediction and label
coming from the dataset and add a KL divergence term between the student
and teacher predictions as a regularization (in this case, add a
coefficient in front of the KL divergence term, to control the amount of
regularization you want)
- You could also add distillation for valence and arousal. In this
case, I would use a L2 loss between the valence and arousal predictions of
the teacher and student instead of the KL divergence as this is a
regression problem (a KL divergence with negative values for regression
will break...). Again this distillation term could either replace the
original loss with the label coming from the dataset or come on top of it
as a regularization
It is hard to know what will work better as it is data dependent. I would
recommend you to try on a validation set and see what gives you the best
accuracy there to select your best model.
Hope it helps!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACHLPEW3V5LGPJKSH2JYLCLTDPSSHANCNFSM4ZBEKOZQ>
.
|
Hi there. Not sure if you'll see this but this aspect of your paper is still unclear to me. You say you use knowledge distillation, but what are you distilling from? Did you train a teacher network first on AffectNet (i.e. no other datasets), and then use that network's predictions for the knowledge distillation loss component when training the student? Thanks! |
Hi,
do you apply the KL divergence loss to both valence arousal and expression?
Can you provide more details about it? For instance what do you pass to the loss? Do you create distributions from the prediction and draw a sample and pass them to the loss? digitize the predictions before passing them to the loss? pass the predictions as they are?
Did you also use some temperature parameter?
Thank you
The text was updated successfully, but these errors were encountered: