Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

The centroid and star-shaped structure of DisTraL #8

Closed
c4cld opened this issue Jul 20, 2021 · 5 comments
Closed

The centroid and star-shaped structure of DisTraL #8

c4cld opened this issue Jul 20, 2021 · 5 comments

Comments

@c4cld
Copy link

c4cld commented Jul 20, 2021

Description

In 2.3 Policy Gradient and a Better Parameterization of 'Distral: Robust Multitask Reinforcement Learning', the author argues that the centroid and star-shaped structure of DisTraL is good for learning a better distilled policy. However, the explanation is too simple. Could you know the advantages of the centroid and star-shaped structure and explain them in detail? I have tried to contact the author of the paper, but I haven't received a reply. So I come to consult you.

@shagunsodhani
Copy link
Contributor

Hi! Thank you for the question. The paper mentions that Distral learns a distilled policy in the space of policies, which is better than learning in the space of parameters. The basic idea is, it should be easier to interpolate in the space of functions/policies than interpolating in the parameter space. Two networks could have very similar predictions on any given input but have very different weights. Averaging their predictions would give meaningful predictions, while averaging their weights could result in a worse model than the original two models. This is also related to how we perform ensembling- we average the predictions of multiple models and not average the weights of different models.

The paper does not comment about star/centroid being better. If you think otherwise, could you please point me to the relevant line in the paper?

@c4cld
Copy link
Author

c4cld commented Jul 26, 2021

@shagunsodhani Thank you for you selfless help. The paper mentions star/centroid in 2.3 Policy Gradient and a Better Parameterization. Could you know the advantages of the centroid and star-shaped structure and explain them in detail?

1627264892(1)

@shagunsodhani
Copy link
Contributor

One advantage could be the reduced computation, every model distills with the a central model so the number of distillation operations is linear in number of models. With say a fully-connected topology, the number of operations will be quadratic. This has the obvious limitation that the information exchange is bottlenecked on the central model.

@c4cld
Copy link
Author

c4cld commented Jul 27, 2021

@shagunsodhani Thank you very much!

@shagunsodhani
Copy link
Contributor

Cool - closing the task - feel free to reopen if needed :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants