Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO Implementation Details - Checklist #53

Closed
9 of 13 tasks
herbiebradley opened this issue Oct 20, 2022 · 2 comments
Closed
9 of 13 tasks

PPO Implementation Details - Checklist #53

herbiebradley opened this issue Oct 20, 2022 · 2 comments

Comments

@herbiebradley
Copy link

herbiebradley commented Oct 20, 2022

The 37 Implementation Details of PPO, a blog post published at ICLR, details a number of PPO implementation details to improve both efficiency and model performance. See also: Andrychowicz et al., Engstrom et al.

Some of these optimizations are minor and probably irrelevant, many are already implemented here, and some may provide performance boosts to trlx. This issue documents these details as a checklist, to track the progress of this repository towards the entire list.

  • 1. Vectorized Architecture - trlx already does this.
  • 2. Weights and Biases Initialisation. Any layers initialised from scratch should use orthogonal initialization with scaling sqrt(2) and bias of 0, with policy network last layer scaled by 0.01 after init.
  • 3. Adam Optimizer initialization. Andrychowicz et al. recommend 1e-7 as Adam epsilon (and actually find that the PyTorch default of 1e-8 is the worst of the choices tested).
  • 4. Optimizer Weight Decay. Currently the code does not use the config value of weight_decay: 1e-6 at all? It also uses Cosine Annealing instead of Linear, and decays not to 0 (recommended by Andrychowicz et al.) but to 1.412e-4 by default. Maybe test linear to see if it makes a difference?
  • 5. Generalized Advantage Estimation. Correctly implemented in trlx.
  • 6. Mini-batch updates. In trlx this is being done in make_experience.
  • 7. Normalization of Advantages (at the mini-batch level). I believe this is being done, since I think whiten is called at mini-batch level?
  • 8. Clipped surrogate objective. Done in trlx.
  • 9. Value function loss clipping. Done in trlx.
  • 10. Overall loss and entropy bonus. Entropy is not used for regularization in trlx. OAI set it to 0 for mujoco anyway, and Andrychowicz et al. find that regularization does not help performance, so this may not be useful to implement.
  • 11. Global gradient clipping. The trlx grad_clip config option does not appear to be connected to anything. Andrychowicz et al. find a small performance boost from ensuring the norm of gradients of all parameters does not exceed 0.5.
  • 12. KL approximation. Check that the unbiased estimator is being used.
  • 13. Shared vs separate policy/value networks. Irrelevant in trlx due to the hydra heads implementation.

Other items in the blog post are environment/network specific to problems trlx does not tackle. Andrychowicz also contains other hyperparameter choices not mentioned here which may be of interest.

@Dahoas
Copy link
Collaborator

Dahoas commented Oct 21, 2022

Thanks for this!

@LouisCastricato
Copy link
Contributor

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants