Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self Imitation Learning #21

Open
flrngel opened this issue Sep 16, 2018 · 0 comments
Open

Self Imitation Learning #21

flrngel opened this issue Sep 16, 2018 · 0 comments

Comments

@flrngel
Copy link
Owner

flrngel commented Sep 16, 2018

https://arxiv.org/abs/1806.05635

Abstract

  • SIL(Self Imitation Learning) is to verify past good experiences can indirectly drive deep exploration.
  • competitive to state-of-the-art

1. Introduction

  • Atari "Montezuma's Revenge"
    • ends up with a poor policy (A2C)
    • exploiting the experiences that pick up the key makes able to explore onwards
  • Main contributions are,
    1. To study how exploiting past good experiences affects learning (SIL)
    2. Theretical justification of the SIL that is derived from the lower bound of the optimal Q-function
    3. SIL is very simple to implement
    4. Generically applicable to any actor-critic architecture

2. Related work

  • Exploration
  • this paper is different from using that what agent experienced but not yet learned
  • Episodic control
    • Lengyel & Dayan, 2008
    • extreme way of exploiting past experiences in the sense that repeats the best outcome in the past
  • Experience replay
  • Experience replay for actor-critic
    • actor-critic framework can also utilize experience replay
    • difference with off-policy and on-policy stackoverflow
    • off-policy evaluation involves importance sampling (ACER, Reactor; use Retrace to evaluate), that may not benefit much from past experience if the policy in the past is very different from current policy
    • this paper does not involve importance sampling and both applicable to discrete and continuous control
  • Connection between policy gradient and Q-learning
  • Learning from imperfect demonstrations
    • prior works [Liang et al., 2016, Abolafia et al., 2018] used classification loss without justification
    • paper propose a new objective, provide a theoretical justification and systematically investigate how it drives exploration in RL

3. Self Imitation Learning

  • goal is to imitate agent's past good experiences in the actor-critic framework
  • propose to store past episodes with cumulative rewards in replay buffer
  • image
  • off-policy actor-critic loss
    image
    image
    image
    image

4. Theoretical Justification

image
image
image
image
image
image
image
image

5. Experiment

image
image
image

6. Conclusion

  • proper level of exploitation of past experiences during learning can drive deep exploration and that SIL and exploration methods can be complementary
  • balancing between exploration and exploitation in terms collecting and learning from experiences is an important future research direction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant