Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting A3C LSTM for Pong #48

Open
MatheusMRFM opened this issue Aug 1, 2017 · 21 comments
Open

Adapting A3C LSTM for Pong #48

MatheusMRFM opened this issue Aug 1, 2017 · 21 comments

Comments

@MatheusMRFM
Copy link

Did anyone managed to get the A3C LSTM of this repo to work for Pong (using the openai gym)?

I have already tried several different optimizers, learning rates, network architectures, but still no success. I even altered the code in this repo to try to replicate the architecture used in the A3C from the OpenAI starter agent, but no success.....the agent maintains a mean reward of about -20.5 forever.......I left it training until it reached 70k global episodes, but it didn't get any better. In some architectures, the agent would just diverge to a policy where it executes only a single action all the time.

If anyone managed to get this implementation to work for Pong, I would really appreciate some hints.

@BenPorebski
Copy link

@MatheusMRFM, I have managed to modify the A3C LSTM implementation to work for Pong using the openai gym.

I have not observed the -20 reward forever problem that you are seeing. The most common problem I observed was that the agent would train to hit a mean reward of -15 then the gradient would explode and everything would be forgotten. This was mitigated by increasing the size of the LSTM episode buffer from 30 to 100.

I have been training this agent for about 16 hours now and the mean reward is +13. Although I am pleased to be able to get this working, my training time is a lot slower than what was reported in the deepmind's asynchronous methods paper.

If you need some more hints, take a look at my changes.
pong_a3c_train.py

@MatheusMRFM
Copy link
Author

Thanks for the response!

Actually, I found one tiny detail that made a huge impact in the training process:

  • In the current implementation, in the 'train' function, the LSTM state used for updating the global network is rnn_state = self.local_AC.state_init. If you check OpenAI's A3C, they instead use the LSTM state of the last batch of the current worker (unless the episode has ended).

Therefore, I corrected this by inserting the LSTM state into the batch and recovering the last LSTM state before updating the global network. You can see this in my updated repository in github at the 'train' method inside Worker.py.

With this, I managed to get a mean reward of +15 within 17k episodes in Pong. This took me something around one or two days using 8 threads. Still very slow as well.....

@BenPorebski
Copy link

Good to hear that you have made it work, and thanks for sharing your code.

Interesting find. It looks like my code matches what you are doing with the lstm state, although I might be missing something.

I suspect that the slow training times we are seeing is the result of non-optimal hyperparameters, or some sort of inefficiency in the code.

My training progress with pong can be seen in the attached image. 18 hours is about 460 epochs running over 16 cores + 1 gpu.
Pong progress

@awjuliani
Copy link
Owner

Thanks @BenPorebski and @MatheusMRFM for pointing this out. I'm really glad you've both figured this out, as I know it was giving a number of people issues. I've updated the code in the Jupyter notebook to reflect the proper way of remembering the lstm state during training now.

@BenPorebski
Copy link

Thanks for the update and for writing the initial agent @awjuliani!
I spent quite some time trying to implement A3C from scratch without any luck. Your code definitely put me on the right track.

@BenPorebski
Copy link

@awjuliani, I'm just testing out your changes, but it seems that unpacking the sess.run with the self.batch_rnn_state seems to throw an error with other calls to train. Not sure if this is just broken in my code.

v_l,p_l,e_l,g_n,v_n, self.batch_rnn_state,_ = sess.run([self.local_AC.value_loss, self.local_AC.policy_loss, self.local_AC.entropy,

File "tf_a3c_parallel.py", line 319, in
worker_work = lambda: worker.work(max_episode_length, gamma, sess, coord, saver)
File "tf_a3c_parallel.py", line 233, in work
v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer, sess, gamma, v1)
File "tf_a3c_parallel.py", line 175, in train
feed_dict=feed_dict)
ValueError: not enough values to unpack (expected 7, got 6)

I have removed the self.batch_rnn_state from the unpack and it seems to be running, but I'm not sure if it has undone your intended changes.

@DMTSource
Copy link
Contributor

DMTSource commented Aug 23, 2017

@BenPorebski I had that problem because I missed one addition of an input var to that line, you can check the history of the file to see(its a pain cause notebook formatting):

v_l,p_l,e_l,g_n,v_n, self.batch_rnn_state, _ = sess.run([self.local_AC.value_loss,
self.local_AC.policy_loss,
self.local_AC.entropy,
self.local_AC.grad_norms,
self.local_AC.var_norms,
self.local_AC.state_out,
self.local_AC.apply_grads],
feed_dict=feed_dict)

@BenPorebski
Copy link

@DMTSource, oh, of course!! Thank you.
Sorry @awjuliani, ignore my last.

@chrisplyn
Copy link

@BenPorebski is your program applicable to other Atari games?

@BenPorebski
Copy link

@chrisplyn, I've not tested it, but it probably would work for the other Atari games. You might need to double check that the number of actions is set correctly for the new environment.

@chrisplyn
Copy link

@BenPorebski Thanks Ben, I am also trying to leverage your code to play flappybird-v0. I don't understand the lstm part of your code, can you refer me to any tutorial I can read?

@BenPorebski
Copy link

@chrisplyn, it's been a while since I was last playing with this code, so I'm not sure I understand it well enough to explain. But the general idea of using an lstm is for it to function as a short term memory of the last 50 frames, which is set on line 227. I might be wrong, but I believe it makes the computation a bit quicker than feeding in an 848450 array through the entire network.

If you are after a bit more detail than that, I might have some time tomorrow to have a bit of a play. Maybe it will refresh my memory.

@chrisplyn
Copy link

@BenPorebski Thanks a lot Ben, based on your experiment, will A3C works better than other DQN techniques dealing with Atari Games?

@BenPorebski
Copy link

BenPorebski commented Nov 9, 2017

@chrisplyn In my limited experience when playing with RL, I found A3C to be more stable than DQN. It could be that I botched my DQN implementation, but it struggled learning to play pong with a positive score. The A3C implementation that I posted does work, however, I still found replicate training runs to occasionally go out of control and forget everything it has learnt.

I don't fully understand why this happens. I'm not sure if it is something specific to my implementation of the RL algorithms, or if it is a common experience.

@chrisplyn
Copy link

@BenPorebski Hi Ben, do you know how to do separate testing using the trained model?

@BenPorebski
Copy link

@chrisplyn, Hi Chris, do you mean having the trained model play without learning?

@chrisplyn
Copy link

@BenPorebski yeah, I am not quite familiar with tensorflow, can you share you code of doing this?

@BenPorebski
Copy link

@chrisplyn, so I think this worked for playing pong last time I checked https://gist.github.com/BenPorebski/0df9a2da264bdc33aec26f0809685d8a

@rjgarciap
Copy link

@BenPorebski Hi Ben, have you finally found the reason why it happens '' I still found replicate training runs to occasionally go out of control and forget everything it has learnt.'' ?

It also happens to me with another A3C implementation and I would like to know if it is common experience, due to A3C instability or due to the way A3C was implemented.

@BenPorebski
Copy link

@rjgarciap Hi Ricardo, I have not thoroughly explored this any further with my implementation. However, this does seem to be a very common experience in reinforcement learning [1][2].

@rjgarciap
Copy link

@BenPorebski Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants