Adapting A3C LSTM for Pong #48

MatheusMRFM · 2017-08-01T13:41:40Z

Did anyone managed to get the A3C LSTM of this repo to work for Pong (using the openai gym)?

I have already tried several different optimizers, learning rates, network architectures, but still no success. I even altered the code in this repo to try to replicate the architecture used in the A3C from the OpenAI starter agent, but no success.....the agent maintains a mean reward of about -20.5 forever.......I left it training until it reached 70k global episodes, but it didn't get any better. In some architectures, the agent would just diverge to a policy where it executes only a single action all the time.

If anyone managed to get this implementation to work for Pong, I would really appreciate some hints.

BenPorebski · 2017-08-23T13:36:36Z

@MatheusMRFM, I have managed to modify the A3C LSTM implementation to work for Pong using the openai gym.

I have not observed the -20 reward forever problem that you are seeing. The most common problem I observed was that the agent would train to hit a mean reward of -15 then the gradient would explode and everything would be forgotten. This was mitigated by increasing the size of the LSTM episode buffer from 30 to 100.

I have been training this agent for about 16 hours now and the mean reward is +13. Although I am pleased to be able to get this working, my training time is a lot slower than what was reported in the deepmind's asynchronous methods paper.

If you need some more hints, take a look at my changes.
pong_a3c_train.py

MatheusMRFM · 2017-08-23T13:59:05Z

Thanks for the response!

Actually, I found one tiny detail that made a huge impact in the training process:

In the current implementation, in the 'train' function, the LSTM state used for updating the global network is rnn_state = self.local_AC.state_init. If you check OpenAI's A3C, they instead use the LSTM state of the last batch of the current worker (unless the episode has ended).

Therefore, I corrected this by inserting the LSTM state into the batch and recovering the last LSTM state before updating the global network. You can see this in my updated repository in github at the 'train' method inside Worker.py.

With this, I managed to get a mean reward of +15 within 17k episodes in Pong. This took me something around one or two days using 8 threads. Still very slow as well.....

BenPorebski · 2017-08-23T14:28:21Z

Good to hear that you have made it work, and thanks for sharing your code.

Interesting find. It looks like my code matches what you are doing with the lstm state, although I might be missing something.

I suspect that the slow training times we are seeing is the result of non-optimal hyperparameters, or some sort of inefficiency in the code.

My training progress with pong can be seen in the attached image. 18 hours is about 460 epochs running over 16 cores + 1 gpu.

awjuliani · 2017-08-23T16:18:37Z

Thanks @BenPorebski and @MatheusMRFM for pointing this out. I'm really glad you've both figured this out, as I know it was giving a number of people issues. I've updated the code in the Jupyter notebook to reflect the proper way of remembering the lstm state during training now.

BenPorebski · 2017-08-23T16:51:05Z

Thanks for the update and for writing the initial agent @awjuliani!
I spent quite some time trying to implement A3C from scratch without any luck. Your code definitely put me on the right track.

BenPorebski · 2017-08-23T17:06:22Z

@awjuliani, I'm just testing out your changes, but it seems that unpacking the sess.run with the self.batch_rnn_state seems to throw an error with other calls to train. Not sure if this is just broken in my code.

v_l,p_l,e_l,g_n,v_n, self.batch_rnn_state,_ = sess.run([self.local_AC.value_loss, self.local_AC.policy_loss, self.local_AC.entropy,

File "tf_a3c_parallel.py", line 319, in
worker_work = lambda: worker.work(max_episode_length, gamma, sess, coord, saver)
File "tf_a3c_parallel.py", line 233, in work
v_l,p_l,e_l,g_n,v_n = self.train(episode_buffer, sess, gamma, v1)
File "tf_a3c_parallel.py", line 175, in train
feed_dict=feed_dict)
ValueError: not enough values to unpack (expected 7, got 6)

I have removed the self.batch_rnn_state from the unpack and it seems to be running, but I'm not sure if it has undone your intended changes.

DMTSource · 2017-08-23T17:08:58Z

@BenPorebski I had that problem because I missed one addition of an input var to that line, you can check the history of the file to see(its a pain cause notebook formatting):

v_l,p_l,e_l,g_n,v_n, self.batch_rnn_state, _ = sess.run([self.local_AC.value_loss,
self.local_AC.policy_loss,
self.local_AC.entropy,
self.local_AC.grad_norms,
self.local_AC.var_norms,
self.local_AC.state_out,
self.local_AC.apply_grads],
feed_dict=feed_dict)

BenPorebski · 2017-08-23T17:11:03Z

@DMTSource, oh, of course!! Thank you.
Sorry @awjuliani, ignore my last.

chrisplyn · 2017-11-09T20:07:13Z

@BenPorebski is your program applicable to other Atari games?

BenPorebski · 2017-11-09T20:14:13Z

@chrisplyn, I've not tested it, but it probably would work for the other Atari games. You might need to double check that the number of actions is set correctly for the new environment.

chrisplyn · 2017-11-09T20:19:27Z

@BenPorebski Thanks Ben, I am also trying to leverage your code to play flappybird-v0. I don't understand the lstm part of your code, can you refer me to any tutorial I can read?

BenPorebski · 2017-11-09T20:55:31Z

@chrisplyn, it's been a while since I was last playing with this code, so I'm not sure I understand it well enough to explain. But the general idea of using an lstm is for it to function as a short term memory of the last 50 frames, which is set on line 227. I might be wrong, but I believe it makes the computation a bit quicker than feeding in an 848450 array through the entire network.

If you are after a bit more detail than that, I might have some time tomorrow to have a bit of a play. Maybe it will refresh my memory.

chrisplyn · 2017-11-09T22:14:48Z

@BenPorebski Thanks a lot Ben, based on your experiment, will A3C works better than other DQN techniques dealing with Atari Games?

BenPorebski · 2017-11-09T22:22:02Z

@chrisplyn In my limited experience when playing with RL, I found A3C to be more stable than DQN. It could be that I botched my DQN implementation, but it struggled learning to play pong with a positive score. The A3C implementation that I posted does work, however, I still found replicate training runs to occasionally go out of control and forget everything it has learnt.

I don't fully understand why this happens. I'm not sure if it is something specific to my implementation of the RL algorithms, or if it is a common experience.

chrisplyn · 2017-11-14T23:15:24Z

@BenPorebski Hi Ben, do you know how to do separate testing using the trained model?

BenPorebski · 2017-11-14T23:16:48Z

@chrisplyn, Hi Chris, do you mean having the trained model play without learning?

chrisplyn · 2017-11-14T23:25:43Z

@BenPorebski yeah, I am not quite familiar with tensorflow, can you share you code of doing this?

BenPorebski · 2017-11-14T23:30:23Z

@chrisplyn, so I think this worked for playing pong last time I checked https://gist.github.com/BenPorebski/0df9a2da264bdc33aec26f0809685d8a

rjgarciap · 2018-02-27T09:47:07Z

@BenPorebski Hi Ben, have you finally found the reason why it happens '' I still found replicate training runs to occasionally go out of control and forget everything it has learnt.'' ?

It also happens to me with another A3C implementation and I would like to know if it is common experience, due to A3C instability or due to the way A3C was implemented.

BenPorebski · 2018-02-27T10:08:01Z

@rjgarciap Hi Ricardo, I have not thoroughly explored this any further with my implementation. However, this does seem to be a very common experience in reinforcement learning [1][2].

rjgarciap · 2018-02-27T10:46:41Z

@BenPorebski Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting A3C LSTM for Pong #48

Adapting A3C LSTM for Pong #48

MatheusMRFM commented Aug 1, 2017

BenPorebski commented Aug 23, 2017

MatheusMRFM commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

awjuliani commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

DMTSource commented Aug 23, 2017 •

edited

Loading

BenPorebski commented Aug 23, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017 •

edited

Loading

chrisplyn commented Nov 14, 2017

BenPorebski commented Nov 14, 2017

chrisplyn commented Nov 14, 2017

BenPorebski commented Nov 14, 2017

rjgarciap commented Feb 27, 2018

BenPorebski commented Feb 27, 2018

rjgarciap commented Feb 27, 2018

Adapting A3C LSTM for Pong #48

Adapting A3C LSTM for Pong #48

Comments

MatheusMRFM commented Aug 1, 2017

BenPorebski commented Aug 23, 2017

MatheusMRFM commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

awjuliani commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

BenPorebski commented Aug 23, 2017

DMTSource commented Aug 23, 2017 • edited Loading

BenPorebski commented Aug 23, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017

chrisplyn commented Nov 9, 2017

BenPorebski commented Nov 9, 2017 • edited Loading

chrisplyn commented Nov 14, 2017

BenPorebski commented Nov 14, 2017

chrisplyn commented Nov 14, 2017

BenPorebski commented Nov 14, 2017

rjgarciap commented Feb 27, 2018

BenPorebski commented Feb 27, 2018

rjgarciap commented Feb 27, 2018

DMTSource commented Aug 23, 2017 •

edited

Loading

BenPorebski commented Nov 9, 2017 •

edited

Loading