-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapting A3C LSTM for Pong #48
Comments
@MatheusMRFM, I have managed to modify the A3C LSTM implementation to work for Pong using the openai gym. I have not observed the -20 reward forever problem that you are seeing. The most common problem I observed was that the agent would train to hit a mean reward of -15 then the gradient would explode and everything would be forgotten. This was mitigated by increasing the size of the LSTM episode buffer from 30 to 100. I have been training this agent for about 16 hours now and the mean reward is +13. Although I am pleased to be able to get this working, my training time is a lot slower than what was reported in the deepmind's asynchronous methods paper. If you need some more hints, take a look at my changes. |
Thanks for the response! Actually, I found one tiny detail that made a huge impact in the training process:
Therefore, I corrected this by inserting the LSTM state into the batch and recovering the last LSTM state before updating the global network. You can see this in my updated repository in github at the 'train' method inside Worker.py. With this, I managed to get a mean reward of +15 within 17k episodes in Pong. This took me something around one or two days using 8 threads. Still very slow as well..... |
Thanks @BenPorebski and @MatheusMRFM for pointing this out. I'm really glad you've both figured this out, as I know it was giving a number of people issues. I've updated the code in the Jupyter notebook to reflect the proper way of remembering the lstm state during training now. |
Thanks for the update and for writing the initial agent @awjuliani! |
@awjuliani, I'm just testing out your changes, but it seems that unpacking the sess.run with the self.batch_rnn_state seems to throw an error with other calls to train. Not sure if this is just broken in my code.
I have removed the self.batch_rnn_state from the unpack and it seems to be running, but I'm not sure if it has undone your intended changes. |
@BenPorebski I had that problem because I missed one addition of an input var to that line, you can check the history of the file to see(its a pain cause notebook formatting): v_l,p_l,e_l,g_n,v_n, self.batch_rnn_state, _ = sess.run([self.local_AC.value_loss, |
@DMTSource, oh, of course!! Thank you. |
@BenPorebski is your program applicable to other Atari games? |
@chrisplyn, I've not tested it, but it probably would work for the other Atari games. You might need to double check that the number of actions is set correctly for the new environment. |
@BenPorebski Thanks Ben, I am also trying to leverage your code to play flappybird-v0. I don't understand the lstm part of your code, can you refer me to any tutorial I can read? |
@chrisplyn, it's been a while since I was last playing with this code, so I'm not sure I understand it well enough to explain. But the general idea of using an lstm is for it to function as a short term memory of the last 50 frames, which is set on line 227. I might be wrong, but I believe it makes the computation a bit quicker than feeding in an 848450 array through the entire network. If you are after a bit more detail than that, I might have some time tomorrow to have a bit of a play. Maybe it will refresh my memory. |
@BenPorebski Thanks a lot Ben, based on your experiment, will A3C works better than other DQN techniques dealing with Atari Games? |
@chrisplyn In my limited experience when playing with RL, I found A3C to be more stable than DQN. It could be that I botched my DQN implementation, but it struggled learning to play pong with a positive score. The A3C implementation that I posted does work, however, I still found replicate training runs to occasionally go out of control and forget everything it has learnt. I don't fully understand why this happens. I'm not sure if it is something specific to my implementation of the RL algorithms, or if it is a common experience. |
@BenPorebski Hi Ben, do you know how to do separate testing using the trained model? |
@chrisplyn, Hi Chris, do you mean having the trained model play without learning? |
@BenPorebski yeah, I am not quite familiar with tensorflow, can you share you code of doing this? |
@chrisplyn, so I think this worked for playing pong last time I checked https://gist.github.com/BenPorebski/0df9a2da264bdc33aec26f0809685d8a |
@BenPorebski Hi Ben, have you finally found the reason why it happens '' I still found replicate training runs to occasionally go out of control and forget everything it has learnt.'' ? It also happens to me with another A3C implementation and I would like to know if it is common experience, due to A3C instability or due to the way A3C was implemented. |
@rjgarciap Hi Ricardo, I have not thoroughly explored this any further with my implementation. However, this does seem to be a very common experience in reinforcement learning [1][2]. |
@BenPorebski Thank you very much. |
Did anyone managed to get the A3C LSTM of this repo to work for Pong (using the openai gym)?
I have already tried several different optimizers, learning rates, network architectures, but still no success. I even altered the code in this repo to try to replicate the architecture used in the A3C from the OpenAI starter agent, but no success.....the agent maintains a mean reward of about -20.5 forever.......I left it training until it reached 70k global episodes, but it didn't get any better. In some architectures, the agent would just diverge to a policy where it executes only a single action all the time.
If anyone managed to get this implementation to work for Pong, I would really appreciate some hints.
The text was updated successfully, but these errors were encountered: