Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DQN solution results peak at ~35 reward #30

Open
nerdoid opened this issue Nov 15, 2016 · 77 comments

Comments

@nerdoid
Copy link

commented Nov 15, 2016

Hi Denny,

Thanks for this wonderful resource. It's been hugely helpful. Can you say what your results are when training the DQN solution? I've been unable to reproduce the results of the DeepMind paper. I'm using your DQN solution below, although I did try mine first :)

While training, the TensorBoard "episode_reward" graph peaks at an average reward of ~35, and then tapers off. When I run the final checkpoint with a fully greedy policy (no epsilon), I get similar rewards.

The DeepMind paper cites rewards around 400.

I have tried 1) implementing the error clipping from the paper, 2) changing the RMSProp arguments to (I think) reflect those from the paper, 3) changing the "replay_memory_size" and "epsilon_decay_steps" to the paper's 1,000,000, 4) enabling all six agent actions from the Open AI Gym. Still average reward peaks at 35.

Any pointers would be greatly appreciated.

Screenshot is of my still-running training session at episode 4878 with all above modifications to dqn.py:

screen shot 2016-11-15 at 11 35 37 am

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 15, 2016

I've been seeing the same and I haven't quite figured out what's wrong yet (debugging turnaround time is quite long). I recently implemented A3C and also saw results peak earlier than they should.

The solutions share some of the code so right now my best guess is that it could be something related to the image preprocessing code or the way consecutive states are stacked before being fed to the CNN. Perhaps I'm doing something wrong there and throwing away information.

If you have time to look into that of course it'd be super appreciated.

@fferreres

This comment has been minimized.

Copy link

commented Nov 15, 2016

I saw the same on a3c after 1.9 million steps:

A3C on Breakout

And attributed it to the fact that I only used 4 learners (2 are hyper threads, only 2 real cores). In the paper, the training is orders of magnitude slower due to less stochasticity from fewer agents. The variance remained huge too, and it plateaus maybe too quickly in about 38 of score.

I started reviewing DQN first (though I got a high level view of how A3c is implemented too and will come back to it later), and made some dumb questions so far, as I was seeing the same level of performance.

One thing that would help diagnose performance is building the graph so we can inspect it in TensorBoard. Other than that, it could be many things. The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 15, 2016

Thanks for helping out.

The most impactful is exactly how the backdrop is behaving, and (for me) if the Gather operation and the way the optimization is requested result in exactly what is wanted. Probably, it is, but that's where I am looking now.

If the backprop calculation was wrong I think the model wouldn't learn at all. It seems unlikely that it starts learning but bottoms out with an incorrect gradient. Unlikely, but not impossible. It's definitely worth looking into, but it'd be harder to debug than looking at the preprocessing. A simple thing to try would be to multiply with a mask matrix (a one hot vector) instead of using tf.gather.

Adding the graph to Tensorboard should be simple. It can be passed to the saver that is created in the code, e.g. tf.train.SummaryWriter(..., sess.graph).

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 15, 2016

Thanks for the responses.

So I manually performed some environment steps and displayed their preprocessed forms. An example is below. It looks OK to me. The paper also mentions taking the max of the current and previous frames for each pixel location in order to handle flickering in some games. It seems to be unnecessary here since I didn't observe the ball disappearing between one step and the next.

I'm happy to check out the stacking code next. Looking at TensorFlow's Convolution documentation, is it possible that information is being lost by passing the state histories to conv2d as channels? Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

screen shot 2016-11-15 at 3 28 51 pm

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 15, 2016

Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

I think you are right in that the history is being reduced to one representation after the first convolutional layer, but I think that's how it's supposed to be - You want to combine all the historic frames to create a new representation. If you used depthwise_conv2d you would never combine information across frames.

I do agree that it seems likely that something is wrong in the CNN though... It could be a missing bias term (there is no bias initializer also in the fully connected layer), wrong initializer, etc..

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 15, 2016

I just looked at the paper again:

The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.

In my code the final hidden layer does not have a ReLU activation (or bias term). Maybe that's it?

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 15, 2016

Nice catch. That does seem like it would limit the network's expressiveness. Unless you already have something running, I'll go ahead and fire up an instance and give it a shot with ReLU on the hidden fully connected layer. Can also add biases to both fully connected layers (probably initialized to zero?).

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 15, 2016

I'm not running it right now so it'd be great if you could try. In my experience the biases never helped much but adding them won't hurt. Zero-initialized should be fine.

@fferreres

This comment has been minimized.

Copy link

commented Nov 16, 2016

  1. I don't know how to add the graph the right way to review with TensorBoard, since at tf.train.SummaryWriter(..., sess.graph)time the session isn't created yet (but much later). If I try to change things I'll refactor the wrong way.

  2. Some other things that can affect performance and that are discussed in several forums as things that have been done by the DeepMind team:

  • Clipping of rewards to range (1,-1) - I think breakout not affected
  • Clipping of the gradients too, to limit any one update the hard way if necessary (not sure how they do this or how real, they'd need to mess with compute gradients output.
  • They consider terminal state after LOSS OF LIFE not after episode ends. This is reported to have a large impact in the learning rate and in the score you can achieve (likely, because without it, it's hard to notice for the agent that losing a life is really very bad and not "free")
  1. I really suspect the backprop is causing most of the trouble, and there are two potential problems with it.

The first (a) is regarding the Gather operation instead of the one_hot (that you mentioned, Danny), the second (b) is in asking the optimizer to think of vars in relation to multiple different examples (instead of vars we want fixed in general across an entire mini batch, which is usually how it will "think", that the same rules apply to the entire mini-batch).

a) The Gather operation operates very differently than a one hot. The Gather tells the optimizer to disregard completely the impact on the actions not taken as if they didn't affect the total cost. And it results in updates that disregard that we have placed a lot of effort on fine tuning the other variables because on the next state they may be the right action, and we want Q to provide a good estimate of those state, action values. But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.

b) The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch. Another way to say it is that I think mini-batch assumes you don't change the variables to be fixed between examples within a batch, and one way to look at it is to notice how the optimizer always works across Trainable Vars, and the List of Variables that minimize takes as argument, most likely expect to be vars used across the mini match for each example. eg. you may be optimizing only x1 and not x2. But may not work if you want to optimize x1 in the first example of the batch and then x2 in the second example of the batch. This I only suspect must be so, as in many other context mini-batch is derived from a cost function that is static (not fixing vars across the batch).

I hope I am not saying something too stupid. If so apologies. Bottom line is that reward and gradient clipping (I just noticed tensorflow makes that easy to implement), ending episode when life is lost, using a one hot instead of gather, and ReLU at the final fully connected layer, may combined boost learning significantly, here, and in what may apply from here to a3c (then I care the most about).

@fferreres

This comment has been minimized.

Copy link

commented Nov 16, 2016

Depending on how conv2d combines channel filters, maybe history is getting reduced to one state in the first convolutional layer. Maybe depthwise_conv2d is relevant or, alternatively, flattening the stack prior to feeding in to the network. I'm more comfortable with RNNs than with CNNs, so I could be off base here though :)

I just edited my response which was wrong. I really don't know how conv usually combines the channels. Simplest case is averaging, in cs231 hey say sum as in many other resources. The history isn't necessarily lost, and will be spread along the filters.

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 16, 2016

Unfortunately, it seems to have plateaued again. Currently at episode 3866. Here's the code I used:

fc1 = tf.contrib.layers.fully_connected(flattened, 512, activation_fn=tf.nn.relu, biases_initializer=tf.constant_initializer(0.0))
self.predictions = tf.contrib.layers.fully_connected(fc1, len(VALID_ACTIONS), biases_initializer=tf.constant_initializer(0.0))

screen shot 2016-11-16 at 11 58 02 am

@fferreres

This comment has been minimized.

Copy link

commented Nov 16, 2016

Is something like this possible? If the action is not taken, set the target value be equal to the current Q prediction for that action. For the (one) Action taken, set the target y value be the one of our target value q(s',a'). Or does it need a one hot? If possible, then

self.y_pl is filled with the TD target we calculated (using s',a') when the index corresponds to the action taken, and all other y targets (actions not chosen) are filled with just the predicted value from self.predictions (same value in Q vs target y). Then just:

# Calcualte the loss
        self.losses = tf.squared_difference(self.y_pl, self.predictions)
        self.loss = tf.reduce_mean(self.losses)
@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

Thanks for the comments. Here are a few thoughts:

Clipping of rewards to range (1,-1) - I think breakout not affected

Good point. Likely to make a difference for some of the games.

Clipping of the gradients too, to limit any one update the hard way if necessary (not sure how they do this or how real, they'd need to mess with compute gradients output.

This won't make a difference in learning ability, it'll only help you avoid exploding gradients. So as long as you don't get NaN values for your gradients this is not necessary. It's easy to add though.

They consider terminal state after LOSS OF LIFE not after episode ends. This is reported to have a large impact in the learning rate and in the score you can achieve (likely, because without it, it's hard to notice for the agent that losing a life is really very bad and not "free")

Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?

But with Gather (instead of one hot) the optimizer reduces the cost in one action, without balancing the need to not change the value estimates for the other actions. So, regardless of how Gather works, a one hot is needed so that when the optimizer starts looking at the derivatives, it takes into account the other actions cost, and thus tries to find the direction which improves the current action cost (lowering it) while affecting the other action values (ie. current output value) as little as possible.

I don't think this is right. It is correct to only take into the account the current action. Remember that the goal of the value function approximator is the predict the total value of a state action pair, q(s,a). The value of an s,a pair is completely independent of other actions in the same state. In fact, the only reason our estimator even outputs values for all actions is because it's faster. In theory you could have a separate estimator for each actions (as it is done in regular Q-Learning). You may be thinking of policy-based methods like policy gradients that predict probabilities for all actions. There we need to take into account the actions that are not taken, but not here. I'm pretty confident the current code is correct.

The other is regarding the mini-batch. I am not sure how TensorFlow mini-batch works, but it may be possible that it's not designed to have Variables corresponding to different examples within a single Batch

I'm not sure if I understand this point. Can you explain more? However, I'm pretty confident that the optimization is right.

So I think it may be worth looking at the reward structure of the environment and what they do in the original paper...

I'll try to refactor the code on the weekend to look at the gradients in Tensorboard. That should give some insight into what's going on.

@fferreres

This comment has been minimized.

Copy link

commented Nov 17, 2016

Denny,

Really good point and may be the key here. This could make a huge difference. I looked at the paper again and I actually can't find details on this. Can you point me to it?

In the paper, under Methods, Training, they mention this:

For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.

I've seen in some forums that they take this for granted, and they tried it in at least Breakout and it made a large difference.

SKIP THIS - It's just background on how I was confused...

Regarding the optimization, thanks for taking the time to explain. I highly appreciate it, and will sit on it and try to become more familiar with the concepts. My comment was inspired by comment by Karpathy that implemented REINFORCE.js and that in that page literally mentions under Bell & Whistles, Modeling Q(s,a):

Step 3. Forward fθ(st) and apply a simple L2 regression loss on the dimension a(t) of the output, to be y. This gradient will have a very simple form where the predicted value is simply subtracted from y. The other output dimensions have zero gradient.

But I see how I miss-read it. It confuses me a bit that having 10 outputs can be the same as having 10 networks dedicated to each action (not that both don't work, but how the 10 outputs can be as expressive).

I am also probably mixing things...such as when doing classification with softmax, we set labels for all classes, the wrong ones we send closer to 0 and the one that was right towards 1. So it trains the network to not only output something closer to one for the right label, but also trains the other labels to be closer to 0. In this case (DQN) I though similarly, that our "0" for the other actions was whatever value the function estimates (which is not relevant in the current state, but we don't want to change much because those action where not taken), thus I thought their equivalent to "0" was to set them to the value the function estimates, and the "1" was the cost we actually calculate for the action taken. I know I am still pulling my hair, ... :-)

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.

Hm, but that doesn't mean the end of an episode happens when a life is lost. It could also mean that they use the life counter to end the episode once the count reaches 0. It's a bit ambiguous. It's definitely worth trying out.

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 17, 2016

I'm having a hard time wrapping my head around the impact of ending an episode at the loss of a life considering that training uses random samples of replay memory rather than just the current episode's experience. Starting a new episode is not intrinsically punishing, is it? A negative reward for losing a life is obviously desirable though. I'd love to be wrong because this would be an easy win if it does make a difference.

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 17, 2016

On another note, I manually performed a few steps (fire, left, left) while printing out the processed state stack one depth (time step) at a time, and it checks out. The stacking code looks good to me.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

@nerdoid I'm not 100% sure either, but here are some thoughts:

  • Since the episode isn't ended you could potentially stack frames across lives. That doesn't seem right since there's a visual reset when you lose a life. But I agree this isn't likely to make a huge difference given that these samples from the replay memory should be pretty rate.
  • Intuitively, the agent isn't punished by losing a live because it knows that it can get rewards with the next life. Imagine a scenario where you are just about to lose a life. What should be Q value be? It depends on how many lives you have left. With no lives left the value should be 0, but with multiple lives left the value should be > 0. However, the estimator doesn't have a notion of how many lives are left (it's not even in the image) so it would make sense that it becomes confused about these kind of states. It becomes an easier learning problem when the target values of states just before losing a life are set to 0.

I know it's a bit far fetched of an explanation and I don't have a clear notion of how it affects things mathematically, but it's definitely worth trying.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

@fferreres I think your main confusion is that you are mixing up policy based methods (like REINFORCE and actor critic) with value-based methods like Q-Learning. In policy-based methods you do need to take all actions into account because the predictions are a probability distribution. However, in the DQN code we are not even using a softmax ;)

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

Just to clarify the above: By not ending the episode when a life is lost you are basically teaching the agent that upon losing the visual state is reset and it can accumulate more reward in the same episode starting from scratch. In a way you are even encouraging it to get into states that lead to a loss of life because it learns that these states lead to a visual reset, but not end of episode, which is good for getting more reward. So I do believe this can make a huge difference. Does that make sense?

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 17, 2016

Ahhh, yeah that makes a lot of sense. Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that. Either way it's worth trying.

Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

Thanks. I wonder then about the relative impact of doing a full episode restart vs. just reinitializing the state stack and anything else like that.

I think the key is that the targets are set to 0 when an life is lost, i.e. this line here:

targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * discount_factor * np.amax(q_values_next, axis=1)

So when a life is lost done should be true and the env should have a full reset.

Although I do still have this nagging feeling that the environment could just give a large negative reward for losing a life. Is that incorrect?

It could. I agree this should also improve things but the learning problem would still be confusing to the estimator. The Q value (no matter if negative or positive) depends on how many lives you have left, but that information is nowhere in the feature representation.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 17, 2016

I added a wrapper: 18528d0#diff-14c9132835285c0345296e54d07e7eb2

I'll try to run this for A3C, would be great if you could try it for Q-Learning. Crossing my fingers that this fixes it ;)

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 17, 2016

Awesome! Yes, I will give this a go for the Q-Learning in just a bit here.

@fferreres

This comment has been minimized.

Copy link

commented Nov 17, 2016

@dennybritz, yeah, I have things mixed up. Too many things to try to understand too quickly (new to Python, Numpy, TensorFlow, vim, Machine Learning, RL, refresh statistics/prob 101 :-)

@nerdoid

This comment has been minimized.

Copy link
Author

commented Nov 17, 2016

Not sure if either of you have encountered this yet, but when a gym monitor is active, attempting to reset the environment when the episode is technically still running results in an error with message: "you cannot call reset() unless the episode is over." Here is the offending line.

A few ideas come to mind:

  1. Wrap the env.reset() call in a try/except. If we end up in the except, it just means we lost a life rather than finishing. Skip resetting the state stack, and instead just continue like nothing happened. Important then to move the if done: break after the state = next_state at the end of the episode loop. We still benefit from the correct done flag being added to replay memory in the previous step.
  2. Start and stop the monitor every record_video_every episodes. So we do the bookkeeping manually, and start the monitor only when we want to record a new video. There is probably overhead with starting and stopping the monitor repeatedly. IDK though.

The first option seems less hacky to me, but I might be overlooking something.

@fferreres

This comment has been minimized.

Copy link

commented Nov 18, 2016

Here are a few threads discussing low scores vs paper. It can be a source of troubleshooting ideas:

tambetm/simple_dqn#32

https://groups.google.com/forum/m/#!topic/deep-q-learning/JV384mQcylo

https://docs.google.com/spreadsheets/d/1nJIvkLOFk_zEfejS-Z-pHn5OGg-wDYJ8GI5Xn2wsKIk/htmlview

I re-read all the DQN.py (not the notebook) and assuming all np slices and advanced indexing and functions operating over axis are ok, I still can't find what could be causing the score to flatten at ~35.

I think troubleshooting may be time consuming but it is also a good learning experience

Insome of the posts some mention doing a max among consecutive frames, some that the stacking is for consecutive frames I've seen a karpathy post explaining policy gradients and he did difference of frames.

This was reported regarding loss of life:

I regarded the loss of life as the terminal state but didn't reset the game, which is different from DeepMind.
Without the modification (original simple_dqn), it was 155 test score in 22 epoch in kung_fu_master.
With this modification, I got 9,500 test score in 22 epoch and 19,893 test score in 29 epoch.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 18, 2016

Thanks for researching this. I made a bunch of changes to the DQN code on a separate branch. It now reset the env when losing a life, it writes gradient summaries and the graph to Tensorboard, and has gradient clipping: https://github.com/dennybritz/reinforcement-learning/compare/dqn?expand=1#diff-67242013ab205792d1514be55451053c

I'm re-running it now. If this doesn't "fix" it the in my opinion most likely source of error is the frame processing. Looking at the threads:

  • They use a fixed frame skip in the Atari environment. OpenAI gym defaults to a random frame skip in (3, 5). It seems like you can use a DeterministicBreakout-v0 environment to get a fixed frame skip.
  • It seems like they max-pool the colors over 4 frames, but the gym implementation seems to just skip frames and take the last one.
  • They extract luminance from the RGB data, I do grayscale (I have no idea if this is actually different)

I can't imagine that each of these makes a huge difference, but who knows. Here's the full paragraph:

Working directly with raw Atari 2600 frames, which are 2103 160
pixel images with a 128-colour palette, can be demanding in terms of computation
and memory requirements.We apply a basic preprocessing step aimed at reducing
the input dimensionality and dealing with some artefacts of the Atari 2600 emulator.
First, to encode a singlef rame we take the maximum valuefor each pixel colour
value over the frame being encoded and the previous frame. This was necessary to
remove flickering that is present in games where some objects appear only in even
frames while other objects appear only in odd frames, an artifact caused by the
limited number of sprites Atari 2600 can display at once. Second, we then extract
the Y channel, also known as luminance, from the RGB frame and rescale it to
84 x 84. The function w from algorithm 1 described below applies this preprocessing
to the m most recent frames and stacks them to produce the input to the
Q-function, in which m=4, although the algorithm is robust to different values of
m (for example, 3 or 5).
@fferreres

This comment has been minimized.

Copy link

commented Nov 18, 2016

Something entirely plausible is that we didn't try "hard enough". Look for example at breakout in this page:

https://github.com/Jabberwockyll/deep_rl_ale/wiki

It remains close to 0 even at 4 million steps, then starts climbing but beed millions and millions to make a difference. Also, in many of the kearning charts there are long periods of diminishing scores, but they recover later. It takes a million sometimes to do that.

Our tries are about 2-3 millions. I may have to check the reward curves from the paper, but maybe there's nothing wrong, if the plateu is actually just the policy taking a "nap". Unfortunately, I am stuck with an Apple Air (slow). Maybe with $10 I could try a3c on a 32 cpu processor...not very familiar with aws though but a spot instance for a day would do it

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Nov 18, 2016

I'd stick with trying DQN instead of A3C for now - there are fewer potential sources of error.

If the implementation you linked to works correctly (it seems like it does) it should be easy to figure out what the difference is:

  • Reward/State Preprocessing
  • Network Architecture
  • Optimizer (They also used a different implementation of RMSProp)

Looking at it again there another difference from my code. They take 4 steps in the environment before making an update. I update after each step. That's not shown in the algorithm section in the paper, but it's mentioned in the appendix. So that's another thing to add.

Good point about the training time. If it takes 10M steps that's 2.5M updates (assuming you update only after 4 steps) and that'd be like 2-3 days looking at the current training speed I have on my GPU.

@IbrahimSobh

This comment has been minimized.

Copy link

commented Mar 9, 2017

Dears

Any updates regarding this issue?

@fferreres

This comment has been minimized.

Copy link

commented Mar 11, 2017

I wish someone else with time could help diagnose what's wrong. I know Denny is probably very busy with Language Translate stuff.

The value of this repository is that the code from Denny is very easy to follow (documented, commented) even if takes time at points. I stopped learning about RL when I couldn't figure out why DQN could not be diagnosed by anyone reading this repo, and stuck in general. I am reading slowly on new algorithms like Neural Episodic Control (https://arxiv.org/pdf/1703.01988.pdf). I take blame, in that this RL literature is fascinating, but the details and fine tuning are not for gentle on amateurs.

Any kind soul that knows what may be causing things in the DQN code, please help us get unstuck.

@dennybritz

This comment has been minimized.

Copy link
Owner

commented Mar 11, 2017

I should have time again to look into it starting late next week, I've been busy with another project. Thanks for all the suggestions so far.

@IbrahimSobh

This comment has been minimized.

Copy link

commented Mar 11, 2017

Thank you Denny,

I think the same cause is affecting both DQN and A3C ...

@ppwwyyxx

This comment has been minimized.

Copy link

commented Mar 20, 2017

@cgel I think the optimizer doesn't actually matter very much. In my own DQN implementation I just used no-brainer AdamOptimizer as usual and it's working fine.
I had the same experience that the terminal flag makes little difference.

To this issue itself, it looks like the score you're looking at is the training score (i.e. with epsilon greedy). Using greedy evaluation would improve a lot (and that's what everyone is using). When my agent gets 400 test score, the training score (with eps=0.1) is only around 70.
And btw, in evaluation, lost of life is not end of episode (unlike training).

@ERed

This comment has been minimized.

Copy link

commented Mar 27, 2017

I'm currently training DQN with some small changes mentioned by @ppwwyyxx . WIll let you know the results asap.
Btw, thanks @dennybritz for this nice repo.

@ERed

This comment has been minimized.

Copy link

commented Mar 29, 2017

No improvement. But I think it's just a misunderstanding by me.

Don't know if @dennybritz has already tried it but I changed the environment so every time a life was lost the episode should be considered finished and the environment should restart.

This is where I did wrong, because this way you are not letting the agent see advanced states of the game. So the key is to just set the reward to -1 when a life is lost and don't reset the environment, just keep playing so the agent gets to play 5 episodes per game. The key here is that every state-action pair that lead to a loss of life should get a Q value of -1. So we should still be putting a done==True to a loss of life so np.invert() can make his job but only reset the environment when env.env.ale.lives() == 0.

Next week I'll have some time to try this out and will let you know. Maybe it's not the root cause of the problem but I though it might be worth sharing it in case it can help someone or give some ideas.

Let's fix this once and for all between all of us ;)
If someone gets to try this please let us know :)

@cgel

This comment has been minimized.

Copy link

commented Mar 31, 2017

@ppwwyyxx I have not run an extensive optimizer search but in my experience DQN is very sensitive to it. Regarding what epsilon should you use, the original DQN used 0.1 for the training and 0.05 the testing. The numbers that you give seem a bit strange to me. When I see a testing score > 300 the training score is > 150.

@ERed you are right in saying that you should not reset the environment but you seem to be over complicating the solution. There is no reason for the Q value of a life lost being -1, it should be 0. The easiest solution is to pass a terminal flag to the agent and keep the game running. But again, I have already tried it and it barely made any difference.

@hengyuan-hu

This comment has been minimized.

Copy link

commented Apr 18, 2017

Marking loss of life as termination of the game can bring big improvement in some game (at least in SpaceInvader as I tested). I think the intuition is that, if not, the q-value for the same state under different number of lives should be different but the agent cannot differentiate this difference given only the raw pixel input. In spaceinvader, this can increase the score from ~600 to ~2000 with other variates controlled.

@fuxianh

This comment has been minimized.

Copy link

commented Dec 17, 2017

@dennybritz Hi, I think I met the same problem as you, I tried in Open AI Gym baseline BreakoutNoFrameskip-v4, and the reward converges to around 15[In DQN nature paper, it's around 400]. Did you figure out the problem, if possible, pease give me some suggestion about this.

@dylanthomas

This comment has been minimized.

Copy link

commented Dec 18, 2017

@jpangburn

This comment has been minimized.

Copy link

commented Jan 10, 2018

I've found more success with ACKTR than with Open AI's deepq stuff. Here's a video of it getting 424: breakout. This wasn't an especially good score, it was just a random run I recorded to show a friend. I've seen it win the round and score over 500.

@hengyuan-hu

This comment has been minimized.

Copy link

commented Jan 10, 2018

@cr7anand

This comment has been minimized.

Copy link

commented Feb 15, 2018

@dennybritz firstly, thank you for creating this easy to follow DeepRL repo. I've been using it as a reference to code up my own DQN implementation for Breakout in PyTorch. Is the current avg_score that the agent is achieving still saturating around 30? In your dqn implementation the Experience Replay buffer has a size=500,000 right, I feel that the size of the Replay is very critical to replicating DeepMind's performance on these games. would like to hear your thoughts on this? Have you tried increasing Replay size to 1,000,000 as suggested in the paper?

@zmonoid

This comment has been minimized.

Copy link

commented Jun 4, 2018

Wonder if anyone is still following this post.

DQN is very sensitive to your optimizer, and your random seed. Shown below is 4 runs with same setting.
image
image

It is very hard to find a good optimizer as below one:
image
image
This one use Adagrad: lr=0.01, epsilon = 0.01 (You need to change adagrad.py for pytorch for this optimizer). It is strange that, use rmsprop: lr=0.01, epsilon=0.01, decay=0, momentum=0, which according to source code is exactly the same, but it will be simply not working. Same for Adam. Good luck to your RL research.

@ppwwyyxx

This comment has been minimized.

Copy link

commented Jun 4, 2018

DQN is very sensitive to your optimizer, and your random seed

... in certain implementation and certain tasks.

Breakout is in fact very stable given a good implementation of either DQN or A3C. And DQN is also quite stable compared to policy-based algorithms like A3C or PPO.

@zmonoid

This comment has been minimized.

Copy link

commented Jun 4, 2018

@ppwwyyxx Notice in your implementation:

gradproc.GlobalNormClip(10)

This may be very important ....
Actually, PPO and TRPO could be regarded as gradient clip which makes it stable.

@ppwwyyxx

This comment has been minimized.

Copy link

commented Jun 4, 2018

IIRC that line of code has no visible effect on training breakout as the gradients are far from large enough to reach that threshold. I'll remove it.

@zmonoid

This comment has been minimized.

Copy link

commented Jun 5, 2018

@ppwwyyxx I tried Adam optimizer with epsilon=1e-2 same as your setting, it works.

@boranzhao

This comment has been minimized.

Copy link

commented Jun 7, 2018

I tested with the gym environment Breakout-v4 and found that the env.ale.lives() will not change instantly after losing a life (i.e. the ball disappears from the screen). Instead, the value will change roughly 8 steps after a life is lost. Could somebody verify what I said? If this is true, then the episode will end (according to whether a life is lost) at a fake terminal state that is 8 steps after the true terminal state. This is probably one reason why DQN using gym Atari environment cannot get high enough scores.

zmonoid added a commit to zmonoid/reinforcement-learning that referenced this issue Jun 7, 2018

Update dqn.py
Larger epsilon is needed here to solve the problem dennybritz#30
@zmonoid zmonoid referenced this issue Jun 7, 2018
@SamKirkiles

This comment has been minimized.

Copy link

commented Jun 24, 2018

I've done a brief search over RMSProp momentum hyperparameters and I believe it could be the source of the issue.

I ran three models for 2000 episodes each. The number of frames varies based on the length of the episodes in each test. For scale, the red model ran for around 997,000 frames while the other two ran for around 700k.

The momentum values were as follows:
dark blue - 0.0
red - 0.5
light blue - 0.9

Because red performed best at 0.5, I will run a full training cycle and report back with my findings.

momentum

It seems strange to use a momentum value of 0.0 while the paper uses 0.95. However, I feel the logic here could be that momentum doesn't work well on a moving target but I haven't seen hard evidence of this being the case.

@MagiKoco

This comment has been minimized.

Copy link

commented Jul 24, 2018

@fuxianh I met the same problem as you when running the DQN code from Open AI Gym baseline on BreakoutNoFrameskip-v4. Have you solved this problem and could you share the experiences?Thank you very much!

@fg91

This comment has been minimized.

Copy link

commented Jul 25, 2018

hey guys,

I've been working on implementing my own version of DQN recently and I encountered the same problem: the average reward (over the last 100 rounds in my plot) peaked at around 35 for Breakout.

I used Adam with a learning rate of 0.00005 (see blue curve in the plot below). Reducing the learning rate to 0.00001 increased the reward to ~50 (see violet curve).

I then got the reward to > 140 by passing the terminal flag to the replay memory when a life is lost (without resetting the game), so it does make a huge difference also for Breakout.


breakout

Today I went over the entire code again, also outputting some intermediate results. A few hours later at dinner it struck me that I saw 6 Q values this morning while I had the feeling that it should be 4.

A quick test confirms that for BreakoutDeterministic-v3 the number of actions env.action_space.n is 6:
NOOP, FIRE, RIGHT, LEFT, RIGHTFIRE, LEFTFIRE
Test this with env.unwrapped.get_action_meanings()

DeepMind used a minimal set of 4 actions:
PLAYER_A_NOOP, PLAYER_A_FIRE, PLAYER_A_RIGHT, PLAYER_A_LEFT

I assume that this subtle difference could have a huge effect, as it alters the tasks difficulty.

I will immediately try to run my code with the minimal set of 4 actions and post the result as soon as I know :)

Fabio

@fg91

This comment has been minimized.

Copy link

commented Aug 16, 2018

I recently implemented DQN myself and initially encountered the same problem: The reward reached a plateau at around 35 in the Breakout environment.

I certainly found DQN to be quite fiddly to get working. There are lots of small details to get right or it just doesn't work well but after many experiments, the agent now reaches an average training reward per episode (averaged over the last 100 episodes) of ~300 and an average evaluation reward per episode (averaged over 10000 frames as suggested by Mnih et al. 2013) of slightly less than 400.

I commented my code (in a single jupyter nb) as best as I could so that it hopefully is as easy as possible to follow along if you are interested.

dqn_breakout

Opposed to previous posts in this thread I also did not find dqn to be very sensitive to the random seed:

tensorboad_sceenshot

My hints:

  1. Use the right initializer! The dqn uses the Relu activation function and the right initializer is He.
    https://www.youtube.com/watch?v=s2coXdufOzE&t=157s
    In tensorflow use tf.variance_scaling_initializer with scale = 2

  2. I found that Adam works fairly well so before implementing the RMSProp optimizer DeepMind used (not the same as in TensorFlow) I would try to get it to work with Adam. I used a learning rate of 1e-5. In my experiments larger learning rates had the effect that the reward plateaued at around 100 to 150.

  3. Make sure you are updating the networks in the right frequency: The paper says that the target network update frequency is "measured in the number of parameter updates" whereas in the code it is measured in the number of action choices/frames the agent sees.

  4. Make sure that your agent is actually trying to learn the same task as in the DeepMind paper!!!
    4.1) I found that passing the terminal flag to the replay memory, when a life is lost, has a huge difference also in Breakout. This makes sense since there is no negative reward for losing a life and the agent does "not notice that losing a life is bad".
    4.2) DeepMind used a minimal set of 4 actions in Breakout, several versions of open ai gym's Breakout have 6 actions. Additional actions can alter the difficulty of the task the agent is supposed to learn drastically! The Breakout-v4 and BreakoutDeterministic-v4 environments have 4 actions (Check with env.unwrapped.get_action_meanings()).

  5. Use the Huber loss function, the Mnih et al 2013 paper called this error clipping.

  6. Normalize the input to the interval [0,1].

I hope you can get it to work in your own implementation, good luck!!
Fabio

breakout

@sloanqin

This comment has been minimized.

Copy link

commented Sep 6, 2018

Has anyone solved this problem for denny's code?

@stillarrow

This comment has been minimized.

Copy link

commented Sep 20, 2018

Is there anything to do to solve the problem? Should we use the deterministic environment, like 'BreakoutDeterministic-v4' and should we change the optimizer from RMSprop to Adam since it seem s that DeepMind change to using Adam in their rainbow paper.

@fg91

This comment has been minimized.

Copy link

commented Sep 21, 2018

If you want to try to improve Denny's code, I suggest you start with the following:

  1. Use Adam and a learning rate of 0.00001. DeepMind used 0.0000625 in the Rainbow paper for all the environments. I don't think that RMSProp is the problem in Denny's code in general but maybe the learning rate is not right and I can confirm that Adam with 0.00001 works well in Breakout.

  2. Make sure that the terminal flag is passed when a life is lost!

  3. Use the tf.variance_scaling_initializer with scale = 2, not Xavier (I assume the implementation uses RELU as activation function in the hidden layers like in the DeepMind papers?!).

  4. Does the implementation put 1 million transitions into the replay memory or less?

  5. I don't think the "Deterministic" matters that much atm...

  6. Make sure you are updating the networks in the right frequency (see my previous post).

Keep us updated, if you try these suggestions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.