New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bug in the implementation #16

Closed
karpathy opened this Issue Dec 18, 2016 · 6 comments

Comments

Projects
None yet
6 participants
@karpathy

karpathy commented Dec 18, 2016

Hello, I spotted what I believe might be a bug in the DQN implementation on line 291 here:

https://github.com/devsisters/DQN-tensorflow/blob/master/dqn/agent.py#L291

The code tries to clip the self.delta with tf.clip_by_value, I assume with the intention of being robust when the discrepancy in Q is above a threshold:

self.delta = self.target_q_t - q_acted
self.clipped_delta = tf.clip_by_value(self.delta, self.min_delta, self.max_delta, name='clipped_delta')
self.global_step = tf.Variable(0, trainable=False)
self.loss = tf.reduce_mean(tf.square(self.clipped_delta), name='loss')

However, the clip_by_value function's local gradient outside of the min_delta, max_delta range is zero. Therefore, with the current code whenever the discrepancy is above min/max delta, the gradient becomes exactly zero in backprop. This might not be what you intend, and is certainly not standard, I believe.

I think you probably want to clip the gradient here, not the raw Q. In that case you would have to use the Huber loss:

def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 0.5 * tf.square(x), tf.abs(x) - 0.5) # condition, true, false

and use this on this.delta instead of tf.square. This would have the desired effect of increased robustness to outliers.

@carpedm20

This comment has been minimized.

Show comment
Hide comment
@carpedm20

carpedm20 Dec 19, 2016

Contributor

Thanks for noticing this issue. I've kept repeating this mistake since I read "Human-level control through deep reinforcement learning" in the wrong way.

Contributor

carpedm20 commented Dec 19, 2016

Thanks for noticing this issue. I've kept repeating this mistake since I read "Human-level control through deep reinforcement learning" in the wrong way.

@carpedm20 carpedm20 closed this Dec 19, 2016

matthiasplappert added a commit to keras-rl/keras-rl that referenced this issue Dec 19, 2016

Fix bad clipping of error in DQN and DDPG.
The problem here was that we clipped the error in such a way that the
gradient would be zero everywhere outside of the clip region. This is
obviously not desired as pointed out by @karpathy and was thus fixed by
using the Huber loss.

For details, see
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
and devsisters/DQN-tensorflow#16

carpedm20 referenced this issue in carpedm20/deep-rl-tensorflow Dec 20, 2016

@vuoristo

This comment has been minimized.

Show comment
Hide comment
@vuoristo

vuoristo Dec 20, 2016

Here's a code snippet from the DeepMind implementation of dqn (NeuralQLearner.lua) available at https://sites.google.com/a/deepmind.com/dqn/

if self.clip_delta then
    delta[delta:ge(self.clip_delta)] = self.clip_delta
    delta[delta:le(-self.clip_delta)] = -self.clip_delta
end

Doesn't this mean that the original implementation has the same issue? Maybe this should be mentioned as a difference to the original implementation?

vuoristo commented Dec 20, 2016

Here's a code snippet from the DeepMind implementation of dqn (NeuralQLearner.lua) available at https://sites.google.com/a/deepmind.com/dqn/

if self.clip_delta then
    delta[delta:ge(self.clip_delta)] = self.clip_delta
    delta[delta:le(-self.clip_delta)] = -self.clip_delta
end

Doesn't this mean that the original implementation has the same issue? Maybe this should be mentioned as a difference to the original implementation?

@solmonk

This comment has been minimized.

Show comment
Hide comment
@solmonk

solmonk Dec 20, 2016

@vuoristo The original implementation directly clips the error term for the gradient, as described in the paper. delta there has the same name and value, but serves different role. It essentially goes to the second parameter of backward(), and the function receives gradients with respect to the output of the module as a second parameter. I am pretty sure that both implementations clip the same thing now.

solmonk commented Dec 20, 2016

@vuoristo The original implementation directly clips the error term for the gradient, as described in the paper. delta there has the same name and value, but serves different role. It essentially goes to the second parameter of backward(), and the function receives gradients with respect to the output of the module as a second parameter. I am pretty sure that both implementations clip the same thing now.

@jaara

This comment has been minimized.

Show comment
Hide comment
@jaara

jaara Jun 5, 2017

Hi, I believe that the linear range of the Huber function should be 2. That results into:

def clipped_error(x):
    return tf.where(tf.abs(x) < 2.0, 0.5 * tf.square(x), 2*(tf.abs(x) - 1))

Explanation
I believe that Huber loss was primarily used to clip the gradient resulting from Q value overestimation. E.g. if your target Q_ value really differs from the current Q value in:

Q(s,a) -> r + max Q_(s_, a)

You don't want to do large updates, because this is an estimation anyway. This could also be the only source of large updates, because you clip the reward to [-1, 1] anyway.

But to qualify as correct Q-learning, you want to compute an estimate over possible future states s_. This what you would get with MSE, or Huber loss with behaving as MSE in [-2, 2] region. 2 is the distance between the possible rewards.

Intuition:
If your current estimation for a state s and action a terminating the episode is Q(s, a) = 1 and now you get a reward of -1 (in stochastic environment), the loss is L = huber( 1 + 1 ). And that's where you want it to behave as MSE.

Note: Some additional learning rate tuning might be necessary.

Discussion welcome. Opinion from @karpathy welcome.

jaara commented Jun 5, 2017

Hi, I believe that the linear range of the Huber function should be 2. That results into:

def clipped_error(x):
    return tf.where(tf.abs(x) < 2.0, 0.5 * tf.square(x), 2*(tf.abs(x) - 1))

Explanation
I believe that Huber loss was primarily used to clip the gradient resulting from Q value overestimation. E.g. if your target Q_ value really differs from the current Q value in:

Q(s,a) -> r + max Q_(s_, a)

You don't want to do large updates, because this is an estimation anyway. This could also be the only source of large updates, because you clip the reward to [-1, 1] anyway.

But to qualify as correct Q-learning, you want to compute an estimate over possible future states s_. This what you would get with MSE, or Huber loss with behaving as MSE in [-2, 2] region. 2 is the distance between the possible rewards.

Intuition:
If your current estimation for a state s and action a terminating the episode is Q(s, a) = 1 and now you get a reward of -1 (in stochastic environment), the loss is L = huber( 1 + 1 ). And that's where you want it to behave as MSE.

Note: Some additional learning rate tuning might be necessary.

Discussion welcome. Opinion from @karpathy welcome.

@ppwwyyxx

This comment has been minimized.

Show comment
Hide comment
@ppwwyyxx

ppwwyyxx Jun 6, 2017

reward is clipped, but not the Q value. Distance between Q values can be larger than 2.

ppwwyyxx commented Jun 6, 2017

reward is clipped, but not the Q value. Distance between Q values can be larger than 2.

@jaara

This comment has been minimized.

Show comment
Hide comment
@jaara

jaara Jun 6, 2017

@ppwwyyxx I assume you say that the error of Q can be larger than 2. That's true, and that's where Huber loss helps - avoids large updates. When it converges, the error is 0, however.

What I was saying is that because the maximal distance between possible reward is 2, the delta argument of Huber loss function should also be 2. With this loss function the Q-learning should converge to the same solution as with MSE - giving proper Q value estimations (means):

Q(s, a) = E[ r + max Q(s_, a) ] 

That's how the Q-learning is defined.

With delta=1, it converges to a different solution, where Q values are different.

Q(s, a) ≠ E[ r + max Q(s_, a) ] 

Also see corresponding reddit discussion.

jaara commented Jun 6, 2017

@ppwwyyxx I assume you say that the error of Q can be larger than 2. That's true, and that's where Huber loss helps - avoids large updates. When it converges, the error is 0, however.

What I was saying is that because the maximal distance between possible reward is 2, the delta argument of Huber loss function should also be 2. With this loss function the Q-learning should converge to the same solution as with MSE - giving proper Q value estimations (means):

Q(s, a) = E[ r + max Q(s_, a) ] 

That's how the Q-learning is defined.

With delta=1, it converges to a different solution, where Q values are different.

Q(s, a) ≠ E[ r + max Q(s_, a) ] 

Also see corresponding reddit discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment