-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting networks to be equal #5
Comments
Hi Fausto, It is a tricky thing, and if you have empirical results either way I would be interested in seeing what they suggest. The thinking behind using the tau for incremental updates is that the single big change to the target network in the original DQN architecture can actually be disruptive to training itself, since the target q values are now suddenly drawn from a potentially very different distribution than they were a moment ago. The idea behind utilizing tau is to eliminate this with a slow and smooth change over time. As you say though, the interpolation may result in target values the original q-network would never have produced. Overall though, I think this is alright, since it matters more that the target values are relatively stable and uncorrelated with the primary network more than it does that they are exactly correct. (The q values from the primary network aren't actually correct themselves, just closer approximations) If they are somewhat off that is okay, since they will eventually continue to be pushed in the right direction. I hope that long explanation provided some context. As I said earlier though, it may be that in certain cases one ends up working better than the other depending on the specific task. I will however change the wording in the notebook in order to make it clear that the tau updating strategy is being employed. |
Hello Arthur, In relation to the first question in my message, do you confirm that the statement that you do right at the beginning of training (right after sess.run(init)) to make the two networks (main and target equal) should actually be applied with tau != 1.0? |
I apologize for not realizing you were referring specifically to the initial setting of the networks. You are right that they should be initialized to the same values at the beginning of training. Though, since those values are initialized as random in both cases (and as such produce random Q values), it may not make much of an empirical difference. |
Good point!
Fausto Milletarì
Sent from my iPhone
… On 8 Feb 2017, at 12:08, Arthur Juliani ***@***.***> wrote:
I apologize for not realizing you were referring specifically to the initial setting of the networks. You are right that they should be initialized to the same values at the beginning of training. Though, since those values being as random (and as such produce random Q values), it may not make much of an empirical difference.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have tried your simple grid world, size 5x5, 10K iterations of pre-training, 10K iteration for chance of random action annealing, experience replay buff 250K, learning rate 0.0001, tau 0.001 It seems that making the network equal at the beginning of training actually harms performance. In orange: the performance when at the beginning we perform just a standard update with tau = 0.001. Hopefully I have no bugs around that influence the performance of this evaluation. (I can think of reasons why the loss would go up a little after going down, now i'm looking into checking that it still goes down as the algorithm runs further) |
Thanks for running these experiments! It looks like having the networks start correlated is indeed detrimental. I am going to remove the initial update line. |
the statement:
updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.
uses the op list
targetOps
which is built with tau=0.001.Therefore the networks are not the same after the op is executed.
Can you confirm this issue? Does it have an impact on the results?
Another thing:
this idea of getting, after update, a network whose weights are a convex combination between the weights of the target and the main networks seems to be a bit weird to me, even if it comes from a paper done by deep mind. starting from the same initialization and using very small update weights it might work, but in general it should actually not work at all. Interpolating between network weights of different runs can potentially disrupt performances. (i will have a look to the paper though, even though i would love to hear your comments as an expert in this field)
The text was updated successfully, but these errors were encountered: