New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new way to produce stronger weight? #814
Comments
Now LZ-halfblood-W1-P1600 and LZ-halfblood-W3-p1600 are testing on cgos. |
interesting |
The match between weight1 and 257aeeb8 is ongoing, and weight1 currently leads 27-16 |
So can it be concluded that the learning rate is too high for now? |
Mabey it has something like "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"--https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf |
This is a good idea. Can you share the source code about how to do this? |
I think the way is very simlpe like this :wfmix=[(wf1+wf2)/2 |
@larrywang30092 I'm sure the code will definitely disappointed you. It's far simpler than you may think.
|
Thank you, @MingWR |
I think the actual effect of averaging the two weights is reducing the noise. Maybe we introduced too much noise in self play games? |
I don't think noise is the problem. I think that each of the current strong weight files has particular weak points, where the policy priors don't include the correct response in some situations, and those situations are different for each weight file. The averaging of two similarly strong networks would make sure that the resulting policy priors include the correct response to each position where one of the networks knows the correct answer. MCTS will then sort out the correct response in each situation, as long as the policy net ensures it is searched. In this way, the combination can reasonably be greater than the sum of its parts. |
@gcp: If the procedure introduced here is a type of regularisation (and is shown to produce stronger weight files), would it not make sense to try adjusting the regularisation term in tfprocess.py? |
Interesting. I've already introduced that concept a few days ago ( #794 ). But I've never tried to check its strength like you. and please note that you can't merge different sized(e.g.: 5 blocks vs 6 blocks) network weights. |
Instead of validating via games - you might do validation via testing on prediction of pro games. Should be a lot faster in determining what networks are 'good' vs 'bad'. Then only do play testing on the good networks. |
@wpstmxhs When I started merging two weights, I was just thinking that maybe the study rate is too high so "pulling the new weight a little back" might work. Therefore I only tried to average the new weights trained after 257 and 257 itself. I never thought that would really work untill my friends had helped test the strength. |
For fun I just averaged the last that didn't pass 20 networks with each other by simple running the above script 20x. I don't know if the network I created is better. The only thing I notice is that the network I created is considering 5x the amount of moves every turn(all of them have 1 visit) |
@MingWR I see. I also tried to extrapolate a new stronger future network's weight, from weaker networks of the past. Playing with network weights was really fun. |
@LetterRip I don't think so. IMHO, because zero's networks are not from human games, so I think validating its accuracy from pro games doesn't make sense. As you know, Leela Zero networks developed some new technical moves, like early 3-3 point invasion, and attachment right after star point(4-4) or komoku(4-3). Therefore it's normal not to fit to human's moves. Also, this project's goal is to make zero-based strong go AI. not human pro-like AI. |
The network I created from the last 20 networks that didn't pass won 4 out of 5 games against the current best with 400 playouts. Even though I used several networks that are below 10% it doesn't seem to be any worse and it might even be better. |
@MartinDevelopment Can you continue this a bit more? If you can show that you got a much stronger net with in this way with statistical significance, this may be big.... |
How about to promote the new mixed network weights to best network weights and let many people use that for making self-play games? I would like to listen to @gcp 's thought. |
So is the effect of this just averaging out the changes of the networks and keeping what's identical between them? |
I just added another 10 networks. If anyone would like to test it out you can download it here. https://drive.google.com/file/d/1t6TG4hGdZqkIbNf_FBQyHEtchc9ztCaj |
Running ./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w newWeight.txt |
@zediir It should smear the policy priors, and probably make the search tree wider but less deep in this way. It may also reduce errors in the value calculation, but I'm unsure about that part. Are you using FPU reduction code for your test? |
yes. I'm running on next-branch. |
@zediir I think averaging weights between networks weakens some over-fitted weight values and emphasizes more common legit values. that made the network stronger. |
It seems fun to try the automated system like zero.sjeng.org server.
Maybe it makes a super-duper stronger new network, without more distributed efforts needed. |
Averaging weights from nearby networks decreases noise from mini-batch gradient calculations. If the network is very near the optimum then the weights don't change much during the training and we can think that the network weights are the optimal weights plus some noise caused by the stochastic gradient calculation. Averaging the weights decreases the amount of noise in weights and brings the network closer to the optimum. Same effect can be had if learning rate is decreased. |
I noticed that the weight files produced by this script balloon in size because of rounding errors. It shouldn't make any difference to the outcome to use single precision rounding like in regular LeelaZ weight files... |
@pangafu Can your test program run in windows? |
@optimistman yes, I tested in ubuntu and windows. |
This thread reminds me of this paper which some of you may find interesting. Exploiting Cyclic Symmetry in Convolutional Neural Networks. It has some drop-in CNN layers that can be used with any network design, to directly handle rotation (and with some more effort, transposition). Effectively it looks at every rotation in the later and averages the results coming out before the next layer, so all rotations have the same results. Which is sort of what we do here in this thread. ;) |
Could someone translate it for me into plain C? [...] I will appreciate :) |
let me try:
} |
Thanx a lot |
Could this be related? https://arxiv.org/abs/1803.05407 |
This seems to indicate that it is more than a simple learning rate decrease :) |
I tested this over the weekend by averaging ed26f and 4 other failed networks that were started from ed26f and trained 8k or 16k steps. The averaged network is stronger than ed26f it started from:
The strength gain is about the same that the learning rate decrease gave. |
And did you test an average starting from the lastest network after the learning rate decrease to see how it does ? |
That's a very nice result. And since the paper indicate this technique puts us into an entirely different place than what we get from SGD, I would bet no matter how much we decrease learning rate we will keep seeing good results from this. I've been trying to install a working setup in order to try implementing the paper directly but I'm having problems with tensorflow for now xD |
If it were up to me the next network would be 20x128. Just my gut feeling or a hunch. |
SWA in general is a good idea which was proposed long time ago for Stochastic Approximation. But even for the model average I saw some similar implementation before. |
Ttl, i'm trying to average some wieght, but i don't know what to use. What do you use to average weight ? A script ? And how do you run these validation match that give you result like :
Is it a command of leelaz.exe ? |
@herazul There's a validation/ folder with a program in there designed to run two leela programs in head to head. :) |
Ok i see the folder in the source code, but can i use it with the release version of leela ? If not i need to compile it myself to use it ? (ps : i'am a noob in c++) |
I installed VS2017 and manage to build leelaz, but not the validation project. |
If you just want to average weights but not using the SWA pipeline, why not just use #814 (comment) https://github.com/pangafu/Hybrid_LeelaZero? Using SWA requires training. I guess @Ttl also did it that way in #814 (comment). SWA was only implemented in #1064. |
There is also this https://github.com/gcp/leela-zero/blob/next/training/tf/average_weights.py |
@alreadydone i tried to use it but doesnt work. I opened an issue will see I managed to get it work, the problem was a string on windows. |
So now that it works, i have a question : did anyone managed to average the lastest SWA trained networks and get an improvement ? i'm at 3 experiment and still nothing worthy. |
We tried to combine 2 or 3 stronge weight by simply “add them together”:
We picked up 257aeeb8 (the strongest one by now on http://zero.sjeng.org/ ) and some other weight files which won over 40% to 257aeeb8 in SPRT. We made some “hybrid” weight files by
linear superposition——0.5weight1+0.5weight2; 0.25weight1+0.25weight2+0.5*weight3… Surprisingly, we’ve got serval weight files much stronger than 257aeeb8… Here are two of them. Both of these "hybrid" files can win ~70% of the matches to 257aeeb8 (1600playouts).
weight1.zip
weight3.zip
The text was updated successfully, but these errors were encountered: