Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new way to produce stronger weight? #814

Open
2532796145 opened this issue Feb 4, 2018 · 126 comments
Open

A new way to produce stronger weight? #814

2532796145 opened this issue Feb 4, 2018 · 126 comments

Comments

@2532796145
Copy link

We tried to combine 2 or 3 stronge weight by simply “add them together”:
We picked up 257aeeb8 (the strongest one by now on http://zero.sjeng.org/ ) and some other weight files which won over 40% to 257aeeb8 in SPRT. We made some “hybrid” weight files by
linear superposition——0.5weight1+0.5weight2; 0.25weight1+0.25weight2+0.5*weight3… Surprisingly, we’ve got serval weight files much stronger than 257aeeb8… Here are two of them. Both of these "hybrid" files can win ~70% of the matches to 257aeeb8 (1600playouts).

weight1.zip
weight3.zip

@leelaup
Copy link

leelaup commented Feb 4, 2018

Now LZ-halfblood-W1-P1600 and LZ-halfblood-W3-p1600 are testing on cgos.

@l1t1
Copy link

l1t1 commented Feb 4, 2018

interesting

@godmoves
Copy link
Contributor

godmoves commented Feb 4, 2018

The match between weight1 and 257aeeb8 is ongoing, and weight1 currently leads 27-16

@thynson
Copy link
Contributor

thynson commented Feb 4, 2018

So can it be concluded that the learning rate is too high for now?

@Godady
Copy link

Godady commented Feb 4, 2018

Mabey it has something like "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"--https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

@larrywang30092
Copy link

This is a good idea. Can you share the source code about how to do this?
We are working on the mix of the weights by a more "scientific" way in that the weights in each layer will be diagonized to generate eigenvalues and eigenvectors. We will then generate symmetrized weights in conformity with the inherent symmetry of the game. Next, we will explore the "entanglement" between the layers and see if we can find a better mixing formula.

@leelaup
Copy link

leelaup commented Feb 4, 2018

I think the way is very simlpe like this :wfmix=[(wf1+wf2)/2

@MingWR
Copy link

MingWR commented Feb 4, 2018

@larrywang30092 I'm sure the code will definitely disappointed you. It's far simpler than you may think.

w1 = open('oldWeight1.txt', 'rt')
w2 = open('oldWeight2.txt', 'rt')
weight = open('newWeight.txt', 'wt')

n = 0
while n<67:
    v1 = [float(x) for x in w1.readline().split()]
    v2 = [float(x) for x in w2.readline().split()]
    if n==0: weight.write('1')
    else:
        weight.write('\n')
        for i,x in enumerate(v1):
            weight.write(f'{(x + v2[i]) / 2} ')
    n += 1
print('Finished.')
weight.close()

@larrywang30092
Copy link

Thank you, @MingWR

@Splee99
Copy link

Splee99 commented Feb 4, 2018

I think the actual effect of averaging the two weights is reducing the noise. Maybe we introduced too much noise in self play games?

@jkiliani
Copy link

jkiliani commented Feb 4, 2018

I don't think noise is the problem. I think that each of the current strong weight files has particular weak points, where the policy priors don't include the correct response in some situations, and those situations are different for each weight file. The averaging of two similarly strong networks would make sure that the resulting policy priors include the correct response to each position where one of the networks knows the correct answer. MCTS will then sort out the correct response in each situation, as long as the policy net ensures it is searched. In this way, the combination can reasonably be greater than the sum of its parts.

@jkiliani
Copy link

jkiliani commented Feb 4, 2018

@gcp: If the procedure introduced here is a type of regularisation (and is shown to produce stronger weight files), would it not make sense to try adjusting the regularisation term in tfprocess.py?

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

Interesting. I've already introduced that concept a few days ago ( #794 ).
It's easy to write codes, even if you're not familiar with deep learning technologies.
because every network weight file is plain-text file and we can figure out that all of its contents are network weights except for the first value(which is version number '1').

But I've never tried to check its strength like you.
I just checked only that it seems working not bad versus Zen7 9dan.
Very interesting experiment result.

and please note that you can't merge different sized(e.g.: 5 blocks vs 6 blocks) network weights.
I also tried to use net2net to make same block network and merge a new one, but no luck. it didn't work correctly. IMHO, because its weights came from different structured network.
So you can't merge different sized network weights, even though you use net2net.

@ghost
Copy link

ghost commented Feb 4, 2018

Instead of validating via games - you might do validation via testing on prediction of pro games. Should be a lot faster in determining what networks are 'good' vs 'bad'. Then only do play testing on the good networks.

@MingWR
Copy link

MingWR commented Feb 4, 2018

@wpstmxhs When I started merging two weights, I was just thinking that maybe the study rate is too high so "pulling the new weight a little back" might work. Therefore I only tried to average the new weights trained after 257 and 257 itself. I never thought that would really work untill my friends had helped test the strength.

@MartinDevelopment
Copy link

MartinDevelopment commented Feb 4, 2018

For fun I just averaged the last that didn't pass 20 networks with each other by simple running the above script 20x. I don't know if the network I created is better. The only thing I notice is that the network I created is considering 5x the amount of moves every turn(all of them have 1 visit)

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

@MingWR I see. I also tried to extrapolate a new stronger future network's weight, from weaker networks of the past. Playing with network weights was really fun.

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

@LetterRip I don't think so. IMHO, because zero's networks are not from human games, so I think validating its accuracy from pro games doesn't make sense.

As you know, Leela Zero networks developed some new technical moves, like early 3-3 point invasion, and attachment right after star point(4-4) or komoku(4-3). Therefore it's normal not to fit to human's moves.

Also, this project's goal is to make zero-based strong go AI. not human pro-like AI.

@MartinDevelopment
Copy link

The network I created from the last 20 networks that didn't pass won 4 out of 5 games against the current best with 400 playouts. Even though I used several networks that are below 10% it doesn't seem to be any worse and it might even be better.

@jkiliani
Copy link

jkiliani commented Feb 4, 2018

@MartinDevelopment Can you continue this a bit more? If you can show that you got a much stronger net with in this way with statistical significance, this may be big....

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

How about to promote the new mixed network weights to best network weights and let many people use that for making self-play games?

I would like to listen to @gcp 's thought.

@zediir
Copy link
Contributor

zediir commented Feb 4, 2018

So is the effect of this just averaging out the changes of the networks and keeping what's identical between them?

@MartinDevelopment
Copy link

I just added another 10 networks. If anyone would like to test it out you can download it here. https://drive.google.com/file/d/1t6TG4hGdZqkIbNf_FBQyHEtchc9ztCaj

@zediir
Copy link
Contributor

zediir commented Feb 4, 2018

Running

./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w newWeight.txt
vs
./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w 257aeeb863dc51bfc598838361225459257377a4b2c9abd3e1ac6cdba1fcc88f

@jkiliani
Copy link

jkiliani commented Feb 4, 2018

@zediir It should smear the policy priors, and probably make the search tree wider but less deep in this way. It may also reduce errors in the value calculation, but I'm unsure about that part. Are you using FPU reduction code for your test?

@zediir
Copy link
Contributor

zediir commented Feb 4, 2018

yes. I'm running on next-branch.

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

@zediir I think averaging weights between networks weakens some over-fitted weight values and emphasizes more common legit values. that made the network stronger.

@wpstmxhs
Copy link
Contributor

wpstmxhs commented Feb 4, 2018

It seems fun to try the automated system like zero.sjeng.org server.

  1. We have network candidates. and we mix their strongest networks and let it be a network candidate.
  2. We evaluate network's strength by playing games like zero server.
    Loop 1~2 process.

Maybe it makes a super-duper stronger new network, without more distributed efforts needed.

@Ttl
Copy link
Member

Ttl commented Feb 4, 2018

Averaging weights from nearby networks decreases noise from mini-batch gradient calculations. If the network is very near the optimum then the weights don't change much during the training and we can think that the network weights are the optimal weights plus some noise caused by the stochastic gradient calculation. Averaging the weights decreases the amount of noise in weights and brings the network closer to the optimum.

Same effect can be had if learning rate is decreased.

@jkiliani
Copy link

jkiliani commented Feb 4, 2018

I noticed that the weight files produced by this script balloon in size because of rounding errors. It shouldn't make any difference to the outcome to use single precision rounding like in regular LeelaZ weight files...

@optimistman
Copy link

@pangafu Can your test program run in windows?

@pangafu
Copy link

pangafu commented Feb 13, 2018

@optimistman yes, I tested in ubuntu and windows.

@roy7
Copy link
Collaborator

roy7 commented Feb 22, 2018

This thread reminds me of this paper which some of you may find interesting. Exploiting Cyclic Symmetry in Convolutional Neural Networks.

It has some drop-in CNN layers that can be used with any network design, to directly handle rotation (and with some more effort, transposition). Effectively it looks at every rotation in the later and averages the results coming out before the next layer, so all rotations have the same results. Which is sort of what we do here in this thread. ;)

@Marcin1960
Copy link

Marcin1960 commented Feb 23, 2018

Could someone translate it for me into plain C?
Or into pseudocode?

[...]
v1 = [float(x) for x in w1.readline().split()]
[...]
for i,x in enumerate(v1):
weight.write(f'{(x + v2[i]) / 2} ')
[...]

I will appreciate :)

@AAPMTG306
Copy link

let me try:
I assume there are 64 elements in v1.
char line[4096];
float v1[64];
FILE *f=fopen("filename.txt","r");
fscanf(f,"%[^\n]",line);
for (i=0, i<64, i++)
sscanf(line, "%f ", v1+i);
fclose(f);
...
FILE *fw=fopen("weights.txt", "w");
for (i=0; i<64; i++)
{

 float newval=(v1[i]+v2[i])/2;
 fprintf(fw, "%f ", newval);

}
fclose(fw);

@Marcin1960
Copy link

Thanx a lot

@barrtgt
Copy link

barrtgt commented Mar 15, 2018

Could this be related? https://arxiv.org/abs/1803.05407

@remdu
Copy link
Contributor

remdu commented Mar 15, 2018

This seems to indicate that it is more than a simple learning rate decrease :)
The algorithm they present seems like a slightly more advanced form of what was done here. It would be very interesting to implement it for Leela zero.

@Ttl
Copy link
Member

Ttl commented Mar 19, 2018

I tested this over the weekend by averaging ed26f and 4 other failed networks that were started from ed26f and trained 8k or 16k steps. The averaged network is stronger than ed26f it started from:

115 wins, 76 losses
The first net is better than the second
averaged v ../autog ( 191 games)
              wins        black       white
averaged  115 60.21%   45 63.38%   70 58.33%
../autog   76 39.79%   26 36.62%   50 41.67%
                       71 37.17%  120 62.83%

The strength gain is about the same that the learning rate decrease gave.

@herazul
Copy link

herazul commented Mar 19, 2018

And did you test an average starting from the lastest network after the learning rate decrease to see how it does ?

@remdu
Copy link
Contributor

remdu commented Mar 19, 2018

That's a very nice result. And since the paper indicate this technique puts us into an entirely different place than what we get from SGD, I would bet no matter how much we decrease learning rate we will keep seeing good results from this.

I've been trying to install a working setup in order to try implementing the paper directly but I'm having problems with tensorflow for now xD

@Marcin1960
Copy link

If it were up to me the next network would be 20x128. Just my gut feeling or a hunch.

@bobchennan
Copy link

SWA in general is a good idea which was proposed long time ago for Stochastic Approximation.

But even for the model average I saw some similar implementation before.
One example is Kaldi which is a famous speech recognition tool.
This is the code to take average of different networks and this is one script using this trick.
Basically we train multiple model on multiple machine and then take average. Similar idea is also known as parallel sgd in this paper.

@herazul
Copy link

herazul commented Mar 30, 2018

Ttl, i'm trying to average some wieght, but i don't know what to use. What do you use to average weight ? A script ?

And how do you run these validation match that give you result like :

115 wins, 76 losses
The first net is better than the second
averaged v ../autog ( 191 games)
              wins        black       white
averaged  115 60.21%   45 63.38%   70 58.33%
../autog   76 39.79%   26 36.62%   50 41.67%
                       71 37.17%  120 62.83%

Is it a command of leelaz.exe ?

@roy7
Copy link
Collaborator

roy7 commented Mar 30, 2018

@herazul There's a validation/ folder with a program in there designed to run two leela programs in head to head. :)

@herazul
Copy link

herazul commented Mar 30, 2018

Ok i see the folder in the source code, but can i use it with the release version of leela ? If not i need to compile it myself to use it ? (ps : i'am a noob in c++)

@herazul
Copy link

herazul commented Mar 30, 2018

I installed VS2017 and manage to build leelaz, but not the validation project.

@alreadydone
Copy link
Contributor

If you just want to average weights but not using the SWA pipeline, why not just use #814 (comment) https://github.com/pangafu/Hybrid_LeelaZero? Using SWA requires training. I guess @Ttl also did it that way in #814 (comment). SWA was only implemented in #1064.

@remdu
Copy link
Contributor

remdu commented Mar 30, 2018

@herazul
Copy link

herazul commented Mar 30, 2018

@alreadydone i tried to use it but doesnt work. I opened an issue will see

I managed to get it work, the problem was a string on windows.

@herazul
Copy link

herazul commented Mar 31, 2018

So now that it works, i have a question : did anyone managed to average the lastest SWA trained networks and get an improvement ? i'm at 3 experiment and still nothing worthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests