A new way to produce stronger weight? #814

2532796145 · 2018-02-04T12:13:58Z

We tried to combine 2 or 3 stronge weight by simply “add them together”:
We picked up 257aeeb8 (the strongest one by now on http://zero.sjeng.org/ ) and some other weight files which won over 40% to 257aeeb8 in SPRT. We made some “hybrid” weight files by
linear superposition——0.5weight1+0.5weight2; 0.25weight1+0.25weight2+0.5*weight3… Surprisingly, we’ve got serval weight files much stronger than 257aeeb8… Here are two of them. Both of these "hybrid" files can win ~70% of the matches to 257aeeb8 (1600playouts).

weight1.zip
weight3.zip

leelaup · 2018-02-04T12:22:47Z

Now LZ-halfblood-W1-P1600 and LZ-halfblood-W3-p1600 are testing on cgos.

l1t1 · 2018-02-04T13:46:52Z

interesting

godmoves · 2018-02-04T13:49:40Z

The match between weight1 and 257aeeb8 is ongoing, and weight1 currently leads 27-16

thynson · 2018-02-04T13:55:22Z

So can it be concluded that the learning rate is too high for now?

Godady · 2018-02-04T13:59:32Z

Mabey it has something like "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"--https://research.fb.com/wp-content/uploads/2017/06/imagenet1kin1h5.pdf

larrywang30092 · 2018-02-04T14:17:14Z

This is a good idea. Can you share the source code about how to do this?
We are working on the mix of the weights by a more "scientific" way in that the weights in each layer will be diagonized to generate eigenvalues and eigenvectors. We will then generate symmetrized weights in conformity with the inherent symmetry of the game. Next, we will explore the "entanglement" between the layers and see if we can find a better mixing formula.

leelaup · 2018-02-04T14:38:24Z

I think the way is very simlpe like this :wfmix=[(wf1+wf2)/2

MingWR · 2018-02-04T15:47:13Z

@larrywang30092 I'm sure the code will definitely disappointed you. It's far simpler than you may think.

w1 = open('oldWeight1.txt', 'rt')
w2 = open('oldWeight2.txt', 'rt')
weight = open('newWeight.txt', 'wt')

n = 0
while n<67:
    v1 = [float(x) for x in w1.readline().split()]
    v2 = [float(x) for x in w2.readline().split()]
    if n==0: weight.write('1')
    else:
        weight.write('\n')
        for i,x in enumerate(v1):
            weight.write(f'{(x + v2[i]) / 2} ')
    n += 1
print('Finished.')
weight.close()

larrywang30092 · 2018-02-04T15:59:56Z

Thank you, @MingWR

Splee99 · 2018-02-04T17:02:10Z

I think the actual effect of averaging the two weights is reducing the noise. Maybe we introduced too much noise in self play games?

jkiliani · 2018-02-04T17:06:18Z

I don't think noise is the problem. I think that each of the current strong weight files has particular weak points, where the policy priors don't include the correct response in some situations, and those situations are different for each weight file. The averaging of two similarly strong networks would make sure that the resulting policy priors include the correct response to each position where one of the networks knows the correct answer. MCTS will then sort out the correct response in each situation, as long as the policy net ensures it is searched. In this way, the combination can reasonably be greater than the sum of its parts.

jkiliani · 2018-02-04T17:24:26Z

@gcp: If the procedure introduced here is a type of regularisation (and is shown to produce stronger weight files), would it not make sense to try adjusting the regularisation term in tfprocess.py?

wpstmxhs · 2018-02-04T17:48:53Z

Interesting. I've already introduced that concept a few days ago ( #794 ).
It's easy to write codes, even if you're not familiar with deep learning technologies.
because every network weight file is plain-text file and we can figure out that all of its contents are network weights except for the first value(which is version number '1').

But I've never tried to check its strength like you.
I just checked only that it seems working not bad versus Zen7 9dan.
Very interesting experiment result.

and please note that you can't merge different sized(e.g.: 5 blocks vs 6 blocks) network weights.
I also tried to use net2net to make same block network and merge a new one, but no luck. it didn't work correctly. IMHO, because its weights came from different structured network.
So you can't merge different sized network weights, even though you use net2net.

ghost · 2018-02-04T18:16:34Z

Instead of validating via games - you might do validation via testing on prediction of pro games. Should be a lot faster in determining what networks are 'good' vs 'bad'. Then only do play testing on the good networks.

MingWR · 2018-02-04T18:22:25Z

@wpstmxhs When I started merging two weights, I was just thinking that maybe the study rate is too high so "pulling the new weight a little back" might work. Therefore I only tried to average the new weights trained after 257 and 257 itself. I never thought that would really work untill my friends had helped test the strength.

MartinDevelopment · 2018-02-04T18:26:08Z

For fun I just averaged the last that didn't pass 20 networks with each other by simple running the above script 20x. I don't know if the network I created is better. The only thing I notice is that the network I created is considering 5x the amount of moves every turn(all of them have 1 visit)

wpstmxhs · 2018-02-04T18:26:32Z

@MingWR I see. I also tried to extrapolate a new stronger future network's weight, from weaker networks of the past. Playing with network weights was really fun.

wpstmxhs · 2018-02-04T18:36:49Z

@LetterRip I don't think so. IMHO, because zero's networks are not from human games, so I think validating its accuracy from pro games doesn't make sense.

As you know, Leela Zero networks developed some new technical moves, like early 3-3 point invasion, and attachment right after star point(4-4) or komoku(4-3). Therefore it's normal not to fit to human's moves.

Also, this project's goal is to make zero-based strong go AI. not human pro-like AI.

MartinDevelopment · 2018-02-04T18:40:41Z

The network I created from the last 20 networks that didn't pass won 4 out of 5 games against the current best with 400 playouts. Even though I used several networks that are below 10% it doesn't seem to be any worse and it might even be better.

jkiliani · 2018-02-04T18:44:26Z

@MartinDevelopment Can you continue this a bit more? If you can show that you got a much stronger net with in this way with statistical significance, this may be big....

wpstmxhs · 2018-02-04T18:53:02Z

How about to promote the new mixed network weights to best network weights and let many people use that for making self-play games?

I would like to listen to @gcp 's thought.

zediir · 2018-02-04T18:56:50Z

So is the effect of this just averaging out the changes of the networks and keeping what's identical between them?

MartinDevelopment · 2018-02-04T18:57:26Z

I just added another 10 networks. If anyone would like to test it out you can download it here. https://drive.google.com/file/d/1t6TG4hGdZqkIbNf_FBQyHEtchc9ztCaj

zediir · 2018-02-04T19:00:12Z

Running

./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w newWeight.txt
vs
./leelaz.exe --gpu=1 -g -p 1600 --noponder -t 1 -q -d -r 1 -w 257aeeb863dc51bfc598838361225459257377a4b2c9abd3e1ac6cdba1fcc88f

jkiliani · 2018-02-04T19:01:56Z

@zediir It should smear the policy priors, and probably make the search tree wider but less deep in this way. It may also reduce errors in the value calculation, but I'm unsure about that part. Are you using FPU reduction code for your test?

zediir · 2018-02-04T19:02:58Z

yes. I'm running on next-branch.

wpstmxhs · 2018-02-04T19:03:05Z

@zediir I think averaging weights between networks weakens some over-fitted weight values and emphasizes more common legit values. that made the network stronger.

wpstmxhs · 2018-02-04T19:10:30Z

It seems fun to try the automated system like zero.sjeng.org server.

We have network candidates. and we mix their strongest networks and let it be a network candidate.
We evaluate network's strength by playing games like zero server.
Loop 1~2 process.

Maybe it makes a super-duper stronger new network, without more distributed efforts needed.

Ttl · 2018-02-04T19:15:20Z

Averaging weights from nearby networks decreases noise from mini-batch gradient calculations. If the network is very near the optimum then the weights don't change much during the training and we can think that the network weights are the optimal weights plus some noise caused by the stochastic gradient calculation. Averaging the weights decreases the amount of noise in weights and brings the network closer to the optimum.

Same effect can be had if learning rate is decreased.

jkiliani · 2018-02-04T19:17:42Z

I noticed that the weight files produced by this script balloon in size because of rounding errors. It shouldn't make any difference to the outcome to use single precision rounding like in regular LeelaZ weight files...

optimistman · 2018-02-12T21:38:22Z

@pangafu Can your test program run in windows?

pangafu · 2018-02-13T01:46:21Z

@optimistman yes, I tested in ubuntu and windows.

roy7 · 2018-02-22T15:38:11Z

This thread reminds me of this paper which some of you may find interesting. Exploiting Cyclic Symmetry in Convolutional Neural Networks.

It has some drop-in CNN layers that can be used with any network design, to directly handle rotation (and with some more effort, transposition). Effectively it looks at every rotation in the later and averages the results coming out before the next layer, so all rotations have the same results. Which is sort of what we do here in this thread. ;)

Marcin1960 · 2018-02-23T21:59:48Z

Could someone translate it for me into plain C?
Or into pseudocode?

[...]
v1 = [float(x) for x in w1.readline().split()]
[...]
for i,x in enumerate(v1):
weight.write(f'{(x + v2[i]) / 2} ')
[...]

I will appreciate :)

AAPMTG306 · 2018-02-24T07:24:23Z

let me try:
I assume there are 64 elements in v1.
char line[4096];
float v1[64];
FILE *f=fopen("filename.txt","r");
fscanf(f,"%[^\n]",line);
for (i=0, i<64, i++)
sscanf(line, "%f ", v1+i);
fclose(f);
...
FILE *fw=fopen("weights.txt", "w");
for (i=0; i<64; i++)
{

 float newval=(v1[i]+v2[i])/2;
 fprintf(fw, "%f ", newval);

}
fclose(fw);

Marcin1960 · 2018-02-24T07:55:25Z

Thanx a lot

barrtgt · 2018-03-15T06:51:39Z

Could this be related? https://arxiv.org/abs/1803.05407

remdu · 2018-03-15T08:25:41Z

This seems to indicate that it is more than a simple learning rate decrease :)
The algorithm they present seems like a slightly more advanced form of what was done here. It would be very interesting to implement it for Leela zero.

Ttl · 2018-03-19T13:52:24Z

I tested this over the weekend by averaging ed26f and 4 other failed networks that were started from ed26f and trained 8k or 16k steps. The averaged network is stronger than ed26f it started from:

115 wins, 76 losses
The first net is better than the second
averaged v ../autog ( 191 games)
              wins        black       white
averaged  115 60.21%   45 63.38%   70 58.33%
../autog   76 39.79%   26 36.62%   50 41.67%
                       71 37.17%  120 62.83%

The strength gain is about the same that the learning rate decrease gave.

herazul · 2018-03-19T13:58:07Z

And did you test an average starting from the lastest network after the learning rate decrease to see how it does ?

remdu · 2018-03-19T14:47:51Z

That's a very nice result. And since the paper indicate this technique puts us into an entirely different place than what we get from SGD, I would bet no matter how much we decrease learning rate we will keep seeing good results from this.

I've been trying to install a working setup in order to try implementing the paper directly but I'm having problems with tensorflow for now xD

Marcin1960 · 2018-03-20T20:56:50Z

If it were up to me the next network would be 20x128. Just my gut feeling or a hunch.

bobchennan · 2018-03-28T04:43:17Z

SWA in general is a good idea which was proposed long time ago for Stochastic Approximation.

But even for the model average I saw some similar implementation before.
One example is Kaldi which is a famous speech recognition tool.
This is the code to take average of different networks and this is one script using this trick.
Basically we train multiple model on multiple machine and then take average. Similar idea is also known as parallel sgd in this paper.

herazul · 2018-03-30T17:48:20Z

Ttl, i'm trying to average some wieght, but i don't know what to use. What do you use to average weight ? A script ?

And how do you run these validation match that give you result like :

115 wins, 76 losses
The first net is better than the second
averaged v ../autog ( 191 games)
              wins        black       white
averaged  115 60.21%   45 63.38%   70 58.33%
../autog   76 39.79%   26 36.62%   50 41.67%
                       71 37.17%  120 62.83%

Is it a command of leelaz.exe ?

roy7 · 2018-03-30T17:50:33Z

@herazul There's a validation/ folder with a program in there designed to run two leela programs in head to head. :)

herazul · 2018-03-30T17:59:46Z

Ok i see the folder in the source code, but can i use it with the release version of leela ? If not i need to compile it myself to use it ? (ps : i'am a noob in c++)

herazul · 2018-03-30T18:42:43Z

I installed VS2017 and manage to build leelaz, but not the validation project.

alreadydone · 2018-03-30T19:34:01Z

If you just want to average weights but not using the SWA pipeline, why not just use #814 (comment) https://github.com/pangafu/Hybrid_LeelaZero? Using SWA requires training. I guess @Ttl also did it that way in #814 (comment). SWA was only implemented in #1064.

remdu · 2018-03-30T19:47:32Z

There is also this https://github.com/gcp/leela-zero/blob/next/training/tf/average_weights.py

herazul · 2018-03-30T20:06:26Z

@alreadydone i tried to use it but doesnt work. I opened an issue will see

I managed to get it work, the problem was a string on windows.

herazul · 2018-03-31T09:39:31Z

So now that it works, i have a question : did anyone managed to average the lastest SWA trained networks and get an improvement ? i'm at 3 experiment and still nothing worthy.

pangafu mentioned this issue Feb 13, 2018

Some test of "Hybrid" weight on 5x64 network of leela #867

Closed

wctgit mentioned this issue Feb 14, 2018

Go tournament manager #870

Closed

This was referenced Feb 23, 2018

A new way to produce stronger weight? pangafu/Hybrid_LeelaZero#2

Open

Maybe I found why the "Hybrid" weight is stronger than original weight #908

Closed

alreadydone mentioned this issue Feb 28, 2018

Version 0.10 released - Next steps #591

Open

bood mentioned this issue Mar 18, 2018

Matches against Chinese top tier pros (2:0 so far, 4 x Titan V) #1046

Closed

remdu mentioned this issue Mar 19, 2018

Bigger, stronger, not faster (4dc12a8e etc) #1030

Open

kiudee mentioned this issue Mar 21, 2018

Adding Stochastic Weight Averaging (SWA) glinscott/leela-chess#159

Open

sethtroisi mentioned this issue Jun 20, 2018

SWA attempt tensorflow/minigo#280

Open

A new way to produce stronger weight? #814

A new way to produce stronger weight? #814

Comments

2532796145 commented Feb 4, 2018

leelaup commented Feb 4, 2018

l1t1 commented Feb 4, 2018

godmoves commented Feb 4, 2018

thynson commented Feb 4, 2018

Godady commented Feb 4, 2018

larrywang30092 commented Feb 4, 2018

leelaup commented Feb 4, 2018

MingWR commented Feb 4, 2018 • edited

larrywang30092 commented Feb 4, 2018

Splee99 commented Feb 4, 2018

jkiliani commented Feb 4, 2018 • edited

jkiliani commented Feb 4, 2018

wpstmxhs commented Feb 4, 2018

ghost commented Feb 4, 2018 • edited by ghost

MingWR commented Feb 4, 2018

MartinDevelopment commented Feb 4, 2018 • edited

wpstmxhs commented Feb 4, 2018

wpstmxhs commented Feb 4, 2018 • edited

MartinDevelopment commented Feb 4, 2018

jkiliani commented Feb 4, 2018

wpstmxhs commented Feb 4, 2018

zediir commented Feb 4, 2018

MartinDevelopment commented Feb 4, 2018

zediir commented Feb 4, 2018

jkiliani commented Feb 4, 2018

zediir commented Feb 4, 2018

wpstmxhs commented Feb 4, 2018

wpstmxhs commented Feb 4, 2018 • edited

Ttl commented Feb 4, 2018

jkiliani commented Feb 4, 2018

optimistman commented Feb 12, 2018

pangafu commented Feb 13, 2018

roy7 commented Feb 22, 2018

Marcin1960 commented Feb 23, 2018 • edited

AAPMTG306 commented Feb 24, 2018

Marcin1960 commented Feb 24, 2018

barrtgt commented Mar 15, 2018

remdu commented Mar 15, 2018 • edited

Ttl commented Mar 19, 2018

herazul commented Mar 19, 2018

remdu commented Mar 19, 2018

Marcin1960 commented Mar 20, 2018

bobchennan commented Mar 28, 2018

herazul commented Mar 30, 2018

roy7 commented Mar 30, 2018

herazul commented Mar 30, 2018

herazul commented Mar 30, 2018

alreadydone commented Mar 30, 2018

remdu commented Mar 30, 2018

herazul commented Mar 30, 2018 • edited

herazul commented Mar 31, 2018

MingWR commented Feb 4, 2018 •

edited

jkiliani commented Feb 4, 2018 •

edited

ghost commented Feb 4, 2018 •

edited by ghost

MartinDevelopment commented Feb 4, 2018 •

edited

wpstmxhs commented Feb 4, 2018 •

edited

wpstmxhs commented Feb 4, 2018 •

edited

Marcin1960 commented Feb 23, 2018 •

edited

remdu commented Mar 15, 2018 •

edited

herazul commented Mar 30, 2018 •

edited