Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantize network files #1733

Closed
gcp opened this issue Aug 15, 2018 · 47 comments
Closed

Quantize network files #1733

gcp opened this issue Aug 15, 2018 · 47 comments

Comments

@gcp
Copy link
Member

gcp commented Aug 15, 2018

Won't fundamentally solve #1731 but a simple and fully backwards compatible optimization:

We already know that using fp16 doesn't really affect the network strength. This means we should be able to write out the network files with much less significant digits than we are now, and maybe cut the size of everything almost in half.

Given the experience with things like Bfloat16, it may even be that less digits than in "half" format may be needed, as long as the comma is in the right spot (which it is given that we're text based).

This could be nicely tested by quantizing some existing networks and scheduling matches between them. It would be interesting to see how far we can go before the precision loss is really noticeable.

@l1t1
Copy link

l1t1 commented Aug 15, 2018

write out the network files into ieee 754 float instead of double could save half space , but write as plain text cannot.
I know little about Bfloat16, could you attach a test weight file of Bfloat16 format?

the orignal elf format (https://github.com/pytorch/ELF/releases/download/pretrained-go-19x19-v0/pretrained-go-19x19-v0.bin) is smaller than gz.

@gcp
Copy link
Member Author

gcp commented Aug 15, 2018

Writing the files in binary (be it half, double or Bfloat16) isn't supported by the clients nor the training code nor the server and needs a network format update.

How can I give you a Bfloat16 file if literally nothing supports it?

We already write out files with float precision, I'm saying we can use even less than that exactly because we do not use a fixed binary format, i.e. the exact opposite of what you claim is true.

@ihavnoid
Copy link
Member

Another hillarious but effective trick is to replace all instances of 0.xxx to .xxx, e.g., print .372 instead of 0.372 - most numbers are less than 1.000 so this should save quite a bit of size.

@l1t1
Copy link

l1t1 commented Aug 15, 2018

compress to gz .372 is same as 0.372

nozero.txt.gz
zero.txt.gz
nozero.txt
zero.txt

date time size file
2018-08-15 17:35 174 nozero.txt
2018-08-15 17:36 60 nozero.txt.gz
2018-08-15 17:35 203 zero.txt
2018-08-15 17:36 59 zero.txt.gz

@gjm11
Copy link

gjm11 commented Aug 15, 2018

Leela Nozero. (Sorry.)

@OmnipotentEntity
Copy link
Contributor

OmnipotentEntity commented Aug 15, 2018

I tested removing the leading zeroes on the newest network as a real life test and the results were:

File Size (bytes)
9c56ae62.in 276756703
9c56ae62.out 253294256
9c56ae62.in.gz 96011046
9c56ae62.out.gz 94786826

So removing the leading zeroes really does seem to help.

Method for gzip was just to call the gzip command with default settings. I recompressed the downloaded weights because the server seems to be more aggressive in compression settings. So for sake of comparison, and to see if removing the leading zeroes really helped in the limit, I also attempted to be aggressive with compression (7-zip with LZMA2 under compression level 9 with a huge dictionary):

File Size (bytes)
9c56ae62.in.7z 78915725
9c56ae62.out.7z 78395896

So even under very aggressive compression settings, simply removing the 0 is a net gain.

@Ttl
Copy link
Member

Ttl commented Aug 15, 2018

Here is a proof of concept quantization script:

import sys, os

output_name = os.path.splitext(sys.argv[1])
output_name = output_name[0] + '_compressed' + output_name[1]
output = open(output_name, 'w')

def format_n(x):
    x = float(x)
    if abs(x) < 1e-6:
        return '0'
    if abs(x) < 1e-2:
        x = '{:.2e}'.format(x)
        x = x.replace('e-0', 'e-')
        return x
    x = '{:.5f}'.format(x).rstrip('0').rstrip('.')
    if x.startswith('0.'):
        x = x[1:]
    if x.startswith('-0.'):
        x = '-' + x[2:]
    return x

with open(sys.argv[1], 'r') as f:
    for line in f:
        line = ' '.join(map(format_n, line.split(' ')))
        output.write(line + '\n')

output.close()

This has somewhere around half precision accuracy. I didn't test the playing strength, but the net output looks still very close to single precision. Drops the .gz file size from 96 MB to 56 MB. Leela Chess protobuf format is 44 MB and I guess in half precision it would be around half of that.

@gcp
Copy link
Member Author

gcp commented Aug 15, 2018

So even under very aggressive compression settings, simply removing the 0 is a net gain.

It's still less things that have to be encoded as being present, makes sense.

@gcp
Copy link
Member Author

gcp commented Aug 15, 2018

I did a test with:

#!/usr/bin/env python3
import sys

with open(sys.argv[1], 'r') as w:
    with open('quant.txt', 'w') as q:
        w.readline()  # skip network version
        q.write('1\n')
        for line in w:
            weights = [float(x) for x in line.split()]
            print("weights: {}".format(len(line)))
            str_weights = " ".join([str("{:.2g}".format(x)) for x in weights])
            q.write(str_weights + "\n")

Which didn't do the zero stripping, round to zero etc tricks.

-rw-rw-r-- 1 morbo morbo 93595393 aug 15 15:10 c910dee9c057d144efaafb603ce69d564825ca316dd8129ce0f05c1231d1f9bc.gz
-rw-rw-r-- 1 morbo morbo 36229210 aug 15 15:32 quant2.txt.gz
-rw-rw-r-- 1 morbo morbo 51016744 aug 15 15:29 quant3.txt.gz
-rw-rw-r-- 1 morbo morbo 64418836 aug 15 15:27 quant4.txt.gz
-rw-rw-r-- 1 morbo morbo 79537730 aug 15 15:24 quant5.txt.gz

I tested quant2 (:2g) which has about 7 bits of mantissa (similar to Bfloat16). It's 11-11 vs the unquantized network so far.

@gcp
Copy link
Member Author

gcp commented Aug 15, 2018

I guess I should incorporate some of the tricks in this script and then do a comparison whether rounding to half style (big significand, less exponent bits, like @Ttl's script) or Bfloat16 style (small significand, keep exponent bits) is more effective for us.

@betterworld
Copy link
Contributor

@Ttl

x = x.replace('0.', '.')

Careful with 10.1

@Ttl
Copy link
Member

Ttl commented Aug 15, 2018

Well spotted. There actually was one weight starting with "10." in the weight file.

I calculated some L2-norms between original and quantized weights and "5f" seems to be more accurate than "5g":

{:.5f}, 3.53505565914826e-05
{:.2g}, 0.12159673216677779
{:.3g}, 0.04279058207864573
{:.4g}, 0.00163199402697143
{:.5g}, 0.00022178429992239518

For comparison numpy.float16 has L2-norm of 0.006657267984862084. {:.2g} seems significantly more inaccurate than half precision.

@gcp
Copy link
Member Author

gcp commented Aug 16, 2018

The testmatch of c910dee vs the quant2 version ended up in a 84-76 win for the quant2 version.

"5f" seems to be more accurate than "5g"

But what about the file sizes? Edit: Hmm, trying to make sense of the description of both, 5f seems like it should mostly be smaller? g is apparently like f but switching to exponential notation for very large or small numbers, and having 1 less precision. So your conclusion is that very small or very big weights are completely ignorable for the computation?

@gcp
Copy link
Member Author

gcp commented Aug 16, 2018

{:.2g} seems significantly more inaccurate than half precision.

That makes sense as it's 11 bits of significand compared to 6.5 bits. However, Bfloat only has 7 and that's apparently the hot thing in machine learning land. From the above test, it seems fine for us.

I think the question is whether you want to keep more precision in the values, or be able to handle the odd value that is very big or small? From your results, small values could just be rounded to zero. But looking in the actual network files, this means a large part of the last few innerproduct layers is fully redundant, as it's almost all tiny values.

I'm somewhat partial to using g/Bfloat16 style because it seems more robust to keep some precision in all values and not just round those to 0. But maybe that's just my preconception/imagination, as BatchNorm means very small coefficients or biases can't really have much meaning, given the distribution of input values? (But those last layers are exactly the place where there is no batchnorm!)

Any comments/thoughts?

@Ttl
Copy link
Member

Ttl commented Aug 16, 2018

Minimum positive non-denormal half precision number is 6e-5. Since numbers smaller than that would be in any case rounded to zero when storing them in half precision we might as well just round them already in the weight file. But I'm not completely sure if it has some drawbacks when restoring training from the quantized weights?

Just using "g" formatting should be a safe choice. Very small numbers could probably be written with less precision without much issues.

@gcp
Copy link
Member Author

gcp commented Aug 16, 2018

But I'm not completely sure if it has some drawbacks when restoring training from the quantized weights?

Right, that's something to be careful about.

3g with the extra tricks (but no round-to-zero) should halve the file size and (as far as I can tell) be very safe. Want to modify your script and make a pull request for it?

gcp pushed a commit that referenced this issue Aug 16, 2018
Use smaller precision to store the weights to decrease the file size.

See discussion in issue #1733.

Pull request #1736.
@l1t1
Copy link

l1t1 commented Aug 18, 2018

my test of elf v0

C:\Users\aaa\Downloads>python quantize_weights.py  elf_converted_weights.txt
Weight file difference L2-norm: 0.0034445704188723683

C:\Users\aaa\Downloads>dir elf_converted_weights*

2018/05/04  02:45       260,579,331 elf_converted_weights.txt
2018/08/17  15:24       154,259,295 elf_converted_weights_quantized.txt

could you set a test match of elf v0 vs elf_v0_quantized?

@diadorak
Copy link

We don't need to use the ELF network. Matching a 5x64 network against the quantized version of itself should be enough, right?

@gcp
Copy link
Member Author

gcp commented Aug 20, 2018

There might be an argument that the deeper the network, the more the intermediate rounding can affect things. I'd just queue the current best vs a quantized version of itself. Or best 40b vs quant 40b.

@Mardak
Copy link
Collaborator

Mardak commented Aug 20, 2018

Here's some numbers from the 40b quantized test:
gzip size quantized: 96MB, original: 183MB
plain text size quantized: 352MB, original: 534MB

quantized:  -.00371     -.00158     -.00174     -.00292     -.00653    -.0118     -.000745     -.00946    .0194     -.0037     .000385     -.00168     -.000181   -.00304     -.0117    -.000807     -.00911
original:  -0.00370743 -0.00157953 -0.00174433 -0.00292096 -0.0065255 -0.0117573 -0.000745065 -0.0094582 0.0194095 -0.0036998 0.000384994 -0.00167523 -0.0001806 -0.00303627 -0.011729 -0.000807318 -0.0091053

@l1t1
Copy link

l1t1 commented Aug 20, 2018

in a304 vs b072 match, same training games and the file quantized win, it is hard to explain.

@gcp
Copy link
Member Author

gcp commented Aug 21, 2018

it is hard to explain.

It's very easy, it's called statistical variance.

@Gondolieri
Copy link

Another idea for an explanation: Perhaps very little values should be 0.0 in a 'correct' weight file. The training didn't reach that value yet but the script erases the bad influence.
Up to now I didn't double check whether this may be true.

@gcp
Copy link
Member Author

gcp commented Aug 21, 2018

The script doesn't round to zero.

@jokkebk
Copy link

jokkebk commented Aug 21, 2018

Assuming the network has not yet converged to "optimal weights" on most of the values when measured with rounded numbers, the rounding should "improve performance" about 50 % in the roundings and "reduce performance" on the other 50.

Has anyone measured how much the values change over one passed network to another (I'm unfortunately not familiar with the weight file format so no, I haven't) -- if the unquantized change is larger on the nodes, rounding should do little. But if we have like 20 % of weights changing so little they will not get over "0.5" in one go, we'd effectively be locking the network partially into place.

Of course, if the server would have the "full precision" network for training, and it's just the training games run by clients (and maybe match games) that used the reduced version, it should be much a non-issue, seeing how close performance the quantized versions are putting out.

@gcp
Copy link
Member Author

gcp commented Aug 22, 2018

There's about 10 bits of precision in the significand, if weights don't change by a fraction of at least 1/1000 in the 256k training steps, they indeed will be locked in place. That seems fine by me!

Of course, if the server would have the "full precision" network for training, and it's just the training games run by clients (and maybe match games) that used the reduced version, it should be much a non-issue, seeing how close performance the quantized versions are putting out.

This is partially true in that during a training run the server has the unquantized weights. But every run does restart from the actual network, which will be quantized now.

@gcp
Copy link
Member Author

gcp commented Aug 22, 2018

I changed the server so the next bunch of networks are all going to be quantized out of the box. All test results so far were extremely good. This is more or less the last step before switching to 40b, I believe.

@l1t1
Copy link

l1t1 commented Aug 22, 2018

the winrate of Quantized 20b is low

Start Date Network Hashes Wins / Losses Games SPRT
2018-08-22 21:49 f37a584b  VS  7141e697 6 : 27 (18.18%) 33 / 400 fail

@l1t1
Copy link

l1t1 commented Aug 23, 2018

could you test a original version of f37a584b?

@gcp
Copy link
Member Author

gcp commented Aug 23, 2018

Looks like it was just bad luck, the other networks are more normal.

@l1t1
Copy link

l1t1 commented Aug 23, 2018

68a8e58e is also bad luck.the luck of b072 is too good to believe.

@gcp
Copy link
Member Author

gcp commented Aug 23, 2018

Uh, we did quite a bit more tests, with 99b64745 for example.

And 68a8 is now at 50%, with only 150 games played. Stop drawing conclusions from no data.

@l1t1
Copy link

l1t1 commented Aug 23, 2018

think about force promote 1667? it is best Quantized 20b

@alphaladder
Copy link

@I1t1 Friend, keep some patience and trust GCP.

@l1t1
Copy link

l1t1 commented Aug 24, 2018

I only regret that if a mistake is made by the server, it is fine, like lz169 , while human cannot make the same mistake, like 1667, it seems unfair and seems the server has more rights than human.
I don't have a high end computer, I cannot effect the matchs, so I can only suggest run 400 games for each match, and then decide promote or not. and if 400 games winrate is bigger than 0.54, run more 50 games

170 passed, it seems each time i lost patience , the weights is born, nevertheless, it's a good thing

@gonzalezjo
Copy link

gonzalezjo commented Aug 24, 2018

I did a lot of research on this, and wrote some code to automate tests for various approaches to shrinking network files. My work is documented here: LeelaChessZero/lc0#121

As a TLDR of my research issue:

Using the state of the art in compression techniques, I was able to reduce the Leela Chess Zero 20b plaintext net from 373MB uncompressed / 129MB gzipped to just under 50MB, compressed. Achieving better compression is possible still, depending on how okay you are with varying degrees of quantization.

There is still further research to be done (by me, later) using Dropbox DivANS instead of BSC.

Anyway, in order to get those results, a few things had to be done. First off, parsing of the weights file took place, creating an in memory array of floats. Second, the state of the art in lossy float compression, zfp, was ran on the array with an error tolerance of 1e-5 and told to quantize. The output was a bit over 52.4MB. Finally, bsc, a novel compressor, was used on the output of zfp. This squeezed a few more megabytes from the file.

zfp is magical as a quantizer.

@l1t1
Copy link

l1t1 commented Aug 24, 2018

the early passed wrights is playing more games of their match
do it for them to get their more accurate elo?

@gonzalezjo
Copy link

I collected some data on compression of leelaz-model-swa-12-96000.txt. Using an error tolerance of 1 / 2 ^ 8, the results are as follows:

14712009 leelaz-model-swa-12-96000.7z
12067786 leelaz-model-swa-12-96000.bsc
276757452 leelaz-model-swa-12-96000.uncompressed.txt
12432145 leelaz-model-swa-12-96000.ppmd```


The numbers are in bytes.

@l1t1
Copy link

l1t1 commented Aug 25, 2018

I download the https://s3.amazonaws.com/libbsc/bsc-3.1.0-x86.zip
https://s3.amazonaws.com/libbsc/bsc-3.1.0-x64.zip
compare to normal file 3/4 gz , to Quantize file 7/10 gz, to torch file 9/10 bin

table

C:\>bsc e elf_converted_weights.txt elf0
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov <Ilya.Grebnov@gmail.com>.

elf_converted_weights.txt compressed 260579331 into 72602044 in 35.428 seconds.

2018/05/04  08:07        93,944,064 62b5417b64c46976795d10a6741801f15f857e5029681a42d02c9852097df4b9.gz

C:\>bsc e d:b1c3 b1c3.bsc
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov <Ilya.Grebnov@gmail.com>.

d:b1c3 compressed 180740951 into 35075710 in 8.752 seconds.

2018/08/27  07:22        50,198,595 b1c37640fbd1a80249693b236f551d1886f7479e1b57051f2d5e098d2abffd02.gz

C:\>bsc\bsc e pretrained-go-19x19-v0.bin elf0.bsc
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov <Ilya.Grebnov@gmail.com>.

pretrained-go-19x19-v0.bin compressed 74052822 into 65747456 in 6.895 seconds.

C:\>bsc\bsc e b072 b072.bsc
This is bsc, Block Sorting Compressor. Version 3.1.0. 8 July 2012.
Copyright (c) 2009-2012 Ilya Grebnov <Ilya.Grebnov@gmail.com>.

b072 compressed 369407814 into 68850202 in 16.802 seconds.

2018/08/21  01:03       369,407,814 b072
2018/08/28  07:28        68,850,202 b072.bsc
2018/08/28  07:25       100,358,641 b0729bbd04f7e6a7d6b42b0d53cb472490fd66c17bfd02e131655939d14cc1ba.gz

ChinChangYang added a commit to ChinChangYang/leela-zero that referenced this issue Aug 25, 2018
* Add multi GPU training support.

Pull request leela-zero#1386.

* Extend GTP to support real time search info.

* Extend GTP to add support for displaying winrates and variations
  from LZ while LZ is thinking.
* Use UCI format for lz-analyze and lz-genmove-analyze.
* Don't sort gtp lz-analyze ouput because it is not thread-safe.

Pull request leela-zero#1388.

* Remove virtual loss from eval for live stats.

For discussion see pull request leela-zero#1412.

* Make analysis output use one move per line.

More in line with UCI, cleaner, easier to parse, smaller code.

* Remove versioned clang from Makefile.

Don't hardcode the clang version in the Makefile.

* Fix varargs usage.

Regression from leela-zero#1388. Fixes issue leela-zero#1424.

* AutoGTP: send leelaz version to server.

Send leelaz version embedded in the URL used to ask for a new job.

Pull request leela-zero#1430.

* Multi GPU: fix split and variable placement.

* Fix split in net_to_model.
* Add soft placement of variables.
* Fixes Windows issues.

Pull request leela-zero#1443.

* Mutex optimization.

* Updated Mutex implementation to use TTS instead of TS.
* Explicitly relax memory order (no behavior change, it's the default) 
  and attempt TS before TTS loop. 
  (improves performance in low contention locks)

Pull request leela-zero#1432.

* Update leela-zero.vcxproj for VS2015.

Pull request leela-zero#1439.

* Add order to analysis data.

See discussion in issue leela-zero#1425.

Pull request leela-zero#1478.

* Fix misleading comments & naming.

The Alpha (Go) Zero outputs use TanH nonlinearities, not sigmoids. The
code comments and variable naming refer to an earlier version that used
sigmoids and that is confusing people.

See issue leela-zero#1484.

* Add Lizzie and LeelaSabaki to README.

Pull request leela-zero#1513.

* Make Debian package with CMake.

* Create debian package by cpack

We can create debian leelaz package by "make package"  
by cpack.

* Find leelaz if ./leelaz is not existed

If leelaz is installed at /usr/bin, then autogtp should find it by 
leelaz instead of ./leelaz.

* Generate package dependency list

Use dpkg-shlibdeps to generate better package dependency list

* Use git tags as version strings

Pull request leela-zero#1445.

* Look for symmetry on NNCache lookup.

* Look for symmetrical position in cache.
* Disable NNCache symmetry in self-play.

To increase randomness from rotational assymetry.

* Only check symmetry in opening. Refactor TimeControl.

Only check for symmetries in the NNCache when we are in the 
opening (fast moving zone). Refactor TimeControl to take the 
boardsize out.

* Change bench to assymetric position.

Avoids rotation symmetry speedups, they are not typical.

* Rename rotation to symmetry, limit to early opening.

Be consistent and don't call symmetries rotations. Limit the symmetry
lookups to until halfway the opening (which is the first 30 moves on
19 x 19).

Based on pull request leela-zero#1275, but without keeping the rotation array in
every board instance.

Pull request leela-zero#1421.

* Symmetry calculation cleanup.

Pull request leela-zero#1522.

* Non-pruning (simple) time management.

See issue leela-zero#1416.

Pull request leela-zero#1497.

* Clean up some constants.

* Remove unused 'BIG' constant.
* Capture "N/A" vertex value in constant.

Pull request leela-zero#1528.

* Duplicate line removal.

Pull request leela-zero#1529.

* Script for converting minigo weights.

Pull request leela-zero#1538.

* Update README.md.

Added q+Enter instructions.

Pull request leela-zero#1542.

* Fix Validation checking on Windows.

Fix Validation checking if binary exists on Windows.

Pull request leela-zero#1544.

* Constant for the unchanged symmetry index.

Pull request leela-zero#1548.

* Update README.md.

Update the TODO list.

* Removed unused class KeyPress. 

Pull request leela-zero#1560.

* Allow 3 AutoGTP quitting conditions.

Pull request leela-zero#1580.

* More draw handling.

Pull request leela-zero#1577.

* Suppress upstream warnings in Makefile.

Pull request leela-zero#1605.

* Fix TF update operations.

The real update operation should be the computation of the gradient 
rather than the assignment of it.

Pull request leela-zero#1614.

Fixes issue leela-zero#1502.

* Code restructuring: less globals.

* Remove thread_local variables for OpenCL subsystem.
  (this is to allow many different OpenCL implementations
   to exist concurrently)
* OpenCLScheduler: task queue cleanup.
* Change static Network methods to instance methods and
  replace it with global Network instance.
* All weights moved from Network.cpp static variables to class Network.
* NNCache is now a member variable of Network, not a global.
* Network filename now comes from external call, not a global variable.
* Removed global g_network object,
  instead it is member of UCTSearch class.
* UCTNode is now a static member variable of GTP.
  (instead of a static of a function)
* Rename ThreadData to OpenCLContext.
  (it's no longer a thread-specific structure).

Pull request leela-zero#1558.

* Removed unused types. 

Pull request leela-zero#1621.

* Resurrect GPU autodetection.

Fixes issue leela-zero#1632.

Pull request leela-zero#1633.

* Restrict the use of "score".

Using "score" as a nonspecific term (and not when it, for example,
refers to the count at the end of game) makes it unnecessarily hard
to understand the code and see how it matches with the literature.

Pull request leela-zero#1635.

* Code restructuring: Create ForwardPipe interface.

Code restructuring: Create abstract class ForwardPipe,
which represents a class that has a forward() call.

* Moved network initialization code to OpenCLScheduler.
* Created abstract class ForwardPipe which will be the base interface
  of all forward() calls.
* Moved CPU-based forward() code to class CPUPipe.
* Added --cpu-only option.

This command line option will run a CPU-only implementation on a
OpenCL build. Can be used for testing and running fallback modes
rather than switching binaries.

Pull request leela-zero#1620.

* Coding style consistency cleanups.

* Remove use of "new".

Prefer make_unique instead.

* Give ForwardPipe a virtual destructor.

Silence clang warning.

Pull request leela-zero#1644.

* Replace if-else chain with switch statement.

Pull request leela-zero#1638.

* Use Winograd F(4x4, 3x3).

* Winograd F(4x4, 3x3) for CPU
* Winograd F(4x4, 3x3) for OpenCL 
* OpenCL batching support.

Pull request leela-zero#1643.

* Increase error budget in tuner.

The 256 channel network exceeds 1% error in the tuner,
but the network output seems accurate enough during play.

Fixes leela-zero#1645.

Pull request leela-zero#1647.

* Get rid of more "network" globals and pointers. 

Keep a single "network" global in GTP, owned by a unique_ptr and move
things around when needed.

Pull request leela-zero#1650.

* Runtime selection of fp16/fp32.

* OpenCL half precision is now command-line option, 
  support compiled in by default.
  This converts the OpenCL code into a gigantic template library.
* Update Network self-check.
 - Final output is used for self-check.
 - Criteria is 20% error, while ignoring values smaller than 1/361.
 - Throws exception when three out of last ten checks fails.

Pull request leela-zero#1649.

* Minor code cleanups.

Slight style edits of code and comments.

* Clean up SGFTree style.

Modernize some parts of SGFTree's style.

* Remove separate USE_HALF build from CI.

This is integrated into the main build now.

Pull request leela-zero#1655.

* Don't assume alternating colors in SGF.

Fix a bug that an SGF file/string cannot contain
2 consecutive moves of the same color.

Fixes issue leela-zero#1469.

Pull request leela-zero#1654.

* Remove separate half precision kernel.

Use the preprocessor defines to make a single kernel support 
both single precision and half precision storage.

Pull request leela-zero#1661.

* Compress duplicate evaluation code. 

Pull request leela-zero#1660.

* Consistent header guard naming. 

Pull request leela-zero#1664.

* Replace macros with proper constants.

Pull request leela-zero#1671.

* Implement NN eval fp16/fp32 autodetect.

Implemented NN eval fp16/fp32 autodetect.
Runs both precisions for 1 seconds, and if fp16 is faster than 
fp32 by more than 5%, fp16 is used. 
Removes --use-half, replaces it with 
--precision [auto|single|half] option, default auto.

Pull request leela-zero#1657.

* Resign analysis: search for the highest resign threshold. 

Added resign analysis option to search for the highest 
resign threshold that should be set.

Pull request leela-zero#1606.

* Half precision compute support.

Use half precision computation on cards that support it.

Pull request leela-zero#1672.

* Thread scalability improvements.

- On OpenCLScheduler, don't use condvars which tends to be slow 
  because of thread sleep/wake.
- Instead, use spinlocks and just have enough contexts to avoid sleeping.
- Allow more threads than the CPU physically has. 
  This is required in many multi-GPU setups with low core counts 
  (e.g., quad-core non-hyperthread with 2 GPUs)

Pull request leela-zero#1669.

* Use L2-norm in self check.

The previous method is too strict for fp16 compute. 

Since lower precision of fp16 is still good enough to play at 
the same strength as fp32 relax the self check.

Pull request leela-zero#1698.

* OpenCL tuner fixes.

* Fix error calculation (Missing batch_size divider).
* Better error reporting when no working configuration could be found.
* Change reference data to have less rounding errors with half precision.
* Replace BLAS reference SGEMM with custom code that gives transposed 
  output like the OpenCL SGEMM.

Pull request leela-zero#1710.

* Change policy vector to array.

Should save a tiny bit of memory.

Pull request leela-zero#1716.

* Fall back to single precision net on breakage.

Fall back to single precision net when half precision is broken, 
at least when detection mode is auto.

Pull request leela-zero#1726.

* AutoGTP: use compressed weights networks.

Pull request leela-zero#1721.

* Fix OpenCL buffer sizes.

Some OpenCL buffers were allocated too big. 
Tested with oclgrind that the new sizes are correct.

Pull request leela-zero#1727.

* Script for quantizing weights.

Use smaller precision to store the weights to decrease the file size.

See discussion in issue leela-zero#1733.

Pull request leela-zero#1736.

* Network initialization restructuring.

* Network initialization restructuring

- Create one net at a time when doing fp16/fp32 autodetect.
  Saves some GPU memory.
- Create an internal lambda which initializes the nets.
- Use std::copy to copy vectors to reduce runtime.

* zeropad_U : loop reordering for performance optimization.

Plus other optimizations for zero-copying initialization.

Pull request leela-zero#1750.

* Fix comments, code style.

Minor fixes to incorrect comments, and reduce some excessively long
lines.

* Validation: support GTP commands for each binary.

* Changed Validation and Game to support multiple GTP commands
  at start up but left the Validations options untouched.
* Separated engine options (as positional arguments) from match options.
  Replaced time settings option with ability to specify any GTP commands.
* Added --gtp-command options using the existing option parser.
  Also changed default binary options from -p 1600 to -v 3200.
* Each binary argument has to be preceded by "--".
* Changes to use Engine Objects.
* Exits on failed GTP command.

Added printing of GTP commands in gameStart() so users can see what
commands are actually sent to each engine.

Pull request leela-zero#1652.

* Don't refer to stone locations as "squares".

* Don't refer to stone locations as "squares".

* Use "vertex" for those in the "letterbox" representation.
* Otherwise, mostly use "intersection".
* Also, capture all possible moves (i.e. including pass) in its own
  explicit constant.

* Clean up network constants.

Pull request leela-zero#1723.
@l1t1
Copy link

l1t1 commented Aug 28, 2018

some more compress by deleting points
eg
4.82e-5 ->482e-7
.000639->639e-6

before
d:b1c3 compressed 180740951 into 35075710 in 8.752 seconds.
after
d:b1c3 compressed 180739513 into 35076652 in 8.705 seconds.

the size of text smaller, but the size after compress bigger

@jonlave
Copy link

jonlave commented Aug 31, 2018

When are we switching to 40b? There's been several cycles now. Did we face any issues recently that's delaying the switchover?

@roy7
Copy link
Collaborator

roy7 commented Aug 31, 2018

Only thing left is gcp is travelling, once he is back we might make the move. :)

@l1t1
Copy link

l1t1 commented Sep 1, 2018

delete point

def format_n(x):
    x = float(x)
    x = '{:.3g}'.format(x)
    x = x.replace('e-0', 'e-')
    if x.startswith('0.'):
        x = x[1:]
    if x.startswith('-0.'):
        x = '-' + x[2:]
    # delete point
    y=x.find('e')
    if y>0:
     z=x.find('.')
     s=x[:y].replace('.','')+'e'+str(int(x[y+1:])-(y-z)+1)
     if len(s)<len(x):return s
    
    return x
def format_n(x):
    x = float(x)
    e = '{:.2e}'.format(x)
    x = '{:.3g}'.format(x)
    if len(e)==len(x):
      x=e
    x = x.replace('e-0', 'e-')
    if x.startswith('0.'):
        x = x[1:]
    if x.startswith('-0.'):
        x = '-' + x[2:]
    # delete point
    y=x.find('e')
    if y>0:
     z=x.find('.')
     # delete 0
     n=int(x[y+1:])
     if x[y-1]=='0': y=y-1
     if x[y-1]=='0': y=y-1
     s=x[:y].replace('.','')+'e'+str(n-(y-z)+1)
     if len(s)<len(x):return s
    
    return x

it shows bigger file has the smaller compress result, so the convertion is useless
d:\leelazb_quantized.txt compressed 22933127 into 4849952 in 3.760 seconds.
d:\leelazb_quantized1.txt compressed 23120676 into 4833380 in 3.837 seconds.

@l1t1
Copy link

l1t1 commented Sep 3, 2018

anyway, the bsc tool of #1733 (comment)
is considerable, it's about 0.7 of gz format size and fast.

@l1t1
Copy link

l1t1 commented Sep 4, 2018

I also test google brotli, it's performance is much worse

AncalagonX pushed a commit to AncalagonX/leela-zero that referenced this issue Sep 5, 2018
Use smaller precision to store the weights to decrease the file size.

See discussion in issue leela-zero#1733.

Pull request leela-zero#1736.
@l1t1
Copy link

l1t1 commented Sep 10, 2018

https://github.com/moinakg/pcompress
can use bsc algorithm at linux

AncalagonX pushed a commit to AncalagonX/leela-zero that referenced this issue Oct 21, 2018
Use smaller precision to store the weights to decrease the file size.

See discussion in issue leela-zero#1733.

Pull request leela-zero#1736.
@gcp
Copy link
Member Author

gcp commented Oct 23, 2018

Note that if we use another codec instead of zlib, we still need to be able to include it in Leela Zero (so it needs to have suitable licensing).

zpf sounds neat. I'll close this as we now have a simple text based quantizer in use.

@gcp gcp closed this as completed Oct 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests