-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LZ winrate vs various anchors #1557
Comments
Nice job! |
nice diagram |
Nice work! We've been doing something similar in MiniGo were we play the best network against ~20 older networks (last, 2 ago, 4 ago, 10 ago...) Which has reduced the noise in our signal but this is more need because we often have -200 elo followed by plus 200 elo. |
Chart updated
|
Can you give us the numbers in raw text? One could use those to estimate the self-play inflation factor. |
|
If it's 200 games per match, why are some winrates not a multiple of 0.5%? |
@ryouiki Could you please add testing LeelaMaster G9? It seems G9 is a bit stronger than G13... |
@TFiFiE Though I gave -m10 option, duplicated game could occur. Duplicated matches are not counted. LZ#150 vs ELF Match Detail
match 20 = match 98 |
Recent networks beat G9 easily.
each cell represents winrate from 200 matches (400 visits) |
@ryouiki Ok, thanks a lot. |
If not every match consists of 200 (unique) games, maybe (also) give it as "wins-losses" or "wins/games" to prevent the unnecessary bit of information loss. |
@TFiFiE There were 14 duplicated matches from total 24200 matches. |
Your numbers suggest the supposed rating differences of the networks are in reality inflated by a factor of about 4. |
An alternative interpretation is higher visits amplify strength differences. |
please add b20v16 compare result |
Chart updated LZ#151, LZ#152 matches. |
Chart updated LZ#153 matches.
|
best_5b.txt.gz |
I'll test the 64x5 net. (trained by @NhanHo)
each cell represents winrate from 200 matches (400 visits) |
Seems impressive. This net would probably work nicely for mobile? |
how about the strongest 15b weight 4dad, maybe lz153 will be the last king of 15b. |
@l1t1 Maybe I could try. However, it will become worthless when there comes legit promotion. |
gcp said he would start next size after a reasonable long peroid in a post |
#1113 |
@ryouiki can you upload the 200 games archive of lz 153 vs 20b v17, i find them more convenient to watch on my local sgf reader. thanks |
@wonderingabout Here is LZ153 vs 20B v17 Feel free to request any SGFs you would like to watch. |
thanks |
btw i wanted to mention last time that arround 5-10% of the games in the old archive and this one to are wrongly resigned. i linked it here: http://eidogo.com/#3EYojEPLI edit : game 9 of the archive too edit 2 : game 15 of the archive seems so too edit 3: game 21 too it seems |
hmm that's frustrating, where does this happen? In official games too? |
(btw I would also resign if my opponent would continue playing in such a bad position :P , but I should assume leelaz has no such emotions :) ) |
These diversity comes from lower visits. FYI, I use option -r10 to speed up test while official matches use -r5. |
i see, but the data should be taken with a 10% delta considering the wrongly resigned games, i think |
I'm testing 20B V18 and LZ#154. However, they are not performing well as expected so far.. |
@ryouiki you mean they both weaker than elo shows? |
In my test, 20 block V18 was inferior to V17.
each cell represents winrate from 200 matches (400 visits) |
20b v18 failed at 0.3 |
Could you also run the tests against LeelaZero + PhoenixGo's weights (#1477 )? Another external benchmark would be very interesting! Thanks. |
I tried! But, LZ+phoenixGo was not stable to test massive matches. Crashed after 30-40 games. 😟 |
@ryouiki can i kindly ask you again the game archives of lz 156 vs 20b v17 + lz 157 vs 20b v17 thanks edit : v18 and v19 showed to be weaker, so i'm interested to see how v17 plays against latest official lz networks rather |
Is it possible to add a GX37? |
LZ#156vs20V17(v400).zip Graph updated. (some data points are missing though) |
thanks a lot, i greatly appreciate |
hi again ryouiki thanks again |
@ryouiki edit : if you do it for lz 164, i'd rather like lz163+164 vs lz 157 |
I did over ten thousand of matches to test recent LZ networks versus various anchor networks to see how much each LZ network performs vs other known networks.
ex) Each green dot represents a win rate(y) of each LZ network(x) versus ELF network. (200 visits / 200 matches) Green line represents win rate trends.
Sometimes newer network perform worse than its predecessor vs certain anchor.
However, they seem to follow the winning trends in the long run.
Edit) update : ~LZ#150, LeelaMaster G13
Edit) update : duplicated match fixed
Edit) update 06.28 : ~LZ#152, 20 Block V16
Edit) update 07.01 : ~LZ#153
Edit) update 07.04 : 20 Block V17
Edit) update 07.23 : ~LZ#157
The text was updated successfully, but these errors were encountered: