Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we update our selection process? #2143

Open
Friday9i opened this issue Jan 10, 2019 · 29 comments
Open

Should we update our selection process? #2143

Friday9i opened this issue Jan 10, 2019 · 29 comments

Comments

@Friday9i
Copy link

Friday9i commented Jan 10, 2019

Should we update our selection process?
Here, I will try to demonstrate as properly as possible that "yes, we should change match conditions, as they are biased and quite probably suboptimals!" ; -(
Good news is that alternative approaches are available:

  • One is clearly better (some will debate this point for sure, but apparently Deepmind did it, so that should somewhat settle the discussion ;-). Unfortunately, this approach does not seem very easy to implement...
  • Fortunately, another approach should be efficient, is probably easy to put in place for LZ and is almost systematically used in Computer Chess Tournaments for 20 years: that should inspire us!

Here is the global conclusion :
All in all, our selection process should be upgraded:
1) possibly by using an opening book in match games (as done for Chess): that would improve the situation, limiting the clear risk of bias and rock-scissor-paper we may experience today, thanks to tests with more diverse situations
2) or ideally by matching candidate nets against a panel of opponents (as apparently done by Deepmind for Alpha Chess), but that's not easy to set-up...

Hence, we should think about upgrading our selection process and match conditions.

Now, the details!
Sorry, it's a bit long, but I'm trying to be explicit and as rigourous as possible in the demonstration: that takes some place...

Reminder: there are convincing arguments that our current "selection process" (ie matching candidate nets vs the current best net in a match without noise, and applying SPRT) is efficient:

  • Experimentally, the selection process works quite well, as LZ improves over time! I have no doubt about that and am not contesting the fact that it works ; -)
  • "The current match process tests LZ absolute strength without noise, that's the best we can do". I'm contesting this point, as this test seems biased, with a strong risk of rock-scissor-paper situation (leading to a sub-optimal selection process). And indeed, progression of LZ is quite slow as we will see below.

What could be a better selection process?
Let's imagine we get access to several versions of AlphaGo, several versions of MiniGo, several versions of Tencent's engine, etc... for a total of 40 different Go engines.
From that, we could easily build an extremely solid selection process:

  1. First, let's test theses engines against the current best net, LZ200, in order to find the approximate amount of visits for each of these engines in order to get a ~50% winrate against LZ200@1600 visits. Eg the currently weaker Minigo may need ~5000 visits to match LZ200@1600 visits while the stronger AlphaGo may only need ~200 visits and AZ 250 visits.
    Hence, a match of 400 games (5 games with white and 5 with black against each of these 40 engines) should give a 50% winrate for LZ200 (with of course some statistical noise).
  2. To test a new "LZ candidate net" (which would become LZ201 if promoted), we could test it against this panel of 40 opponents (without noise) and select it if it gets more than 55% winrate. I think we can agree that "this would be an extremely reliable test: a candidate net getting >55% winrate would be a true improvement over LZ200 which only get a 50% winrate against this same panel of opponents".

In these match games, which openings would be played? Answer: of course, it would be a diverse panel of openings having almost nothing in common with the 30-moves opening currently played systematically in match games between LZ candidate net vs LZ official net (please take a look by yourself at http://zero.sjeng.org/match-games/5c37063ff06758029e301a8c for example to be convinced: symmetries differ but they are only 1 or 2 different games played up to move 30 in all match games!). In a match against various engines, we would of course get many different openings as some engine like to open with the 3-4 point, some may play the 5-4 point, answers after initial 4-4 point are different, etc.
In this test against a panel of engines, LZ would play various openings and would probably never play its "preferred 30-moves opening" (as other engines have no reason to play that specific opening): is it an isuue for the fiability of the test? The answer IMHO is "Of course it's not a problem, it's even a clear advantage (vs current match conditions) because it tests LZ in a wide and diverse panel of situations, revealing much more precisely its global strength and ability to play in diverse situations" ; -)
Currently, our selection process leads to "a chosen 30-moves opening" which is played again and again, and if the candidate net has 55% winrate better after this single opening against a single engine, it gets a PASS. Isn't it a bit narrow, meaning there is a clear bias?
Hopefully, we can agree that testing a candidate LZ net against a large panel of opponents (and subsequently against many diverse openings) is a more solid test than the current selection process (ie a match against one single engine and after a single 30-moves opening)!

If not clear enough, let's add an example:
Let's imagine a candidate net getting a 58% winrate at this test against the panel of 40 other engines (vs a 50% winrate for LZ200 against this same panel of engines): it seems that the candidate net is a "clear improvement" over LZ200. Still, what would be its result in a match against LZ200? If it has "bad luck", the single opening chosen against LZ200 (played again and again, for ~30-moves, in every game!) may unfortunately put it at a disadvantage: in that case, its winrate may be significantly below 55% and it would be rejected by our current selection process. What a pity, we reject a strong net due to a single bad opening, ignoring all other things it learned better than LZ200!
On the contrary, a candidate net getting a 50% winrate against the panel of opponents (ie not better than LZ200) may get some luck against LZ200 if the 30-moves opening chosen against LZ200 is by chance good for it, and hence it could perfectly get a PASS with our current selection process (despite not being stronger, or possibly even slightly worse)...

Because we test several nets, there are statistically big chances that we select nets mainly because they get an advantage with a good opening rather than "because they are overall better nets"...! This is the perfect context to get a "rock-scissor-paper situation": among several tested net, one choses a good opening against LZ200 an is selected to become LZ201 (while not being stronger against other engines), then LZ202 is selected against LZ201 for the same reason, but in the end no progress was made and LZ200 may beat LZ202 because the chosen opening favours LZ200 vs LZ202...
Conclusion: our current match conditions are biased and may lead to rock-scissor-paper issues. On the opposite, matching candidate nets against a panel of opponents (leading to a large panel of diverse openings) would be significantly better to reveal the "true Go strength of candidate nets" (as it's hard to imagine how a rock-scissor-paper situation may appear against a panel of competing engines and various openings).

And this is not a theoretical situation:

  • LC0 (Leela Chess) faced a similar rock-scissor-paper issue, recently gaining 2000 selfplay ELO points (!) but NOT improving at all against other engines... (note: the selection process is different as there is no gating, but the conclusion holds: there is a real risk of rock-scissor-paper issue from "iterative selfplay improvements")
  • AlphaChess faced a comparable problem too (aggravated by draws), so Deepmind implemented a promotion process against a panel of opponents, cf Demis Hassabis interview during the World Chess Championship (between humans ; -), more info here (AlphaZero will play chess #2038, but the youtube video is not available anymore unfortunately).
  • Today's test of LZ200 vs ELFv1 shows no improvement in winrate since LZ195, despite 5 promotions (LZ195 also had ~66% winrate against ELFv1): that may be the unfortunate result of this rock-scissor-paper issue due to our biased selection process based on a single opening played again and again.

Conclusion 2: rock-scissor-paper happens in real life with iterative selection processes, as used by LZ. Hence, our selection process may be plagued by that same issue, and alternatives should be studied...

What alternative selection(s) process(es) could be used?
Approach 1: ideally, LZ candidates could be tested against a panel of opponents as it seems quite optimal, but it is not easy to set up. We could compare the result of LZ200 and of candidate nets against a panel of engines, such as for example LZ157@~5K visits, LZ174@~3k visits, LZ199@1600 visits and possibly some bjiyxo nets, 20b nets, ELFv0 and ELFv1, MiniGo nets, ... If a candidate net performs significantly better than LZ200 against this panel of engines (eg getting 53% vs only 46% for LZ200), we should select it and if not significantly better, we should reject it.
Approach 2: we could simply play match games of the candidate net vs LZ200 while using a large panel of fair openings (played once with white and once with black) to test networks in a good diversity of situations, and apply SPRT. That's probably not as good as testing a candidate net against a panel of several engines, but it seems better than simply testing a candidate net against the single LZ200 net with a single "30-moves preferred opening"... An example of openings that could be used is here: #2104, with a panel of "120 fair 2-moves openings" (by "fair", I mean winrate stays close to the winrate of optimal moves as judged by LZ, and also close to 50%)
Approach 3: approaches 1 and 2 may be combined, by testing candidate nets against a panel of different engines and after a panel of opening moves (played twice, once with white and once with black).

Note: for computer chess tournaments, usage of opening books is almost universally used, after being introduced by Nunn 20 years ago! So I'm not reinventing the wheel ;-)
Cf this article presenting the context and the approach (https://en.chessbase.com/post/test-your-engines-the-silver-openings-suite) and https://chess.com/computer-chess-championship for example (where you can find the tournament rules: first round does not use openings as diversity comes from the large panel of engines, then for 2nd round an opening book is used). The TCEC tournament (http://tcec.chessdom.com) where LC0 plays also uses an opening book.

Global conclusion : all in all, our selection process should be upgraded:
1) possibly by using an opening book (as done for Chess): that would improve the situation, limiting the clear risk of bias and rock-scissor-paper we may experience today, thanks to tests with more diverse situations
2) or ideally by matching candidate nets against a panel of opponents (as apparently done by Deepmind for Alpha Chess), but that's not easy to set-up...
3) Playing match games with for example "-m 20 --randomvisits 200" should give some welcome diversity in games too, but it may introduce other bias (eg LZ and ELF are quite different, so the impact of the noise may be unfair). Hence it's not clear if this approach would be beneficial or not.

Note: playing a part of the training games with a "zero-opening book" may be useful to get more diversity too (in addition to the random noise -m 30), but that's another story ;-)

@Ishinoshita
Copy link

Ishinoshita commented Jan 10, 2019

@Friday9i Yeah, lack of diversity in match games is always a potential threat wrt to paper scissors rock pitfall, so I personally would feel more confident with more diversity (but intuition sometimes sucks...).

Approach 2.2: Implement AZ paper2 chess match condition against SF (randomization by sampling suboptimal moves in search with some soft max temp and within some wr bound, during say 30 first loves ). Less cumbersome than managing opening books, still potentially a larger spectrum of what opening lines the policy 'contains'.

Would still be LZ X vs LZ X+1, but LZ X-larger policy spectrum vs LZ X+1-larger policy spectrum.

AZ2 options already added to LC0 code if I recall correct (don't now if it's used for promotion yet, though).
See Lcz blog about v0.19.1-rc2.

Edit: With openings randomization, SPRT threshold might need to be revisited. Or not. It's function is just to guarantee some sufficient gating. Its absolute meaning does not matter. You already pointed that its meaning is subject to some reservations, with the use of a single opening line ...

@l1t1
Copy link

l1t1 commented Jan 10, 2019

so a short version?

@Friday9i
Copy link
Author

Friday9i commented Jan 10, 2019

@Ishinoshita Thanks for your comments, and indeed an AZ2 options to introduce diversity is already implemented in LC0 (but I don't know either if it is already used for promotion). Regarding SPRT, it would not need to be changed if we use a panel of openings.
@l1t1 lol, I admit it is long : -)
Short version:

  1. Match games are all identical for 30-moves, which is a bias , as we are testing the strength of candidate nets against the single current net with a single 30-moves opening, with a high (and demonstrated) risk of rock-scissor-paper situation... Ie candidate nets may be promoted just because they beat the current net with an innovative opening but not being better overall.
  2. A fair test of candidate nets would be to test them against a large panel of different engines: if a candidate does signicantly better than the official net, it should be promoted (and rejected otherwise). Deepmind implemented that approach (and it leads to a test of the candidate net in a large panel of diverse openings, which limits the risks of rock-scissor-paper)! But it's not easy to implement ;-(
  3. A simple approach would limit risks and be easy to implement: test candidate nets against the current net but after playing a large panel of openings (twice, once with black and once with white). That gives the needed diversity (to avoid rock-scissor-paper), it is fair (because even if an opening is a bit unbalanced, the candidate net plays it both with white and with black ;-) and it is easy to implement. BTW, it is used in Computer Chess Tournaments for 20 years, for specifically the same reason: avoid rock-scissor-paper (without noise and with deterministic engines, it's easy to "cheat" by knowing top engines moves and prepare answers to beat them: forcing engines to play twice a set of random openings negate that risks and make fair tournaments).

@alreadydone
Copy link
Contributor

alreadydone commented Jan 10, 2019

A problem is that autogtp doesn't currently process --randomtemp option; otherwise I would start from trying --randomtemp 0.001 --randomcnt 30. --randomtemp 1 (default) would be too high IMO.

https://github.com/gcp/leela-zero/blob/6d1649774ac8a2524185d02b9c5aca2d4f3c30e1/autogtp/Management.cpp#L259-L273

@Friday9i
Copy link
Author

Friday9i commented Jan 10, 2019

Yeah, or combining -m 30 and --randomvisits 200 (or something around 200): from tests I did, it has a quite small impact on playing strength and provides reasonable diversity. But it is still almost always the same opening for a part of the 30 moves: better, but not ideal

@alreadydone
Copy link
Contributor

I prefer randomtemp but either way we need an new release of autogtp :)

@gjm11
Copy link

gjm11 commented Jan 10, 2019

It seems like it would be a bad idea to adopt a selection scheme that doesn't give any advantage to a network that plays better openings, because playing better openings is part of what it means to be a stronger player. Some sort of panel selection would be lovely, given a good selection of opponents of strength comparable to current-LZ that can be implemented inside the existing LZ code. That includes earlier LZ versions, and both versions of Elf, and ... what else?

@Friday9i
Copy link
Author

Friday9i commented Jan 10, 2019

@gjm11 I agree that opening skills are important, but 2-moves openings are just the very beginning of the game: a net still has all the time it needs to demonstrate its opening skills, and it will need to demonstrate its skills on various openings rather than on a single one : -). And there are ~120 fair 2-moves openings available, which is enough to generate 240 games without any risk of redundancy.
But yeah, if I understand well your point on opening panel selection, it would be nice indeed, and it is in line with the idea that Albert Silver proposes in the link above (here it is again: https://en.chessbase.com/post/test-your-engines-the-silver-openings-suite), isn't it?

@l1t1
Copy link

l1t1 commented Jan 11, 2019

There are so many data methods and data used in training, but the method of judging the weight intensity is still palying the games. Is there a more mathematical method?

@l1t1
Copy link

l1t1 commented Jan 11, 2019

I have an idea, a game's each move has a winrate to a specific weight, so can we conclude something meaningful by comparing the winrate series of differnet weights?
eg. if elfv1 has a series, say .55 .48 .51 .36 ...
while lz 200 gets .53 .49 .50 .40 ...
we calculate sum of variance

@Splee99
Copy link

Splee99 commented Jan 11, 2019

From my observations of games between different bots, a higher estimate of the win rate could mean either the bot is weaker because it couldn't see any "hidden" danger or it is stronger because it sees the "zigzag" route to victory earlier.

@featurecat
Copy link

good idea, I think this is a necessary and obvious improvement. Lacking a wide variety of strong opponents, we can use an option like -m, so that mathematically we are 50% likely to get 400 distinctly different games for first 30 moves or so.

@Marcin1960
Copy link

What about ignoring early wins? At least for a while?

@betterworld
Copy link
Contributor

betterworld commented Jan 14, 2019

I hacked up something to force LZ to play many different openings. Of course you could use the random move feature, but this is another approach based on my avoidmove branch. For the openings I created a new branch: https://github.com/betterworld/leela-zero/commits/avoid-opening

The description (as in the commit messages):

The GTP command is lz-analyze avoid-opening <move_number> <directory>.
It will read all *.sgf files in <directory> and add the first
<move_number> moves of each game to a list of known openings.

Then LZ will only play openings that do not occur in this list.
How it works:

If all <move_number>-1 moves are from a known opening, then the next
move will be avoided as if it were an illegal move. So if <move_number>
is even, White will need to find a different move, and if it is odd,
Black needs to find a different move. Of course, LZ already knows about
this at the beginning of the game, so it may choose to play new moves
earlier in order to avoid getting into the variation where the forbidden
move would come up.

The GTP command lz-analyze b avoid-opening <moves> max=<count> <dirpath>
specifies that only those openings will be avoided that occur more than
<count> times in the SGF files in the directory.
If one opening occurs with two different symmetries, it will be counted
as two occurencies.

This might not be an option for the official match games because it will be tricky to get this to work in a distributed setup. However I thought this might be a nice way to generate a bunch of games with various openings (which I haven't tried yet).

@Friday9i
Copy link
Author

Friday9i commented Jan 14, 2019

Nice, great work!
And that could be used to automatically create a panel of zero openings:

  1. use 1600 playouts to select preferred opening
  2. use 1600 playouts while removing the preferred move and, if predicted winrate is close enough (eg within 2%) to the winrate of the best move, this is the 2nd opening move
  3. iterate until threshold (of 2%) is exceeded, and that finish the 1st black moves (basically, you should get 5 moves once symmetries are taken into account with a 2% threshold : 4-4, 4-3, 3-3, 5-4, 5-3)
  4. go to 2nd move (ie the first white move) and do the same for each of the 5 first black moves: select the best one, then the 2nd best, ...

That's basically what I did manually to create a 2-ply opening book, getting around 120 openings

@Splee99
Copy link

Splee99 commented Jan 14, 2019

Even in self play games I saw many interesting openings although they were created"randomly". The question is, however, how to let the new network to adopt any new openings. Another question is then, why did lz stuck with the old knowledge and refuse to learn new things?

@l1t1
Copy link

l1t1 commented Jan 14, 2019

http://www.yss-aya.com/cgos/19x19/cross/LZ_9229_192b15_p6k.html
shows 9229 p6000 is a good benchmark

@Friday9i
Copy link
Author

New idea for an "Approach 4", to get more diversity in openings in test matches (reminder: diversity limits the risks of rock scissor paper issues :-):
Instead of playing all match games with identical conditions (no noise and 1600 visits, which leads to identical openings for ~30 moves), we could mix the amount of visits, for example using a panel of 500, 600, ..., 1600, ..., 3000 visits for 10 or 20 moves (for both nets of course), then come back to 1600 visits.
Advantage: both nets would still play at full strength, but as moves played depends on the amount of visits, that would provide a reasonable diversity of openings.
Disadvantages: it still requires an update of LZ to manage 2 different visit numbers (1. amount of visits for the first n moves and 2. a different amount of visits for following moves), and the diversity would still be relatively low (maybe 5 to 10 different openings only? But it's already much better than just 1 or 2 openings for 30-moves as of today).

@l1t1
Copy link

l1t1 commented Jan 18, 2019

can we use top 3 elo weights (that doesnot need to be prompted)to play self play games

@dbosst
Copy link
Contributor

dbosst commented Jan 23, 2019

If a new candidate only shows improvement for a specific opening then using a variety of openings for selection will mean making it much more difficult to pass -- we may not ever descend down the mountain because we can only take gradual steps.

I think only using a variety of opponents might be the right idea, at least until progress stalls -- then use a panel of openings and >50% gating.

@Friday9i
Copy link
Author

Friday9i commented Jan 24, 2019

@dbosst in line ; -)
Selecting successive nets on a single opening is like selecting decathlonian athletes by only looking at their performance on 100-m...
For sure, among decathlonians, best 100-m performers havec chances to be very good decathlonians overall, but this test is biased: there's for sure a correlation between top skills at 100-m and top skills at decathlon, but the correlation is not perfect (as top decathlonians may not be experts at 100-m, and top 100-m performers may not be very good overall at decathlon)...
That's what we are doing with LZ: we are selecting nets able to get an upside vs a single opponent in the opening (ie "top 100-m performers") and not too bad thereafter. For sure it is correlated to the overall Go performance (the decathlon test) but the test is biased, so it's not an efficient selection process. To get an efficient selection, we should test nets on a wide variety of situations: forcing them to play selection matches against a panel of various engines (eg LZ+Minigo+PhoenixGo) and/or a panel of various LZ nets (eg LZ157@3K visits + LZ nets from LZ190 to LZ200 for example) and/or a panel of openings would ensure a wide variety of situation, ie a less biased test.

By the way, I'm currently testing LZ202 vs LZ199/LZ200/LZ201 with different set-ups and with openings, and it's performing significantly worse... It seems we have selected a "weaker overall net" but LZ202 was apparently able to outperform LZ201 sufficiently in the opening to get an edge (and able to keep that edge until endgame) to get 55% winrate overall and be selected. We selected and excellent 100-m runner but unfortunately, it performs worse at decathlon than the previous athlete (ie LZ202 is better at the opening than LZ201, but weaker overall at Go)...

@dbosst
Copy link
Contributor

dbosst commented Jan 24, 2019

I believe the difference in thinking comes from the consequences of what it means to be "stronger" or "weaker" at go. From a human side, it can mean (1) comprehension at different aspects of the game, or (2) your winning rate against a ranked opponent. Doing a panel of opening moves will select candidates that are stronger according to (1), but I believe this can lead to the same problem, where you end up selecting nets that: regress in strength in the opening (since we are not making sure candidate nets are equally strong in the opening), are stronger in the some other aspect of the game, but yet still weaker according to (2). In that case you end up with the same see-saddle problem where the net doesn't get stronger with (2) over time since they may end up regressing in the opening. I don't see how to efficiently separate the training of the opening from other aspects of the game.

@Friday9i
Copy link
Author

Friday9i commented Jan 24, 2019

@dbosst : I agree with you with the theoretical reverse risk
But IMHO this risk is much lower than the current risk if the nets play a wide panel of short openings. That's why I proposed a panel of 2-moves zero openings: after only 2 forced moves (eg a 5-3 corner move answered by a 4-3 in another corner), the 2 nets have all the moves they need to show their general Go skills in the game thereafter : -): opening & fuseki skills, middle game skills, skills at ko, endgame skills...
I would even say that the nets will really demonstrate their opening skills after 2-moves openings, while they are not really demonstrating their opening skills in current matches: in current matches, they only demonstrate that they studied a lot the chosen opening, they utterly master it and know how to refute variations, but they don't show at all their opening skills (against other opponents, they would be forced to play other openings, and we don't test at all if they can handle them!)... It's meant to test the opening skills but, practically speaking, it just tests its skills on 1 opening among billions of openings...

And of course, to be fair:

  • each opening should be played 2 times with color reversed: even if an opening is unbalanced, it does not favour one net vs the other because they play it twice (with reversed colors)
  • a wide panel of openings should be played: even if a net doesn't like an opening (neither with white nor with black), it's not a big deal as it is just one opening among a large panel of openings. It's negligible as long as it is just one opening among many, but if a net doesn't like many openings, then it will lose and that's perfectly fair and desirable to reject it!

On the opposite, what is the current situation? Both nets play only one or two 30-moves openings in ALL match games, and we measure their skills after that very specific 30-moves opening? Is that representative of their global Go skills? Of course, a bad net will lose, but among two more or less equally good nets, the chosen opening will probably favor one or the other net, and we will select it. Is it really better, or is it just better at that specific opening? We don't know as we test 2 dimensions (1. skills to chose a 30-moves opening 2. skills thereafter) with one single measure...

Hence, playing a panel of openings should greatly limit the risks of bias that we face with the current "selection process".
And once again, that's what Computer Chess Tournaments use all the time for 20 years to get diversity in games and fair results between engines: without openings, 2 engines playing against each other will play again and again the same opening, and this bias leads to erratic results. (note: for tournaments with many engines, openings are not needed as enough diversity comes from the high number of engine pairs: that's an alternative approach, matching the candidate net against a panel of engines & nets!).

@Ishinoshita
Copy link

Ishinoshita commented Jan 24, 2019

@Friday9i Interesting experiment, confirming your initial concerns.

But I still believe it's definitively dangerous to use any fixed openings, coming from LZ or from other engines, human database or whatever, in the gating scheme of a RL pipeline, as you would arbitrarily anchor this close-loop process to an external and fixed reference, introducing a major bias. The only fixed reference itself should the the game rules and goal. A bit like training Antifish, but against a fixed panel. You may succeed in beating your panel but no guarantee it's optimal.

IMHO, gating should be based on what openings the networks prefer. And randomization (which is fine IMO) should rely only on networks themselves. In this respect, and again, I like AZ chess scheme of pickng slightly suboptimal moves which values are close enough to the 'best' move, during 30 first moves e.g.. Incidently, this is also probably the simpler approach to implement (though I'm no dev to say so; just judging from the fact it was quickly added to LC0). Cohort based gating is more ambitious and robust against paper-scissors-rock trap, but probably much heavier to implement on server side.

Matches with other networks are useful, but only for assessing 'real' progress against some known anchors.

NB: this remark applies only to some of your gating schemes; I note that others are based on networks preference only.

@Friday9i
Copy link
Author

I think we more or less agree regarding the selection process, from best to worst:
Best approach: selecting nets by observing winrate improvements against a wide panel of other engines (Phoenix, MiniGo, ...) and previous LZ nets. But probably quite hard to implement ; -(
2nd or 3rd best approach (2nd in your mind, 3rd in mine): selecting nets through "noised matches" with AZ's approach (ie each move is chosen among moves with a winrate close to the winrate of the best move). This ensures a reasonable diversity in test match, ie this tests the nets' skills in a wide panel of openings they like, which is good. Still, there is a bias too: for example if you test LZ vs ELF, LZ will play some alternate moves while ELF will probably not (as ELF is "very sharp", the winrate of alternative moves would be below threshold...): this introduce a bias as the impact of noise depends of the nets, and it may reinforce the selection of sharp nets, which may not be very good : -(. Probably quite easy to implement, so it's desirable indeed
3rd or 2nd best approach: using a panel of openings to tests the candidate net vs the current best net. That tests the openings skills in diverse situation, which is important, and it should not be a serious bias (otherwise Computer Chess would have noticed it after 20years using it!). But if you are very worried regarding the "fixed openings used", they could be dynamically adjusted over time: for each match, each candidate net could select its "n preferred openings" (with AZ's methods for example) and the n+n preferred openings of the candidate net and the current best net could be used for the match (2 times each, with colors reversed). Probably not easy to implement, but probably not so hard either?
"4th approach": the current approach ; -(. Basically it tests skills on a single opening: that cannot be representative of the nets strength in diverse openings (that would be encountered against diverse opponents). Hence, it's biased But already implemented ;-)

Do you agree (more or less :-)?

@Ishinoshita
Copy link

@Friday9i Lol! Again a rock-paper-scissors situation. Transitity is not a given, even among humans.

As it turns that I strongly disagree with your 1rst/best approach (was 'more or less' the central point of my last post ;-), less disagree with your 3rd, unless openings are dynamically adjusted, which amounts 'more or less' to your 2nd, which I fully agree with! We also both agree on 4th...

This been said, what I agree or not doesn't care much. Let's those who know, and, more importantly, those who know how to code that and are ready to dedicate time for that, make the decision.

@jillybob
Copy link

jillybob commented Mar 1, 2019

Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest

1. If >53% winrate over current net, go to step 2
2.  If >53% winrate over 2 previous best nets then promote.

This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too.

The reduced requirement for winrate is due to increased statistical power with increased match games.

Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png

Minigo Model graphs: https://cloudygo.com/leela-zero/graphs

@Marcin1960
Copy link

"I'd personally suggest

  1. If >53% winrate over current net, go to step 2
  2. If >53% winrate over 2 previous best nets then promote."

Makes perfect sense!

@iopq
Copy link

iopq commented Mar 1, 2019

Or even 50% win rate with a certain confidence threshold. That way it doesn't take much longer to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests