New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we update our selection process? #2143
Comments
@Friday9i Yeah, lack of diversity in match games is always a potential threat wrt to paper scissors rock pitfall, so I personally would feel more confident with more diversity (but intuition sometimes sucks...). Approach 2.2: Implement AZ paper2 chess match condition against SF (randomization by sampling suboptimal moves in search with some soft max temp and within some wr bound, during say 30 first loves ). Less cumbersome than managing opening books, still potentially a larger spectrum of what opening lines the policy 'contains'. Would still be LZ X vs LZ X+1, but LZ X-larger policy spectrum vs LZ X+1-larger policy spectrum. AZ2 options already added to LC0 code if I recall correct (don't now if it's used for promotion yet, though). Edit: With openings randomization, SPRT threshold might need to be revisited. Or not. It's function is just to guarantee some sufficient gating. Its absolute meaning does not matter. You already pointed that its meaning is subject to some reservations, with the use of a single opening line ... |
so a short version? |
@Ishinoshita Thanks for your comments, and indeed an AZ2 options to introduce diversity is already implemented in LC0 (but I don't know either if it is already used for promotion). Regarding SPRT, it would not need to be changed if we use a panel of openings.
|
A problem is that autogtp doesn't currently process |
Yeah, or combining -m 30 and --randomvisits 200 (or something around 200): from tests I did, it has a quite small impact on playing strength and provides reasonable diversity. But it is still almost always the same opening for a part of the 30 moves: better, but not ideal |
I prefer randomtemp but either way we need an new release of autogtp :) |
It seems like it would be a bad idea to adopt a selection scheme that doesn't give any advantage to a network that plays better openings, because playing better openings is part of what it means to be a stronger player. Some sort of panel selection would be lovely, given a good selection of opponents of strength comparable to current-LZ that can be implemented inside the existing LZ code. That includes earlier LZ versions, and both versions of Elf, and ... what else? |
@gjm11 I agree that opening skills are important, but 2-moves openings are just the very beginning of the game: a net still has all the time it needs to demonstrate its opening skills, and it will need to demonstrate its skills on various openings rather than on a single one : -). And there are ~120 fair 2-moves openings available, which is enough to generate 240 games without any risk of redundancy. |
There are so many data methods and data used in training, but the method of judging the weight intensity is still palying the games. Is there a more mathematical method? |
I have an idea, a game's each move has a winrate to a specific weight, so can we conclude something meaningful by comparing the winrate series of differnet weights? |
From my observations of games between different bots, a higher estimate of the win rate could mean either the bot is weaker because it couldn't see any "hidden" danger or it is stronger because it sees the "zigzag" route to victory earlier. |
good idea, I think this is a necessary and obvious improvement. Lacking a wide variety of strong opponents, we can use an option like -m, so that mathematically we are 50% likely to get 400 distinctly different games for first 30 moves or so. |
What about ignoring early wins? At least for a while? |
I hacked up something to force LZ to play many different openings. Of course you could use the random move feature, but this is another approach based on my The description (as in the commit messages):
This might not be an option for the official match games because it will be tricky to get this to work in a distributed setup. However I thought this might be a nice way to generate a bunch of games with various openings (which I haven't tried yet). |
Nice, great work!
That's basically what I did manually to create a 2-ply opening book, getting around 120 openings |
Even in self play games I saw many interesting openings although they were created"randomly". The question is, however, how to let the new network to adopt any new openings. Another question is then, why did lz stuck with the old knowledge and refuse to learn new things? |
http://www.yss-aya.com/cgos/19x19/cross/LZ_9229_192b15_p6k.html |
New idea for an "Approach 4", to get more diversity in openings in test matches (reminder: diversity limits the risks of rock scissor paper issues :-): |
can we use top 3 elo weights (that doesnot need to be prompted)to play self play games |
If a new candidate only shows improvement for a specific opening then using a variety of openings for selection will mean making it much more difficult to pass -- we may not ever descend down the mountain because we can only take gradual steps. I think only using a variety of opponents might be the right idea, at least until progress stalls -- then use a panel of openings and >50% gating. |
@dbosst in line ; -) By the way, I'm currently testing LZ202 vs LZ199/LZ200/LZ201 with different set-ups and with openings, and it's performing significantly worse... It seems we have selected a "weaker overall net" but LZ202 was apparently able to outperform LZ201 sufficiently in the opening to get an edge (and able to keep that edge until endgame) to get 55% winrate overall and be selected. We selected and excellent 100-m runner but unfortunately, it performs worse at decathlon than the previous athlete (ie LZ202 is better at the opening than LZ201, but weaker overall at Go)... |
I believe the difference in thinking comes from the consequences of what it means to be "stronger" or "weaker" at go. From a human side, it can mean (1) comprehension at different aspects of the game, or (2) your winning rate against a ranked opponent. Doing a panel of opening moves will select candidates that are stronger according to (1), but I believe this can lead to the same problem, where you end up selecting nets that: regress in strength in the opening (since we are not making sure candidate nets are equally strong in the opening), are stronger in the some other aspect of the game, but yet still weaker according to (2). In that case you end up with the same see-saddle problem where the net doesn't get stronger with (2) over time since they may end up regressing in the opening. I don't see how to efficiently separate the training of the opening from other aspects of the game. |
@dbosst : I agree with you with the theoretical reverse risk And of course, to be fair:
On the opposite, what is the current situation? Both nets play only one or two 30-moves openings in ALL match games, and we measure their skills after that very specific 30-moves opening? Is that representative of their global Go skills? Of course, a bad net will lose, but among two more or less equally good nets, the chosen opening will probably favor one or the other net, and we will select it. Is it really better, or is it just better at that specific opening? We don't know as we test 2 dimensions (1. skills to chose a 30-moves opening 2. skills thereafter) with one single measure... Hence, playing a panel of openings should greatly limit the risks of bias that we face with the current "selection process". |
@Friday9i Interesting experiment, confirming your initial concerns. But I still believe it's definitively dangerous to use any fixed openings, coming from LZ or from other engines, human database or whatever, in the gating scheme of a RL pipeline, as you would arbitrarily anchor this close-loop process to an external and fixed reference, introducing a major bias. The only fixed reference itself should the the game rules and goal. A bit like training Antifish, but against a fixed panel. You may succeed in beating your panel but no guarantee it's optimal. IMHO, gating should be based on what openings the networks prefer. And randomization (which is fine IMO) should rely only on networks themselves. In this respect, and again, I like AZ chess scheme of pickng slightly suboptimal moves which values are close enough to the 'best' move, during 30 first moves e.g.. Incidently, this is also probably the simpler approach to implement (though I'm no dev to say so; just judging from the fact it was quickly added to LC0). Cohort based gating is more ambitious and robust against paper-scissors-rock trap, but probably much heavier to implement on server side. Matches with other networks are useful, but only for assessing 'real' progress against some known anchors. NB: this remark applies only to some of your gating schemes; I note that others are based on networks preference only. |
I think we more or less agree regarding the selection process, from best to worst: Do you agree (more or less :-)? |
@Friday9i Lol! Again a rock-paper-scissors situation. Transitity is not a given, even among humans. As it turns that I strongly disagree with your 1rst/best approach (was 'more or less' the central point of my last post ;-), less disagree with your 3rd, unless openings are dynamically adjusted, which amounts 'more or less' to your 2nd, which I fully agree with! We also both agree on 4th... This been said, what I agree or not doesn't care much. Let's those who know, and, more importantly, those who know how to code that and are ready to dedicate time for that, make the decision. |
Minigo has shown LZ has a great ELO improvement from 1 net to the next net. However it shows little improvement delta past 1 net (see below) . A way to improve generalised improvement is to match the test net against the current net, and the previous 2 nets. This should improve the robustness of future "best nets". I'd personally suggest
This promotion protocol has been supported by previous discussion over what the selection process can be. However there has been much discussion above as to the best promotion schedule. What the evidence shows is we need to be focused on not just beating the previous net, but making sure the next best net is able to beat others too. The reduced requirement for winrate is due to increased statistical power with increased match games. Graph of ELO delta over model numbers https://i.imgur.com/ueIXzm8.png Minigo Model graphs: https://cloudygo.com/leela-zero/graphs |
"I'd personally suggest
Makes perfect sense! |
Or even 50% win rate with a certain confidence threshold. That way it doesn't take much longer to test |
Should we update our selection process?
Here, I will try to demonstrate as properly as possible that "yes, we should change match conditions, as they are biased and quite probably suboptimals!" ; -(
Good news is that alternative approaches are available:
Here is the global conclusion :
All in all, our selection process should be upgraded:
1) possibly by using an opening book in match games (as done for Chess): that would improve the situation, limiting the clear risk of bias and rock-scissor-paper we may experience today, thanks to tests with more diverse situations
2) or ideally by matching candidate nets against a panel of opponents (as apparently done by Deepmind for Alpha Chess), but that's not easy to set-up...
Hence, we should think about upgrading our selection process and match conditions.
Now, the details!
Sorry, it's a bit long, but I'm trying to be explicit and as rigourous as possible in the demonstration: that takes some place...
Reminder: there are convincing arguments that our current "selection process" (ie matching candidate nets vs the current best net in a match without noise, and applying SPRT) is efficient:
What could be a better selection process?
Let's imagine we get access to several versions of AlphaGo, several versions of MiniGo, several versions of Tencent's engine, etc... for a total of 40 different Go engines.
From that, we could easily build an extremely solid selection process:
Hence, a match of 400 games (5 games with white and 5 with black against each of these 40 engines) should give a 50% winrate for LZ200 (with of course some statistical noise).
In these match games, which openings would be played? Answer: of course, it would be a diverse panel of openings having almost nothing in common with the 30-moves opening currently played systematically in match games between LZ candidate net vs LZ official net (please take a look by yourself at http://zero.sjeng.org/match-games/5c37063ff06758029e301a8c for example to be convinced: symmetries differ but they are only 1 or 2 different games played up to move 30 in all match games!). In a match against various engines, we would of course get many different openings as some engine like to open with the 3-4 point, some may play the 5-4 point, answers after initial 4-4 point are different, etc.
In this test against a panel of engines, LZ would play various openings and would probably never play its "preferred 30-moves opening" (as other engines have no reason to play that specific opening): is it an isuue for the fiability of the test? The answer IMHO is "Of course it's not a problem, it's even a clear advantage (vs current match conditions) because it tests LZ in a wide and diverse panel of situations, revealing much more precisely its global strength and ability to play in diverse situations" ; -)
Currently, our selection process leads to "a chosen 30-moves opening" which is played again and again, and if the candidate net has 55% winrate better after this single opening against a single engine, it gets a PASS. Isn't it a bit narrow, meaning there is a clear bias?
Hopefully, we can agree that testing a candidate LZ net against a large panel of opponents (and subsequently against many diverse openings) is a more solid test than the current selection process (ie a match against one single engine and after a single 30-moves opening)!
If not clear enough, let's add an example:
Let's imagine a candidate net getting a 58% winrate at this test against the panel of 40 other engines (vs a 50% winrate for LZ200 against this same panel of engines): it seems that the candidate net is a "clear improvement" over LZ200. Still, what would be its result in a match against LZ200? If it has "bad luck", the single opening chosen against LZ200 (played again and again, for ~30-moves, in every game!) may unfortunately put it at a disadvantage: in that case, its winrate may be significantly below 55% and it would be rejected by our current selection process. What a pity, we reject a strong net due to a single bad opening, ignoring all other things it learned better than LZ200!
On the contrary, a candidate net getting a 50% winrate against the panel of opponents (ie not better than LZ200) may get some luck against LZ200 if the 30-moves opening chosen against LZ200 is by chance good for it, and hence it could perfectly get a PASS with our current selection process (despite not being stronger, or possibly even slightly worse)...
Because we test several nets, there are statistically big chances that we select nets mainly because they get an advantage with a good opening rather than "because they are overall better nets"...! This is the perfect context to get a "rock-scissor-paper situation": among several tested net, one choses a good opening against LZ200 an is selected to become LZ201 (while not being stronger against other engines), then LZ202 is selected against LZ201 for the same reason, but in the end no progress was made and LZ200 may beat LZ202 because the chosen opening favours LZ200 vs LZ202...
Conclusion: our current match conditions are biased and may lead to rock-scissor-paper issues. On the opposite, matching candidate nets against a panel of opponents (leading to a large panel of diverse openings) would be significantly better to reveal the "true Go strength of candidate nets" (as it's hard to imagine how a rock-scissor-paper situation may appear against a panel of competing engines and various openings).
And this is not a theoretical situation:
Conclusion 2: rock-scissor-paper happens in real life with iterative selection processes, as used by LZ. Hence, our selection process may be plagued by that same issue, and alternatives should be studied...
What alternative selection(s) process(es) could be used?
Approach 1: ideally, LZ candidates could be tested against a panel of opponents as it seems quite optimal, but it is not easy to set up. We could compare the result of LZ200 and of candidate nets against a panel of engines, such as for example LZ157@~5K visits, LZ174@~3k visits, LZ199@1600 visits and possibly some bjiyxo nets, 20b nets, ELFv0 and ELFv1, MiniGo nets, ... If a candidate net performs significantly better than LZ200 against this panel of engines (eg getting 53% vs only 46% for LZ200), we should select it and if not significantly better, we should reject it.
Approach 2: we could simply play match games of the candidate net vs LZ200 while using a large panel of fair openings (played once with white and once with black) to test networks in a good diversity of situations, and apply SPRT. That's probably not as good as testing a candidate net against a panel of several engines, but it seems better than simply testing a candidate net against the single LZ200 net with a single "30-moves preferred opening"... An example of openings that could be used is here: #2104, with a panel of "120 fair 2-moves openings" (by "fair", I mean winrate stays close to the winrate of optimal moves as judged by LZ, and also close to 50%)
Approach 3: approaches 1 and 2 may be combined, by testing candidate nets against a panel of different engines and after a panel of opening moves (played twice, once with white and once with black).
Note: for computer chess tournaments, usage of opening books is almost universally used, after being introduced by Nunn 20 years ago! So I'm not reinventing the wheel ;-)
Cf this article presenting the context and the approach (https://en.chessbase.com/post/test-your-engines-the-silver-openings-suite) and https://chess.com/computer-chess-championship for example (where you can find the tournament rules: first round does not use openings as diversity comes from the large panel of engines, then for 2nd round an opening book is used). The TCEC tournament (http://tcec.chessdom.com) where LC0 plays also uses an opening book.
Global conclusion : all in all, our selection process should be upgraded:
1) possibly by using an opening book (as done for Chess): that would improve the situation, limiting the clear risk of bias and rock-scissor-paper we may experience today, thanks to tests with more diverse situations
2) or ideally by matching candidate nets against a panel of opponents (as apparently done by Deepmind for Alpha Chess), but that's not easy to set-up...
3) Playing match games with for example "-m 20 --randomvisits 200" should give some welcome diversity in games too, but it may introduce other bias (eg LZ and ELF are quite different, so the impact of the noise may be unfair). Hence it's not clear if this approach would be beneficial or not.
Note: playing a part of the training games with a "zero-opening book" may be useful to get more diversity too (in addition to the random noise -m 30), but that's another story ;-)
The text was updated successfully, but these errors were encountered: