-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to find Information about the new best network #78
Comments
19k games. Learning on 38k is running now. I observed that the network now thinks white has an advantage in the opening. I think this is because it learned that if black passes before capturing much or gaining territory (not something it understands at this point), white will win on komi. |
Thanks for the information. I see that the network file is named based on the number of games it was trained on which will help answer this question in the future. |
The files on the server are named after the hash of the contents, though. I just do this to keep track of which is which. I also tested a smaller network (to control for overfitting) but it was not better. |
I was referenced "19k.txt" inside of fe3f6...gz |
I appreciate updates like that. I think it will help to keep contributers motivated to start up autogtp.exe when they get some feedback on how their contribution helps to make progress. Btw, how good would the current network (19k) play against a human? |
Including the win rate info in the best network would be great. @lithander I don't think the 19k network is better than a human beginner. |
It barely knows how to count I think. |
I was quite curious and replayed a few of the games. Interestingly it seems like the newest version has a small understanding of specific shapes. |
The learning now has the policy network achieving a 4% prediction rate, which is very far from random (0.3%). I wonder if this is just learning to understand what the legal moves are (many training games have almost-filled boards) or if it can already statically see some capture and defense moves. |
Could you please clarify what prediction rate means? Is this in regards to a dataset of human professional games (GoKifu), as they used in Fig.3 of https://www.nature.com/articles/nature24270? |
It's a prediction rate over the dataset of games from the program itself. So in 4% of the cases, it correctly guesses the move it would eventually play after searching. This is a sign play is starting to become more non-random. (Or, as said above, maybe simply that the network now understands you can't place stones on top of each other) |
Hi, you mentioned that learning on 38k is running now, and I wonder how much time will this 38k learning process take? Could you give an estimation based on your experience? Thank you. |
Any idea what the prediction rate would be for a strong fully trained Leela Zero? Did you ever happen to try loading in Leela's own human game data into Leela Zero's architecture just to see what happens? |
That's how the supervised network in the README.md was built. It does about 52.5% on humans. But the prediction rates for those vary a bit with the exact dataset. Also it trained for only a few days. You can probably get quite a bit more by running it for a few weeks. The prediction rate for a Zero that is trained by supervised learning (i.e. what we're building now) should be less, because it won't predict the bad moves those puny humans play by imitation. |
"The 19K game network beats the 9k game network 63% of the time. A 38K network is training now." I wonder, should the older less informed games to be discarded at certain moments? |
Alpha Go Zero used a window of 500k games IIRC. |
1/10th for Leela Zero? 50K? |
Are you sure about using a 500k game window instead of a 100-300k window? Because our network is only 6 blocks, it should improve faster. |
I'm not sure about anything. But it's important to keep a window of the old games, or the network forgets the basic things it has learned before. This is a very common problem for reinforcement learning setups. |
You could probably experiment with the window size, by testing two networks with the same recent data but different numbers of older games against each other. I doubt this would make any sense before we have at least 150-200k games though. |
I might suggest a slight wording change given the confusion over kyu some people had "The 19K game network beats the 9k game network 63% of the time. A 38K network is training now." |
Replacing K for T, to mark a thousand games wouldn't be a bad idea too since K is a rank measurement in Baduk. |
Sorry. Although I play Go, saying 9K in that context didn't even occur to me it might confuse people. I just changed it to be full numbers with no abbreviation. |
Is the network which learns from 38k games still under training? |
To keep the progression I would see 76K as next :) |
The current ones did not beat the 19k games network yet. So clearly it's not all so easy. I am retraining 49k with stronger regularization, and starting 62k soon. |
@gcp Can you publish some games between different networks? I think it might be a way to motivate people who join the training of leelaz. |
@godmoves it hasn't been updated for awhile but you can get them here: https://sjeng.org/zero/ |
The strength evolution curve in the Deepmind paper is not strictly monotonous. That suggests to me that they must have allowed their network to update occasionally when the evaluator did not prove a strength increase, presumably to get out of local maxima. |
@roy7: We just had another mis-promotion with a91721af. It passed SPRT at 11-1, and then the losses started coming in. Maybe the code for promoting networks should have a minimum number of games requirement (like 50-100), in addition to an SPRT pass? |
@jkiliani Makes sense since shorter games arrive earlier in the results. What's a reasonable minimum though? 100 feels high... imagine if it's like 1-99? Perhaps a minimum 50 games, or even 30, would avoid these 11-1 situations well enough. |
Don't know, anything upward of 50 would be my choice. But even 30 might prevent a case like today. I don't see the problem with even 100 though, since match games are generated really quickly these days. The delay from waiting hardly seems significant. |
Wait, what do you mean 1-99? I would only implement a minimum to pass, not to fail. Or maybe allow a pass SPRT, but remove the automatic promotion? A minimum to fail isn't needed in any case: Even if a net does fail 1-11, more results are going to come in due to queue_buffer anyway, and if it returns to null this way, it still has a chance to pass. The difference between fail and pass is that promotion is permanent, so you need a higher confidence. By the way, could you mark mistaken passes in the match list somehow? Maybe blue color for the bar of any net that was promoted, to be kept even if the net returns to null or fails after all? This would make it easier to find the passed nets in the list, which (in my opinion) include those who failed after passing, since they produced self-play games and were used as reference net for further matches. |
@gcp Something is odd with the current training. It's a lot slower than before. I remember you mentioned somewhere we now do 256k steps, but i only see a 128k steps one. Is it possible that the server is displaying the wrong step count? So the 128k is actually 256k, and the 64k is actually the 128k one etc? (I forgot what issue or i would post this there).. |
Not really @Dorus. 7b7d5d59 was the 64k net, the 128k net is due in 20 minutes, and the 256k probably around 9 hours from now. I don't see any irregularity... |
It's also possible someone weaker in short games but much stronger in long games could start with a fail but end up over 55%. I was going to just make it 100 games minimum before any pass or fail but good catch there. Perhaps 100 minimum to pass but no minimum to fail, then? |
Here we go @jkiliani leela-zero/leela-zero-server@0251e63 |
That would be my suggestion. There is already an effective minimum to fail of around 50 games, by the queue_buffer. Edit: Looks good to me. |
@jkiliani Oops, you are right. I must have remembered it wrong because of the inbetween steps. |
The AZ method working demonstrates that this is not true. The networks are not doing mutations and intersections between a population as in an evolutionary process, and being improved from that. It's training from external data that improves them (that happens to be generated by them, but that's actually unrelated, which is exactly what AZ shows!). The analogy and comparison with evolutionary methods is just totally and utterly wrong, and again, the only "adaptation" that is being done is really us cherry picking the best between completely non-adapted networks. |
This will be totally off topic, but still... Obviously it's not just a comparison, it IS an evolutionary method, and that's just totally and utterly right. If you adopt the AZ method you are no longer selecting networks, but you must still be selecting something: that something is the single knowledge elements that we humans would call moves/tesujis/techniques/strategies/other things we cannot even conceive but the network can, and the "survival of the fittest" part is not "surviving a match between networks" but surviving the training phase, which decides what goes into the next network. Knowledge doesn't appear by magic. Especially in a "Zero" situation, where you scrupolously keep the Designer away, there must be a selection mechanism. How can you expect to keep moving towards the optimum if you exclude the existence of any mechanism to guide you in the right direction? Why do you suppose the AZ method works? By the way, the "happens to be generated by them, but that's actually unrelated" is just as it should be. You don't need to be clever and try to generate good "mutations". Anything goes. We evolved from single cells just from random mutations. It's the selection part that's important. |
Did you juts throw together a bunch of random keywords and compiled it into a post?
Because MCTS guides the learning algorithm. This guidance is so strong you need very little other help. |
Sure, and where does it guide it? It helps you choose the good moves over the bad ones. That's my point. What's yours? |
Training does not happen "at random" like how it would happen during real evolution. Instead, it does pick random moves and that subset of moves that is picked has some effect, but for each picked move, it modifies to neural net to output a result closer to the output of the MCTS at that move. This modification is not a random one, but instead the neural net is one big mathematical formula, and that formula is modified to match the expected result closer. Also the move picked by the MCTS is not necessarily a 'good' or 'bad' move, but even if it is a bad move, next generation networks will pick that move more often, and it will start doing MCTS on the next move also. At one point the MCTS one (or several) moves deeper will discover a way to reject the previous considered move, and now the MCTS at the top level move will become unfavorable. It will then switch to the previous considered second best move, that might actually be a good move all along. The knowledge about this now unused search path, however, is not lost. As long as the neural net is large enough, future modifications made to it during training will not completely erased during training. At one point the network will however reach the limit of what it can learn, this happen when every next training step makes it forget just as much as that it learns from it. With the AGZ approach, you train a new net, and if it is not stronger, you roll back to the current net. All the training you just did is then 100% lost. With AZ, you might progress even faster, because you never roll back any training. Instead, you might have some trouble once the network reaches capacity because you are not hand picking the best net, however if you set the learning rate low enough, new networks will not forget as much, and you stay relatively at the same strength. Also if you really want to, you can also hand pick the very last net you get from an AZ method. Anyway i feel like i'm just writing the long version of what @gcp wrote a few post above:
I have some doubt about an AZ method to surpass an AGZ method, however i have little doubt an AZ method can be equally fast if not faster than AGZ. Just as long as your methods are sound, an AZ method provide much less feedback so you need to have a very good idea about when to drop learning rate, how efficient your MCTS is, and have no bugs etc. Moving to AZ method from an already saturated AGZ net like we have now will probably not result in a much stronger net, however it should also not result in catastrophic forgetting. |
I'm sorry, but you're really just trying to knock down a straw man, based on faulty assumptions of what 'must' be the case for something to qualify as an evolutionary method. In fact, there are very few assumptions that must be met, and this project (and the AGZ project, and even the AZ project (see below)) meets those few assumptions just fine. You don't need a large population size to be an evolutionary method, you can have a population of just 1 surviving individual per generation, which is what we're working with. Mutations need not be entirely random, they can be produced by any process, including neural net training, as long as some level of inheritance is maintained. Intersection (i.e. crossover) is also not necessary; in fact, it can be considered as simply another variant of generalized 'mutation'. And honestly, I think you simply are using an unrelated meaning of the term 'adaptation', perhaps. I already linked to the Wikipedia article which explains the usage of 'adaptation' that I'm referring to, but here it is again: https://en.wikipedia.org/wiki/Adaptation In the current project, when a champion network (population size=1) is used to train a variant of itself (mutation), that new variant isn't just a random network out of the blue, it is still based largely off of the original champ (inheritance). However, the mutant first must pass the SPRT gauntlet (selection) against the champ before becoming the champ's successor. If it fails, it is left behind (death aka extinction). If it succeeds, then it becomes the new champ (generations), and the process continues (iteration). From an evolutionary process point of view, it's irrelevant if the mutation step is unguided or guided. If it's unguided, it will still work, it will just take a lot longer. But if it's guided, it's still producing a child variant largely related to its parent, it's just that the odds are much higher that the child will be better than its parent, so the overall process can go faster. [There's an inherent risk of getting stuck in a local optimum, especially with a population size of just 1, but the learning rate parameter apparently is intended to prevent that (similar to simulated annealing, I presume).] Even if this project were to take on the AZ method of always accepting the mutant child without testing SPRT against the parent, it would still count as an evolutionary process, but you would just take on a broader perspective, considering the training games themselves as part of the 'environment' left over by previous generations of parent/ancestor networks. Perhaps more like cultural evolution than biological evolution, but evolution nonetheless (see memetic/Baldwinian algorithm below). Yes, the neural net training does most of the heavy lifting in terms of generating candidates that are likely to be more 'fit' to their environment. But also yes, it can still be considered a form of mutation. Guided or unguided doesn't matter. In this project's case, it's guided. Still mutation. Still evolutionary. Here are a collection of relevant topics which buttress my points:
The point is, there are many many forms of evolutionary computation, and this project definitely falls under that broad category. You could even argue that it uses one of the oldest forms of evolutionary computation, namely Evolution Strategies. It just uses a souped-up mutation operator that's way better than random mutation. And it keeps some additional legacy inheritance from long-dead ancestors in the form of a long window of training games that they produced. None of that disqualifies this project from being a form of evolutionary computation, and indeed, an evolutionary algorithm with the special case of having a population size of just 1 (or 2 depending on how you delineate a 'population'). The reason I'm dragging this on is because understanding that this is the case will allow us to borrow knowledge and techniques from evolutionary computation which can help us overcome actual practical problems that we may run into, such as getting stuck in local optima (most likely simple solution would be increase pop. size to > 1), or perhaps adopting / experimenting with certain techniques or operators, such as the simple linear 'hybrid' technique that seems to work pretty well (which can be explained/understood partly by analogy with a generalized 'crossover' operator). Denying outright that it's a form of evolutionary computation/algorithm won't serve any useful purpose, IMHO. Yes, the main workhorse is of course the neural net training algorithm; that fact has never been under contention. That doesn't mean it's not also an evolutionary process. The two methods blend together quite well, in fact. |
I think you could easily get away with W + L > 50. By the time you get past 50, the very short games that may have tipped the SPRT balance to 'PASS' will have already been diluted by many more longer games, undoubtedly bringing the balance back to 'green bar' (or even 'fail'). And then the SPRT should from then on function as it's intended. It's really only when you get a whole bunch of early wins all at once (in first 10-20 results returned) that you're in danger of a really bad false positive. In fact, going by the old Statistics class rule-of-thumb for minimum sample sizes, you could probably get away with only waiting for 30 games. But why not stick with 50, as that's the number we currently use to prevent a false negative? There are 20 out of the last 100 matches where 50 < N < 100, i.e. 20% of matches don't need to go all the way to 100, but would be forced to go to 100. |
I disagree. You're still selecting networks, but the networks are also participating in a kind of 'cultural evolution' (which itself does indeed involve the evolution of 'something', that being a library of games and/or game positions). However, at the end of the day, it is the networks that we select and use to play further games (inside and outside of the training process). The training game window exists as a kind of persistent (yet evolving) environment that the current network exists within, and which the 'souped-up' mutation operator of neural network training uses to produce new candidate networks which are more likely to be better than their parents than random. (That's not to say that the evolving library of games is not something worthwhile on its own; indeed, most of what the Go community has gotten from Deep Mind has been a bunch of games -- no one has access to play any of the Alphas directly anymore, they can just look back on the history of their games and try to learn from them.) In a sense, the window of training games is like a 'ghost population' of dead ancestors (or their memoirs) whose effect on the environment indirectly influences the mutation/evolution of the lone survivor (who also leaves its own memoirs/ghost, and so on), via the super-mutation operator of neural network training. This could also be considered a form of co-evolution, which is a very powerful evolutionary engine, and may partially explain why the Alpha methods work so well. See https://en.wikipedia.org/wiki/Coevolution. |
@Dorus, good explanation of the effect of MCTS on the training. Well said.
I was just about to say that we'll eventually always hit the 'no free lunch' limit either way, which is in reference to the no free lunch theorems, which is an interesting topic all on its own. But then I just happened to run across this other NFL article, https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization, which has a subsection titled Coevolutionary Free Lunches. You'd never guess WTF it says (I swear I just found this like 5 minutes ago!):
So now I'm like, "Yeah. What they said." :-) [BTW, that Wikipedia reference [9] is to "Wolpert, D.H., and Macready, W.G. (2005) "Coevolutionary free lunches," IEEE Transactions on Evolutionary Computation, 9(6): 721–735". I found a few slightly different PDF versions of it:
|
Meanwhile I ran a match between 82 (0fb6) and 87 (92dd) to check whether we are actually making a progress. The rating difference between two is 248, which corresponds to a winrate of 80.7%. And the result was:
Within the 95% confidence interval, the winrate is between 64.6% and 81.6%, which translates to the rating difference between 104 and 259. So while it seems like the rating difference is not fully cumulative, we are making some progress still. |
oh! |
I noticed a new network (fe3f6...) in http://zero.sjeng.org/networks/
Can you provide some information about win rate over the previous best network?
Number of games it was trained on? How long it took to train?
I'm thirsty for details :)
The text was updated successfully, but these errors were encountered: