New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FPU could be a trainable target #1109
Comments
I like this idea. Rather than coming up with fpu formulas ourselves, use the NN instead. Edit: Can the client be modified to output this additional training data for future use? |
Yes but not in a backwards compatible manner. That is, you can add support for it, but you can't really make data "for future use" without making the current pipeline a mess. |
Deleted a bunch of offtopic posts, I reply to them in the original thread where that question was also already asked... |
It seems like a good step in the direction Deepmind seems to be taking with mctsnets https://arxiv.org/pdf/1802.04697.pdf giving more and more control of the search to a neural network. It might be more tricky than it seems, since min(E(v)) will tend to underestimate the unexplored moves values, sometimes in a big way. But maybe that's ok. |
Noise and randomcnt should help with improving min(E(v)) as it does currently for the policy output. |
For what it's worth i had a try at some point with using the last expanded node's q value as FPU. I think I had equivalent results to sqrt_tvp when using that minus 0.1*sqrt_tvp(I'll look at my notes this evening). The idea was that the last expanded node is the expanded node with the lowest policy, so it can be found easily, and it is the closest to the unexpanded node with highest policy (which is the only one we really care about) |
It is presumably worth thinking about how to define the loss function (probably cross entropy?). Should it include unexpanded nodes or not? Also, nodes with few visits (like <10) probably won't be good training targets. It will take some time to fill the details. |
You need an output for them, so yes, you have to include them one way or another.
Why on earth wouldn't they be? You're looking for a good first estimate of E(v). 10 playout search is massively better than nothing. |
In a situation where we are going to try and directly train FPU for all nodes, would we maybe want to just expand every root node? Thus training the FPU always has at least one nn_eval of the resulting board state after the move. No guessing or unknown targets at all. |
I mean, we do need some output for unvisited nodes, but calculating the loss function with them is something to think about. Of course we don't want those values to be random by not using them for loss function, but doing it in a naive way can be not that desirable due to practical problems, since the number of those moves will be dominant. Also I do agree that few playouts are probably better than nothing, but many playouts will be definitely better, so that these are not to be treated equal. Perhaps we can allocate larger weights for more visits in calculating the loss function, but in a nonlinear way such that the weight starts from a finite (non-zero) value when visits=0 and increases slowly as the number of visits of the node increases. |
Since policy output can be trained on few playouts with no problems, I think E(v) output shouldn't be different. |
Assume that move A has 10k visits with winrate of 55% and move B has only 10 visits with also 55% winrate. The training data for policy training will clearly include much higher probability for A, so a more visited node is rewarded, in some sense. But if we naively apply equal weights for the FPU, since the quality of the training data for a move is dependent on the number of visits for the node, such an effect is not well taken into consideration. |
I don't see what your point is. If the policy probability of a move is 1/3199 because it got a single visit, we don't weight it down because the error on the probability is huge. |
For probability it does not matter, since it has a naturally low probability and its impact on the search is quite small. But for the FPU, it becomes a different story since a less visited node can still have a high FPU. Of course it will be (hopefully) better than nothing, but I don't think it should be treated equal to nodes with more visits. |
Would you do this by essentially adding using a triple head for the network, instead of the current dual one? With the FPU head identical to the policy head in architecture, and the loss function for training weighted 1/3 each for cross-entropy loss, regular MSE and FPU MSE? |
You might not need the value head as it could be equivalent to max(E(v)) of the "FPU" head. Of course, practical results override theoretical possibilities. |
Well, the training targets are different, aren't they? Value head would always be trained to reduce MSE to game result, while FPU head would be trained to reduce MSE to child evaluation. |
It's actually hard to make this true because of how the search behaves. And I'd think that in that case, you'd want the high FPU, for sure. (i.e. in Leela's search output you are aren't likely to find a move with a high score that has low visits) |
True, in general you see better results when training the value head to the 0/1 value instead of the search output. So for that a triple head would work. But it might even work without it. |
The idea sounds really promising, since you would essentially amplify the training data by using the child node data much more than currently. Training to both game result and search output may also regularise the network better. |
That is absolutely true for the N_visits->inf limit, but when the number of total visits is limited, it is not always the case, I suppose. The estimated winrate for a move can always change as the search progresses, and it becomes more accurate as the node is visited more. So my question is that isn't it better to consider the data for more visited nodes important?
This is interesting, but what would be the loss function that accounts for visited nodes, for example? Could you elaborate a bit? |
I remember that I initialy misread AGZ and thought they were training both the policy and the value head on the tree search result (exponantiated visit count of child nodes for policy headn & action value accumulated at the root for value head). I understand that such a training target for value head may suffer serious issue of convergence. |
The shogi program became pretty strong using this loss function grad = (1 - lambda) * (eval_winrate - t) + lambda * (eval_winrate - teacher_winrate) eval_winrate = root_winrate |
There should have been a direct link to the originating thread in the first post. |
I personally think this is false. It could be close to the truth enough that it would work though. Anyway, if at some point we explore options to extend the predictions the net make to make the tree search more effective, why limit ourselves to a head for predicting q value of child nodes ? There are a lot of tree statistics apart from visits count and q values that could be very interesting to have the net predict. I think the way it's currently done, there is a lack of information from the net. When it says a move has a 0.50 prior, what does it mean exactly ? That in half the examples positions it was visited 2 times and in the other half it was searched 3200 times ? or that it was consistently visited 1600 times ? I'm wondering, is it possible for the net to be able to output upper and lower bounds for the predictions it make ? For training it would be about not punishing it when it gives an upper bound larger than what the value truly ended up being for example. It seems like it might be possible for the policy head for example ? Please correct me If I'm wrong though. |
Someone from OGS was doing this tonight, https://gist.github.com/AncalagonX/75892d7f4f9fed74d4cf7c91f719d92f to basically force root to explore very child to a minimum # before normal search takes over. For a simple test, the # could be 1. So every root child gets 1 visit and fpu estimates aren't even used/needed any more for the root node. No idea what this does for strength. |
I'm the OGS guy who provided that initial snippet. I've slightly expanded on this idea and given the highest "value" head search results about half the processor time up to a max of 1000 visits, and given the "vanilla winrate + puct" search results the other half. Effectively, any root node with >= 1000 visits is allowed to compete with the "vanilla winrate + puct" results in an "eval-sorted" (value-sorted a.k.a. winrate-sorted) results list. The highest-scoring result is played. I have only had time to play one game vs. unchanged LZ 0.13, but move 100 both version agreed the game was 63.5 - 38.5 split, and the split got more severe after that until modified LZ won the game. |
My previous post isn't statistically significant, so I'll come back later with larger numbers of game results. But the extremely polarized nature of LZ's policy network led me down this road. Often, it only considers a SINGLE move, and dumps untold numbers of visits into examining a SINGLE root node (a.k.a. next move candidate). Even when it considers multiple moves, it is still too heavy-handed in how it rates policy vs. non-policy moves. It seems that allocating a certain percent of visits to non-policy root nodes may have significant value. It allows the network to choose a better value-ranked root node when there is no better (value+policy)-ranked root node. In my case above, "winrate" was used until 6400 playouts (not visits) on that turn, with a max of 1000 visits allowed per root node. After 6400 playouts were performed that turn, the default "winrate + puct" was used for the remaining 3600 playouts to a max of 10-11k playouts that turn. Finally, an "eval-sorted" (value-sorted) results list was used to pick from any move with the highest eval that had more than 1000 visits. I will bring more match results back. My implementation is crude, not only in theory but also in programming ability. I'm not a statistics guy, nor a CS major. If there's really something here after I bring back more matches, someone else should be able to improve upon it. Again, this effort was chiefly motivated by how sharp LZ's policy network has become in allowing alternate moves to be given CPU time for consideration. Edit: 2 wins 1 loss. Don't get too excited. I might not be able to bring more data for a while. |
I think we tried this when experimenting with FPU stuff, or it was even the initial default, and it was just bad (it's equivalent to FPU=>1.0 IIRC). Most of those "let's explore more" ideas seem appealing but they generally just make the program weaker. Spending a lot of effort on moves you know are extremely likely to be bad just isn't very efficient. Also it doesn't really have anything to do with FPU, because that controls only the first visit. |
An easy way to test how big the impact more correct fpu could have is to change the code to send one visit down each child but carefully accounting those as not actual visits. i.e. count a visit only after a child is returned as uct_best_child, use the rest as replacement for fpu. Comparing that at fixed "actual visits" to current implementation with the same visits gives an upper bound to how much better LZ can get with perfect fpu. |
Chao Gao, Martin Mueller, and Ryan Hayward have a new paper titled "Three-Head Neural Network Architecture for Monte Carlo Tree Search". Since the full-text isn't available, I opened an issue to ask whether it's related cgao3/benzene-vanilla-cmake#1 |
Three-Head Neural Network Architecture for Monte Carlo Tree Search AlphaGo Zero pioneered the concept of two-head neural networks in Monte Carlo Tree Search (MCTS), where the policy output is used for prior action probability and the state-value estimate is used for leaf node evaluation. We propose a three-head neural net architecture with policy, state- and action-value outputs, which could lead to more efficient MCTS since neural leaf estimate can still be back-propagated in tree with delayed node expansion and evaluation. To effectively train the newly introduced action-value head on the same game dataset as for two-head nets, we exploit the optimal relations between parent and children nodes for data augmentation and regularization. In our experiments for the game of Hex, the action-value head learning achieves similar error as the state-value prediction of a two-head architecture. The resulting neural net models are then combined with the same Policy Value MCTS (PV-MCTS) implementation. We show that, due to more efficient use of neural net evaluations, PV-MCTS with three-head neural nets consistently performs better than the two-head ones, significantly outplaying the state-of-the-art player MoHex-CNN |
Great, so confirmed. Where did you find it? I searched the title and found nothing. |
Three-Head Neural Network Architecture for Monte Carlo Tree Search |
So the architecture in the paper above looks pretty interesting. @gcp would changes to support different architecture(s) like this be welcome? Or would you prefer to have some performance evidence of nets trained using a proposed architecture first? i.e. try it all out on a branch first. I think I know what your answer will be :) |
The additional action value head sounds super interesting, it would be great if that could be implemented. As the authors say, it's likely that the results will be even better once a few self-play cycles are done, so that the system can adapt better. For testing, like they did, we should already be able to see an effect when training on already existing data. |
Very interesting indeed. Having this Q value head would open the field of exploration even more, I presume, in terms of tree search algorithmics. The authors insist that
If I read it well, the paper don't mention the possibility to use as FPU for the expanded node the action-value or some value derived from it. I also suppose that other approachs could be used to train this action-value head, like using the win rate for each move having received visits as training target (V %). As of now, however, I understand that these values are thrown away and not uploaded as part of the training data, correct ? |
I actually tested this idea (but NOT in the LZ framework) just in the last few days. Here are my notes: Unless I'm misinterpreting the paper, I believe their proposed loss function does not result in the learning of an accurate action-value-head due to noticeable bias in its construction that never vanishes even in the limit of lots of data. In particular, a hypothetical neural net that makes perfect action-value predictions on a large set of training data would NOT experience zero gradient under their loss function. So this makes me much less excited than I was (I guess my expectations were too high). But I do think there's still possible value here in trying it out or something similar. I am NOT confident that I have not made a mistake here, so take this with a grain of salt. Here is the argument to show why a locally perfect neural net is not a local minimum of their proposed loss function. Consider a specific Go position s, and assume we have a extraordinary large number of data samples for that position (s,a,z) where a is the actual move played in that data sample and z (-1 or 1) is the value target resulting from finishing the game after playing a. Assume that there are only a few good moves which are played in almost all samples that all have almost identical winning rates, but we still have so extraordinary many samples that every move (even the really bad ones) have enough samples that you could get a good estimate for q(s,a) for them simply by taking the average value of z over those samples, despite those samples only being a tiny proportion of the data. Assume that our neural net currently has weights such that the when run on s, the policy head outputs the exact distribution of a, the value head outputs exactly the average value of z, and the action-value head for each action a' outputs exactly the average value of z restricted to samples with action a'. This is the "perfect" output given this data set. Now, imagine providing this entire dataset as one giant batch, set the L2 regularization coefficient to 0 so we can ignore it, and perform a gradient step. What happens? Here are the components of their loss function: (Note that they define q(s,a) is the value from the perspective of the OTHER player after you make move a, so for a good move it should be equal to the negative of v(s)).
Basically, term (5) makes it so that any time an action is not the one chosen, it the action-value gets pushed a little towards thinking it is loss (a win for the opponent) if the game itself was a loss. This happens on every sample, so despite us by assumption having enough data to get a good action-value estimate on some rare move 'a, if it is percentage-wise rare it will still get drowned out even with infinite data and still forced close to being viewed as a loss for us (a win for the opponent). Now, a counterpoint: the term (3) 0.5(z+q(s,a))^2 does make it so that for moves that are played frequently, the action-value head will be incentivized to converge to a value that is close to the empirical q(s,a) statistics. And if they are played frequently, then term (5) will only contribute a weak effect due to the 1/|A(s)| factor. So it could be great for that use. The strength gain they observed in the paper might be simply because they had poor settings before around FPU-like things beforehand and anything reasonable like this would be better, or it might be because of these genuinely better q(s,a) estimates for common moves. So there's still much promise here. However, I do suspect that the action-value-head in this architecture is useless for setting first-play urgency for most moves past the first few that the net considers. Because at that point, the "data augmentation" term (5) becomes dominant, and pulls the neural net's output to think that the move completely loses regardless of whether it does or not, so it's almost equivalent to outright pruning all moves that fall below some threshold in the NN policy, and if you wanted that behavior, you don't really need a fancy third head for that. I'll post a followup reply here with some pictures that I got from testing that are consistent with this. |
@lightvector How strong is your network? Could you release it and we can have a test |
It's not currently a LZ-compatible network because it has multiple architectural differences, so you might have trouble testing it with LZ. The input features are also different - it takes liberties and ladders as input features and uses a one-hot representation for the history. You can see most of the differences here (except for a missing value head, I haven't pushed that yet): https://github.com/lightvector/GoNN/blob/master/model.py. If you still think it would actually be useful to you, I could post the weights for a three-headed version of this model. The weights would simply be a Tensorflow Saver.save() of an instance this model graph, except with two fewer blocks. Edit: and of course, no idea how strong it is, as I mentioned I only just finished implementing MCTS on my own. Learning how all this stuff works one step at a time... :) |
@lightvector so If I understand correctly it seems the behaviour you describe is a result of using "the AND constraint" as a way of saying "if the position is lost then it doesn't matter what is played; all moves result in a loss". This doesn't immediately sound like a bad thing to consider to me so what am I missing? Do you think that loss function shouldn't be attempting to combine a statement that would be true only for perfect play? |
@Hersmunch yes I think their AND constraint could be problematic, although it remains to be seen how much it matters when applying it to real search.
Augmenting the data in this way sounds reasonable but any time you data augment you need to be aware of bias. For example imagine a magic oracle handed you a million opening positions where white played two hoshi as the first two moves (in addition to a dozen other moves) where the game theoretic truth was that white had a forced loss after that, but gave you no examples of openings with two hoshi leading to white win (and imagine two hoshi alone is a win, the losses are because of subtle further mistakes). If you naively just add these positions to training, even though the data is truly "correct", the bias will probably cause the net to wrongly learn to dislike playing two hoshi as white.
…On Sun, Jul 22, 2018, 08:28 Hersmunch ***@***.***> wrote:
@lightvector <https://github.com/lightvector> so If I understand
correctly it seems the behaviour you describe is a result of using "the AND
constraint" as a way of saying "if the position is lost then it doesn't
matter what is played; all moves result in a loss". This doesn't
immediately sound like a bad thing to consider to me so what am I missing?
Do you think that loss function shouldn't be attempting to combine a
statement that would be true only for perfect play?
Also, could a different loss function be used that only considers the move
that was actually played? Perhaps something like multiplying the
(z+q(s,a'))^2 term of your (5) above by 1 if a' was the move played and 0
if it isn't?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1109 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALY5-2pjgq24t5WHeSh68MEFSDSdO7CLks5uJG_xgaJpZM4S6_NZ>
.
|
Just learned about this old paper from DM: https://arxiv.org/pdf/1511.06581.pdf |
It's very common in the beginning where 3-3 and pincers might be good moves, but hated by the bot. It may not search them, but if you play them it doesn't change the evaluation |
Not something I'd do for LZ right now, but an interesting idea to push things further:
Someone on the Leela Chess Zero list asked:
He was given various explanations why it's not needed, bad idea, etc. But I think it's a good idea!
If this doesn't work well, you could keep the regular value head too. But anyway, the idea is this: instead of trying to come up with tricky formulas for FPU, just make the FPU a trainable target.
The only catch is that you don't necessarily will have expanded every node at the root, so the question is what to pick as the training target for those moves. I propose min(E(v)) i.e. minimum of all children, somewhat similar to setting the search probability to 0% for unsearched moves.
You can't use the current training data for this, you'd need a new format that stores the vector of winrates after every search. It's also possible this doesn't help with the current FPU reduction, which IMHO just implements the above by trying to translate the search probabilities into good FPU guesses.
The text was updated successfully, but these errors were encountered: