FPU could be a trainable target #1109

gcp · 2018-03-26T11:55:16Z

Not something I'd do for LZ right now, but an interesting idea to push things further:

Someone on the Leela Chess Zero list asked:

Hello, I still can not understand the reason for using both Policy and Value head for position evaluation in A0 / LC0 projects. I am sure it has good reason. But I still can not understand why policy head is necessary.
If you can evaluate position P by its Value and know which moves are legal, you can simply select the best move only based on their Value. So couldn't be Policy head replaced just by choosing such next position that has relatively good Value?

He was given various explanations why it's not needed, bad idea, etc. But I think it's a good idea!

Yes.

Your idea is correct, i.e. that instead of search probabilities, the
policy head could output estimated winrates for each move.

It's easy to understand if you consider the probabilities are currently
a normalized mapping from 0 to 1.0 - but there's no need for the network
output to be this. Nothing prevents you from multiplying that with the
E(v) and scaling it.

As to why this isn't done:

a) The 2 "heads" are essentially free since computing them is a tiny
fraction of the cost of evaluating the network itself.
b) The formulation of the search algorithm requires a probability (not
an evaluation) in its current form.

Now, that all being said, one big problem of the current algorithm is
what to use as the FPU. And what you proposed is exactly to calculate
and train this.

Thus, one could use this design:

a) policy head producing search probabilities for every move
b) value head producing move evaluations for every move

Now E(v) is indeed max(children), and we can immediately initialize the
FPU of every move, as that is exactly what the value head produces.

Also consider that FPU reductions are exactly there to solve the problem
that we have only one E(v) for the root, and a bunch of probabilities
for moves, that we now have to turn into E(v) for every move!

If this doesn't work well, you could keep the regular value head too. But anyway, the idea is this: instead of trying to come up with tricky formulas for FPU, just make the FPU a trainable target.

The only catch is that you don't necessarily will have expanded every node at the root, so the question is what to pick as the training target for those moves. I propose min(E(v)) i.e. minimum of all children, somewhat similar to setting the search probability to 0% for unsearched moves.

You can't use the current training data for this, you'd need a new format that stores the vector of winrates after every search. It's also possible this doesn't help with the current FPU reduction, which IMHO just implements the above by trying to translate the search probabilities into good FPU guesses.

The text was updated successfully, but these errors were encountered:

ghost · 2018-03-26T12:00:10Z

I like this idea. Rather than coming up with fpu formulas ourselves, use the NN instead.

Edit: Can the client be modified to output this additional training data for future use?

gcp · 2018-03-26T12:12:20Z

Edit: Can the client be modified to output this additional training data for future use?

Yes but not in a backwards compatible manner. That is, you can add support for it, but you can't really make data "for future use" without making the current pipeline a mess.

gcp · 2018-03-26T12:18:37Z

Deleted a bunch of offtopic posts, I reply to them in the original thread where that question was also already asked...

remdu · 2018-03-26T12:20:07Z

It seems like a good step in the direction Deepmind seems to be taking with mctsnets https://arxiv.org/pdf/1802.04697.pdf giving more and more control of the search to a neural network. It might be more tricky than it seems, since min(E(v)) will tend to underestimate the unexplored moves values, sometimes in a big way. But maybe that's ok.

ghost · 2018-03-26T12:21:39Z

Noise and randomcnt should help with improving min(E(v)) as it does currently for the policy output.

remdu · 2018-03-26T12:27:28Z

For what it's worth i had a try at some point with using the last expanded node's q value as FPU. I think I had equivalent results to sqrt_tvp when using that minus 0.1*sqrt_tvp(I'll look at my notes this evening). The idea was that the last expanded node is the expanded node with the lowest policy, so it can be found easily, and it is the closest to the unexpanded node with highest policy (which is the only one we really care about)

isty2e · 2018-03-26T13:04:42Z

It is presumably worth thinking about how to define the loss function (probably cross entropy?). Should it include unexpanded nodes or not? Also, nodes with few visits (like <10) probably won't be good training targets. It will take some time to fill the details.

gcp · 2018-03-26T13:09:01Z

Should it include unexpanded nodes or not?

You need an output for them, so yes, you have to include them one way or another.

Also, nodes with few visits (like <10) probably won't be good training targets.

Why on earth wouldn't they be? You're looking for a good first estimate of E(v). 10 playout search is massively better than nothing.

roy7 · 2018-03-26T13:14:17Z

In a situation where we are going to try and directly train FPU for all nodes, would we maybe want to just expand every root node? Thus training the FPU always has at least one nn_eval of the resulting board state after the move. No guessing or unknown targets at all.

isty2e · 2018-03-26T13:16:20Z

I mean, we do need some output for unvisited nodes, but calculating the loss function with them is something to think about. Of course we don't want those values to be random by not using them for loss function, but doing it in a naive way can be not that desirable due to practical problems, since the number of those moves will be dominant.

Also I do agree that few playouts are probably better than nothing, but many playouts will be definitely better, so that these are not to be treated equal. Perhaps we can allocate larger weights for more visits in calculating the loss function, but in a nonlinear way such that the weight starts from a finite (non-zero) value when visits=0 and increases slowly as the number of visits of the node increases.

ghost · 2018-03-26T14:10:50Z

Since policy output can be trained on few playouts with no problems, I think E(v) output shouldn't be different.

isty2e · 2018-03-26T14:35:10Z

Assume that move A has 10k visits with winrate of 55% and move B has only 10 visits with also 55% winrate. The training data for policy training will clearly include much higher probability for A, so a more visited node is rewarded, in some sense. But if we naively apply equal weights for the FPU, since the quality of the training data for a move is dependent on the number of visits for the node, such an effect is not well taken into consideration.

gcp · 2018-03-26T14:37:00Z

I don't see what your point is. If the policy probability of a move is 1/3199 because it got a single visit, we don't weight it down because the error on the probability is huge.

isty2e · 2018-03-26T14:47:39Z

For probability it does not matter, since it has a naturally low probability and its impact on the search is quite small. But for the FPU, it becomes a different story since a less visited node can still have a high FPU. Of course it will be (hopefully) better than nothing, but I don't think it should be treated equal to nodes with more visits.

jkiliani · 2018-03-26T14:47:50Z

Would you do this by essentially adding using a triple head for the network, instead of the current dual one? With the FPU head identical to the policy head in architecture, and the loss function for training weighted 1/3 each for cross-entropy loss, regular MSE and FPU MSE?

gcp · 2018-03-26T14:49:54Z

Would you do this by essentially adding using a triple head for the network, instead of the current dual one?

You might not need the value head as it could be equivalent to max(E(v)) of the "FPU" head.

Of course, practical results override theoretical possibilities.

jkiliani · 2018-03-26T14:51:00Z

Well, the training targets are different, aren't they? Value head would always be trained to reduce MSE to game result, while FPU head would be trained to reduce MSE to child evaluation.

gcp · 2018-03-26T14:51:02Z

But for the FPU, it becomes a different story since a less visited node can still have a high FPU.

It's actually hard to make this true because of how the search behaves. And I'd think that in that case, you'd want the high FPU, for sure.

(i.e. in Leela's search output you are aren't likely to find a move with a high score that has low visits)

gcp · 2018-03-26T14:54:25Z

Well, the training targets are different, aren't they? Value head would always be trained to reduce MSE to game result, while FPU head would be trained to reduce MSE to value head.

True, in general you see better results when training the value head to the 0/1 value instead of the search output. So for that a triple head would work. But it might even work without it.

jkiliani · 2018-03-26T14:57:49Z

The idea sounds really promising, since you would essentially amplify the training data by using the child node data much more than currently. Training to both game result and search output may also regularise the network better.

isty2e · 2018-03-26T15:01:59Z

It's actually hard to make this true because of how the search behaves.

That is absolutely true for the N_visits->inf limit, but when the number of total visits is limited, it is not always the case, I suppose. The estimated winrate for a move can always change as the search progresses, and it becomes more accurate as the node is visited more. So my question is that isn't it better to consider the data for more visited nodes important?

FPU head would be trained to reduce MSE to value head.

This is interesting, but what would be the loss function that accounts for visited nodes, for example? Could you elaborate a bit?

Ishinoshita · 2018-03-26T22:12:20Z

@jkilliani: "Training to both game result and search output may also regularise the network better."

I remember that I initialy misread AGZ and thought they were training both the policy and the value head on the tree search result (exponantiated visit count of child nodes for policy headn & action value accumulated at the root for value head). I understand that such a training target for value head may suffer serious issue of convergence.
However, using a full game to generate a single training sample is such a waste. There might be more than one lesson learnt out of game, at various stage of the game. Now, reading the discussion in this thread and considering that the value head is now quite knowledgeable, I'm wondering if a target for value head mixing game outcome and tree search result (kind of baseline) would make a meaningfull target allowing to sample positions multiple times in a game, while preventing overfitting too much on the game outcome, since the later would be 'tempered' by the current evaluation from the tree search. This would not require to redefine the value head, just the training target for value.

lp200 · 2018-03-26T23:41:07Z

The shogi program became pretty strong using this loss function

grad = (1 - lambda) * (eval_winrate - t) + lambda * (eval_winrate - teacher_winrate)

eval_winrate = root_winrate
teacher_winrate = search_winrate
t = game_result
https://github.com/yaneurao/YaneuraOu/blob/master/source/learn/learner.cpp#L1123

TFiFiE · 2018-03-27T23:33:21Z

There should have been a direct link to the originating thread in the first post.

remdu · 2018-03-28T21:40:14Z

You might not need the value head as it could be equivalent to max(E(v)) of the "FPU" head.

I personally think this is false. It could be close to the truth enough that it would work though.

Anyway, if at some point we explore options to extend the predictions the net make to make the tree search more effective, why limit ourselves to a head for predicting q value of child nodes ? There are a lot of tree statistics apart from visits count and q values that could be very interesting to have the net predict. I think the way it's currently done, there is a lack of information from the net. When it says a move has a 0.50 prior, what does it mean exactly ? That in half the examples positions it was visited 2 times and in the other half it was searched 3200 times ? or that it was consistently visited 1600 times ?
It's important to consider that for the "move value head" proposed in this thread. If this head says a move has a value of 0.10, it seems important to know if it was a sure win in 1/10 of the training position, and a suicide in the other 9/10, or if it was consistently 0.10. If they net was able to output such statistics, that would allow to better find capture moves that are normally forbidden.

I'm wondering, is it possible for the net to be able to output upper and lower bounds for the predictions it make ? For training it would be about not punishing it when it gives an upper bound larger than what the value truly ended up being for example. It seems like it might be possible for the policy head for example ? Please correct me If I'm wrong though.

roy7 · 2018-03-29T03:51:35Z

Someone from OGS was doing this tonight,

https://gist.github.com/AncalagonX/75892d7f4f9fed74d4cf7c91f719d92f

to basically force root to explore very child to a minimum # before normal search takes over. For a simple test, the # could be 1. So every root child gets 1 visit and fpu estimates aren't even used/needed any more for the root node. No idea what this does for strength.

AncalagonX · 2018-03-29T12:36:07Z

I'm the OGS guy who provided that initial snippet. I've slightly expanded on this idea and given the highest "value" head search results about half the processor time up to a max of 1000 visits, and given the "vanilla winrate + puct" search results the other half.

Effectively, any root node with >= 1000 visits is allowed to compete with the "vanilla winrate + puct" results in an "eval-sorted" (value-sorted a.k.a. winrate-sorted) results list. The highest-scoring result is played.

I have only had time to play one game vs. unchanged LZ 0.13, but move 100 both version agreed the game was 63.5 - 38.5 split, and the split got more severe after that until modified LZ won the game.

AncalagonX · 2018-03-29T12:48:12Z

My previous post isn't statistically significant, so I'll come back later with larger numbers of game results.

But the extremely polarized nature of LZ's policy network led me down this road. Often, it only considers a SINGLE move, and dumps untold numbers of visits into examining a SINGLE root node (a.k.a. next move candidate). Even when it considers multiple moves, it is still too heavy-handed in how it rates policy vs. non-policy moves.

It seems that allocating a certain percent of visits to non-policy root nodes may have significant value. It allows the network to choose a better value-ranked root node when there is no better (value+policy)-ranked root node.

In my case above, "winrate" was used until 6400 playouts (not visits) on that turn, with a max of 1000 visits allowed per root node. After 6400 playouts were performed that turn, the default "winrate + puct" was used for the remaining 3600 playouts to a max of 10-11k playouts that turn. Finally, an "eval-sorted" (value-sorted) results list was used to pick from any move with the highest eval that had more than 1000 visits.

I will bring more match results back. My implementation is crude, not only in theory but also in programming ability. I'm not a statistics guy, nor a CS major. If there's really something here after I bring back more matches, someone else should be able to improve upon it. Again, this effort was chiefly motivated by how sharp LZ's policy network has become in allowing alternate moves to be given CPU time for consideration.

Edit: 2 wins 1 loss. Don't get too excited. I might not be able to bring more data for a while.

gcp · 2018-03-29T15:19:18Z

So every root child gets 1 visit and fpu estimates aren't even used/needed any more for the root node. No idea what this does for strength.

I think we tried this when experimenting with FPU stuff, or it was even the initial default, and it was just bad (it's equivalent to FPU=>1.0 IIRC).

Most of those "let's explore more" ideas seem appealing but they generally just make the program weaker. Spending a lot of effort on moves you know are extremely likely to be bad just isn't very efficient.

Also it doesn't really have anything to do with FPU, because that controls only the first visit.

ghost · 2018-03-29T16:20:22Z

An easy way to test how big the impact more correct fpu could have is to change the code to send one visit down each child but carefully accounting those as not actual visits. i.e. count a visit only after a child is returned as uct_best_child, use the rest as replacement for fpu.

Comparing that at fixed "actual visits" to current implementation with the same visits gives an upper bound to how much better LZ can get with perfect fpu.

alreadydone · 2018-06-05T06:00:47Z

Chao Gao, Martin Mueller, and Ryan Hayward have a new paper titled "Three-Head Neural Network Architecture for Monte Carlo Tree Search". Since the full-text isn't available, I opened an issue to ask whether it's related cgao3/benzene-vanilla-cmake#1

jillybob · 2018-06-05T10:43:11Z

Three-Head Neural Network Architecture for Monte Carlo Tree Search
Chao Gao, Martin Müller, Ryan Hayward

AlphaGo Zero pioneered the concept of two-head neural networks in Monte Carlo Tree Search (MCTS), where the policy output is used for prior action probability and the state-value estimate is used for leaf node evaluation. We propose a three-head neural net architecture with policy, state- and action-value outputs, which could lead to more efficient MCTS since neural leaf estimate can still be back-propagated in tree with delayed node expansion and evaluation. To effectively train the newly introduced action-value head on the same game dataset as for two-head nets, we exploit the optimal relations between parent and children nodes for data augmentation and regularization. In our experiments for the game of Hex, the action-value head learning achieves similar error as the state-value prediction of a two-head architecture. The resulting neural net models are then combined with the same Policy Value MCTS (PV-MCTS) implementation. We show that, due to more efficient use of neural net evaluations, PV-MCTS with three-head neural nets consistently performs better than the two-head ones, significantly outplaying the state-of-the-art player MoHex-CNN

alreadydone · 2018-06-05T10:46:23Z

Great, so confirmed. Where did you find it? I searched the title and found nothing.
Edit: OK I find it here: http://static.ijcai.org/2018-Program.html should have added quotation marks in search.

lp200 · 2018-07-14T09:29:30Z

Three-Head Neural Network Architecture for Monte Carlo Tree Search
https://www.ijcai.org/proceedings/2018/0523.pdf

Hersmunch · 2018-07-19T17:36:31Z

So the architecture in the paper above looks pretty interesting. @gcp would changes to support different architecture(s) like this be welcome? Or would you prefer to have some performance evidence of nets trained using a proposed architecture first? i.e. try it all out on a branch first. I think I know what your answer will be :)
I don't have the capability to train a network (as tensorflow doesn't appear to support my AMD GPU) but wouldn't mind helping out with making changes to leelaz code. So is it worth pursuing?

odeint · 2018-07-19T19:00:46Z

The additional action value head sounds super interesting, it would be great if that could be implemented. As the authors say, it's likely that the results will be even better once a few self-play cycles are done, so that the system can adapt better. For testing, like they did, we should already be able to see an effect when training on already existing data.

Ishinoshita · 2018-07-19T20:51:47Z

Very interesting indeed. Having this Q value head would open the field of exploration even more, I presume, in terms of tree search algorithmics. The authors insist that

The main advantage brought by this architecture is that the action-value can be immediately back-propagated to the tree without expanding a node.

If I read it well, the paper don't mention the possibility to use as FPU for the expanded node the action-value or some value derived from it.

I also suppose that other approachs could be used to train this action-value head, like using the win rate for each move having received visits as training target (V %). As of now, however, I understand that these values are thrown away and not uploaded as part of the training data, correct ?

lightvector · 2018-07-19T23:03:43Z

I actually tested this idea (but NOT in the LZ framework) just in the last few days. Here are my notes:

Unless I'm misinterpreting the paper, I believe their proposed loss function does not result in the learning of an accurate action-value-head due to noticeable bias in its construction that never vanishes even in the limit of lots of data. In particular, a hypothetical neural net that makes perfect action-value predictions on a large set of training data would NOT experience zero gradient under their loss function.

So this makes me much less excited than I was (I guess my expectations were too high). But I do think there's still possible value here in trying it out or something similar.

I am NOT confident that I have not made a mistake here, so take this with a grain of salt.

Here is the argument to show why a locally perfect neural net is not a local minimum of their proposed loss function.

Consider a specific Go position s, and assume we have a extraordinary large number of data samples for that position (s,a,z) where a is the actual move played in that data sample and z (-1 or 1) is the value target resulting from finishing the game after playing a. Assume that there are only a few good moves which are played in almost all samples that all have almost identical winning rates, but we still have so extraordinary many samples that every move (even the really bad ones) have enough samples that you could get a good estimate for q(s,a) for them simply by taking the average value of z over those samples, despite those samples only being a tiny proportion of the data.

Assume that our neural net currently has weights such that the when run on s, the policy head outputs the exact distribution of a, the value head outputs exactly the average value of z, and the action-value head for each action a' outputs exactly the average value of z restricted to samples with action a'. This is the "perfect" output given this data set. Now, imagine providing this entire dataset as one giant batch, set the L2 regularization coefficient to 0 so we can ignore it, and perform a gradient step. What happens? Here are the components of their loss function:

(Note that they define q(s,a) is the value from the perspective of the OTHER player after you make move a, so for a good move it should be equal to the negative of v(s)).

(1) -log p(a|s) : contributes 0 gradient, since the policy head exactly fits the batch statistics.
(2) 0.5(z-v(s))^2 : contributes 0 gradient, since the value head exactly fits the batch statistics.
(3) 0.5(z+q(s,a))^2 : contributes 0 gradient, since the action-value head exactly fits the batch statistics.
(4) (min_a' q(s,a') + v(s))^2 : contributes negligible gradient since almost all data samples are focused on good moves and those good moves have almost identical winrates, so min_a' q(s,a') and v(s) should be almost exactly negatives.
(5) sum_a' max(-z,0)/|A(s)| * (z+q(s,a'))^2 : Large gradient! For samples with z = 1, does nothing. For samples with z = -1, this trains the action-value-head to push estimating q(s,a') towards 1 for all a', by a quite large amount.

Basically, term (5) makes it so that any time an action is not the one chosen, it the action-value gets pushed a little towards thinking it is loss (a win for the opponent) if the game itself was a loss. This happens on every sample, so despite us by assumption having enough data to get a good action-value estimate on some rare move 'a, if it is percentage-wise rare it will still get drowned out even with infinite data and still forced close to being viewed as a loss for us (a win for the opponent).

Now, a counterpoint: the term (3) 0.5(z+q(s,a))^2 does make it so that for moves that are played frequently, the action-value head will be incentivized to converge to a value that is close to the empirical q(s,a) statistics. And if they are played frequently, then term (5) will only contribute a weak effect due to the 1/|A(s)| factor. So it could be great for that use. The strength gain they observed in the paper might be simply because they had poor settings before around FPU-like things beforehand and anything reasonable like this would be better, or it might be because of these genuinely better q(s,a) estimates for common moves. So there's still much promise here.

However, I do suspect that the action-value-head in this architecture is useless for setting first-play urgency for most moves past the first few that the net considers. Because at that point, the "data augmentation" term (5) becomes dominant, and pulls the neural net's output to think that the move completely loses regardless of whether it does or not, so it's almost equivalent to outright pruning all moves that fall below some threshold in the NN policy, and if you wanted that behavior, you don't really need a fancy third head for that.

I'll post a followup reply here with some pictures that I got from testing that are consistent with this.

lightvector · 2018-07-20T00:01:06Z

Here is the action-value head (12 blocks resnet, non-LZ-compatible architecture) of a neural net I trained on LZ data, all games from LZ105-LZ142 and ELF, for about 140M training samples worth of batches. The action-values are viewed from the perspective of the current player, rather than the paper's defining it from the opponent's perspective.

The color scale is such that very roughly gray = 0%, red 10%, orange 25%, yellow 35%, green 50%, bluegreen 70%, whiteish green >80%. (None of these pictures have any bluegreen or whiteish green values).

The game is about even, you can see here that only 7 moves have the action-value head produce an estimate close to 50%, these are precisely the moves that policy head is putting enough mass on. The reddish orange moves have "action values" of about 10-25%, and most of the board is dead gray, less than 1% estimated action-value winrate.

For a lot of the gray moves, if you actually play them, they of course aren't anywhere near that bad. For example if you play at A (M17), the value head thinks that after white's response, black goes down from about 46% to about 45%, rather than all the way down to 1% or 0% like the action value head is outputting. As expected from the theoretical analysis, term (5) has caused the action-value head to push them massively towards 0%, even if they actually are still okay or lose just a slight amount.

For comparison, here is the policy:

And here is the action-value and policy on another position:

The action-value head gives fairly reasonable numbers on about 8-10 of the moves, again mostly the ones that the policy puts more than a few percent of weight on. All other moves (all the orange and red ones, plus of course the whole mass of grey) have massively lower action-values compared to the current value head - still about even, very slightly favoring white.

Some of the more-edgy green moves are still noticeably depressed towards zero. For example the value head is about 55% for white right now and the action value head says playing F15 would drop white to 47%. If you actually play it, the value head only drops down to only 53%, and in fact goes back up slightly the next few moves (bouncing around noisily).

I have NOT tested the strength of various ways of integrating the action-value head into the search, since I've only just barely finished implementing MCTS in my own sandbox codebase. If someone is excited enough to try training within LZ codebase and testing in actual play, that would of course be very interesting.

I do have the concern that if used in too "naive" of a way for purposes like FPU where the near-zero outputs of the action-value head are used in a way that effectively prunes almost all moves, it's possible that you may see improvements at low playouts but a weakening of strength at very high playouts due to blind spots compared to the current LZ FPU formula, since with FPU = ~loss_value you will obviously prune many moves that would have gotten at least a little exploration under the current LZ formula once you reached high enough playouts. But of course, testing still needed.

Tsun-hung · 2018-07-20T01:57:38Z

@lightvector How strong is your network? Could you release it and we can have a test

lightvector · 2018-07-20T03:24:31Z

It's not currently a LZ-compatible network because it has multiple architectural differences, so you might have trouble testing it with LZ. The input features are also different - it takes liberties and ladders as input features and uses a one-hot representation for the history. You can see most of the differences here (except for a missing value head, I haven't pushed that yet): https://github.com/lightvector/GoNN/blob/master/model.py. If you still think it would actually be useful to you, I could post the weights for a three-headed version of this model. The weights would simply be a Tensorflow Saver.save() of an instance this model graph, except with two fewer blocks.

Edit: and of course, no idea how strong it is, as I mentioned I only just finished implementing MCTS on my own. Learning how all this stuff works one step at a time... :)

Hersmunch · 2018-07-22T12:28:26Z

@lightvector so If I understand correctly it seems the behaviour you describe is a result of using "the AND constraint" as a way of saying "if the position is lost then it doesn't matter what is played; all moves result in a loss". This doesn't immediately sound like a bad thing to consider to me so what am I missing? Do you think that loss function shouldn't be attempting to combine a statement that would be true only for perfect play?
Also, could a different loss function be used that only considers the move that was actually played? Perhaps something like multiplying the (z+q(s,a'))^2 term of your (5) above by 1 if a' was the move played and 0 if it isn't?

lightvector · 2018-07-22T13:10:50Z

@Hersmunch yes I think their AND constraint could be problematic, although it remains to be seen how much it matters when applying it to real search. Augmenting the data in this way sounds reasonable but any time you data augment you need to be aware of bias. For example imagine a magic oracle handed you a million opening positions where white played two hoshi as the first two moves (in addition to a dozen other moves) where the game theoretic truth was that white had a forced loss after that, but gave you no examples of openings with two hoshi leading to white win (and imagine two hoshi alone is a win, the losses are because of subtle further mistakes). If you naively just add these positions to training, even though the data is truly "correct", the bias will probably cause the net to wrongly learn to dislike playing two hoshi as white.

…

On Sun, Jul 22, 2018, 08:28 Hersmunch ***@***.***> wrote: @lightvector <https://github.com/lightvector> so If I understand correctly it seems the behaviour you describe is a result of using "the AND constraint" as a way of saying "if the position is lost then it doesn't matter what is played; all moves result in a loss". This doesn't immediately sound like a bad thing to consider to me so what am I missing? Do you think that loss function shouldn't be attempting to combine a statement that would be true only for perfect play? Also, could a different loss function be used that only considers the move that was actually played? Perhaps something like multiplying the (z+q(s,a'))^2 term of your (5) above by 1 if a' was the move played and 0 if it isn't? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1109 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALY5-2pjgq24t5WHeSh68MEFSDSdO7CLks5uJG_xgaJpZM4S6_NZ> .

alreadydone · 2019-04-23T21:29:31Z

Just learned about this old paper from DM: https://arxiv.org/pdf/1511.06581.pdf
https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682 Dueling DQN (aka DDQN)

(This doesn't have a policy network (uses epsilon-greedy) or tree search, though.)

iopq · 2019-04-24T02:52:57Z

But for the FPU, it becomes a different story since a less visited node can still have a high FPU.

It's actually hard to make this true because of how the search behaves. And I'd think that in that case, you'd want the high FPU, for sure.

(i.e. in Leela's search output you are aren't likely to find a move with a high score that has low visits)

It's very common in the beginning where 3-3 and pincers might be good moves, but hated by the bot. It may not search them, but if you play them it doesn't change the evaluation

gcp added the enhancement label Mar 26, 2018

leela-zero deleted a comment from jkiliani Mar 26, 2018

alreadydone mentioned this issue Jun 5, 2018

three-head neural network cgao3/benzene-vanilla-cmake#1

Open

Hersmunch mentioned this issue Jun 20, 2018

Replace UCT with small neural network. #1575

Closed

TFiFiE mentioned this issue Jul 19, 2018

what is the aim of next gen of weight? #1625

Open

gcp mentioned this issue Feb 11, 2019

Question about heads architecture #2201

Open

FPU could be a trainable target #1109

FPU could be a trainable target #1109

Comments

gcp commented Mar 26, 2018 • edited

ghost commented Mar 26, 2018 • edited by ghost

gcp commented Mar 26, 2018

gcp commented Mar 26, 2018

remdu commented Mar 26, 2018

ghost commented Mar 26, 2018 • edited by ghost

remdu commented Mar 26, 2018 • edited

isty2e commented Mar 26, 2018

gcp commented Mar 26, 2018

roy7 commented Mar 26, 2018

isty2e commented Mar 26, 2018 • edited

ghost commented Mar 26, 2018

isty2e commented Mar 26, 2018

gcp commented Mar 26, 2018

isty2e commented Mar 26, 2018

jkiliani commented Mar 26, 2018

gcp commented Mar 26, 2018

jkiliani commented Mar 26, 2018 • edited

gcp commented Mar 26, 2018 • edited

gcp commented Mar 26, 2018

jkiliani commented Mar 26, 2018

isty2e commented Mar 26, 2018

Ishinoshita commented Mar 26, 2018 • edited

lp200 commented Mar 26, 2018

TFiFiE commented Mar 27, 2018

remdu commented Mar 28, 2018

roy7 commented Mar 29, 2018

AncalagonX commented Mar 29, 2018

AncalagonX commented Mar 29, 2018 • edited

gcp commented Mar 29, 2018 • edited

ghost commented Mar 29, 2018

alreadydone commented Jun 5, 2018 • edited

jillybob commented Jun 5, 2018

alreadydone commented Jun 5, 2018 • edited

lp200 commented Jul 14, 2018

Hersmunch commented Jul 19, 2018

odeint commented Jul 19, 2018

Ishinoshita commented Jul 19, 2018

lightvector commented Jul 19, 2018 • edited

lightvector commented Jul 20, 2018 • edited

Tsun-hung commented Jul 20, 2018

lightvector commented Jul 20, 2018 • edited

Hersmunch commented Jul 22, 2018

lightvector commented Jul 22, 2018 via email • edited

alreadydone commented Apr 23, 2019 • edited

iopq commented Apr 24, 2019

gcp commented Mar 26, 2018 •

edited

ghost commented Mar 26, 2018 •

edited by ghost

ghost commented Mar 26, 2018 •

edited by ghost

remdu commented Mar 26, 2018 •

edited

isty2e commented Mar 26, 2018 •

edited

jkiliani commented Mar 26, 2018 •

edited

gcp commented Mar 26, 2018 •

edited

Ishinoshita commented Mar 26, 2018 •

edited

AncalagonX commented Mar 29, 2018 •

edited

gcp commented Mar 29, 2018 •

edited

alreadydone commented Jun 5, 2018 •

edited

alreadydone commented Jun 5, 2018 •

edited

lightvector commented Jul 19, 2018 •

edited

lightvector commented Jul 20, 2018 •

edited

lightvector commented Jul 20, 2018 •

edited

lightvector commented Jul 22, 2018 via email •

edited

alreadydone commented Apr 23, 2019 •

edited