Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPU could be a trainable target #1109

Open
gcp opened this issue Mar 26, 2018 · 45 comments
Open

FPU could be a trainable target #1109

gcp opened this issue Mar 26, 2018 · 45 comments

Comments

@gcp
Copy link
Member

gcp commented Mar 26, 2018

Not something I'd do for LZ right now, but an interesting idea to push things further:

Someone on the Leela Chess Zero list asked:

Hello, I still can not understand the reason for using both Policy and Value head for position evaluation in A0 / LC0 projects. I am sure it has good reason. But I still can not understand why policy head is necessary.
If you can evaluate position P by its Value and know which moves are legal, you can simply select the best move only based on their Value. So couldn't be Policy head replaced just by choosing such next position that has relatively good Value?

He was given various explanations why it's not needed, bad idea, etc. But I think it's a good idea!

Yes.

Your idea is correct, i.e. that instead of search probabilities, the
policy head could output estimated winrates for each move.

It's easy to understand if you consider the probabilities are currently
a normalized mapping from 0 to 1.0 - but there's no need for the network
output to be this. Nothing prevents you from multiplying that with the
E(v) and scaling it.

As to why this isn't done:

a) The 2 "heads" are essentially free since computing them is a tiny
fraction of the cost of evaluating the network itself.
b) The formulation of the search algorithm requires a probability (not
an evaluation) in its current form.

Now, that all being said, one big problem of the current algorithm is
what to use as the FPU. And what you proposed is exactly to calculate
and train this.

Thus, one could use this design:

a) policy head producing search probabilities for every move
b) value head producing move evaluations for every move

Now E(v) is indeed max(children), and we can immediately initialize the
FPU of every move, as that is exactly what the value head produces.

Also consider that FPU reductions are exactly there to solve the problem
that we have only one E(v) for the root, and a bunch of probabilities
for moves, that we now have to turn into E(v) for every move!

If this doesn't work well, you could keep the regular value head too. But anyway, the idea is this: instead of trying to come up with tricky formulas for FPU, just make the FPU a trainable target.

The only catch is that you don't necessarily will have expanded every node at the root, so the question is what to pick as the training target for those moves. I propose min(E(v)) i.e. minimum of all children, somewhat similar to setting the search probability to 0% for unsearched moves.

You can't use the current training data for this, you'd need a new format that stores the vector of winrates after every search. It's also possible this doesn't help with the current FPU reduction, which IMHO just implements the above by trying to translate the search probabilities into good FPU guesses.

@ghost
Copy link

ghost commented Mar 26, 2018

I like this idea. Rather than coming up with fpu formulas ourselves, use the NN instead.

Edit: Can the client be modified to output this additional training data for future use?

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

Edit: Can the client be modified to output this additional training data for future use?

Yes but not in a backwards compatible manner. That is, you can add support for it, but you can't really make data "for future use" without making the current pipeline a mess.

@leela-zero leela-zero deleted a comment from jkiliani Mar 26, 2018
@leela-zero leela-zero deleted a comment from jkiliani Mar 26, 2018
@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

Deleted a bunch of offtopic posts, I reply to them in the original thread where that question was also already asked...

@remdu
Copy link
Contributor

remdu commented Mar 26, 2018

It seems like a good step in the direction Deepmind seems to be taking with mctsnets https://arxiv.org/pdf/1802.04697.pdf giving more and more control of the search to a neural network. It might be more tricky than it seems, since min(E(v)) will tend to underestimate the unexplored moves values, sometimes in a big way. But maybe that's ok.

@ghost
Copy link

ghost commented Mar 26, 2018

Noise and randomcnt should help with improving min(E(v)) as it does currently for the policy output.

@remdu
Copy link
Contributor

remdu commented Mar 26, 2018

For what it's worth i had a try at some point with using the last expanded node's q value as FPU. I think I had equivalent results to sqrt_tvp when using that minus 0.1*sqrt_tvp(I'll look at my notes this evening). The idea was that the last expanded node is the expanded node with the lowest policy, so it can be found easily, and it is the closest to the unexpanded node with highest policy (which is the only one we really care about)

@isty2e
Copy link

isty2e commented Mar 26, 2018

It is presumably worth thinking about how to define the loss function (probably cross entropy?). Should it include unexpanded nodes or not? Also, nodes with few visits (like <10) probably won't be good training targets. It will take some time to fill the details.

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

Should it include unexpanded nodes or not?

You need an output for them, so yes, you have to include them one way or another.

Also, nodes with few visits (like <10) probably won't be good training targets.

Why on earth wouldn't they be? You're looking for a good first estimate of E(v). 10 playout search is massively better than nothing.

@roy7
Copy link
Collaborator

roy7 commented Mar 26, 2018

In a situation where we are going to try and directly train FPU for all nodes, would we maybe want to just expand every root node? Thus training the FPU always has at least one nn_eval of the resulting board state after the move. No guessing or unknown targets at all.

@isty2e
Copy link

isty2e commented Mar 26, 2018

I mean, we do need some output for unvisited nodes, but calculating the loss function with them is something to think about. Of course we don't want those values to be random by not using them for loss function, but doing it in a naive way can be not that desirable due to practical problems, since the number of those moves will be dominant.

Also I do agree that few playouts are probably better than nothing, but many playouts will be definitely better, so that these are not to be treated equal. Perhaps we can allocate larger weights for more visits in calculating the loss function, but in a nonlinear way such that the weight starts from a finite (non-zero) value when visits=0 and increases slowly as the number of visits of the node increases.

@ghost
Copy link

ghost commented Mar 26, 2018

Since policy output can be trained on few playouts with no problems, I think E(v) output shouldn't be different.

@isty2e
Copy link

isty2e commented Mar 26, 2018

Assume that move A has 10k visits with winrate of 55% and move B has only 10 visits with also 55% winrate. The training data for policy training will clearly include much higher probability for A, so a more visited node is rewarded, in some sense. But if we naively apply equal weights for the FPU, since the quality of the training data for a move is dependent on the number of visits for the node, such an effect is not well taken into consideration.

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

I don't see what your point is. If the policy probability of a move is 1/3199 because it got a single visit, we don't weight it down because the error on the probability is huge.

@isty2e
Copy link

isty2e commented Mar 26, 2018

For probability it does not matter, since it has a naturally low probability and its impact on the search is quite small. But for the FPU, it becomes a different story since a less visited node can still have a high FPU. Of course it will be (hopefully) better than nothing, but I don't think it should be treated equal to nodes with more visits.

@jkiliani
Copy link

Would you do this by essentially adding using a triple head for the network, instead of the current dual one? With the FPU head identical to the policy head in architecture, and the loss function for training weighted 1/3 each for cross-entropy loss, regular MSE and FPU MSE?

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

Would you do this by essentially adding using a triple head for the network, instead of the current dual one?

You might not need the value head as it could be equivalent to max(E(v)) of the "FPU" head.

Of course, practical results override theoretical possibilities.

@jkiliani
Copy link

jkiliani commented Mar 26, 2018

Well, the training targets are different, aren't they? Value head would always be trained to reduce MSE to game result, while FPU head would be trained to reduce MSE to child evaluation.

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

But for the FPU, it becomes a different story since a less visited node can still have a high FPU.

It's actually hard to make this true because of how the search behaves. And I'd think that in that case, you'd want the high FPU, for sure.

(i.e. in Leela's search output you are aren't likely to find a move with a high score that has low visits)

@gcp
Copy link
Member Author

gcp commented Mar 26, 2018

Well, the training targets are different, aren't they? Value head would always be trained to reduce MSE to game result, while FPU head would be trained to reduce MSE to value head.

True, in general you see better results when training the value head to the 0/1 value instead of the search output. So for that a triple head would work. But it might even work without it.

@jkiliani
Copy link

The idea sounds really promising, since you would essentially amplify the training data by using the child node data much more than currently. Training to both game result and search output may also regularise the network better.

@isty2e
Copy link

isty2e commented Mar 26, 2018

It's actually hard to make this true because of how the search behaves.

That is absolutely true for the N_visits->inf limit, but when the number of total visits is limited, it is not always the case, I suppose. The estimated winrate for a move can always change as the search progresses, and it becomes more accurate as the node is visited more. So my question is that isn't it better to consider the data for more visited nodes important?

FPU head would be trained to reduce MSE to value head.

This is interesting, but what would be the loss function that accounts for visited nodes, for example? Could you elaborate a bit?

@Ishinoshita
Copy link

Ishinoshita commented Mar 26, 2018

@jkilliani: "Training to both game result and search output may also regularise the network better."

I remember that I initialy misread AGZ and thought they were training both the policy and the value head on the tree search result (exponantiated visit count of child nodes for policy headn & action value accumulated at the root for value head). I understand that such a training target for value head may suffer serious issue of convergence.
However, using a full game to generate a single training sample is such a waste. There might be more than one lesson learnt out of game, at various stage of the game. Now, reading the discussion in this thread and considering that the value head is now quite knowledgeable, I'm wondering if a target for value head mixing game outcome and tree search result (kind of baseline) would make a meaningfull target allowing to sample positions multiple times in a game, while preventing overfitting too much on the game outcome, since the later would be 'tempered' by the current evaluation from the tree search. This would not require to redefine the value head, just the training target for value.

@lp200
Copy link

lp200 commented Mar 26, 2018

The shogi program became pretty strong using this loss function

grad = (1 - lambda) * (eval_winrate - t) + lambda * (eval_winrate - teacher_winrate)

eval_winrate = root_winrate
teacher_winrate = search_winrate
t = game_result
https://github.com/yaneurao/YaneuraOu/blob/master/source/learn/learner.cpp#L1123

@TFiFiE
Copy link
Contributor

TFiFiE commented Mar 27, 2018

There should have been a direct link to the originating thread in the first post.

@remdu
Copy link
Contributor

remdu commented Mar 28, 2018

You might not need the value head as it could be equivalent to max(E(v)) of the "FPU" head.

I personally think this is false. It could be close to the truth enough that it would work though.

Anyway, if at some point we explore options to extend the predictions the net make to make the tree search more effective, why limit ourselves to a head for predicting q value of child nodes ? There are a lot of tree statistics apart from visits count and q values that could be very interesting to have the net predict. I think the way it's currently done, there is a lack of information from the net. When it says a move has a 0.50 prior, what does it mean exactly ? That in half the examples positions it was visited 2 times and in the other half it was searched 3200 times ? or that it was consistently visited 1600 times ?
It's important to consider that for the "move value head" proposed in this thread. If this head says a move has a value of 0.10, it seems important to know if it was a sure win in 1/10 of the training position, and a suicide in the other 9/10, or if it was consistently 0.10. If they net was able to output such statistics, that would allow to better find capture moves that are normally forbidden.

I'm wondering, is it possible for the net to be able to output upper and lower bounds for the predictions it make ? For training it would be about not punishing it when it gives an upper bound larger than what the value truly ended up being for example. It seems like it might be possible for the policy head for example ? Please correct me If I'm wrong though.

@roy7
Copy link
Collaborator

roy7 commented Mar 29, 2018

Someone from OGS was doing this tonight,

https://gist.github.com/AncalagonX/75892d7f4f9fed74d4cf7c91f719d92f

to basically force root to explore very child to a minimum # before normal search takes over. For a simple test, the # could be 1. So every root child gets 1 visit and fpu estimates aren't even used/needed any more for the root node. No idea what this does for strength.

@AncalagonX
Copy link
Contributor

I'm the OGS guy who provided that initial snippet. I've slightly expanded on this idea and given the highest "value" head search results about half the processor time up to a max of 1000 visits, and given the "vanilla winrate + puct" search results the other half.

Effectively, any root node with >= 1000 visits is allowed to compete with the "vanilla winrate + puct" results in an "eval-sorted" (value-sorted a.k.a. winrate-sorted) results list. The highest-scoring result is played.

I have only had time to play one game vs. unchanged LZ 0.13, but move 100 both version agreed the game was 63.5 - 38.5 split, and the split got more severe after that until modified LZ won the game.

@AncalagonX
Copy link
Contributor

AncalagonX commented Mar 29, 2018

My previous post isn't statistically significant, so I'll come back later with larger numbers of game results.

But the extremely polarized nature of LZ's policy network led me down this road. Often, it only considers a SINGLE move, and dumps untold numbers of visits into examining a SINGLE root node (a.k.a. next move candidate). Even when it considers multiple moves, it is still too heavy-handed in how it rates policy vs. non-policy moves.

It seems that allocating a certain percent of visits to non-policy root nodes may have significant value. It allows the network to choose a better value-ranked root node when there is no better (value+policy)-ranked root node.

In my case above, "winrate" was used until 6400 playouts (not visits) on that turn, with a max of 1000 visits allowed per root node. After 6400 playouts were performed that turn, the default "winrate + puct" was used for the remaining 3600 playouts to a max of 10-11k playouts that turn. Finally, an "eval-sorted" (value-sorted) results list was used to pick from any move with the highest eval that had more than 1000 visits.

I will bring more match results back. My implementation is crude, not only in theory but also in programming ability. I'm not a statistics guy, nor a CS major. If there's really something here after I bring back more matches, someone else should be able to improve upon it. Again, this effort was chiefly motivated by how sharp LZ's policy network has become in allowing alternate moves to be given CPU time for consideration.

Edit: 2 wins 1 loss. Don't get too excited. I might not be able to bring more data for a while.

@gcp
Copy link
Member Author

gcp commented Mar 29, 2018

So every root child gets 1 visit and fpu estimates aren't even used/needed any more for the root node. No idea what this does for strength.

I think we tried this when experimenting with FPU stuff, or it was even the initial default, and it was just bad (it's equivalent to FPU=>1.0 IIRC).

Most of those "let's explore more" ideas seem appealing but they generally just make the program weaker. Spending a lot of effort on moves you know are extremely likely to be bad just isn't very efficient.

Also it doesn't really have anything to do with FPU, because that controls only the first visit.

@ghost
Copy link

ghost commented Mar 29, 2018

An easy way to test how big the impact more correct fpu could have is to change the code to send one visit down each child but carefully accounting those as not actual visits. i.e. count a visit only after a child is returned as uct_best_child, use the rest as replacement for fpu.

Comparing that at fixed "actual visits" to current implementation with the same visits gives an upper bound to how much better LZ can get with perfect fpu.

@alreadydone
Copy link
Contributor

alreadydone commented Jun 5, 2018

Chao Gao, Martin Mueller, and Ryan Hayward have a new paper titled "Three-Head Neural Network Architecture for Monte Carlo Tree Search". Since the full-text isn't available, I opened an issue to ask whether it's related cgao3/benzene-vanilla-cmake#1

@jillybob
Copy link

jillybob commented Jun 5, 2018

Three-Head Neural Network Architecture for Monte Carlo Tree Search
Chao Gao, Martin Müller, Ryan Hayward

AlphaGo Zero pioneered the concept of two-head neural networks in Monte Carlo Tree Search (MCTS), where the policy output is used for prior action probability and the state-value estimate is used for leaf node evaluation. We propose a three-head neural net architecture with policy, state- and action-value outputs, which could lead to more efficient MCTS since neural leaf estimate can still be back-propagated in tree with delayed node expansion and evaluation. To effectively train the newly introduced action-value head on the same game dataset as for two-head nets, we exploit the optimal relations between parent and children nodes for data augmentation and regularization. In our experiments for the game of Hex, the action-value head learning achieves similar error as the state-value prediction of a two-head architecture. The resulting neural net models are then combined with the same Policy Value MCTS (PV-MCTS) implementation. We show that, due to more efficient use of neural net evaluations, PV-MCTS with three-head neural nets consistently performs better than the two-head ones, significantly outplaying the state-of-the-art player MoHex-CNN

@alreadydone
Copy link
Contributor

alreadydone commented Jun 5, 2018

Great, so confirmed. Where did you find it? I searched the title and found nothing.
Edit: OK I find it here: http://static.ijcai.org/2018-Program.html should have added quotation marks in search.

@lp200
Copy link

lp200 commented Jul 14, 2018

Three-Head Neural Network Architecture for Monte Carlo Tree Search
https://www.ijcai.org/proceedings/2018/0523.pdf

@Hersmunch
Copy link
Member

So the architecture in the paper above looks pretty interesting. @gcp would changes to support different architecture(s) like this be welcome? Or would you prefer to have some performance evidence of nets trained using a proposed architecture first? i.e. try it all out on a branch first. I think I know what your answer will be :)
I don't have the capability to train a network (as tensorflow doesn't appear to support my AMD GPU) but wouldn't mind helping out with making changes to leelaz code. So is it worth pursuing?

@odeint
Copy link

odeint commented Jul 19, 2018

The additional action value head sounds super interesting, it would be great if that could be implemented. As the authors say, it's likely that the results will be even better once a few self-play cycles are done, so that the system can adapt better. For testing, like they did, we should already be able to see an effect when training on already existing data.

@Ishinoshita
Copy link

Very interesting indeed. Having this Q value head would open the field of exploration even more, I presume, in terms of tree search algorithmics. The authors insist that

The main advantage brought by this architecture is that the action-value can be immediately back-propagated to the tree without expanding a node.

If I read it well, the paper don't mention the possibility to use as FPU for the expanded node the action-value or some value derived from it.

I also suppose that other approachs could be used to train this action-value head, like using the win rate for each move having received visits as training target (V %). As of now, however, I understand that these values are thrown away and not uploaded as part of the training data, correct ?

@lightvector
Copy link

lightvector commented Jul 19, 2018

I actually tested this idea (but NOT in the LZ framework) just in the last few days. Here are my notes:

Unless I'm misinterpreting the paper, I believe their proposed loss function does not result in the learning of an accurate action-value-head due to noticeable bias in its construction that never vanishes even in the limit of lots of data. In particular, a hypothetical neural net that makes perfect action-value predictions on a large set of training data would NOT experience zero gradient under their loss function.

So this makes me much less excited than I was (I guess my expectations were too high). But I do think there's still possible value here in trying it out or something similar.

I am NOT confident that I have not made a mistake here, so take this with a grain of salt.

Here is the argument to show why a locally perfect neural net is not a local minimum of their proposed loss function.

Consider a specific Go position s, and assume we have a extraordinary large number of data samples for that position (s,a,z) where a is the actual move played in that data sample and z (-1 or 1) is the value target resulting from finishing the game after playing a. Assume that there are only a few good moves which are played in almost all samples that all have almost identical winning rates, but we still have so extraordinary many samples that every move (even the really bad ones) have enough samples that you could get a good estimate for q(s,a) for them simply by taking the average value of z over those samples, despite those samples only being a tiny proportion of the data.

Assume that our neural net currently has weights such that the when run on s, the policy head outputs the exact distribution of a, the value head outputs exactly the average value of z, and the action-value head for each action a' outputs exactly the average value of z restricted to samples with action a'. This is the "perfect" output given this data set. Now, imagine providing this entire dataset as one giant batch, set the L2 regularization coefficient to 0 so we can ignore it, and perform a gradient step. What happens? Here are the components of their loss function:

(Note that they define q(s,a) is the value from the perspective of the OTHER player after you make move a, so for a good move it should be equal to the negative of v(s)).

  • (1) -log p(a|s) : contributes 0 gradient, since the policy head exactly fits the batch statistics.
  • (2) 0.5(z-v(s))^2 : contributes 0 gradient, since the value head exactly fits the batch statistics.
  • (3) 0.5(z+q(s,a))^2 : contributes 0 gradient, since the action-value head exactly fits the batch statistics.
  • (4) (min_a' q(s,a') + v(s))^2 : contributes negligible gradient since almost all data samples are focused on good moves and those good moves have almost identical winrates, so min_a' q(s,a') and v(s) should be almost exactly negatives.
  • (5) sum_a' max(-z,0)/|A(s)| * (z+q(s,a'))^2 : Large gradient! For samples with z = 1, does nothing. For samples with z = -1, this trains the action-value-head to push estimating q(s,a') towards 1 for all a', by a quite large amount.

Basically, term (5) makes it so that any time an action is not the one chosen, it the action-value gets pushed a little towards thinking it is loss (a win for the opponent) if the game itself was a loss. This happens on every sample, so despite us by assumption having enough data to get a good action-value estimate on some rare move 'a, if it is percentage-wise rare it will still get drowned out even with infinite data and still forced close to being viewed as a loss for us (a win for the opponent).


Now, a counterpoint: the term (3) 0.5(z+q(s,a))^2 does make it so that for moves that are played frequently, the action-value head will be incentivized to converge to a value that is close to the empirical q(s,a) statistics. And if they are played frequently, then term (5) will only contribute a weak effect due to the 1/|A(s)| factor. So it could be great for that use. The strength gain they observed in the paper might be simply because they had poor settings before around FPU-like things beforehand and anything reasonable like this would be better, or it might be because of these genuinely better q(s,a) estimates for common moves. So there's still much promise here.

However, I do suspect that the action-value-head in this architecture is useless for setting first-play urgency for most moves past the first few that the net considers. Because at that point, the "data augmentation" term (5) becomes dominant, and pulls the neural net's output to think that the move completely loses regardless of whether it does or not, so it's almost equivalent to outright pruning all moves that fall below some threshold in the NN policy, and if you wanted that behavior, you don't really need a fancy third head for that.

I'll post a followup reply here with some pictures that I got from testing that are consistent with this.

@lightvector
Copy link

lightvector commented Jul 20, 2018

Here is the action-value head (12 blocks resnet, non-LZ-compatible architecture) of a neural net I trained on LZ data, all games from LZ105-LZ142 and ELF, for about 140M training samples worth of batches. The action-values are viewed from the perspective of the current player, rather than the paper's defining it from the opponent's perspective.

The color scale is such that very roughly gray = 0%, red 10%, orange 25%, yellow 35%, green 50%, bluegreen 70%, whiteish green >80%. (None of these pictures have any bluegreen or whiteish green values).

The game is about even, you can see here that only 7 moves have the action-value head produce an estimate close to 50%, these are precisely the moves that policy head is putting enough mass on. The reddish orange moves have "action values" of about 10-25%, and most of the board is dead gray, less than 1% estimated action-value winrate.

For a lot of the gray moves, if you actually play them, they of course aren't anywhere near that bad. For example if you play at A (M17), the value head thinks that after white's response, black goes down from about 46% to about 45%, rather than all the way down to 1% or 0% like the action value head is outputting. As expected from the theoretical analysis, term (5) has caused the action-value head to push them massively towards 0%, even if they actually are still okay or lose just a slight amount.

For comparison, here is the policy:

And here is the action-value and policy on another position:

The action-value head gives fairly reasonable numbers on about 8-10 of the moves, again mostly the ones that the policy puts more than a few percent of weight on. All other moves (all the orange and red ones, plus of course the whole mass of grey) have massively lower action-values compared to the current value head - still about even, very slightly favoring white.

Some of the more-edgy green moves are still noticeably depressed towards zero. For example the value head is about 55% for white right now and the action value head says playing F15 would drop white to 47%. If you actually play it, the value head only drops down to only 53%, and in fact goes back up slightly the next few moves (bouncing around noisily).


I have NOT tested the strength of various ways of integrating the action-value head into the search, since I've only just barely finished implementing MCTS in my own sandbox codebase. If someone is excited enough to try training within LZ codebase and testing in actual play, that would of course be very interesting.

I do have the concern that if used in too "naive" of a way for purposes like FPU where the near-zero outputs of the action-value head are used in a way that effectively prunes almost all moves, it's possible that you may see improvements at low playouts but a weakening of strength at very high playouts due to blind spots compared to the current LZ FPU formula, since with FPU = ~loss_value you will obviously prune many moves that would have gotten at least a little exploration under the current LZ formula once you reached high enough playouts. But of course, testing still needed.

@Tsun-hung
Copy link

@lightvector How strong is your network? Could you release it and we can have a test

@lightvector
Copy link

lightvector commented Jul 20, 2018

It's not currently a LZ-compatible network because it has multiple architectural differences, so you might have trouble testing it with LZ. The input features are also different - it takes liberties and ladders as input features and uses a one-hot representation for the history. You can see most of the differences here (except for a missing value head, I haven't pushed that yet): https://github.com/lightvector/GoNN/blob/master/model.py. If you still think it would actually be useful to you, I could post the weights for a three-headed version of this model. The weights would simply be a Tensorflow Saver.save() of an instance this model graph, except with two fewer blocks.

Edit: and of course, no idea how strong it is, as I mentioned I only just finished implementing MCTS on my own. Learning how all this stuff works one step at a time... :)

@Hersmunch
Copy link
Member

@lightvector so If I understand correctly it seems the behaviour you describe is a result of using "the AND constraint" as a way of saying "if the position is lost then it doesn't matter what is played; all moves result in a loss". This doesn't immediately sound like a bad thing to consider to me so what am I missing? Do you think that loss function shouldn't be attempting to combine a statement that would be true only for perfect play?
Also, could a different loss function be used that only considers the move that was actually played? Perhaps something like multiplying the (z+q(s,a'))^2 term of your (5) above by 1 if a' was the move played and 0 if it isn't?

@lightvector
Copy link

lightvector commented Jul 22, 2018 via email

@alreadydone
Copy link
Contributor

alreadydone commented Apr 23, 2019

Just learned about this old paper from DM: https://arxiv.org/pdf/1511.06581.pdf
https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682 Dueling DQN (aka DDQN)
image
(This doesn't have a policy network (uses epsilon-greedy) or tree search, though.)

@iopq
Copy link

iopq commented Apr 24, 2019

But for the FPU, it becomes a different story since a less visited node can still have a high FPU.

It's actually hard to make this true because of how the search behaves. And I'd think that in that case, you'd want the high FPU, for sure.

(i.e. in Leela's search output you are aren't likely to find a move with a high score that has low visits)

It's very common in the beginning where 3-3 and pincers might be good moves, but hated by the bot. It may not search them, but if you play them it doesn't change the evaluation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests