Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Code release for Off-Belief Learning and Learned Belief Search #26

Open
mohitahuja1 opened this issue Jul 22, 2021 · 32 comments
Open

Code release for Off-Belief Learning and Learned Belief Search #26

mohitahuja1 opened this issue Jul 22, 2021 · 32 comments

Comments

@mohitahuja1
Copy link

mohitahuja1 commented Jul 22, 2021

Hi @hengyuan-hu ,

Thanks a lot for your informative answers!

Would you have an approximate estimate on when you would be able to release the Off-Belief Learning (OBL) code?

Also, have you released the code for the auto-regressive belief model (part of the Learned Belief Search paper) which is used in the OBL implementation?

Thank you for your efforts in producing such valuable research and sharing it with the larger community!

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

Hi Mohit, thanks for your interest! OBL & belief training are radically different and we will release them as a new repo. We are looking for a Sept release as it needs to go through quite some clean up & internal review.

@mohitahuja1
Copy link
Author

Thanks @hengyuan-hu !

@0xJchen
Copy link

0xJchen commented Jul 24, 2021

Hi Mohit, thanks for your interest! OBL & belief training are radically different and we will release them as a new repo. We are looking for a Sept release as it needs to go through quite some clean up & internal review.

(Clarification: LBS refers to "Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings")

Hi, hengyuan. I am trying to re-implement the belief model mentioned both in "OBL" and "LBS". May I ask some questions concerning the architecture of the belief model?

My understanding is as follows: (in short, I found AR Belief model is like a seq2seq model, but here we regard the hand as a sequence instead of `replay game' as a sequence )

  1. the input of encoder lstm takes a sequence of game as input (just sample from the replay buffer). The output of the encoder lstm is a context vector (for lstm, we need to concatenate hidden and cell).
    Q1: More specifically, the input of the decoder is public & private observation as mentioned in the paper. Can we directly use the priv_s in OP training, which is of dimension 783 ? Or the modified version generated innterface EncodeARV0Belief().
  2. the decoder takes the concatenation of context vector and embedding of previously decoded hands(1->i-1's) as input and output the prediction for the next hand(i's).
  3. If the above speculation is right, then for the sequence of one game, we only do prediction for the last state(as we are unfolding the whole trajectory to the end). Another interpretation (maybe more reasonable) is that we repeat the above two steps(unfold the whole sequency till the current time step, make predictions for hands of current time step, and calculate the loss with the supervision of the ground truth at current time step) for each state for one trajectory. From this perspective, we update the model for each state of the sequence, and in each state, we should predict the hands supervised by the cheated ground truth.
    Q2: which of the above belief model training scheme is right?

Q3: May I ask the necessity of training a new policy model in LBS paper. We could have directly borrowed the model trained in OP or SAD, as it is just for the rollout. I also noticed that this new policy model performs worse than its previous counterparts(in LBS's Figure3-LHS, BP reaches only around 23 score).

Thanks for sharing the code with the community and reading the verbose questions. I really learned a lot from your great work.

@hengyuan-hu
Copy link
Contributor

  1. You mean the input of the encoder? If so, yes the encoder of the belief model can just take the priv_s as input.
  2. Yes. Although at training time the decoder takes the "real" card at previous positions as input, instead of the sampled ones. What you said is correct at inference time.
  3. The second one is right. We want the model to predict the hand for every time step not just the final one.
  4. It is not necessary to train a new policy for LBS. Any blueprint should be fine.

@0xJchen
Copy link

0xJchen commented Jul 27, 2021

Thanks for the insightful explanation!

@0xJchen
Copy link

0xJchen commented Jul 29, 2021

Hi, hengyuan. May I ask an additional question concerning Learned Belief Search that confuses me for some time.
The learned belief model seems to be deterministic once trained (it takes the deterministic priv_s at the current time step as input and gives the most possible hand configuration). Thus how could we sample hands from the belief model (as changing the seed at test time would not introduce randomness)? In Improved Policy via Search, we maintain explicit distribution of cards thus sampling is possible. Could you please elaborate on that?

@hengyuan-hu
Copy link
Contributor

The belief model is not deterministic. It outputs a probability distribution and we sample from it from the oldest postion to the latest, instead of taking argmax.

@0xJchen
Copy link

0xJchen commented Jul 30, 2021

I got it. Thanks for the clear explanation.

@mohitahuja1
Copy link
Author

Hi @hengyuan-hu ,

I hope you are well.

I have a few questions about Off-Belief Learning (including the belief model). Please can you share your views on the below:

  1. This is with reference to Figure 1 (RHS) in the OBL paper. Even though the fictitious state s_t' is different than the actual state s_t, the observations (o_t and o_t') are the same. We are applying the same action a_t to the same observation, therefore o_t+1 should be the same as o_t+1'. Also, r_t and r_t' will be the same. However, o_t+2 will be different compared to o_t+2' as a_t+1 and a_t+1' will be different (policies will be different at t+2). Would you agree?

  2. How is the hand drawn from the belief model input into the OBL network? Is it added only to the private stream (3 layer ff network) or also to the public stream (1 layer ff network + 2 LSTMs)?

  3. In order to learn a grounded policy, it is suggested that we remove the other player's last action from the observation. Do we also remove information of this action from the "card knowledge" available to the agent as part of the observation?

  4. How is pi_1 (current and future policy) calculated from pi_0 (grounded blueprint policy)? Is pi_1 just a result of using the Q-values learnt by the Q-network in OBL? So, we generate the belief model using pi_0 and we train the OBL network using the belief model and the OBL network produces pi_1. Is this how pi_0 and pi_1 are related?

  5. Instead of using the BELIEF of the cards in our hand, can we use the ACTUAL (TRUE) cards in our hand and proceed with training OBL? In this case, we would use the belief cards (instead of the actual cards) while playing with an ad-hoc team mate (during evaluation). Do you think this approach is worth trying?

Thank you very much for all the help!

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

  1. Exactly! The observation would be the same so we dont need to reapply the network.
  2. By definition it is added to the private stream since the hand is considered private information. Implementation-wise we just create a fictitious state as a clone of the real state and reset the hand to the sampled hand. Everything else follows naturally, i.e. the encoder will split input feature properly into public & private parts.
  3. We did not suggest remove the other player's last action from the observation. The grounded policy is simply learned by using two purely random policies to generate trajectories.
  4. Yes, OBL(trained_belief(pi_0)) -> pi_1
  5. If you use actual cards in place of the fictitious hand, then OBL is no difference from selfplay. If I understand the idea correctly, this will not give you any benefit to play with add hoc team mate. Remember that the belief model is not needed/used at all during evaluation.

@mohitahuja1
Copy link
Author

Thank you so much!

Regards,
Mohit

@0xJchen
Copy link

0xJchen commented Aug 9, 2021

Hi, hengyuan. May I ask a related question concerning SAD agent used in another of your work "improving policy via search"? (please inform me if it is an inappropriate place to discuss.)
In Hanabi_SPARTAs codebase, a trained SAD agent is provided(I guess saved with torch::jit). I have successfully evaluating with that model. The environment follows hanabi_SAD repo: pytorch==1.5.1, cuda=10.1.
But when I tried to borrow the model in this repo(hanabi_SAD) and load it in Hanabi_SPARTA(here what I do is first load provided model and then save manually with torch.jit.save), I got the following error:

...
Starting game...
====> cards remaining: 40 , empty? 0 , countdown 0 , mulligans 3 , score 0
Current hands: 1o,1b,1b,3g,5y 2b,3b,5r,5g,1r
terminate called after throwing an instance of 'c10::Error'
  what():  Method 'forward' is not defined. (get_method at /home/xxx/anaconda3/envs/hanabi/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/object.h:93)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7fbda7966536 in /home/xxx/anaconda3/envs/hanabi/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: AsyncModelWrapper::batchForward() + 0x2c63 (0x7fbb8a30d773 in /home/xxx/anaconda3/envs/hanabi/lib/python3.7/site-packages/hanabi_lib-0.0.0-py3.7-linux-x86_64.egg/hanabi_lib.cpython-37m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0xd6de4 (0x7fbdefc09de4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x9609 (0x7fbdf1458609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fbdf137f293 in /lib/x86_64-linux-gnu/libc.so.6)

I suppose maybe the training script for the model in hanabi_SPARTA is different from here. Could you please give me some suggestions on solving the above problem?

@hengyuan-hu
Copy link
Contributor

Hi. It would be better to create a new issue as this is a different problem. The short answer is that the model saved in this repo is simply state_dict that saved with torch.save() instead of torchscript models saved with torch.jit.save. I tried my best to find an example conversion script (45b922e). Basically it loads the weights and save it together with a minimal "forward" function that takes the input and then produce action + update hidden states following the format in SPARTA code base. I am not 100% sure that it will work since I am not actively using the SPARTA code base at this moment. But you may get the idea from the script.

Please open a new issue if this indeed does not work and you would like to discuss more. I may have time to reproduce it later.

@0xJchen
Copy link

0xJchen commented Aug 9, 2021

Hi, hengyuan. Thanks for the clarification and instant reply. The uploaded script works pretty well!

@mohitahuja1
Copy link
Author

Hi @hengyuan-hu ,

I just have a couple of follow-up questions please.

Question 1 (related to the replay buffer in OBL):

The OBL paper mentions that the fictitious target G_t' = r_t' + r_t+1' + Q_theta'(o_t+2', a) is stored in the replay buffer along with the actual trajectory T.

As I understand, in order to train the network in OBL, we will just need o_t, a_t and G_t'. Using these, we can easily calculate Q_theta(o_t, a_t) and use G_t' as target to calculate loss. Is my understanding correct?

So why do we need to store the actual trajectory in the replay buffer? Shouldn't we only store o_t, a_t, and G_t'? Why do we need to additionally store o_t+2, r_t+2, terminal_t+2, etc.?

Please note: I am trying to implement a simplified version of OBL using Rainbow (instead of R2D2). So I do not need to extract sequences from the replay buffer (as I am not using recurrent networks). I just need to extract single instances of prioritized experiences from the replay buffer.

Question 2 (related to the application of the belief model to OBL):

I was not able to understand how I could use the output from the belief model as input to the OBL model. Can I concatenate the hand (from the belief model) to the input observation and feed it to OBL?

How does extracting a fictitious state from the observation work? Do you mind explaining this in a little more detail please?

Thank you very much for your continued support!

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

Hanabi is partially observable so we need recurrent network (LSTM in our case) to handle that. That's way we need to store the entire trajectory tau_t instead of just o_t. We store the entire trajectory because we train on every timestep, o_t+2 is used to train the network with tau_{t+2}. i.e. given one trajectory stored in the replay buffer, we produces T OBL training targets, where T is the number of steps of the entire game.

We sample player's hand from the belief model, copy the current game state to create fictitious state, and reset the player's hand in the fictitious state. Once we have the fictitious state, we can produce encoding (inputs to the network) the same way and feed them into the network.

@mohitahuja1
Copy link
Author

mohitahuja1 commented Aug 15, 2021

Hi @hengyuan-hu ,

Thank you so much for the clarifications.

If you don't mind, I will share a few final questions so that my understanding is absolutely clear.

  1. I noticed that the input to SAD (and therefore OBL) contains the player's own hand. Do we make the hand visible to the player? Or are only those cards made visible to the player that have been completely revealed through hints (and the rest are coded as zeroes)? I was confused about this as the player's own hand should not be visible to the player (based on Hanabi's rules).

  2. As per my understanding, the observation (or state) in SAD is encoded as:

  • own hand
  • team mate's hand
  • remaining deck size
  • fireworks
  • information tokens
  • life tokens
  • discard pile
  • last action (of team mate)
  • card knowledge
  • greedy action
    As I understand, for OBL, we do not require to pass the greedy action. So, I have two sub-questions here:
    a. Does the observation contain last action (of team mate)?
    b. Is the card knowledge changed to v0-belief for OBL (like we do in SAD)?
  1. As I understand from the paper, we are trying to get rid of arbitrary "conventions" that the agent learns during self-play. For example, arbitrary conventions could be something like "If the team mate hints color = white: then play second card". This would not work in an ad-hoc (or zero-shot) setting as the team mate is an unknown player and does not know these conventions. According to the paper, changing the agent's own hand to the belief hand and going through OBL update (using fictitious states) ensures that these conventions aren't learnt by the agent. I understand that the belief hand is created using random policies and therefore it gets rid of conventions (it is grounded). However, we still have the last action in the observation and therefore the agent can always learn conventions using the last action of the team mate (similar to self play). Is that not true? How does changing the own card to the belief card get rid of conventions?

  2. Do we also add a belief hand (instead of own hand) for the team mate in the fictitious state s_t+1'?

  3. During evaluation (or test time) what do we use for own hand? You mentioned earlier that we do not use the belief hand during evaluation. So, is the own hand all zeroes (since the player can't see it)? Or do we encode only those cards that are completely revealed by hints (and keep the unknown cards in the hands as zero)?

  4. I'd like to run the evaluation with the clone-bot. Is the code open-sourced?

Thanks again for your great help and taking the time to answer our questions!

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

hengyuan-hu commented Aug 16, 2021

Hello, you are welcome :)

  1. The "own hand" part of the input is in this repo ALWAYS 0 because, as you said, in Hanabi we can not see our own hand. This feature is here solely due to a legacy reason that when we developed SAD, we initially considered a "centralized value function" that see all the information (we had a flag to control the input so that the centralized Q saw all the input while the real Q function saw only the partial info). This was then abandoned because VDN worked better. In fact, when we developed OBL, this part of the feature was completely removed from the input vector.

  2. a) The input includes last_action field. b) the card knowledge is changed to v0-belief, same as SAD.

  3. Short answer: we include the last action as usual.

Long answer: The core of OBL is that we decouple the formation of convention (induced by belief) and the optimization of the policy by introducing the "fixed belief". In normal RL, these two problems are closely correlated as the change of the policy naturally affects how actions are interpreted.

Intuitively, you can think of belief as the interpreter of actions, and OBL fundamentally changes the way we interpret the actions. For example, the belief trained on top of a random policy will interpret all the actions AS IF they are issued by a random policy (we call it belief-0 as random policy can be seen as OBL level 0), i.e. those actions carries no meaning beyond the grounded information. In normal RL, the policy may discover randomly that "I hint red, my partner play 3rd card and we get reward", then this sequence of actions is reinforced and conventions are formed. In OBL, assuming we use belief-0, then these type of reinforcing loop is broken. When our partner wants to hint red to tell us to play the 3rd card, we "interpret" this action with belief-0, which basically asks the question "what cards would I be holding if my partner is random?". Such interpretation, realized by sampling our hand from belief-0 given the history of observations, will think that our 3rd card can be ANYTHING as long as it complies with the public knowledge. If we try to play that card, i.e. apply a move on the fictitious state with resampled hand, we will very likely fail to score a point but instead lose a life token.

Of course using belief-0 will not give us the best policy to play this game, so in higher level OBL, we use more complicated belief learned from prior OBL policies. As we shown in the paper, this sequence of learning will give us consistent results across multiple different runs.

Side note: Getting rid of last_action part of the input will still lead to formation of "arbitrary conventions" because 1) the model with LSTM can infer what happened, 2) even without LSTM, the model can use other parts of the input such as card knowledge to exchange information. e.g. if your 3rd card is hinted red then your 1st card can be safely discarded. 3) however, we did find that getting rid of last_action, combined with the color permutation method from the "other-play" paper, was sufficient to produce consistent policy across different runs but such policy is not very good at collaborating with others. It is like pushing the search space of the policy optimization into some weird corners of the parameter space with feature engineering, which is not a general method.

  1. No we don't sample hand for team mate. We only need to evaluate partner's policy on S_{t+1}'.

  2. See explanation of 1), the own hand part of the input should be ALL zero or removed completely.

  3. The bot will be open sourced together with the OBL code. Stay tuned.

Hope it helps.

@0xJchen
Copy link

0xJchen commented Aug 16, 2021

Hi, hengyuan. Thanks for sharing insights with the community. In the first part you discussed the centralized training scheme, may I ask some further questions about it?

  1. Is your previous attempt ( training a "centralized value function") similar to an oracle critic, which takes global observation as input as mentioned), and during inference for each agent, those unseen observation slots are simply set to zero?
  2. I have re-implemented the belief model in the Learned Belief Search paper and the belief quality matched well. I am wondering if the belief model works only when the Blue Print Policy is trained via a "centralized" approach (like VDN).

@hengyuan-hu
Copy link
Contributor

  1. Yes, similar idea.

  2. We can train belief model for all sorts of blueprint, including non-neural network based policies (i.e. rule base policy). I don't see any reason why centralized approach is necessary here.

@0xJchen
Copy link

0xJchen commented Aug 17, 2021

  1. Yes, similar idea.
  2. We can train belief model for all sorts of blueprint, including non-neural network based policies (i.e. rule base policy). I don't see any reason why centralized approach is necessary here.

Hi, hengyuan. Thanks for the reply. I got what you mean. Once BP is determined, the belief model tailored for this BP should be able to capture the uncertainty induced by the parter's policies.

By the way, do you think it is possible to extend LBS to the multi-agent case by adopting the techniques introduced in SPARTA's work (i.e. range search/retrospective search), or if it faces some unique challenges here?

@hengyuan-hu
Copy link
Contributor

Oh, I think now I understand what you mean in your previous question. In principle, the belief model should be a function of partner's policy. If we play with partner (policy-A) in a two player game (us, policy-A), we should use a belief model trained using data generated from games played by (policy-pi, policy-A), but from policy-pi's perspective. policy-pi can be anything. It can be the policy we are going to use (sound and on distribution) and can be the same as policy-A (sound but off distribution, we hope neural network generalize well).

Understanding this point, it is easy to see the problem of adapting LBS to SPARTA multi-agent search. In LBS, we start with training policy-A in selfplay (policy-A, policy-A). Then we train belief model of policy-A in selfplay setting (policy-A, policy-A). This belief model is sound when used by LBS(policy-A) to play with policy-A, i.e. (LBS(policy-A), policy-A). The belief is wrong, however, if it is used in (MultiAgent-LBS(policy-A), MultiAgent-LBS(policy-A)). The belief model needs to be updated during search as well.

@mohitahuja1
Copy link
Author

mohitahuja1 commented Aug 18, 2021

Hope it helps.

Thanks a lot Hengyuan. This is very helpful!

@0xJchen
Copy link

0xJchen commented Aug 19, 2021

Oh, I think now I understand what you mean in your previous question. In principle, the belief model should be a function of partner's policy. If we play with partner (policy-A) in a two player game (us, policy-A), we should use a belief model trained using data generated from games played by (policy-pi, policy-A), but from policy-pi's perspective. policy-pi can be anything. It can be the policy we are going to use (sound and on distribution) and can be the same as policy-A (sound but off distribution, we hope neural network generalize well).

Understanding this point, it is easy to see the problem of adapting LBS to SPARTA multi-agent search. In LBS, we start with training policy-A in selfplay (policy-A, policy-A). Then we train belief model of policy-A in selfplay setting (policy-A, policy-A). This belief model is sound when used by LBS(policy-A) to play with policy-A, i.e. (LBS(policy-A), policy-A). The belief is wrong, however, if it is used in (MultiAgent-LBS(policy-A), MultiAgent-LBS(policy-A)). The belief model needs to be updated during search as well.

Hi, hengyuan. Thanks for the explanation, I got it. The learned belief model is fixed once trained, thus can't be adapted for others policy at runtime.

@mohitahuja1
Copy link
Author

3. Short answer: we include the last action as usual.

Long answer: The core of OBL is that we decouple the formation of convention (induced by belief) and the optimization of the policy by introducing the "fixed belief". In normal RL, these two problems are closely correlated as the change of the policy naturally affects how actions are interpreted.

Hi @hengyuan-hu ,

I hope you are well. I have few follow up questions please.

Question 1

Here is my understanding of point 3 of your previous message:

Instead of learning from the team mate's action (and thereby forming conventions), the agent learns from the belief-hand (and the rest of the observation).

In other words, action = f(belief_hand) instead of action = f(team_mate_last_action).

So, it does not form conventions with the team mate's actions but does form a relationship with the belief_hand (and the rest of the observation). And since the belief hand is grounded, this approach generalizes well with other players (ad-hoc team play).

Is my understanding correct?

Question 2

As you mentioned, the belief-hand is not used during evaluation. Instead, the unknown own-hand (all zeroes) is passed to the agent during evaluation (or test time) with other agents.

Since action = f(belief_hand), shouldn't we pass the belief_hand during evaluation too?

Additionally, do we not include the belief hand for s_{t+2}'? Do we pass the unknown own-hand (all zeroes) instead?

Question 3

I am building a simple model to test the Off-Belief Learning approach. I have trained a Hanabi playing agent using Rainbow-DQN. As expected, the self-play score of this agent is high but the cross-play score is low. I would like to use Off-Belief Learning to increase the cross-play score and test with other agents (ad-hoc play).

Is it possible to do this using Rainbow-DQN (which uses two feed forward layers)? Or do we necessarily require a network that uses LSTMs (and other components as described in the OBL paper)?

Question 4

As part of my implementation, I am using a simplified approach to calculate the belief-hand. It involves supervised learning.

I am building a 2-layer feed forward network to predict the cards in the agent's hand. The input to the network is the observation vector and the labels are the actual cards in hand. The data will be generated by playing games using the policy generated using Rainbow-DQN (or a random policy).

Does it make sense to try this approach?

Question 5

I have already implemented another simplified approach to calculate the belief-hand

Instead of using a grounded policy and creating a belief hand (as done in the Learned Belief Search paper), I am simply calculating the probability of each card (in the agent's hand) to be one of the possible 25 cards. This probability is calculated by using card-knowledge (from hints) to find the total count of possible cards, and removing the visible cards in the game (like discards, fireworks, team mate's cards, etc.) from this total count. I then sample a card from this probability distribution and pass it to OBL as the belief hand.

This approach does not yield good results (no convergence, high loss, low scores). My assumption is that it is because there is no learning involved in calculating the belief-hand. Would you agree here?

Question 6

According to the OBL paper: "The training loop sends a new copy of the model to the inference loop to have the inference model synced every 10 gradient steps." Does this mean that the target network is synced with the online network (main Q-network) after every 10 training steps (updates) of the online network?

Thank you very much for your help! I am looking forward to your views.

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

hengyuan-hu commented Aug 26, 2021

  1. I don't quite understand your interpretation. belief hand is never really observed by the policy. It is used to create fictitious state on which we produce the off-belief target. The core of OBL is the idea of using policy pi_0 to play until time t and using pi_1 to play from now on and training pi_1 on at time t. learned-belief, belief hand & fictitious state are the practical implementation of the idea.

  2. Again, the belief hand is never seen by the policy and not part of the observation. It is used to create the fictitious state, on which we apply the action, receive reward and compute bootstrap Q values to get target for Q-learning. Once the policy is trained, it is used the same as any other policies, IQL, VDN, etc... For the same reason, we do not include belief hand for s_{t+2} anywhere in the input of the neural network.

  3. You may use feed-forward network if you want. LSTM is necessary and sound in theory because this is a partially observable environment. However, feed-forward network should give you descent performance, as high as 24 when trained with enough data.

  4. We train the belief model with supervised learning as well. (Public-)LSTM will give you better performance but you are free to use any network. I may miss you point here. Have you done other simplifications that are not mentioned? We use a setting similar to RL for data generation so that we don't have the overfitting problem. You may need to pre-generate a lot of data otherwise.

  5. No, this method will work fine for OBL1 because it is indeed the analytically way to compute the belief given a random policy. We internally used it as well in our first version. A couple things to be careful about. After sampling the first card, we need to update the distribution of the rest of the cards assuming we know the first card, same for every other cards. After sampling all cards, a good sanity check is that the sampled hand should be plausible 100% of the time. This belief is only good for OBL-level1. For the convergence issue, I don't think you are implementing OBL correctly given your previous questions.

  6. No. It just means the training worker send the latest model to the inference worker every 10 batches. Target network & online network are sync every 2500 batches.

@mohitahuja1
Copy link
Author

Hi Hengyuan,

Thanks a lot for your response.

I completely misunderstood one important aspect of the OBL implementation.

My implementation of the observation vector contained own_hand (all zeroes). I was only changing the own_hand to the belief_hand in the observation vector. I was not changing the agent's hand in the actual fictitious game!

I now understand that the actual cards in the agent's hand need to be changed to the belief_hand in the fictitious step.

My apologies for this misunderstanding!

Thanks again for the clarifications. They have been extremely helpful.

Regards,
Mohit

@mohitahuja1
Copy link
Author

mohitahuja1 commented Sep 7, 2021

Hi @hengyuan-hu ,

I hope you're doing well.

I have a few questions related to the implementation of OBL Level 1. Here, the belief hand is sampled from the probability distribution of cards in the agent's hand. The probability distribution is calculated using common card knowledge (hints) and accounting for other visible cards in the game (like discards, fireworks, team mate's cards, etc.).

Can you please provide your views on the questions below:

Question 1:

In the observation vector, is it ok to use common card knowledge (as implemented in the Hanabi paper -- Bard et al) or do we necessarily have to implement the v0-belief (as implemented in the SAD paper)? For now, I have used the common card knowledge in my implementation.

Question 2:

When I change the own-hand to the belief-hand in the fictitious state s_{t}', the card knowledge is completely removed (it shows color = None and rank = None for all cards in the agent's hand). This impacts the card knowledge for states s_{t}' and s_{t+2}'. I think I will need to ensure that the card knowledge is retained (not deleted) in the fictitious states. Would you agree?

Please note that in my current implementation, I have used the card knowledge corresponding to the own-hand for the state s_{t}' (I just use the observation vector corresponding to s_{t}). However, the card knowledge for the fictitious state s_{t+2}' is all None (except any hints coming from state s_{t+1}') -- this is due to the reason mentioned in the previous paragraph.

Question 3:

Based on my current implementation (using Rainbow), I get a score of 15/25 in self-play but only 9/25 for OBL (after 10 million training steps). As you can see, OBL is taking more time to train than self-play. The score for OBL is also plateauing earlier than self-play scores. Is this expected?

Thanks a lot for your help! I am looking forward to your response.

Regards,
Mohit

@hengyuan-hu
Copy link
Contributor

Here, the belief hand is sampled from the probability distribution of cards in the agent's hand. The probability distribution is calculated using common card knowledge (hints) and accounting for other visible cards in the game (like discards, fireworks, team mate's cards, etc.).

Have you normalized the distribution w.r.t. the card count, similar to how we compute v0 belief in our code? How do you sample cards? Do you sample then auto-regressively, meaning after sampling the first card, adjust the distribution of the remaining positions assuming we know the first card?

Q1: It is ok to use common card knowledge. It may be at less than 1 point worse than v0 belief, but not critical, especially considering that your self-play agent is currently only at 15-point level.

Q2: Why do you want to set card knowledge to anything else? Fictitious states/transitions OBL should not change or conflict with the known public knowledge at all. If you have done the hard-coded version (instead of learned) of the belief sampling as mentioned above, it should be guaranteed to be comply with the public card knowledge by definition. In short, you should not change the card knowledge but only change the cards themselves.

Q3: In our experiments selfplay converges at 24 points while OBL converges at 21 points. It is very likely that OBL needs more data. But I dont think it should converge at 9 points. Similarly, selfplay should not stop at 15 points. But if you have limited number of samples in mind then it might be a different story. Our policies are normally trained on 100 millions games, (6~7 billion transitions converted to feed-forward case), and each data is used roughly 4 times during training.

@mohitahuja1
Copy link
Author

Hi Hengyuan,

Thanks for providing the insights!

Q1. Yes, I have normalized the distribution and I sample auto-regressively.

Q2. As suggested, I will not change the card knowledge in the fictitious state.

Q3. I am trying to maximize the score using limited samples.

Quick question:

In OBL and self-play, how does the score (on evaluation games) change with number of training steps? For self-play, I am seeing that the score changes rapidly in the first 5 millions steps and then the curve flattens. Please see the graph below.

Are you seeing the same in your implementation too? Or does your score increase linearly at the same rate throughout training?

Regards,
Mohit

self-play-scores

@hengyuan-hu
Copy link
Contributor

hengyuan-hu commented Sep 8, 2021

Screen Shot 2021-09-07 at 10 32 53 PM

This is the learning curve shown in the SAD paper. Here 1 epoch = 1 batch of 128 game trajectories ~= 128 * 65 transitions, 65 is my rough estimate of the average length of the games. The full learning curve spans roughly 65 hours.

This is the learning curve of the rainbow agent mentioned in the original Hanabi Challenge paper, which takes 7 days to train. You may want to double check their batch size etc.
Screen Shot 2021-09-07 at 10 35 37 PM

From my experience, the most important factor here is the scale of the data. As mentioned, our training infrastructure generates and consumes billions of transitions per day. With this amount of data (and some simple techniques such as varying epsilon exploration), even vanilla feed-forward Q learning without prioritized replay nor distributional Q function works well, higher than 23.

@mohitahuja1
Copy link
Author

Thanks a lot Hengyuan!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants