Code release for Off-Belief Learning and Learned Belief Search #26
Comments
Hi Mohit, thanks for your interest! OBL & belief training are radically different and we will release them as a new repo. We are looking for a Sept release as it needs to go through quite some clean up & internal review. |
Thanks @hengyuan-hu ! |
(Clarification: Hi, hengyuan. I am trying to re-implement the belief model mentioned both in "OBL" and "LBS". May I ask some questions concerning the architecture of the belief model? My understanding is as follows: (in short, I found AR Belief model is like a
Q3: May I ask the necessity of training a new policy model in Thanks for sharing the code with the community and reading the verbose questions. I really learned a lot from your great work. |
|
Thanks for the insightful explanation! |
Hi, hengyuan. May I ask an additional question concerning |
The belief model is not deterministic. It outputs a probability distribution and we sample from it from the oldest postion to the latest, instead of taking argmax. |
I got it. Thanks for the clear explanation. |
Hi @hengyuan-hu , I hope you are well. I have a few questions about Off-Belief Learning (including the belief model). Please can you share your views on the below:
Thank you very much for all the help! Regards, |
|
Thank you so much! Regards, |
Hi, hengyuan. May I ask a related question concerning SAD agent used in another of your work
I suppose maybe the training script for the model in |
Hi. It would be better to create a new issue as this is a different problem. The short answer is that the model saved in this repo is simply state_dict that saved with torch.save() instead of torchscript models saved with torch.jit.save. I tried my best to find an example conversion script (45b922e). Basically it loads the weights and save it together with a minimal "forward" function that takes the input and then produce action + update hidden states following the format in SPARTA code base. I am not 100% sure that it will work since I am not actively using the SPARTA code base at this moment. But you may get the idea from the script. Please open a new issue if this indeed does not work and you would like to discuss more. I may have time to reproduce it later. |
Hi, hengyuan. Thanks for the clarification and instant reply. The uploaded script works pretty well! |
Hi @hengyuan-hu , I just have a couple of follow-up questions please. Question 1 (related to the replay buffer in OBL): The OBL paper mentions that the fictitious target G_t' = r_t' + r_t+1' + Q_theta'(o_t+2', a) is stored in the replay buffer along with the actual trajectory T. As I understand, in order to train the network in OBL, we will just need o_t, a_t and G_t'. Using these, we can easily calculate Q_theta(o_t, a_t) and use G_t' as target to calculate loss. Is my understanding correct? So why do we need to store the actual trajectory in the replay buffer? Shouldn't we only store o_t, a_t, and G_t'? Why do we need to additionally store o_t+2, r_t+2, terminal_t+2, etc.? Please note: I am trying to implement a simplified version of OBL using Rainbow (instead of R2D2). So I do not need to extract sequences from the replay buffer (as I am not using recurrent networks). I just need to extract single instances of prioritized experiences from the replay buffer. Question 2 (related to the application of the belief model to OBL): I was not able to understand how I could use the output from the belief model as input to the OBL model. Can I concatenate the hand (from the belief model) to the input observation and feed it to OBL? How does extracting a fictitious state from the observation work? Do you mind explaining this in a little more detail please? Thank you very much for your continued support! Regards, |
Hanabi is partially observable so we need recurrent network (LSTM in our case) to handle that. That's way we need to store the entire trajectory tau_t instead of just o_t. We store the entire trajectory because we train on every timestep, o_t+2 is used to train the network with tau_{t+2}. i.e. given one trajectory stored in the replay buffer, we produces T OBL training targets, where T is the number of steps of the entire game. We sample player's hand from the belief model, copy the current game state to create fictitious state, and reset the player's hand in the fictitious state. Once we have the fictitious state, we can produce encoding (inputs to the network) the same way and feed them into the network. |
Hi @hengyuan-hu , Thank you so much for the clarifications. If you don't mind, I will share a few final questions so that my understanding is absolutely clear.
Thanks again for your great help and taking the time to answer our questions! Regards, |
Hello, you are welcome :)
Long answer: The core of OBL is that we decouple the formation of convention (induced by belief) and the optimization of the policy by introducing the "fixed belief". In normal RL, these two problems are closely correlated as the change of the policy naturally affects how actions are interpreted. Intuitively, you can think of belief as the interpreter of actions, and OBL fundamentally changes the way we interpret the actions. For example, the belief trained on top of a random policy will interpret all the actions AS IF they are issued by a random policy (we call it belief-0 as random policy can be seen as OBL level 0), i.e. those actions carries no meaning beyond the grounded information. In normal RL, the policy may discover randomly that "I hint red, my partner play 3rd card and we get reward", then this sequence of actions is reinforced and conventions are formed. In OBL, assuming we use belief-0, then these type of reinforcing loop is broken. When our partner wants to hint red to tell us to play the 3rd card, we "interpret" this action with belief-0, which basically asks the question "what cards would I be holding if my partner is random?". Such interpretation, realized by sampling our hand from belief-0 given the history of observations, will think that our 3rd card can be ANYTHING as long as it complies with the public knowledge. If we try to play that card, i.e. apply a move on the fictitious state with resampled hand, we will very likely fail to score a point but instead lose a life token. Of course using belief-0 will not give us the best policy to play this game, so in higher level OBL, we use more complicated belief learned from prior OBL policies. As we shown in the paper, this sequence of learning will give us consistent results across multiple different runs. Side note: Getting rid of last_action part of the input will still lead to formation of "arbitrary conventions" because 1) the model with LSTM can infer what happened, 2) even without LSTM, the model can use other parts of the input such as card knowledge to exchange information. e.g. if your 3rd card is hinted red then your 1st card can be safely discarded. 3) however, we did find that getting rid of last_action, combined with the color permutation method from the "other-play" paper, was sufficient to produce consistent policy across different runs but such policy is not very good at collaborating with others. It is like pushing the search space of the policy optimization into some weird corners of the parameter space with feature engineering, which is not a general method.
Hope it helps. |
Hi, hengyuan. Thanks for sharing insights with the community. In the first part you discussed the
|
|
Hi, hengyuan. Thanks for the reply. I got what you mean. Once By the way, do you think it is possible to extend |
Oh, I think now I understand what you mean in your previous question. In principle, the belief model should be a function of partner's policy. If we play with partner (policy-A) in a two player game (us, policy-A), we should use a belief model trained using data generated from games played by (policy-pi, policy-A), but from policy-pi's perspective. policy-pi can be anything. It can be the policy we are going to use (sound and on distribution) and can be the same as policy-A (sound but off distribution, we hope neural network generalize well). Understanding this point, it is easy to see the problem of adapting LBS to SPARTA multi-agent search. In LBS, we start with training policy-A in selfplay (policy-A, policy-A). Then we train belief model of policy-A in selfplay setting (policy-A, policy-A). This belief model is sound when used by LBS(policy-A) to play with policy-A, i.e. (LBS(policy-A), policy-A). The belief is wrong, however, if it is used in (MultiAgent-LBS(policy-A), MultiAgent-LBS(policy-A)). The belief model needs to be updated during search as well. |
Thanks a lot Hengyuan. This is very helpful! |
Hi, hengyuan. Thanks for the explanation, I got it. The learned belief model is fixed once trained, thus can't be adapted for others policy at runtime. |
Hi @hengyuan-hu , I hope you are well. I have few follow up questions please. Question 1 Here is my understanding of point 3 of your previous message: Instead of learning from the team mate's action (and thereby forming conventions), the agent learns from the belief-hand (and the rest of the observation). In other words, action = f(belief_hand) instead of action = f(team_mate_last_action). So, it does not form conventions with the team mate's actions but does form a relationship with the belief_hand (and the rest of the observation). And since the belief hand is grounded, this approach generalizes well with other players (ad-hoc team play). Is my understanding correct? Question 2 As you mentioned, the belief-hand is not used during evaluation. Instead, the unknown own-hand (all zeroes) is passed to the agent during evaluation (or test time) with other agents. Since action = f(belief_hand), shouldn't we pass the belief_hand during evaluation too? Additionally, do we not include the belief hand for s_{t+2}'? Do we pass the unknown own-hand (all zeroes) instead? Question 3 I am building a simple model to test the Off-Belief Learning approach. I have trained a Hanabi playing agent using Rainbow-DQN. As expected, the self-play score of this agent is high but the cross-play score is low. I would like to use Off-Belief Learning to increase the cross-play score and test with other agents (ad-hoc play). Is it possible to do this using Rainbow-DQN (which uses two feed forward layers)? Or do we necessarily require a network that uses LSTMs (and other components as described in the OBL paper)? Question 4 As part of my implementation, I am using a simplified approach to calculate the belief-hand. It involves supervised learning. I am building a 2-layer feed forward network to predict the cards in the agent's hand. The input to the network is the observation vector and the labels are the actual cards in hand. The data will be generated by playing games using the policy generated using Rainbow-DQN (or a random policy). Does it make sense to try this approach? Question 5 I have already implemented another simplified approach to calculate the belief-hand Instead of using a grounded policy and creating a belief hand (as done in the Learned Belief Search paper), I am simply calculating the probability of each card (in the agent's hand) to be one of the possible 25 cards. This probability is calculated by using card-knowledge (from hints) to find the total count of possible cards, and removing the visible cards in the game (like discards, fireworks, team mate's cards, etc.) from this total count. I then sample a card from this probability distribution and pass it to OBL as the belief hand. This approach does not yield good results (no convergence, high loss, low scores). My assumption is that it is because there is no learning involved in calculating the belief-hand. Would you agree here? Question 6 According to the OBL paper: "The training loop sends a new copy of the model to the inference loop to have the inference model synced every 10 gradient steps." Does this mean that the target network is synced with the online network (main Q-network) after every 10 training steps (updates) of the online network? Thank you very much for your help! I am looking forward to your views. Regards, |
|
Hi Hengyuan, Thanks a lot for your response. I completely misunderstood one important aspect of the OBL implementation. My implementation of the observation vector contained own_hand (all zeroes). I was only changing the own_hand to the belief_hand in the observation vector. I was not changing the agent's hand in the actual fictitious game! I now understand that the actual cards in the agent's hand need to be changed to the belief_hand in the fictitious step. My apologies for this misunderstanding! Thanks again for the clarifications. They have been extremely helpful. Regards, |
Hi @hengyuan-hu , I hope you're doing well. I have a few questions related to the implementation of OBL Level 1. Here, the belief hand is sampled from the probability distribution of cards in the agent's hand. The probability distribution is calculated using common card knowledge (hints) and accounting for other visible cards in the game (like discards, fireworks, team mate's cards, etc.). Can you please provide your views on the questions below: Question 1: In the observation vector, is it ok to use common card knowledge (as implemented in the Hanabi paper -- Bard et al) or do we necessarily have to implement the v0-belief (as implemented in the SAD paper)? For now, I have used the common card knowledge in my implementation. Question 2: When I change the own-hand to the belief-hand in the fictitious state s_{t}', the card knowledge is completely removed (it shows color = None and rank = None for all cards in the agent's hand). This impacts the card knowledge for states s_{t}' and s_{t+2}'. I think I will need to ensure that the card knowledge is retained (not deleted) in the fictitious states. Would you agree? Please note that in my current implementation, I have used the card knowledge corresponding to the own-hand for the state s_{t}' (I just use the observation vector corresponding to s_{t}). However, the card knowledge for the fictitious state s_{t+2}' is all None (except any hints coming from state s_{t+1}') -- this is due to the reason mentioned in the previous paragraph. Question 3: Based on my current implementation (using Rainbow), I get a score of 15/25 in self-play but only 9/25 for OBL (after 10 million training steps). As you can see, OBL is taking more time to train than self-play. The score for OBL is also plateauing earlier than self-play scores. Is this expected? Thanks a lot for your help! I am looking forward to your response. Regards, |
Have you normalized the distribution w.r.t. the card count, similar to how we compute v0 belief in our code? How do you sample cards? Do you sample then auto-regressively, meaning after sampling the first card, adjust the distribution of the remaining positions assuming we know the first card? Q1: It is ok to use common card knowledge. It may be at less than 1 point worse than v0 belief, but not critical, especially considering that your self-play agent is currently only at 15-point level. Q2: Why do you want to set card knowledge to anything else? Fictitious states/transitions OBL should not change or conflict with the known public knowledge at all. If you have done the hard-coded version (instead of learned) of the belief sampling as mentioned above, it should be guaranteed to be comply with the public card knowledge by definition. In short, you should not change the card knowledge but only change the cards themselves. Q3: In our experiments selfplay converges at 24 points while OBL converges at 21 points. It is very likely that OBL needs more data. But I dont think it should converge at 9 points. Similarly, selfplay should not stop at 15 points. But if you have limited number of samples in mind then it might be a different story. Our policies are normally trained on 100 millions games, (6~7 billion transitions converted to feed-forward case), and each data is used roughly 4 times during training. |
Hi Hengyuan, Thanks for providing the insights! Q1. Yes, I have normalized the distribution and I sample auto-regressively. Q2. As suggested, I will not change the card knowledge in the fictitious state. Q3. I am trying to maximize the score using limited samples. Quick question: In OBL and self-play, how does the score (on evaluation games) change with number of training steps? For self-play, I am seeing that the score changes rapidly in the first 5 millions steps and then the curve flattens. Please see the graph below. Are you seeing the same in your implementation too? Or does your score increase linearly at the same rate throughout training? Regards, |
Thanks a lot Hengyuan! |
Hi @hengyuan-hu ,
Thanks a lot for your informative answers!
Would you have an approximate estimate on when you would be able to release the Off-Belief Learning (OBL) code?
Also, have you released the code for the auto-regressive belief model (part of the Learned Belief Search paper) which is used in the OBL implementation?
Thank you for your efforts in producing such valuable research and sharing it with the larger community!
Regards,
Mohit
The text was updated successfully, but these errors were encountered: