New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with agent dimension in multi-agent workflows based on single policy (parameter sharing) #136
Comments
Thank you for your interest in Tianshou. We are just planning to support marl, so this is a good issue for us to understand the requirement of marl developers. Please take a look at #121 . Please check if my understanding is correct: The game you are playing falls into the category of If that is the case, I suggest you creating a new DQN class from inheriting the DQN in Tianshou, and override the class MobaDQN(DQNPolicy):
def process_fn(self, batch: Batch, buffer: ReplayBuffer,
indice: np.ndarray) -> Batch:
outputs = []
for _batch, _buffer in split(batch, buffer):
output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
outputs.append(output)
return Batch.stack(outputs, axis=1) One possible problem would be that spliting buffer in each From Tianshou side, maybe we can add an option to split the buffer after collecting trajectories, so as to just split buffer once, and pass the splitted buffers to process_fn (refactoring the In addition, maybe you will encounter another problem: Tianshou 0.2.3 collector does not support vector rewards. This has been fixed in #125 , provided that you have a |
Yes. Overall your description of the problem is accurate. If I understand your approach, the loop
I think this is a good approach, but I am facing some technical difficulties:
|
Yes, the
My opinion would be the buffer only with the information for agent_i . As for the prioritized experience replay, this would be complicated. Is the prioritized experience replay different for each agent, or each agent has the same sampling weight? In the former case, prioritized sampling weight can be different for each agent, and it would be better to have totally separated buffers for each agent after experiences are collected. For example: collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i For the latter case, it is very annoying, because importance weights have the shape of [batch_size], but others have shape of [batch_size, n_agent, ...]. This would break down the split. In this case, you have to hack the code yourself. class MobaDQN(DQNPolicy):
def process_fn(self, batch: Batch, buffer: ReplayBuffer,
indice: np.ndarray) -> Batch:
outputs = []
for _batch, _buffer in split(batch, buffer): # split every thing else, but copy importance weight
output = DQNPolicy.process_fn(self, _batch, _buffer, indice)
outputs.append(output)
data = Batch.stack(outputs, axis=1)
data['imp_weigt'] = ... # aggregate the imp_weigt over the n_agent dimension.
Since you are doing something unusual, you have to be very familiar with the structure of your Batch objects, know their keys, split the right keys, avoid indexing empty Batch, and copy the right keys. In general this would be very application dependent and it seems Tianshou can help very little. |
In addition, I'm wrong in this point. If |
Thanks for the quick reply, youkaichao. Regarding prioritized experience replay, I believe that both cases are similar ... especially if we are traversing all the agents, so whatever is simpler to implement will be adopted. Besides, prioritized experience replay is not essential for my prototype. Regarding the Batch. I am trying to use the numpy slicing syntax for the split function. However, even in the simple example in the documentation, it breaks and does not have a shape. I will investigate it In any case, I will experiment this approach using DQN and other algorithms before posting more questions here. Maybe for on-policy algorithms, this will be even simpler. |
The implementation will be different.
Agree. It's up to you and heavily depends on your code.
Keep us informed. If it is still not resolved, you can post the content of your Batch object, maybe it is a bug of Batch or we can point out the problem for you.
That's great! Keep us informed, please. We need feedbacks from multi-agent RL developers to understand the requirement in MARL. |
collectors = [copy(collector) for i in range(N)]
buffers = split(collector.buffer)
for collector, buffer in zip(collectors, buffers):
collector.buffer = buffer
# do whatever you like with each collector, which holds only the transitions of agent_i 1 2 3
4 Are these approaches feasible? Do them require changes outside of the classes Collector(3) or ReplayBuffer(4)? |
My proposal is that, you collect experience with one collector first, and after collecting experience, you split the collector into multiple collectors. |
From the perspective of my own prototype, I think the approaches that I proposed above are too complicated. I will consider your approach in the near future and see if I figure out an efficient way of splitting the the Buffer. Today, I opted for an easy route for DQN that seems to be working:
I will test it and I will explore other extensions (ex: prioritized replay) and algorithms (ppo, a2c, etc.). I am not sure if we should close this question now or leave it open for future updates. Thanks for helping me figure out how to customize Tianshou. |
Seems like resolved. Further discussion can move to #121 |
Hi @p-veloso, today we merged the marl-example into |
I am looking for a simple library to implement parameter sharing in multi-agent RL using single-agent RL algorithms. I have just discovered Tianshou and it looks awesome, but I have a problem with the dimension of the data that represents the n of agents.
My project uses a custom grid-based environment where:
As far as I understand, Tianshou uses the collector both for getting the batches for the simulation (in one or multiple environments) and for retrieving batches for the training. Therefore, the
Notice that the number of samples in a batch from the perspective of the neural network (batch_size * n_agents) is different from the number of samples from the perspective of the environment (batch_size), which can be problematic. In the simulation, the agents should generate a coherent trajectory, so the n_agents dimension is important to indicate which action vectors should be passed to which environment. I can use the forward method of the neural network model to check if the n_agents dimension exists. In this case, I concatenate batch_size and n_agents to train the neural networks and then reshape the resulting q-values to extract the action vector for each environment.
However, this creates a problem on the training side, because the observations are stored with the n_agents dimension in the buffer. For the the training algorithm, that dimension does not exist. For example, in the line 91 of dqn.py (tianshou 0.2.3)
returns = buffer.rew[now] + self._gamma * returns
buffer.rew[now] has shape (batch_size, n_agents) and self._gamma (batch_size), so this would break.
What is the best way of addressing this? I foresee two possible strategies
The text was updated successfully, but these errors were encountered: