In [1]:
from pettingzoo.mpe import simple_spread_v3
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import clear_output

In [2]:
# Define the number of agents in the environment
num_agents = 3

# Initialize the 'simple_spread_v2' environment with specified parameters
env = simple_spread_v3.env(
    N=num_agents,                 # Number of agents and landmarks in the environment
    max_cycles=25,                # Maximum number of steps (cycles) per episode
    local_ratio=0.5,              # Weight applied to local versus global rewards
    continuous_actions=False,     # Use discrete action space
    render_mode='rgb_array',      # Set the rendering mode to return RGB frames
    # render_mode = 'human'    # display the environment's state in a window
)

In [3]:
env.reset()

In [4]:
env.agents

['agent_0', 'agent_1', 'agent_2']

In [5]:
for agent in env.agents:
    print(agent, env.observation_space(agent))

agent_0 Box(-inf, inf, (18,), float32)
agent_1 Box(-inf, inf, (18,), float32)
agent_2 Box(-inf, inf, (18,), float32)


## Agents' Observation Space
In the `simple_spread_v2` environment from PettingZoo's Multi-Agent Particle Environments (MPE), each agent's observation is represented by a one-dimensional NumPy array with 18 elements, corresponding to the `Box(-inf, inf, (18,), float32)` observation space.

This observation vector comprises the following components:

1. **Agent's Own Velocity (`self_vel`)**:
   - **Dimensions**: 2
   - **Description**: The agent's current velocity in the 2D plane, represented by its x and y components.

2. **Agent's Own Position (`self_pos`)**:
   - **Dimensions**: 2
   - **Description**: The agent's current position coordinates in the environment.

3. **Relative Positions of Landmarks (`landmark_rel_positions`)**:
   - **Dimensions**: 2 per landmark
   - **Description**: The position of each landmark relative to the agent's current position. With 3 landmarks, this results in 6 dimensions.

4. **Relative Positions of Other Agents (`other_agent_rel_positions`)**:
   - **Dimensions**: 2 per other agent
   - **Description**: The positions of other agents relative to the current agent's position. With 2 other agents, this accounts for 4 dimensions.

5. **Communication from Other Agents (`communication`)**:
   - **Dimensions**: 2
   - **Description**: Communication signals received from other agents. In this environment, agents are silent, so this component is typically zeroed out.

In [6]:
for agent in env.agents:
    print(agent, env.action_space(agent))

agent_0 Discrete(5)
agent_1 Discrete(5)
agent_2 Discrete(5)


## Action space
This indicates that the agent can choose from 5 distinct actions, typically corresponding to:

No Action: The agent remains stationary.
Move Left: The agent moves to the left.
Move Right: The agent moves to the right.
Move Down: The agent moves downward.
Move Up: The agent moves upward.

# Use stable-baselines3

In [7]:
import gymnasium as gym
from pettingzoo.mpe import simple_spread_v2
import supersuit as ss
from stable_baselines3 import DQN
from stable_baselines3 import PPO
from supersuit import (
    pad_action_space_v0,
    pad_observations_v0,
    pettingzoo_env_to_vec_env_v1,
    concat_vec_envs_v1
)

In [8]:
# Initialize the environment
num_agents = 3
env = simple_spread_v3.parallel_env(
    N=num_agents, max_cycles=25, local_ratio=0.5, continuous_actions=True
)

In [9]:
# Pad action and observation spaces to handle varying spaces among agents
env = pad_observations_v0(env)
env = pad_action_space_v0(env)

# Convert the PettingZoo environment to a vectorized environment
env = pettingzoo_env_to_vec_env_v1(env)

# Concatenate vectorized environments for parallel execution
num_envs = 4  # Number of parallel environments
env = concat_vec_envs_v1(env, num_envs, num_cpus=1, base_class='stable_baselines3')

## Vectorized environment's Key Features and Benefits
### Parallel Execution:
Vectorized environments enable the execution of multiple environments in parallel. This can significantly speed up the data sampling process, especially when using multiple CPU cores or GPUs.
### Batched Actions and Observations:
Instead of passing a single action and receiving a single observation and reward, vectorized environments handle batches of actions, observations, and rewards. This means that the agent can take multiple actions and receive multiple observations and rewards in each step.

## Train each agent independently


In [21]:
# Initialize an empty list to store the models for each agent (each agent has an independent policy)
models = []

# Loop over the number of agents to create and configure a PPO model for each
for agent_id in range(num_agents):
    # Initialize a PPO model with the following parameters:
    # - 'MlpPolicy': Indicates the use of a multi-layer perceptron policy network
    # - env: The environment in which the agent will be trained
    # - verbose=3: Sets the verbosity level to 3 for detailed logging during training
    # - device="cpu": Specifies that the model should be trained on the CPU
    model = PPO('MlpPolicy', env,
                n_steps=1024,  # the number of steps the agent collects in each environment before performing a policy update
                verbose=3, device="cpu",
               # tensorboard_log="./ppo_logs/"
               )
    
    # Append the initialized model to the models list
    models.append(model)

Using cpu device
Using cpu device
Using cpu device


In [None]:
# Total number of iterations
total_iterations = 100

# Number of steps each agent learns per iteration
steps_per_agent = 50000

# In Stable Baselines3’s implementation of Proximal Policy Optimization (PPO), 
# the model.learn() function continues training the existing policy rather than initializing a new one with each call.
for iteration in range(total_iterations):
    for agent_id, model in enumerate(models):
        print(f"Iteration {iteration + 1}, Training agent {agent_id + 1}")
        model.learn(total_timesteps=steps_per_agent)

Iteration 1, Training agent 1
------------------------------
| time/              |       |
|    fps             | 8330  |
|    iterations      | 1     |
|    time_elapsed    | 1     |
|    total_timesteps | 12288 |
------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 5147        |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.006066808 |
|    clip_fraction        | 0.0596      |
|    clip_range           | 0.2         |
|    entropy_loss         | -7.08       |
|    explained_variance   | 0.00142     |
|    learning_rate        | 0.0003      |
|    loss                 | 8.12        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00447    |
|    std                  | 0.996       |
|    value_loss           | 34.2       

In [None]:
import sys
print("Stopping the notebook execution.")
sys.exit()

In [18]:
model = PPO('MlpPolicy', env, verbose=3, device="cpu")

Using cpu device


In [19]:
model.learn(total_timesteps=1_000_000)

------------------------------
| time/              |       |
|    fps             | 8395  |
|    iterations      | 1     |
|    time_elapsed    | 2     |
|    total_timesteps | 24576 |
------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 3304        |
|    iterations           | 2           |
|    time_elapsed         | 14          |
|    total_timesteps      | 49152       |
| train/                  |             |
|    approx_kl            | 0.005600501 |
|    clip_fraction        | 0.0511      |
|    clip_range           | 0.2         |
|    entropy_loss         | -7.08       |
|    explained_variance   | -0.00222    |
|    learning_rate        | 0.0003      |
|    loss                 | 8.63        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00283    |
|    std                  | 0.996       |
|    value_loss           | 29.5        |
---------------------------

KeyboardInterrupt: 

### **How to Interpret Training Output**

1. **Monitor Rewards (`ep_rew_mean`):**
   - A consistently rising mean reward (`ep_rew_mean`) is a good indicator that the agent is learning to perform better in the environment.
   - If the reward plateaus or drops significantly, this could suggest that the agent has reached a suboptimal solution or is struggling to learn.

2. **Look at Episode Length (`ep_len_mean`):**
   - If your environment terminates episodes upon failure, a longer `ep_len_mean` indicates fewer failures.
   - In environments with fixed episode lengths, this value might remain constant.

3. **Entropy (`entropy_loss`):**
   - Early in training, higher entropy values are expected because the agent is still exploring.
   - Over time, entropy should decrease as the agent settles on an optimal policy.

4. **Loss Metrics (`value_loss`, `policy_loss`):**
   - These should gradually decrease but not hit zero, as they represent the errors the model is correcting.
   - Sudden spikes in these values can indicate instability, such as an inappropriate learning rate or poor exploration.

---

### **Common Issues Observed in Training Output**
- **Low FPS:** Indicates inefficiencies in the environment or code.
- **Flat Rewards (`ep_rew_mean`):** Indicates that the agent is not learning. Try adjusting hyperparameters like the learning rate, exploration strategy, or reward structure.
- **Diverging Loss Values:** A sign of instability. Ensure that your reward function is properly scaled and check for bugs in the environment.

---

By analysing these outputs, you can fine-tune your training process to achieve better results. Let me know if you want help debugging a specific output!

## **Additional Considerations:**

- **Parameter Sharing:** The above setup employs parameter sharing, where a single policy network is shared among all agents. This approach is commonly used in multi-agent reinforcement learning to reduce computational complexity and improve coordination among agents.

- **Environment-Specific Adjustments:** Depending on the characteristics of the `simple_spread_v3` environment, you might need to apply additional wrappers or adjustments. For instance, if the environment provides visual observations, you may need to include wrappers for frame stacking or resizing.

- **Compatibility Issues:** Be aware of potential compatibility issues between different versions of the libraries. It's advisable to consult the official documentation and release notes for PettingZoo, SuperSuit, and Stable-Baselines3 to ensure seamless integration.

By following these steps, you can effectively wrap the `simple_spread_v3` environment using SuperSuit, making it compatible with Stable-Baselines3 for training multi-agent reinforcement learning models. 