In [24]:
from pettingzoo.mpe import simple_spread_v3
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import clear_output

In [25]:
# Define the number of agents in the environment
num_agents = 3

# Initialize the 'simple_spread_v2' environment with specified parameters
env = simple_spread_v3.env(
    N=num_agents,                 # Number of agents and landmarks in the environment
    max_cycles=25,                # Maximum number of steps (cycles) per episode
    local_ratio=0.5,              # Weight applied to local versus global rewards
    continuous_actions=False,     # Use discrete action space
    render_mode='rgb_array',      # Set the rendering mode to return RGB frames
    # render_mode = 'human'    # display the environment's state in a window
)

In [26]:
env.reset()

In [27]:
env.agents

['agent_0', 'agent_1', 'agent_2']

In [28]:
for agent in env.agents:
    print(agent, env.observation_space(agent))

agent_0 Box(-inf, inf, (18,), float32)
agent_1 Box(-inf, inf, (18,), float32)
agent_2 Box(-inf, inf, (18,), float32)


## Agents' Observation Space
In the `simple_spread_v2` environment from PettingZoo's Multi-Agent Particle Environments (MPE), each agent's observation is represented by a one-dimensional NumPy array with 18 elements, corresponding to the `Box(-inf, inf, (18,), float32)` observation space.

This observation vector comprises the following components:

1. **Agent's Own Velocity (`self_vel`)**:
   - **Dimensions**: 2
   - **Description**: The agent's current velocity in the 2D plane, represented by its x and y components.

2. **Agent's Own Position (`self_pos`)**:
   - **Dimensions**: 2
   - **Description**: The agent's current position coordinates in the environment.

3. **Relative Positions of Landmarks (`landmark_rel_positions`)**:
   - **Dimensions**: 2 per landmark
   - **Description**: The position of each landmark relative to the agent's current position. With 3 landmarks, this results in 6 dimensions.

4. **Relative Positions of Other Agents (`other_agent_rel_positions`)**:
   - **Dimensions**: 2 per other agent
   - **Description**: The positions of other agents relative to the current agent's position. With 2 other agents, this accounts for 4 dimensions.

5. **Communication from Other Agents (`communication`)**:
   - **Dimensions**: 2
   - **Description**: Communication signals received from other agents. In this environment, agents are silent, so this component is typically zeroed out.

In [29]:
for agent in env.agents:
    print(agent, env.action_space(agent))

agent_0 Discrete(5)
agent_1 Discrete(5)
agent_2 Discrete(5)


## Action space
This indicates that the agent can choose from 5 distinct actions, typically corresponding to:

No Action: The agent remains stationary.
Move Left: The agent moves to the left.
Move Right: The agent moves to the right.
Move Down: The agent moves downward.
Move Up: The agent moves upward.

In [30]:
import numpy as np

def close(observation, num_agents=3, proximity_threshold=0.1):
    """
    Determines if an agent is within a specified proximity to any other agent.

    Parameters:
        observation (np.ndarray): The observation vector of the agent.
        num_agents (int): Total number of agents in the environment.
        proximity_threshold (float): Distance threshold to consider "close proximity".

    Returns:
        bool: True if the agent is close to another agent, False otherwise.
    """
    # Length of each relative position vector (assuming 2D space)
    rel_pos_len = 2

    # Extract the relative positions of other agents from the observation
    start_index = 4 + num_agents * rel_pos_len  # Skip self_vel, self_pos, and landmark_rel_positions
    other_agent_rel_positions = observation[start_index:start_index + (num_agents - 1) * rel_pos_len]

    # Iterate through each relative position
    for i in range(0, len(other_agent_rel_positions), rel_pos_len):
        rel_pos = other_agent_rel_positions[i:i + rel_pos_len]
        distance = np.linalg.norm(rel_pos)  # Calculate Euclidean distance
        print(f"distance is {distance}")
        if distance < proximity_threshold:
            return True

    return False


In [31]:
# test the labelling function close()
env.reset()

# Number of agents
num_agents = len(env.agents)

# Iterate through agents to check proximity
for agent in env.agents:
    observation = env.observe(agent)
    print(f"observation is {observation}")
    if close(observation, num_agents):
        print(f"{agent} is in close proximity to another agent.")
    else:
        print(f"{agent} is not in close proximity to another agent.")

observation is [ 0.          0.         -0.94738203 -0.71392    -0.04125339  1.329869
  1.697496    0.15351829  0.54906064 -0.20961536  0.13211206 -0.15654828
  0.2871902   1.54135     0.          0.          0.          0.        ]
distance is 0.20484374463558197
distance is 1.567876935005188
agent_0 is not in close proximity to another agent.
observation is [ 0.          0.         -0.81527    -0.87046826 -0.17336544  1.4864174
  1.565384    0.31006655  0.4169486  -0.05306709 -0.13211206  0.15654828
  0.15507813  1.6978983   0.          0.          0.          0.        ]
distance is 0.20484374463558197
distance is 1.704965591430664
agent_1 is not in close proximity to another agent.
observation is [ 0.          0.         -0.6601919   0.82743    -0.3284436  -0.21148095
  1.4103059  -1.3878318   0.26187047 -1.7509654  -0.2871902  -1.54135
 -0.15507813 -1.6978983   0.          0.          0.          0.        ]
distance is 1.567876935005188
distance is 1.704965591430664
agent_2 is no

In [32]:
import numpy as np

def landmark(observation, arrival_threshold=0.1):
    """
    Determines if the agent has arrived at any landmark.

    Parameters:
    - observation: numpy array containing the agent's observation.
    - arrival_threshold: float, the distance below which a landmark is considered 'arrived at'.

    Returns:
    - bool: True if the agent has arrived at any landmark, False otherwise.
    """
    # Extract the number of landmarks from the observation
    num_landmarks = 3  # Each landmark has an (x, y) position

    # Iterate over each landmark's relative position
    for i in range(num_landmarks):
        # Extract the relative position (x, y) of the landmark
        rel_pos = observation[2 + 2 * i : 4 + 2 * i]
        # Calculate the Euclidean distance
        distance = np.linalg.norm(rel_pos)
        print(f"distance to landmark{i} is {distance}")
        # Check if the distance is below the arrival threshold
        if distance < arrival_threshold:
            return True
    return False


In [33]:
# Assuming 'env' is your initialized simple_spread_v2 environment
observations = env.reset()
agent_id = 'agent_0'  # Example agent ID

# Get the observation for the specified agent
observation = env.observe(agent)
print(f"observation is {observation}")

# Check if the agent has arrived at any landmark
if landmark(observation):
    print(f"{agent_id} has arrived at a landmark.")
else:
    print(f"{agent_id} has not arrived at any landmark.")


observation is [ 0.          0.          0.54260415 -0.7632381  -0.15707378  0.97000307
 -0.571121    1.7438174   0.05553359 -0.216192   -1.1693283  -0.07309182
 -1.1375309   1.5838761   0.          0.          0.          0.        ]
distance to landmark0 is 0.9364569187164307
distance to landmark1 is 0.9826382994651794
distance to landmark2 is 1.8349601030349731
agent_0 has not arrived at any landmark.


In [34]:
import numpy as np

def towards(observation, num_agents=3, num_landmarks=3):
    """
    Determines if the agent is heading towards a landmark that another agent is closer to.

    Parameters:
    - observation: numpy array containing the agent's observation.
    - num_agents: int, total number of agents in the environment.
    - num_landmarks: int, total number of landmarks in the environment.

    Returns:
    - bool: True if the agent is heading towards a landmark that another agent is closer to, False otherwise.
    """
    # Extract agent's velocity and position
    self_vel = observation[0:2]
    self_pos = observation[2:4]

    # Extract relative positions of landmarks
    landmark_rel_positions = observation[4:4 + 2 * num_landmarks].reshape((num_landmarks, 2))

    # Extract relative positions of other agents
    other_agent_rel_positions = observation[4 + 2 * num_landmarks:-4].reshape((num_agents - 1, 2))

    # Calculate absolute positions of landmarks
    landmark_positions = self_pos + landmark_rel_positions

    # Initialize a flag to indicate if the agent is heading towards an occupied landmark
    heading_towards_occupied = False

    # Iterate over each landmark
    for landmark_pos in landmark_positions:
        # Calculate distance from the agent to the landmark
        self_to_landmark = np.linalg.norm(landmark_pos - self_pos)

        # Assume the current agent is the closest to the landmark
        closest_agent_distance = self_to_landmark

        # Iterate over each other agent's relative position
        for rel_pos in other_agent_rel_positions:
            # Calculate the absolute position of the other agent
            other_agent_pos = self_pos + rel_pos
            # Calculate distance from the other agent to the landmark
            other_to_landmark = np.linalg.norm(landmark_pos - other_agent_pos)
            # Update the closest agent distance if the other agent is closer
            if other_to_landmark < closest_agent_distance:
                closest_agent_distance = other_to_landmark

        # Check if the current agent is not the closest to the landmark
        if closest_agent_distance < self_to_landmark:
            # Calculate the vector from the agent to the landmark
            to_landmark_vector = landmark_pos - self_pos
            # Normalize the vectors
            if np.linalg.norm(self_vel) > 0:
                self_vel_normalized = self_vel / np.linalg.norm(self_vel)
            else:
                self_vel_normalized = self_vel
            if np.linalg.norm(to_landmark_vector) > 0:
                to_landmark_normalized = to_landmark_vector / np.linalg.norm(to_landmark_vector)
            else:
                to_landmark_normalized = to_landmark_vector
            # Calculate the dot product to determine if the agent is heading towards the landmark
            dot_product = np.dot(self_vel_normalized, to_landmark_normalized)
            print(f"two unit vectors:{self_vel_normalized, to_landmark_normalized}")
            print(f"dot_product: {dot_product}")
            if dot_product > 0.87:  # Adjust the threshold as needed (cos(30) ~= 0.87)
                heading_towards_occupied = True
                break

    return heading_towards_occupied


In [35]:
# Assuming 'env' is your initialized simple_spread_v2 environment
observations = env.reset()
agent_id = 'agent_0'  # Example agent ID
agent_index = 0  # Index corresponding to agent_0

# Get the observation for the specified agent
observation = env.observe(agent)
print(f"observation is {observation}")

# Check if the agent is heading towards a landmark that another agent is closer to
if towards(observation, agent_index):
    print(f"{agent_id} is heading towards a landmark occupied by another agent.")
else:
    print(f"{agent_id} is not heading towards a landmark occupied by another agent.")


observation is [ 0.          0.          0.510663   -0.1971223  -1.0898983   0.9509414
 -1.0775567   0.92513096 -0.37778127 -0.58092666 -1.1393815   0.9458253
 -1.2440398   0.11394595  0.          0.          0.          0.        ]
two unit vectors:(array([0., 0.], dtype=float32), array([-0.75350773,  0.65743905], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([-0.7587307 ,  0.65140444], dtype=float32))
dot_product: 0.0
agent_0 is not heading towards a landmark occupied by another agent.


In [36]:
import numpy as np

class RewardMachine:
    def __init__(self):
        self.state = 'u0'  # Initial state

    def transition(self, event):
        """
        Transitions the RM state based on the given event and returns the associated reward.
        u0: the initial state
        u1: the agent is moving to a not good landmark
        u2: the agent has arrived a landmark
        u3: the agent is very close to another agent
        """
        if self.state == 'u0':
            if event == 'towards':
                self.state = 'u1'
                return -1
            elif event == 'landmark':
                self.state = 'u2'
                return 10
            elif event == 'close':
                self.state = 'u3'
                return -10
            else:
                return 0
        elif self.state == 'u1':
            if event == 'towards':
                return -1
            elif event == 'landmark':
                self.state = 'u2'
                return 10
            elif event == 'close':
                self.state = 'u3'
                return -10
            else:
                self.state = 'u0'
                return 0
        elif self.state == 'u2':
            return 0
        elif self.state == 'u3':
            if event == 'close':
                return -10
            elif event == 'towards':
                self.state = 'u1'
                return 0
            else:
                self.state = 'u0'
                return 0

In [37]:
# Assuming 'env' is your initialized simple_spread_v2 environment
observations = env.reset()
agent_id = 'agent_0'  # Example agent ID
agent_index = 0  # Index corresponding to agent_0

# Get the observation for the specified agent
agent_obs = env.observe(agent)
print(f"observation is {agent_obs}")

# Example integration with the environment
def compute_reward(agent_obs, reward_machine):
    """
    Computes the reward for the agent based on its observation and the reward machine.
    """
    if landmark(agent_obs):
        event = 'landmark'
    elif close(agent_obs):
        event = 'close'
    elif towards(agent_obs):
        event = 'towards'
    else:
        event = 'none'

    reward = reward_machine.transition(event)
    print(f"the event: {event}")
    print(f"the RM state: {reward_machine.state}")
    print(f"the reward: {reward}")
    return reward

reward_machine = RewardMachine()

compute_reward(agent_obs, reward_machine)

observation is [ 0.          0.         -0.821315    0.81473726  0.5756123  -1.3639863
 -0.09783769 -1.4809564   0.07975288 -0.88076395  0.52989763 -0.4707043
  0.5894765  -0.5078848   0.          0.          0.          0.        ]
distance to landmark0 is 1.1568729877471924
distance to landmark1 is 1.4804688692092896
distance to landmark2 is 1.484184741973877
distance is 0.7087693810462952
distance is 0.7780935168266296
two unit vectors:(array([0., 0.], dtype=float32), array([ 0.38880405, -0.92132044], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([-0.06592015, -0.99782485], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([ 0.09018069, -0.99592537], dtype=float32))
dot_product: 0.0
the event: none
the RM state: u0
the reward: 0


0

In [38]:
# Initialize the environment
env = simple_spread_v3.env()
env.reset()

# Initialize the Reward Machine
reward_machine = RewardMachine()

# Run the environment loop
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()    
    # Update the Reward Machine
    compute_reward(observation, reward_machine)
    if termination or truncation:
        action = None
    else:
        # Replace with your policy's action selection
        action = env.action_space(agent).sample()

    # Step the environment
    env.step(action)

env.close()


distance to landmark0 is 1.152658462524414
distance to landmark1 is 1.0067602396011353
distance to landmark2 is 1.304940104484558
distance is 0.9787635803222656
distance is 1.0965368747711182
two unit vectors:(array([0., 0.], dtype=float32), array([0.6514576 , 0.75868505], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([ 0.9990368 , -0.04388209], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([0.8736684 , 0.48652202], dtype=float32))
dot_product: 0.0
the event: none
the RM state: u0
the reward: 0
distance to landmark0 is 0.9704020619392395
distance to landmark1 is 0.8317274451255798
distance to landmark2 is 0.32947659492492676
distance is 0.9787635803222656
distance is 0.9243974089622498
two unit vectors:(array([0., 0.], dtype=float32), array([-0.38822612,  0.92156416], dtype=float32))
dot_product: 0.0
two unit vectors:(array([0., 0.], dtype=float32), array([0.4780336 , 0.87834156], dtype=float32))
dot_

# Use stable-baselines3

In [39]:
import gymnasium as gym
from pettingzoo.mpe import simple_spread_v2
import supersuit as ss
from stable_baselines3 import DQN
from stable_baselines3 import PPO
from supersuit import (
    pad_action_space_v0,
    pad_observations_v0,
    pettingzoo_env_to_vec_env_v1,
    concat_vec_envs_v1
)

In [49]:
# Initialize the environment
num_agents = 3
env = simple_spread_v3.parallel_env(
    N=num_agents, max_cycles=25, local_ratio=0.5, continuous_actions=True
)

In [50]:
# Develop a custom environment wrapper that incorporates the Reward Machine logic.
import gym
from gym import spaces

class RewardMachineWrapper(gym.Wrapper):
    def __init__(self, env, reward_machines):
        super(RewardMachineWrapper, self).__init__(env)
        self.reward_machines = reward_machines

    def reset(self, **kwargs):
        self.reward_machine.state = 'u0'
        return self.env.reset(**kwargs)


In [51]:
# Pad action and observation spaces to handle varying spaces among agents
env = pad_observations_v0(env)
env = pad_action_space_v0(env)
# Create Reward Machines for each agent
agent_ids = env.possible_agents
reward_machines = {agent_id: RewardMachine() for agent_id in agent_ids}

In [52]:
# Wrap the environment with the RewardMachineWrapper
env = RewardMachineWrapper(env, reward_machines)

# Convert to a vectorized environment
env = ss.pettingzoo_env_to_vec_env_v1(env)

# Concatenate vectorized environments for parallel execution
num_envs = 4  # Number of parallel environments
env = ss.concat_vec_envs_v1(env, num_envs, num_cpus=1, base_class='stable_baselines3')

AssertionError: pettingzoo_env_to_vec_env takes in a pettingzoo ParallelEnv. Can create a parallel_env with pistonball.parallel_env() or convert it from an AEC env with `from pettingzoo.utils.conversions import aec_to_parallel; aec_to_parallel(env)``

In [18]:


# Convert the PettingZoo environment to a vectorized environment
env = pettingzoo_env_to_vec_env_v1(env)

# Concatenate vectorized environments for parallel execution
num_envs = 4  # Number of parallel environments
env = concat_vec_envs_v1(env, num_envs, num_cpus=1, base_class='stable_baselines3')

## Vectorized environment's Key Features and Benefits
### Parallel Execution:
Vectorized environments enable the execution of multiple environments in parallel. This can significantly speed up the data sampling process, especially when using multiple CPU cores or GPUs.
### Batched Actions and Observations:
Instead of passing a single action and receiving a single observation and reward, vectorized environments handle batches of actions, observations, and rewards. This means that the agent can take multiple actions and receive multiple observations and rewards in each step.

## Train each agent independently


In [19]:
# Initialize an empty list to store the models for each agent (each agent has an independent policy)
models = []

# Loop over the number of agents to create and configure a PPO model for each
for agent_id in range(num_agents):
    # Initialize a PPO model with the following parameters:
    # - 'MlpPolicy': Indicates the use of a multi-layer perceptron policy network
    # - env: The environment in which the agent will be trained
    # - verbose=3: Sets the verbosity level to 3 for detailed logging during training
    # - device="cpu": Specifies that the model should be trained on the CPU
    model = PPO('MlpPolicy', env,
                n_steps=1024,  # the number of steps the agent collects in each environment before performing a policy update
                verbose=3, device="cpu",
               # tensorboard_log="./ppo_logs/"
               )
    
    # Append the initialized model to the models list
    models.append(model)

Using cpu device
Using cpu device
Using cpu device


In [22]:
len(envs)

NameError: name 'envs' is not defined

In [None]:
# Assuming 'envs' is a list of your base environments for each agent
reward_machines = [RewardMachine() for _ in envs]
wrapped_envs = [RewardMachineWrapper(env, rm) for env, rm in zip(envs, reward_machines)]
models = [PPO('MlpPolicy', env, verbose=1) for env in wrapped_envs]

In [21]:
# Total number of iterations
total_iterations = 100

# Number of steps each agent learns per iteration
steps_per_agent = 50000

# In Stable Baselines3’s implementation of Proximal Policy Optimization (PPO), 
# the model.learn() function continues training the existing policy rather than initializing a new one with each call.
for iteration in range(total_iterations):
    for agent_id, model in enumerate(models):
        print(f"Iteration {iteration + 1}, Training agent {agent_id + 1}")
        model.learn(total_timesteps=steps_per_agent)

Iteration 1, Training agent 1
------------------------------
| time/              |       |
|    fps             | 8293  |
|    iterations      | 1     |
|    time_elapsed    | 1     |
|    total_timesteps | 12288 |
------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 5117        |
|    iterations           | 2           |
|    time_elapsed         | 4           |
|    total_timesteps      | 24576       |
| train/                  |             |
|    approx_kl            | 0.006126936 |
|    clip_fraction        | 0.062       |
|    clip_range           | 0.2         |
|    entropy_loss         | -7.08       |
|    explained_variance   | -0.00292    |
|    learning_rate        | 0.0003      |
|    loss                 | 12.3        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00433    |
|    std                  | 0.996       |
|    value_loss           | 34.2       

KeyboardInterrupt: 

In [None]:
import sys
print("Stopping the notebook execution.")
sys.exit()

In [None]:
model = PPO('MlpPolicy', env, verbose=3, device="cpu")

In [None]:
model.learn(total_timesteps=1_000_000)

### **How to Interpret Training Output**

1. **Monitor Rewards (`ep_rew_mean`):**
   - A consistently rising mean reward (`ep_rew_mean`) is a good indicator that the agent is learning to perform better in the environment.
   - If the reward plateaus or drops significantly, this could suggest that the agent has reached a suboptimal solution or is struggling to learn.

2. **Look at Episode Length (`ep_len_mean`):**
   - If your environment terminates episodes upon failure, a longer `ep_len_mean` indicates fewer failures.
   - In environments with fixed episode lengths, this value might remain constant.

3. **Entropy (`entropy_loss`):**
   - Early in training, higher entropy values are expected because the agent is still exploring.
   - Over time, entropy should decrease as the agent settles on an optimal policy.

4. **Loss Metrics (`value_loss`, `policy_loss`):**
   - These should gradually decrease but not hit zero, as they represent the errors the model is correcting.
   - Sudden spikes in these values can indicate instability, such as an inappropriate learning rate or poor exploration.

---

### **Common Issues Observed in Training Output**
- **Low FPS:** Indicates inefficiencies in the environment or code.
- **Flat Rewards (`ep_rew_mean`):** Indicates that the agent is not learning. Try adjusting hyperparameters like the learning rate, exploration strategy, or reward structure.
- **Diverging Loss Values:** A sign of instability. Ensure that your reward function is properly scaled and check for bugs in the environment.

---

By analysing these outputs, you can fine-tune your training process to achieve better results. Let me know if you want help debugging a specific output!

## **Additional Considerations:**

- **Parameter Sharing:** The above setup employs parameter sharing, where a single policy network is shared among all agents. This approach is commonly used in multi-agent reinforcement learning to reduce computational complexity and improve coordination among agents.

- **Environment-Specific Adjustments:** Depending on the characteristics of the `simple_spread_v3` environment, you might need to apply additional wrappers or adjustments. For instance, if the environment provides visual observations, you may need to include wrappers for frame stacking or resizing.

- **Compatibility Issues:** Be aware of potential compatibility issues between different versions of the libraries. It's advisable to consult the official documentation and release notes for PettingZoo, SuperSuit, and Stable-Baselines3 to ensure seamless integration.

By following these steps, you can effectively wrap the `simple_spread_v3` environment using SuperSuit, making it compatible with Stable-Baselines3 for training multi-agent reinforcement learning models. 