Simple Tag
https://www.pettingzoo.ml/mpe/simple_tag

> This is a predator-prey environment. Good agents (green) are faster and receive a negative reward for being hit by adversaries (red) (-10 for each collision). Adversaries are slower and are rewarded for hitting good agents (+10 for each collision). Obstacles (large black circles) block the way. By default, there is 1 good agent, 3 adversaries and 2 obstacles.

Testing some hardcoded algorithms

In [1]:
import os
import time
import enum
import math
import random
import collections
import statistics

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import torch
import torch.nn
import torch.nn.functional as F

class AttrDict(dict):
    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self

class TimeDelta(object):
    def __init__(self, delta_time):
        """Convert time difference in seconds to days, hours, minutes, seconds.
        
        Parameters
        ==========
        delta_time : float
            Time difference in seconds.
        """
        self.fractional, seconds = math.modf(delta_time)
        seconds = int(seconds)
        minutes, self.seconds = divmod(seconds, 60)
        hours, self.minutes = divmod(minutes, 60)
        self.days, self.hours = divmod(hours, 24)
    
    def __repr__(self):
        return f"{self.days}-{self.hours:02}:{self.minutes:02}:{self.seconds + self.fractional:02}"
        
from pettingzoo.mpe import simple_tag_v2
from pettingzoo.utils import random_demo

Arguments in instantiate environment.

- num_good: number of good agents
- num_adversaries: number of adversaries
- num_obstacles: number of obstacles
- max_cycles: number of frames (a step for each agent) until game terminates
- continuous_actions: Whether agent action spaces are discrete(default) or continuous

In [2]:
env = simple_tag_v2.env(
    num_good=3,
    num_adversaries=3,
    num_obstacles=2,
    max_cycles=300,
    continuous_actions=False
).unwrapped
print("Peek into unwrapped environment:", *dir(env))

Peek into unwrapped environment: __class__ __delattr__ __dict__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __gt__ __hash__ __init__ __init_subclass__ __le__ __lt__ __module__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__ __weakref__ _accumulate_rewards _agent_selector _clear_rewards _dones_step_first _execute_world_step _index_map _reset_render _set_action _was_done_step action_space action_spaces agent_iter agents close continuous_actions current_actions last local_ratio max_cycles max_num_agents metadata np_random num_agents observation_space observation_spaces observe possible_agents render reset scenario seed state state_space step steps unwrapped viewer world


### What are the environment parameters?

Adversaries (red) capture non-adversary (green). The map is a 2D grid and everything is initialized in the region [-1, +1]. There doesn't seem to be position clipping for out of bounds, but non-adversary agent are penalized for out of bounds.
Agent's observation is a ndarray vector of concatenated data in the following order:

1. current velocity (2,)
2. current position (2,)
3. relative position (2,) of each landmark
4. relative position (2,) of each other agent
5. velocity (2,) of each other non-adversary agent

When there are 3 adverseries and 3 non-adversaries, then advarsary observation space is 24 dimensional and non-advarsary observation space is 22 dimensional.

The environment is sequential. Agents move one at a time. Agents are either `adversary_*` for adversary or `agent_*` for non-adversary.

Actions:

- 0 is NOP
- 1 is go left
- 2 is go right
- 3 is go down
- 4 is go up

In [3]:
# Print variables of the environment
# Documentation:   https://www.pettingzoo.ml/api
env.reset()
print("State size", env.state_space.shape)
print("Name of current agent", env.agent_selection)
print("Observation space of current agent", env.observation_space(env.agent_selection).shape)
print("Action space of current agent", env.action_space(env.agent_selection))
print("Sample random action from current agent", env.action_space(env.agent_selection).sample())
print("The agent names:", *env.agents)
print()

# select an agent in the environment world, after using env.unwrapped
agent = env.world.agents[0]
print("agent's name is", agent.name)
print("agent's position and velocity coordinates", agent.state.p_vel, agent.state.p_pos)
print("is agent an adversary?", agent.adversary)

landmark = env.world.landmarks[0]
print("landmark's name is", landmark.name)
print("landmark's position coordinates (doesn't move)", landmark.state.p_pos)

State size (138,)
Name of current agent adversary_0
Observation space of current agent (24,)
Action space of current agent Discrete(5)
Sample random action from current agent 4
The agent names: adversary_0 adversary_1 adversary_2 agent_0 agent_1 agent_2

agent's name is adversary_0
agent's position and velocity coordinates [0. 0.] [-0.84429837  0.7186429 ]
is agent an adversary? True
landmark's name is landmark 0
landmark's position coordinates (doesn't move) [ 0.34800373 -0.42134618]


In [3]:
# Demo environment with random policy
env.reset()
random_demo(env, render=True, episodes=5)

Average total reward -3542.3399623641044


-17711.699811820523

In [8]:
eps = 0.3

def hardcode_policy_1(observation, agent_name):
    """
    Parameters
    ==========
    observation : ndarray
    agent_name : str
    """
    if "adversary" in agent_name:
        # adversary
        if agent_name == "adversary_0":
            return np.random.binomial(2, 0.3) + 3
    elif "agent" in agent_name:
        # non-adversary
        if agent_name == "agent_0":
            pass
    return 0

def hardcode_policy_2(observation, agent_name):
    """
    Parameters
    ==========
    observation : ndarray
    agent : str
    """
    if "adversary" in agent_name:
        # adversary
        if agent_name == "adversary_0":
            # get agent_0's
            x, y = observation[12:14]
            if x < -eps: # go left
                return 1
            elif x > eps: # go right
                return 2
            elif y < -eps: # go down
                return 3
            elif y > eps: # go up
                return 4
            else:
                return random.randint(0, 4)
            # return np.random.binomial(2, 0.3) + 3
    elif "agent" in agent_name:
        # non-adversary
        if agent_name == "agent_0":
            return 0
            # return random.randint(0, 4)
            # return np.random.binomial(2, 0.3) + 3
    return 0

env.reset()
agent_rewards = 0
adversary_rewards = 0
for agent_step_idx, agent_name in enumerate(env.agent_iter()):
    env.render()
    observation, reward, done, info = env.last()
    if done:
        env.step(None)
    else:
        action = hardcode_policy_1(observation, agent_name)
        env.step(action)
    if "adversary" in agent_name:
        adversary_rewards += reward
    if "agent" in agent_name:
        agent_rewards += reward
    # time.sleep(0.1)

print(f"episode ran for {agent_step_idx} steps")
print("agent_rewards", agent_rewards)
print("adversary_rewards", adversary_rewards)

episode ran for 1805 steps
agent_rewards -543.1800593395708
adversary_rewards 60.0


### How to train the agents?

- Use the differental inter-agent learning (DIAL) algorithm.
- Use parameter sharing for DAIL agents. Separate parameter sets for adversary agents and good agents.
- It's not entirely clear the authors accumulate gradients for differentiable communication, but it 

Messages are vectors. Length 4, 5 should work.

Concatenate the messages from all the actors and add them to the message input for the current agent.

The names of agents are: 
adversary_0 adversary_1 adversary_2 agent_0 agent_1 agent_2

## Scratch work

In [15]:
a = torch.tensor([1,3,2,0])
torch.argmax(a).item(), torch.max(a), a[2]

(1, tensor(3), tensor(2))

In [3]:
d = {1: 'a', 2: 'b', 3: 'c'}
for i in d:
    print(i , end=' ')

1 2 3 

In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
a = torch.tensor(2, device=device)
b = torch.tensor(3)
a*b

tensor(6, device='cuda:0')

In [9]:
v = torch.arange(6)
a = torch.tensor([9, 8])

idx = 4

torch.hstack((v[:idx], a, v[idx + 2:]))


tensor([0, 1, 2, 3, 9, 8])

In [14]:
w = torch.tensor([0,1,2])
w.device
w.to(device)
w.device

device(type='cpu')