# Tutorial 08: Multiagent Environments

This tutorial covers and implementation and execution of multiagent environments in Flow. It assumes some level of knowledge or experience in writing custom environments and running experiments with RLlib; for more on these topics see `tutorial07_environments.ipynb` and `tutorial03_rllib.ipynb`, respectively. The rest of the tutorial is organized as follows. Section 1 describes the procedure through which custom environments can be augmented to generate multiagent environments. Then, section 2 walks you through an example of running a multiagent environment
in RLlib.

## 1. Creating a Multiagent Environment Class

In this part we will be setting up steps to create a multiagent environment in which the agent might use 
a shared or non-shared policy. We begin by importing the abstract multi-agent evironment class.

In [2]:
# import the base Multi-agent environment 

from flow.multiagent_envs.multiagent_env import MultiAgentEnv

ModuleNotFoundError: No module named 'flow'

### 1.1 Shared policies
In the multi-agent environment with a shared policy, different agents will use the same policy. 

We define the environment class, and inherit properties from the Multi-agent version of base env.

In [None]:
class SharedMAEnv(MultiAgentEnv):
    pass

`Env` provides the interface for running and modifying a SUMO simulation. Using this class, we are able to start sumo, provide a scenario to specify a configuration and controllers, perform simulation steps, and reset the simulation to an initial configuration.

When compared to the single-agent evironment, multi-agent environment has changes in the following child classes:

* **apply_rl_actions**
* **get_state**
* **compute_reward**

Each of these components is covered in the next few subsections.

In the multi-agent environment we create a dictionary with Ids as keys and different parameters (acceleration, observation, etc.) as vaules for each ID.  

In [None]:
class SharedMAEnv(MultiAgentEnv): # update the environment class
    def _apply_rl_actions(self, rl_actions):
        """Split the accelerations by ring"""
        if rl_actions:
            rl_ids = list(rl_actions.keys())
            accel = list(rl_actions.values())
            self.k.vehicle.apply_acceleration(rl_ids, accel)

The `get_state` and `compute_reward` methods support the dictionary structure of the multi-agent environment and append observation and reward, respectively, as a value for each correpsonding rl_id. 

In [None]:
class SharedMAEnv(MultiAgentEnv): # update the environment class

    def get_state(self):
    """See class definition."""
    obs = {}
    for rl_id in self.k.vehicle.get_rl_ids():
        lead_id = self.k.vehicle.get_leader(rl_id) or rl_id

        # normalizers
        max_speed = 15.
        max_length = self.env_params.additional_params['ring_length'][1]

        observation = np.array([
            self.k.vehicle.get_speed(rl_id) / max_speed,
            (self.k.vehicle.get_speed(lead_id) -
             self.k.vehicle.get_speed(rl_id))
            / max_speed,
            self.k.vehicle.get_headway(rl_id) / max_length
        ])
        obs.update({rl_id: observation})

    return obs
    def compute_reward(self, rl_actions, **kwargs):
        """See class definition."""
        # in the warmup steps
        if rl_actions is None:
            return {}

        rew = {}
        for rl_id in rl_actions.keys():
            edge_id = rl_id.split('_')[1]
            edges = self.gen_edges(edge_id)
            vehs_on_edge = self.k.vehicle.get_ids_by_edge(edges)
            vel = np.array([
                self.k.vehicle.get_speed(veh_id)
                for veh_id in vehs_on_edge
            ])
            if any(vel < -100) or kwargs['fail']:
                return 0.

            target_vel = self.env_params.additional_params['target_velocity']
            max_cost = np.array([target_vel] * len(vehs_on_edge))
            max_cost = np.linalg.norm(max_cost)

            cost = vel - target_vel
            cost = np.linalg.norm(cost)

            rew[rl_id] = max(max_cost - cost, 0) / max_cost
        return rew

### 1.2 Non-shared policies (FIXME)

Non-shared environment implies that the agents will be using different policies to get the reward. In the folowing exrcise we would see the agents using 'av' and the 'adverserial' policies.
To create an environemtn for the Multi-agent non-shared policy, the following changes are to be made in:

* **apply_rl_actions**
* **get_state**
* **compute_reward**

The action_space and observation_space are the same as for the single-agent environment and can be exported from there. 

In order to make `apply_rl_actions` work for multi-agent environment, we define rl_action as a combinations of each policy actions and the perturb_weight.
In the `get_state` method, we define states for each of the policies. The adversary state and the agent state are identical
In the `compute_reward` the agents receives opposing speed rewards. The agent receives the class definition reward and the adversary receives the negative of the agent reward.

In [3]:
class NonSharedMAEnv(MultiAgentEnv):
    def _apply_rl_actions(self, rl_actions):
            # the names of all autonomous (RL) vehicles in the network
            sorted_rl_ids = [
                veh_id for veh_id in self.sorted_ids
                if veh_id in self.k.vehicle.get_rl_ids()
            ]
            # define different actions for different multi-agents and calucute an rl_action 
            av_action = rl_actions['av']
            adv_action = rl_actions['adversary']
            perturb_weight = self.env_params.additional_params['perturb_weight']
            rl_action = av_action + perturb_weight * adv_action
            # use the base environment method to convert actions into accelerations for the rl vehicles
            self.k.vehicle.apply_acceleration(sorted_rl_ids, rl_action)

    def get_state(self, **kwargs):
        state = np.array([[
            self.k.vehicle.get_speed(veh_id) / self.k.scenario.max_speed(),
            self.k.vehicle.get_x_by_id(veh_id) / self.k.scenario.length()
        ] for veh_id in self.sorted_ids])
        state = np.ndarray.flatten(state)
        return {'av': state, 'adversary': state}

    def compute_reward(self, rl_actions, **kwargs):
        if self.env_params.evaluate:
            reward = np.mean(self.k.vehicle.get_speed(
                self.k.vehicle.get_ids()))
            return {'av': reward, 'adversary': -reward}
        else:
            reward = rewards.desired_velocity(self, fail=kwargs['fail'])
            return {'av': reward, 'adversary': -reward}

NameError: name 'MultiAgentEnv' is not defined

## 2. Running Multiagent Environment in RLlib

When running the scenario that uses multiagent environment, 
we specify certain parameters in the flow_params. flow_param is the dictionary that is called by the 
create_env function which defines the action and bservation space for each agent during the training. 



In [None]:
flow_params = dict(
    # name of the experiment
    exp_tag='multiagent_figure_eight',

    # name of the flow environment the experiment is running on
    env_name='MultiAgentAccelEnv',

    # name of the scenario class the experiment is running on
    scenario='Figure8Scenario',

    # simulator that is used by the experiment
    simulator='traci',

    # sumo-related parameters (see flow.core.params.SumoParams)
    sim=SumoParams(
        sim_step=0.1,
        render=False,
    ),

    # environment related parameters (see flow.core.params.EnvParams)
    env=EnvParams(
        horizon=HORIZON,
        additional_params={
            'target_velocity': 20,
            'max_accel': 3,
            'max_decel': 3,
            'perturb_weight': 0.03,
            'sort_vehicles': False
        },
    ))


### 2.1 Shared policies

When we run the shared policy, we refer to the same policy for each agent. In the example below the agents
will use 'av' policy.

In [None]:
    ######################################################################
    # Start of new code
    ######################################################################
    def gen_policy():
        return (PPOPolicyGraph, obs_space, act_space, {})

    # Setup PG with an ensemble of `num_policies` different policy graphs
    policy_graphs = {'av': gen_policy()}

    def policy_mapping_fn(_):
        return 'av'

    config.update({
        'multiagent': {
            'policy_graphs': policy_graphs,
            'policy_mapping_fn': tune.function(policy_mapping_fn),
            'policies_to_train': ['av']
        }
    })

    return alg_run, env_name, config
    ######################################################################
    # End of new code
    ######################################################################

### 2.2 None-shared policies (FIXME)

When we run the non-shared policy we refer to different policies for each agent. 

In [None]:


    ######################################################################
    # Start of new code
    ######################################################################
    
    def gen_policy():
        return (PPOPolicyGraph, obs_space, act_space, {})

    # Setup PG with an ensemble of `num_policies` different policy graphs
    policy_graphs = {'av': gen_policy(), 'adversary': gen_policy()}

    def policy_mapping_fn(agent_id):
        return agent_id

    config.update({
        'multiagent': {
            'policy_graphs': policy_graphs,
            'policy_mapping_fn': tune.function(policy_mapping_fn)
        }
    })

    ######################################################################
    # End of new code
    ######################################################################

    return alg_run, env_name, config