
#  VMAS Soccer Environment Overview

In this notebook we use the **VMAS Soccer** scenario as a simple example to demonstrate how to train a multi-agent RL algorithm in a continuous-control environment.

##  Basic Environment Description

- We use the **Soccer** scenario from the VMAS library.
- By default there are **3 agents on the blue team**.
- These three blue agents cooperate to score goals against a **red team**.
- The red team is, by default, controlled by a **built-in scripted AI opponent** (not by our learning algorithm).
- The environment is wrapped in TorchRL and returns data as a **TensorDict**, with keys grouped by agent team.

This setup is useful to:
- Show how to plug a multi-agent algorithm into VMAS + TorchRL.
- Let students focus first on **training only one team (blue)** against a fixed AI opponent.
- Introduce the idea of **centralised critic / decentralised actors** in a simple, visual domain.

---

## `ai_red_team` Flag and `agent_red` Key

The Soccer scenario exposes a configuration flag, typically passed when creating the environment:

```python
ai_red_team = True  # or False
```

### Case 1 – `ai_red_team = True` (default)

- The **red team is controlled by a hand-crafted AI** inside the environment.
- Our learning algorithm controls only the **blue agents**.
- The TensorDict then only contains the **blue group**:
  - `agent_blue/observation`
  - `agent_blue/action`
  - `agent_blue/reward`
  - `agent_blue/terminated`, `agent_blue/truncated`, etc.


---

### Case 2 – `ai_red_team = False` (both teams learnable)

If we set:

```python
ai_red_team = False
```

then the environment **no longer controls the red team with a built-in AI**.  
Instead, the red team becomes a second *learnable* group of agents.

In this case, the environment's TensorDict gains a **new group key**:

- `agent_blue` – the original blue team (3 agents).
- `agent_red` – the red team, now exposed like another agent group.

That means we now have keys like:

- `agent_red/observation`
- `agent_red/action`
- `agent_red/reward`
- `agent_red/terminated`, `agent_red/truncated`, etc.

This is exactly where a **second algorithm instance** can be plugged in if we want the teams to **play against each other with learned policies**.

---

##  Using Two Algorithms (Blue vs Red)

A typical two-sided training setup could look like this:

- One algorithm instance (e.g., `MADDPG`) is configured for the **blue team**, using the group key `agent_blue`.
- A second algorithm instance (possibly the same class, possibly different) is configured for the **red team**, using the group key `agent_red`.
- During rollout:
  - Both policies are called: one to compute `agent_blue/action`, one to compute `agent_red/action`.
  - The environment uses both actions to step the dynamics.
- During training:
  - The replay buffer stores both groups in the same TensorDict.
  - Each algorithm reads only the group it is responsible for:
    - Blue algorithm reads from the `agent_blue` keys.
    - Red algorithm reads from the `agent_red` keys.



In [None]:
import torch

# Env
from torchrl.envs import RewardSum, TransformedEnv
from torchrl.envs.libs.vmas import VmasEnv
# Utils
torch.manual_seed(0)
from matplotlib import pyplot as plt
from tqdm import tqdm
from mappo import MAPPO
from maddpg import MADDPG

In [None]:
# Devices

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vmas_device = device  # The device where the simulator is run (VMAS can run on GPU)

# Sampling
frames_per_batch = 6_000  # Number of team frames collected per training iteration
n_iters = 10  # Number of sampling and training iterations
total_frames = frames_per_batch * n_iters

# Training
num_epochs = 30  # Number of optimization steps per training iteration
minibatch_size = 400  # Size of the mini-batches in each optimization step
lr = 3e-4  # Learning rate
max_grad_norm = 1.0  # Maximum norm for the gradients

# PPO
clip_epsilon = 0.2  # clip value for PPO loss
gamma = 0.99  # discount factor
lmbda = 0.9  # lambda for generalised advantage estimation
entropy_eps = 1e-4  # coefficient of the entropy term in the PPO loss

In [None]:
max_steps = 1000  # Episode steps before done
num_vmas_envs = (
    frames_per_batch // max_steps
)  # Number of vectorized envs. frames_per_batch should be divisible by this number
scenario_name = "football"


In [None]:
def make_env(num_envs=10,device=vmas_device,max_steps=max_steps,n_agents=4,ai_red_agents = True):
    '''
    ai_red_agents: If True, the red team is controlled by built-in AI agents. If False, both teams are controlled by learning agents.
                   In the latter case, the corresponding spaces and keys in the tensordict will be under the agent_red namespace in the tensordict.
    '''
    return TransformedEnv(VmasEnv(
        scenario=scenario_name,
        num_envs=num_envs,
        continuous_actions=True,
        max_steps=max_steps,
        device=device,
        n_agents=n_agents,
        ai_red_agents = ai_red_agents,
    ),
    RewardSum(in_keys=[('agent_blue', "reward")], out_keys=[('agent_blue', "episode_reward")]))

In [None]:
env = make_env(num_envs=num_vmas_envs)

In [None]:
algorithm = MAPPO(
    learning_rate=3e-4,
    gamma=0.99,
    lmbda=0.95,
    clip_epsilon=0.2,
    entropy_eps=0.001,
    max_grad_norm=1.0,
    total_frames=total_frames,
    frames_per_batch=frames_per_batch,
    minibatch_size=512,
    num_epochs=30,
    independent_critic=False,
    share_parameters=False,
    continuous_actions=False,
    base_state_value_key="state_value",
    team_reward_aggregation_fn = "mean",
    base_actor_logits_keys="logits",
    # base_actor_logits_keys=["loc","scale"],

    do_absolute_evaluation=True,
    evaluation_interval = 100000,
    evaluation_episodes = 10,
    evaluation_max_steps = 300,
    n_agents = 3,
    
    
    group = "agent_blue",
    
    root_folder = "./experiments/mappo_student_soccer",

)

In [None]:
algorithm.setup(make_env_fn=make_env)

In [None]:
results = algorithm.train()
print(results)

In [None]:
algorithm = MADDPG(
    learning_rate = 3e-4,       # override if supplied,
    gamma = 0.99,
    max_grad_norm = 1.0,
    polyak_tau = 0.995,
    memory_size = 1_000_000,
    #Exploration
    sigma_init = 0.9,
    sigma_end = 0.1,
    annealing_steps_ratio = 0.5,
    
    #Loss
    discrepancy_loss = "l2",
    use_critic_target_network = True,
    use_policy_target_network = False,
    # Training loop
    total_frames = 60_000,
    frames_per_batch = 6_000,
    minibatch_size = 400,
    num_epochs = 50,

    do_absolute_evaluation=True,
    evaluation_interval = 100000,
    evaluation_episodes = 10,
    evaluation_max_steps = 300,
    n_agents = 3,
    
    
    group = "agent_blue",
    
    root_folder = "./experiments/maddpg_student_soccer",

)

In [None]:
algorithm.setup(make_env_fn=make_env)

In [None]:
results = algorithm.train()
print(results)