# Collaboration and Competition

---

In this notebook, we implement a Multi-Agent Reinforcement Learning solution to the Unity ML-Agents Tennis environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [52]:
from sys import platform
import os.path
import time
import random
from itertools import count
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from typing import List
from unityagents import UnityEnvironment

In [None]:
!nvidia-smi

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameters to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```
The code below is specific to the servers that were run to train this notebook. Please feel free to edit to ensure it works for your local environment.

In [11]:
if platform.startswith('darwin'):
    env = UnityEnvironment(file_name="./env/Tennis.app")
elif platform.startswith('linux'):
    env = UnityEnvironment(file_name="./env/Tennis_Linux_NoVis/Tennis.x86_64", no_graphics=True)
else:
    raise NotImplemented()

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [12]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [13]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


In [42]:
states.size

48

### 3. Take Random Actions in the Environment

In the next code cell, we demonstrate how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you can watch the agents' performance, if they select actions at random with each time step. If you are using a non-headless environment, a window should pop up that allows you to observe the agents.

In [8]:
# for i in range(1, 6):                                      # play game for 5 episodes
#     env_info = env.reset(train_mode=False)[brain_name]     # reset the environment
#     states = env_info.vector_observations                  # get the current state (for each agent)
#     scores = np.zeros(num_agents)                          # initialize the score (for each agent)
#     while True:
#         actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#         actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#         env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#         next_states = env_info.vector_observations         # get next state (for each agent)
#         rewards = env_info.rewards                         # get reward (for each agent)
#         dones = env_info.local_done                        # see if episode finished
#         scores += env_info.rewards                         # update the score (for each agent)
#         states = next_states                               # roll over states to next time step
#         if np.any(dones):                                  # exit loop if episode finished
#             break
#     print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

Score (max over agents) from episode 1: 0.0
Score (max over agents) from episode 2: 0.09000000171363354
Score (max over agents) from episode 3: 0.0
Score (max over agents) from episode 4: 0.0
Score (max over agents) from episode 5: 0.09000000171363354


### 4. Implementation

Now we will train our own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [None]:
class MultiAgentReplayBuffer(object):
    def __init__(self, max_buffer_size, num_agents, state_shape, action_size):
        """
        Initialize a ReplayBuffer object.
        :param max_buffer_size: maximum size of the buffer
        :param num_agents: number of agents
        :param state_shape: shape of state with first dimension being the number of agents
        :param action_size: size of each action
        """
        self.max_buffer_size = max_buffer_size
        self.num_agents = num_agents
        self.state_shape = state_shape
        self.action_size = action_size
        self.states = np.full((self.max_buffer_size, *self.state_shape), np.nan)
        self.actions = np.full((self.max_buffer_size, self.num_agents, self.action_size), np.nan)
        self.rewards = np.full((self.max_buffer_size, self.num_agents, 1), np.nan)
        self.next_states = np.full((self.max_buffer_size, *self.state_shape), np.nan)
        self.dones = np.full((self.max_buffer_size, self.num_agents, 1), np.nan)
        self.buffer_size = 0
        self.idx = 0

    def __len__(self):
        """
        Return the current size of the buffer.
        :return: current size of the buffer
        """
        return self.buffer_size

    def upsert(self, states, actions, rewards, next_states, dones):
        """
        Insert samples into the buffer and increment the index.
        :param states: np.array of shape (num_agents, state_size)
        :param actions: np.array of shape (num_agents, action_size)
        :param rewards: np.array of shape (num_agents, 1)
        :param next_states: np.array of shape (num_agents, state_size)
        :param dones: np.array of shape (num_agents, 1)
        """
        self.states[self.idx] = states
        self.actions[self.idx] = actions
        self.rewards[self.idx] = rewards
        self.next_states[self.idx] = next_states
        self.dones[self.idx] = dones
        self.buffer_size = min(self.buffer_size + 1, self.max_buffer_size)
        self.idx = (self.idx + 1) % self.max_buffer_size

    def sample(self, batch_size):
        if batch_size > self.buffer_size:
            raise ValueError("Not enough samples in buffer")
        idx = np.random.randint(0, self.idx, batch_size)
        return {
            'states': self.states[idx],
            'states_flat': self.states[idx].reshape(batch_size, self.states[idx].size),
            'actions': self.actions[idx],
            'actions_flat': self.actions[idx].reshape(batch_size, self.actions[idx].size),
            'rewards': self.rewards[idx],
            'next_states': self.next_states[idx],
            'next_states_flat': self.next_states[idx].reshape(batch_size, self.actions[idx].size),
            'dones': self.dones[idx]
        }

In [None]:
class MATD3TwinCriticNetwork(nn.Module):
    def __init__(self, all_agents_state_size: int, all_agents_action_size: int, hidden_dims: List[int]=(256, 256), activation_fn = F.relu, device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
        """
        Centralized action-value function approximator that takes as input the actions of all agents and the states of all agents, and outputs the Q-value for each agent.
        :param all_agents_state_size: Dimension of the states of all agents concatenated.
        :param all_agents_action_size: Dimension of the actions of all agents concatenated.
        :param hidden_dims: List of hidden dimensions for the fully connected layers.
        :param activation_fn: Activation function for the fully connected layers.
        :param device: Device on which to run the neural network.
        """
        super(MATD3TwinCriticNetwork, self).__init__()
        self.device = device
        self.all_agents_state_size = all_agents_state_size
        self.all_agents_action_size = all_agents_action_size
        self.activation_fn = activation_fn

        input_dim = self.all_agents_state_size + self.all_agents_action_size
        output_dim = 1

        self.layers_a = nn.ModuleList()
        self.layers_b = nn.ModuleList()
        for i, dim in enumerate(hidden_dims):
            if i == 0:
                self.layers_a.append(nn.Linear(input_dim, dim))
                self.layers_b.append(nn.Linear(input_dim, dim))
            else:
                self.layers_a.append(nn.Linear(hidden_dims[i-1], dim))
                self.layers_b.append(nn.Linear(hidden_dims[i-1], dim))
            self.layers_a.append(activation_fn)
            self.layers_b.append(activation_fn)
        self.layers_a.append(nn.Linear(hidden_dims[-1], output_dim))
        self.layers_b.append(nn.Linear(hidden_dims[-1], output_dim))
        self.layers_a.append(nn.Tanh())
        self.layers_b.append(nn.Tanh())

        self.to(device)

    def forward(self, states, actions):
        """
        Takes as input the states and actions of all agents, and outputs the Q-value for each agent.
        :param states: States of all agents concatenated.
        :param actions: Actions of all agents concatenated.
        """
        x = torch.cat([states, actions], dim=1)
        for layer_a, layer_b in zip(self.layers_a, self.layers_b):
            xa = layer_a(x)
            xb = layer_b(x)
        return xa, xb

    def Qa(self, states, actions):
        """
        Takes as input the states and actions of all agents, and outputs the Qa-value for each agent.
        :param states: States of all agents concatenated.
        :param actions: Actions of all agents concatenated.
        """
        x = torch.cat([states, actions], dim=1)
        for layer_a in self.layers_a:
            xa = layer_a(x)
        return xa

    def load(self, states, actions, rewards, next_states, dones):

In [None]:
class MATD3ActorNetwork(nn.Module):
    def __init__(self, state_size: int, action_size: int, hidden_dims=(256,256), activation_fn = F.relu, device=torch.device("cuda" if torch.cuda.is_available() else "cpu")):
        """
        Policy that takes as input the state of an agent and outputs the actions for that agent.
        :param state_size: Dimension of the state of a single agent.
        :param action_size: Dimension of the actions of a single agent.
        :param hidden_dims: List of hidden dimensions for the fully connected layers.
        :param activation_fn: Activation function for the fully connected layers.
        :param device: Device on which to run the neural network.
        """
        super(MATD3ActorNetwork, self).__init__()
        self.device = device
        self.state_size = state_size
        self.action_size = action_size
        self.activation_fn = activation_fn

        input_dim = state_size
        output_dim = action_size

        self.layers = nn.ModuleList()
        for i, dim in enumerate(hidden_dims):
            if i == 0:
                self.layers.append(nn.Linear(input_dim, dim))
            else:
                self.layers.append(nn.Linear(hidden_dims[i-1], dim))
            self.layers.append(activation_fn)
        self.layers.append(nn.Linear(hidden_dims[-1], output_dim))
        self.layers.append(nn.Tanh())
        self.to(device)

    def forward(self, state):
        """
        Takes as input the state of a single agent and outputs the action for that agent.
        :param state: State of the agent.
        """
        x = state
        for layer in self.layers:
            x = layer(x)
        return x

In [51]:
class MATD3Agent():
    def __init__(self, state_size, action_size, all_agents_state_size, all_agents_action_size, config):
        self.state_size = state_size
        self.action_size = action_size
        self.all_agents_state_size = all_agents_state_size
        self.all_agents_action_size = all_agents_action_size
        self.config = config

        self.critic_online = MATD3TwinCriticNetwork(self.all_agents_state_size, self.all_agents_action_size, hidden_dims=self.config.critic_hidden_dims, activation_fn=self.config.critic_activation_fn)
        self.critic_target = MATD3TwinCriticNetwork(self.all_agents_state_size, self.all_agents_action_size, hidden_dims=self.config.critic_hidden_dims, activation_fn=self.config.critic_activation_fn)
        self.update_critic_target(1.0)

        self.actor_online = MATD3ActorNetwork(self.state_size, self.action_size, hidden_dims=self.config.actor_hidden_dims, activation_fn=self.config.actor_activation_fn)
        self.actor_target = MATD3ActorNetwork(self.state_size, self.action_size, hidden_dims=self.config.actor_hidden_dims, activation_fn=self.config.actor_activation_fn)
        self.update_actor_target(1.0)

        self.value_optimizer = optim.Adam(self.critic_online.parameters(), lr=self.config.critic_lr)
        self.policy_optimizer = optim.Adam(self.actor_online.parameters(), lr=self.config.actor_lr)

    def optimize(self, agent_idx: int, experiences, target_next_actions, optimize_policy: bool):
        """
        Optimizes the models of the agent.
        :param agent_idx: Index of this agent.
        :param experiences: Experience sample from the MultiAgentReplayBuffer of all agents.
        :param target_next_actions: Noisy target actions for all agents.
        :param optimize_policy: Whether to optimize the policy or not.
        """
        batch_size = len(experiences['dones'])
        states = experiences['states']
        states_flat = experiences['states_flat']
        states_self = experiences['states'][:, agent_idx, :]
        actions = experiences['actions']
        actions_flat = experiences['actions_flat']
        rewards_self = experiences['rewards'][:, agent_idx]
        next_states_flat = experiences['next_states_flat']
        dones_self = experiences['dones'][:, agent_idx]

        noisy_target_next_actions_flat = target_next_actions.reshape(batch_size, -1)

        with torch.no_grad():
            qa_target, qb_target = self.critic_target(next_states_flat, noisy_target_next_actions_flat)
            target_q = rewards_self + (1 - dones_self) * self.config.gamma * torch.min(qa_target, qb_target)

        qa, qb = self.critic_online(states_flat, actions_flat)
        value_loss = F.mse_loss(qa, target_q) + F.mse_loss(qb, target_q)
        self.value_optimizer.zero_grad()
        value_loss.backward()
        nn.utils.clip_grad_norm(self.critic_online.parameters(), self.config.critic_gradient_clip_value)
        self.value_optimizer.step()

        if optimize_policy:
            greedy_action = self.actor_online(states_self)
            actions_copy = np.copy(actions)
            actions_copy[:, agent_idx] = greedy_action.detach().cpu().numpy()
            q_value = self.critic_online.Qa(states, actions_copy.reshape(batch_size, -1))
            policy_loss = -q_value.mean()
            self.policy_optimizer.zero_grad()
            policy_loss.backward()
            torch.nn.utils.clip_grad_norm_(self.actor_online.parameters(), self.config.actor_gradient_clip_value)
            self.policy_optimizer.step()

    def select_online_action(self, state, std_dev: float, clip_range: float = 1.0):
        """
        Gets the action for exploration from the online actor, taking as input the state of the agent and outputting the greedy action with Gaussian noise added.
        :param state: State of the agent. Can be a single state or a batch of states.
        :param std_dev: Standard deviation of the distribution.
        :param clip_range: Range of the clipped Gaussian noise.
        """
        return self._select_action(self.actor_online, state, std_dev, clip_range)

    def select_target_action(self, state, std_dev: float, clip_range: float):
        """
        Gets the action from the target actor, taking as input the state of the agent and outputting the greedy action with Gaussian noise added.
        :param state: State of the agent. Can be a single state or a batch of states.
        :param std_dev: Standard deviation of the distribution.
        :param clip_range: Range of the clipped Gaussian noise.
        """
        return self._select_action(self.actor_target, state, std_dev, clip_range)

    def _select_action(self, model, state, std_dev: float, clip_range: float):
        """
        Gets the action from the given model, taking as input the state of the agent and outputting the greedy action.
        :param model: Model to use for the action selection.
        :param state: State of the agent. Can be a single state or a batch of states.
        :param std_dev: Standard deviation of the distribution.
        :param clip_range: Range of the clipped Gaussian noise.
        """
        greedy_action = model(state).cpu().detach().numpy()
        noise = np.random.normal(0.0, scale=std_dev, size=self.action_size)
        noisy_action = np.clip(greedy_action + noise, -abs(clip_range), abs(clip_range))
        return noisy_action

    def update_critic_target(self, tau: float):
        self._update_target(self.critic_target, self.critic_online, tau)

    def update_actor_target(self, tau: float):
        self._update_target(self.actor_target, self.actor_online, tau)


    @staticmethod
    def _update_target(target, online, tau: float):
        for target_param, param in zip(target.parameters(), online.parameters()):
            target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau)




In [None]:
class MATD3Trainer():
    def __init__(self, config, env):
        self.config = config
        torch.manual_seed(self.config.seed); np.random.seed(self.config.seed); random.seed(self.config.seed)

        self.env = env
        self.brain_name = self.env.brain_names[0]
        self.brain = self.env.brains[self.brain_name]
        env_info = self.env.reset(train_mode=True)[self.brain_name]
        self.num_agents = len(env_info.agents)

        self.action_size_per_agent = self.brain.vector_action_space_size
        self.all_agents_state_size = env_info.vector_observations.size
        self.all_agents_action_size = self.action_size_per_agent * self.num_agents

        self.replay_buffer = MultiAgentReplayBuffer(self.config.max_buffer_size, self.num_agents, env_info.vector_observations.shape, self.action_size_per_agent)

        self.agents = [MATD3Agent(env_info.vector_observations[agent_idx].size, action_size, self.all_agents_state_size, self.all_agents_action_size, self.config) for agent_idx in range(self.num_agents)]

        self.total_timesteps = 0
        self.total_episodes = 0
        self.episode_rewards = []
        self.episode_timesteps = []
        self.episode_num = 0



    def train(self):
        for episode in range(1, self.config.max_episodes + 1):
            episode_start = time.time()
            env_info = self.env.reset(train_mode=True)[self.brain_name]
            for step in count():
                states = env_info.vector_observations
                actions = [agent.select_online_action(states[agent_idx], self.config.exp_std_dev, self.config.exp_clip_range) for agent_idx, agent in enumerate(self.agents)]
                actions = np.array(actions)
                env_info = self.env.step(actions)[self.brain_name]
                next_states = env_info.vector_observations
                rewards = env_info.rewards
                dones = env_info.local_done
                self.replay_buffer.add(states, actions, rewards, next_states, dones)

                if len(self.replay_buffer) >= self.config.min_buffer_size:
                    experiences = self.replay_buffer.sample(self.config.batch_size)
                    noisy_target_next_actions = np.array([agent.select_target_action(experiences['next_states'][:, agent_idx, :], self.config.tps_std_dev, self.config.tps_clip_range) for agent_idx, agent in enumerate(self.agents)])

                    optimize_policy = True if step % self.config.optimize_policy_every_steps == 0 else False
                    for agent_idx, agent in enumerate(self.agents):
                        agent.optimize(agent_idx, experiences, noisy_target_next_actions, optimize_policy)

                if step % self.config.update_critic_target_every_steps == 0:
                    for agent in self.agents:
                        agent.update_critic_target(self.config.tau)

                if step % self.config.update_actor_target_every_steps == 0:
                    for agent in self.agents:
                        agent.update_actor_target(self.config.tau)

                if np.any(dones):
                    break

            episode_end = time.time()
            episode_duration = episode_end - episode_start

In [136]:
s = env_info.vector_observations
a = np.full((5, *s.shape), np.nan)
for i in range(5):
    a[i] = s + i
a.shape

(5, 2, 24)

(5, 48)

In [148]:
np.array([[[2,2,2,3]],[[2,2,2,3]],[[2,2,2,3]]]).shape

(3, 1, 4)