# Multi Player Tennis

In this `Tennis` environment, two agents control rackets to play the game of tennis in accordance with the rules of the game.

An agent receives `+ve` rewards for the following 2 cases:
- `+0.10` for hitting the ball over the net while landing it within the bounds of the opponent's court
- `+0.50` for refraining itself from a volley, when the ball eventually lands outside the boundary of it own court

Otherwise, it receives `-ve` rewards for all the following scenarios:
- `-0.50` for missing a valid shot from the opponent
- `-0.05` for hitting the ball (without landing it) out of bounds of its opponent's court

It receives `-0.01` for all other cases:
- hitting the net while serving
- missing a serve
- hitting the ball out of bounds of its own court
- double hitting
- letting the ball touch its own court while sending
- letting the ball touch its own court more than once while receiving


The observation space consists of a stack of 3 consecutive observation frames, where each frame is a vector of `9` variables corresponding to the position, velocity, and rotation of the ball and rackets.

Each agent receives its own, local observation. `3` continuous actions are available, corresponding to movement toward (or away from) the net, jumping, and rotating the racket.

The task is episodic, and ends when one agent wins a game. The environment is considered solved, when the agents get an average score of `5.0` (over `100` consecutive episodes, after taking the maximum over both agents).

### Modified Reward Model 

The reward model for this environment has been modified from the [original reward model](https://github.com/Unity-Technologies/ml-agents/blob/com.unity.ml-agents_1.0.7/Project/Assets/ML-Agents/Examples/Tennis/Scripts/HitWall.cs#L42).

### Import Packages

In [None]:
import os
import copy
import random
import pprint
import numpy as np
from collections import deque

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import animation
%matplotlib inline

from IPython.display import display as Display
from IPython.display import HTML
from pyvirtualdisplay import Display as display
display = display(visible=0, size=(1400, 900))
display.start()

from mlagents_envs.base_env import ActionTuple
from mlagents_envs.environment import UnityEnvironment


device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

### Method, Parameters and Style-sheets for Rendering the Scene inside the Notebook

In [None]:
%%html
<style>
.output_wrapper button.btn.btn-default,
.output_wrapper .ui-dialog-titlebar,
.output_wrapper .mpl-message {
    display: none;
}
.output_wrapper .ui-dialog-titlebar + div {
    border: none !important;
    overflow: under !important
}
</style>

In [None]:
def animate_frames(frames):
    'function to animate a list of frames'
    def display_animation(anim):
        plt.close(anim._fig)
        return HTML(anim.to_jshtml())
    plt.axis('off')
    cmap = None if len(frames[0].shape) == 3 else 'Greys'
    patch = plt.imshow(frames[0], cmap=cmap, aspect='auto')
    plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
    fanim = animation.FuncAnimation(plt.gcf(), lambda x: patch.set_data(frames[x]), frames = len(frames), interval=ANIMATION_INTERVAL)
    Display(display_animation(fanim))

In [None]:
DPI                                          = 96     # https://www.infobyip.com/detectmonitordpi.php
IMAGE_SCALE                                  = 1.4    # large scale == more buffer space == slow to render
matplotlib.rcParams['animation.embed_limit'] = 2**128 # animation buffer size
ANIMATION_INTERVAL                           = 60     # delay between frames in ms

---

### Start the Environment

In [None]:
MODEL_FILE               = './models/maddpg-tennis.pt'

ENVIRONMENT              = './Unity ML Agent/Tennis/Linux/Tennis.x86_64'

CAMERA_AGENT_NAME_PREFIX = 'Camera'

In [None]:
env = UnityEnvironment(file_name=ENVIRONMENT, seed=1, side_channels=[])

### Examine the State and Action Spaces

In [None]:
env.reset()

agent_info = {}
camera_agent, display_width, display_height = None, None, None
for behavior_name in env.behavior_specs.keys():
    print('\nBehavior Name: {}'.format(behavior_name))
    print('State Space: {}'.format(env.behavior_specs[behavior_name].observation_specs))
    print('Action Space: {}'.format(env.behavior_specs[behavior_name].action_spec))
    if not behavior_name.startswith(CAMERA_AGENT_NAME_PREFIX):
        agent_info[behavior_name] = {
            'state_size': env.behavior_specs[behavior_name].observation_specs[0].shape[0],
            'continuous_action_size': env.behavior_specs[behavior_name].action_spec.continuous_size,
            'discrete_action_n_branches': env.behavior_specs[behavior_name].action_spec.discrete_size,
            'discrete_action_branches': env.behavior_specs[behavior_name].action_spec.discrete_branches
        }
    else:
        camera_agent = behavior_name
        display_height = env.behavior_specs[behavior_name].observation_specs[0].shape[0]
        display_width = env.behavior_specs[behavior_name].observation_specs[0].shape[1]

### IDs and State Action Spaces of Different Agents in a Multi Agent Environment

In [None]:
pprint.pprint(agent_info)


num_agents = len(agent_info)
state_size = list(agent_info.values())[0]['state_size']
action_size = list(agent_info.values())[0]['continuous_action_size']

print('\nNumber of Agents: {}'.format(num_agents))
print('State Size: {}'.format(state_size))
print('Action Size: {}'.format(action_size))

### `Teams` and `Agents`

There can be multiple agent-groups (`teams`) in the environment and each team can have multiple `agents` sharing the same behaviour (state, action, and reward space). However, teams usually follow different behaviour.

To see all the available `teams` and the `agents` (IDs) in each team, execute the following cell.

In [None]:
env.reset()
for agent_group in agent_info:
    decision_steps, _ = env.get_steps(agent_group)
    print('Team Name: {}\tAgent IDs: {}'.format(agent_group, decision_steps.agent_id))

### Take Random Actions in the environment

In [None]:
def random_action(agent_params):
    _continuous = np.random.uniform(low=-1.0, high=1.0, size=(1, agent_params['continuous_action_size']))
    _discrete = np.zeros((1, agent_params['discrete_action_n_branches']), dtype=np.int32)
    for branch_idx in range(agent_params['discrete_action_n_branches']):
        _discrete = np.column_stack([
            np.random.randint(0, agent_params['discrete_action_branches'][branch_idx], size=(1), dtype=np.int32)
            for branch_idx in range(agent_params['discrete_action_n_branches'])
        ])
    return ActionTuple(continuous=_continuous, discrete=_discrete)

In [None]:
RENDER_REAL_TIME = True

In [None]:
if RENDER_REAL_TIME:
    %matplotlib notebook
    fig = plt.figure()
    plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
    plt.axis('off')
    img = plt.imshow(np.zeros((display_height, display_width, 3)))
else:
    %matplotlib inline
    frames = []

for episode_i in range(2):
    env.reset()                                                           # reset the environment
    score, terminated, experience = {}, {}, {}
    while True:
        for behavior_name, agent_params in agent_info.items():            # for every agent-group in the environment
            decision_steps, terminal_steps = env.get_steps(behavior_name) # get agents' status in the agent-group
            if behavior_name not in score:
                score[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    score[behavior_name][i] = 0
            if behavior_name not in terminated:
                terminated[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    terminated[behavior_name][i] = False
            if behavior_name not in experience:
                experience[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    experience[behavior_name][i] = {
                        'state': None,
                        'action': None,
                        'reward': None,
                        'next_state': None
                    }
            for agent_id in decision_steps.agent_id:                          # for every agent in the agent-group                
                state = decision_steps[agent_id].obs                          # get the initial state for the agent

                # select an action based on the policy
                # policy.choose(behavior_name, agent_id, state)
                action = random_action(agent_params)                          # select an action

                experience[behavior_name][agent_id]['state'] = state
                experience[behavior_name][agent_id]['action'] = action
                env.set_action_for_agent(behavior_name, agent_id, action)     # send the action to the agent
                # env.set_actions(behavior_name, action)                      # send the action to the agent-group
        if camera_agent:                                                      # render the screen
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])
                    

        env.step()                                                            # one step through the environment

        for behavior_name in agent_info:                                      # for every agent-group in the environment
            decision_steps, terminal_steps = env.get_steps(behavior_name)     # get agents' status in the agent-group    
            for agent_id in decision_steps.agent_id:
                reward = decision_steps[agent_id].reward                      # get the reward
                next_state = decision_steps[agent_id].obs                     # get the next state
                score[behavior_name][agent_id] += reward
                experience[behavior_name][agent_id]['reward'] = reward
                experience[behavior_name][agent_id]['next_state'] = next_state
            for agent_id in terminal_steps.agent_id:                          # see if episode finished for an agent
                reward = terminal_steps[agent_id].reward
                score[behavior_name][agent_id] += reward
                experience[behavior_name][agent_id]['reward'] = reward
                experience[behavior_name][agent_id]['next_state'] = None
                terminated[behavior_name][agent_id] = True
                if terminal_steps[agent_id].interrupted:
                    print('agent #{} in the agent-group "{}" has reached the maximum number of steps in the episode'.format(
                        agent_id, behavior_name))
                    
        if camera_agent:                                                      # render the screen
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])

        # train the RL agent with the experience tuple
        # agent.step(experience)

        # if EVERY agent in EVERY agent-group is done
        if np.asarray([np.asarray(list(agent_group.values())).all() for agent_group in terminated.values()]).all():
            break

    print(score)

if not RENDER_REAL_TIME: animate_frames(frames)

In [None]:
env.close()

### Hyperparameters of the Model

In [None]:
SEED         = 6

NUM_AGENTS   = num_agents
ACTION_SIZE  = action_size
STATE_SIZE   = state_size

BUFFER_SIZE  = 5120
BATCH_SIZE   = 1024
GAMMA        = 0.99
TAU          = 3e-3

NOISE_MU     = 0.0
NOISE_THETA  = 0.15
NOISE_SIGMA  = 0.1

ACTOR_LR     = 1e-3
CRITIC_LR    = 2e-3

TARGET_SCORE = 5.0    # score for which the environment is consider to be Solved

In [None]:
torch.manual_seed(SEED)
random.seed(SEED)

### Noise Model for Exploration

In [None]:
class OUNoise:
    def __init__(self, size, mu, theta, sigma):
        self.state = None
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.reset()

    def reset(self):
        self.state = copy.copy(self.mu)

    def sample(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.array([random.random() for _ in range(len(x))])
        self.state = x + dx
        return self.state

### Experience Replay Buffer

In [None]:
class Replay:
    def __init__(self, action_size, buffer_size, batch_size):
        self.action_size = action_size
        self.buffer = deque(maxlen=buffer_size)
        self.batch_size = batch_size

    def add(self, transition):
        self.buffer.append(transition)

    def sample(self):
        return random.sample(self.buffer, k=self.batch_size)

    def __len__(self):
        return len(self.buffer)

### The Actor-Critic Network

In [None]:
def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return -lim, lim

class Actor(nn.Module):
    def __init__(self, state_size, action_size, fc1_units=256, fc2_units=128):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.bn1 = nn.BatchNorm1d(fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.bn2 = nn.BatchNorm1d(fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state):
        x = self.bn1(F.relu(self.fc1(state)))
        x = self.bn2(F.relu(self.fc2(x)))
        return torch.tanh(self.fc3(x))

class Critic(nn.Module):
    def __init__(self, state_size, action_size, fc1_units=256, fc2_units=128):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.bn1 = nn.BatchNorm1d(fc1_units)
        self.fc2 = nn.Linear(fc1_units + action_size, fc2_units)
        self.fc3 = nn.Linear(fc2_units, 1)
        self.reset_parameters()

    def reset_parameters(self):
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, state, action):
        x = self.bn1(F.relu(self.fc1(state)))
        x = F.relu(self.fc2(torch.cat((x, action), dim=1)))
        return self.fc3(x)

### Multi Agent DDPG

In [None]:
class MultiAgentDDPG:
    def __init__(self):
        self.actor = Actor(STATE_SIZE, ACTION_SIZE).to(device)
        self.target_actor = Actor(STATE_SIZE, ACTION_SIZE).to(device)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=ACTOR_LR)

        self.critic = Critic(STATE_SIZE * NUM_AGENTS, ACTION_SIZE * NUM_AGENTS).to(device)
        self.target_critic = Critic(STATE_SIZE * NUM_AGENTS, ACTION_SIZE * NUM_AGENTS).to(device)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=CRITIC_LR)

        self.noise = OUNoise(ACTION_SIZE, mu=NOISE_MU, theta=NOISE_THETA, sigma=NOISE_SIGMA)
        self.replay = Replay(ACTION_SIZE, buffer_size=BUFFER_SIZE, batch_size=BATCH_SIZE)

        self.hard_update(self.target_actor, self.actor)
        self.hard_update(self.target_critic, self.critic)

    def act(self, agent_params, states, noise=True):
        state = torch.from_numpy(np.asarray(states)).float().to(device)
        self.actor.eval()
        with torch.no_grad():
            action = self.actor(state).cpu().numpy()
        self.actor.train()
        if noise:
            action += self.noise.sample()
        _continuous = np.clip(action, -1, 1)
        
        # dummy placeholder for unused discrete action space
        _discrete = np.zeros((1, agent_params['discrete_action_n_branches']), dtype=np.int32)
        for branch_idx in range(agent_params['discrete_action_n_branches']):
            _discrete = np.column_stack([
                np.random.randint(0, agent_params['discrete_action_branches'][branch_idx], size=(1), dtype=np.int32)
                for branch_idx in range(agent_params['discrete_action_n_branches'])
            ])
        return ActionTuple(continuous=_continuous, discrete=_discrete)

    def step(self, experience):
        teams = list(experience.keys())
        full_state = np.asarray([exp['state'][0] for team in teams for exp in experience[team].values()])
        next_full_state = np.asarray([exp['next_state'][0] for team in teams for exp in experience[team].values()])
        full_action = np.asarray([exp['action'].continuous[0] for team in teams for exp in experience[team].values()])
        for team in teams:
            for exp in experience[team].values():
                self.replay.add((
                    exp['state'][0],
                    full_state,
                    exp['action'].continuous[0],
                    full_action,
                    exp['reward'],
                    exp['next_state'][0],
                    next_full_state,
                    exp['done']
                ))

        if len(self.replay) > self.replay.batch_size:
            self.learn()

    def learn(self):
        # sample a batch of transitions from the replay buffer
        transitions = self.replay.sample()
        states, full_state, actions, full_actions, rewards, next_states, next_full_state, dones = self.transpose_to_tensor(transitions)        

        # -------------------- update critic --------------------
        with torch.no_grad():
            target_next_actions = [self.target_actor(next_full_state[:, i, :]) for i in range(NUM_AGENTS)]
            target_next_actions = torch.cat(target_next_actions, dim=1)
            next_full_state = next_full_state.reshape((next_full_state.shape[0], -1))
            target_next_q_values = self.target_critic(next_full_state.to(device), target_next_actions.to(device))

        td_target = rewards.view(-1,1) + GAMMA * target_next_q_values * (1 - dones.view(-1,1))

        # compute Q values for the current states (for all agents) and actions (for all agents) using the critic
        full_actions = full_actions.reshape((full_actions.shape[0], -1))
        current_q_values = self.critic(full_state.reshape((full_state.shape[0], -1)).to(device), full_actions.to(device))

        # compute and minimize the critic loss
        critic_loss = F.mse_loss(current_q_values, td_target.detach())
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.critic.parameters(), 1)
        self.critic_optimizer.step()

        # -------------------- update actor --------------------
        actions = [self.actor(full_state[:, i, :]) for i in range(NUM_AGENTS)]
        actions = torch.cat(actions, dim=1)
        states = full_state.reshape((full_state.shape[0], -1))
        actor_loss = - self.critic(states.to(device), actions.to(device)).mean()
        
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # update target critic and actor models
        self.soft_update(self.target_actor, self.actor, TAU)
        self.soft_update(self.target_critic, self.critic, TAU)
        
    def transpose_to_tensor(self, tuples):
        def to_tensor(x):
            return torch.tensor(x, dtype=torch.float).to(device)
        return list(map(to_tensor, zip(*tuples)))

    def hard_update(self, target_model, source_model):
        for target_param, param in zip(target_model.parameters(), source_model.parameters()):
            target_param.data.copy_(param.data)

    def soft_update(self, target_model, source_model, tau):
        for target_param, online_param in zip(target_model.parameters(), source_model.parameters()):
            target_param.data.copy_(target_param.data * (1.0 - tau) + online_param.data * tau)

### Fault Tolerant Training Function

Training loop can be interrupted in between and be resumed later.

In [None]:
def train(max_episodes=2000, print_every=50, save_every=100, render_every=50):
    '''
    PARAMS
    ======
        render_every (int): render an episode after every n such episodes (-1 for no rendering) 
    '''
    
    agent = MultiAgentDDPG()
    
    scores_deque = deque(maxlen=100)
    start_episode, solved = 0, False
    scores, mean_scores = [], []

    if os.path.isfile(MODEL_FILE):
        # load training checkpoints from file
        map_location = (lambda storage, loc: storage.cuda()) if torch.cuda.is_available() else 'cpu'
        checkpoint = torch.load(MODEL_FILE, map_location=map_location)
        agent.actor.load_state_dict(checkpoint['actor'])
        agent.target_actor.load_state_dict(checkpoint['target_actor'])
        agent.actor_optimizer.load_state_dict(checkpoint['actor_optimizer'])
        agent.critic.load_state_dict(checkpoint['critic'])
        agent.target_critic.load_state_dict(checkpoint['target_critic'])
        agent.critic_optimizer.load_state_dict(checkpoint['critic_optimizer'])
        scores = checkpoint['scores']
        mean_scores = checkpoint['mean_scores']
        start_episode = checkpoint['episode']
        solved = checkpoint['solved']
 
    if solved:
        return scores, mean_scores
    
    agent.actor.train()
    agent.critic.train()
    
    env = UnityEnvironment(file_name=ENVIRONMENT, seed=1, side_channels=[])
    
    if render_every != -1:
        %matplotlib notebook
        fig = plt.figure()
        plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
        plt.axis('off')
        img = plt.imshow(np.zeros((display_height, display_width, 3)))

    try:
        for episode in range(start_episode, start_episode + max_episodes):
            score = np.zeros(NUM_AGENTS)
            env.reset()                                                           # reset the environment
            score, terminated, experience = {}, {}, {}
            if (episode + 1) % render_every == 0: frames = []
            while True:
                for behavior_name, agent_params in agent_info.items():            # for every agent-group in the environment
                    decision_steps, terminal_steps = env.get_steps(behavior_name) # get agents' status in the agent-group
                    if behavior_name not in score:
                        score[behavior_name] = {}
                        for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                            score[behavior_name][i] = 0
                    if behavior_name not in terminated:
                        terminated[behavior_name] = {}
                        for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                            terminated[behavior_name][i] = False
                    if behavior_name not in experience:
                        experience[behavior_name] = {}
                        for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                            experience[behavior_name][i] = {
                                'state': None,
                                'action': None,
                                'reward': None,
                                'next_state': None,
                                'done': False
                            }
                    for agent_id in decision_steps.agent_id:                          # for every agent in the agent-group                
                        state = decision_steps[agent_id].obs                          # get the initial state for the agent

                        # select an action based on the policy
                        # policy.choose(behavior_name, agent_id, state)
                        action = agent.act(agent_params, state)                       # select an action

                        experience[behavior_name][agent_id]['state'] = state
                        experience[behavior_name][agent_id]['action'] = action
                        env.set_action_for_agent(behavior_name, agent_id, action)     # send the action to the agent
                        # env.set_actions(behavior_name, action)                  # send the action to the agent-group

                if camera_agent and (render_every != -1) and ((episode + 1) % render_every == 0):
                    decision_steps, _ = env.get_steps(camera_agent)
                    if (len(decision_steps.agent_id)):
                        camera_agent_id = decision_steps.agent_id[0]
                        image = decision_steps[camera_agent_id].obs
                        img.set_data(image[0])
                        fig.canvas.draw()

                env.step()                                                         # one step through the environment

                for behavior_name in agent_info:                                   # for every agent-group in the environment
                    decision_steps, terminal_steps = env.get_steps(behavior_name)     # get agents' status in the agent-group    
                    for agent_id in decision_steps.agent_id:
                        reward = decision_steps[agent_id].reward                      # get the reward
                        next_state = decision_steps[agent_id].obs                     # get the next state
                        score[behavior_name][agent_id] += reward
                        experience[behavior_name][agent_id]['reward'] = reward
                        experience[behavior_name][agent_id]['next_state'] = next_state
                    for agent_id in terminal_steps.agent_id:                          # see if episode finished for an agent
                        reward = terminal_steps[agent_id].reward
                        score[behavior_name][agent_id] += reward
                        experience[behavior_name][agent_id]['reward'] = reward
                        experience[behavior_name][agent_id]['next_state'] = experience[behavior_name][agent_id]['state']
                        experience[behavior_name][agent_id]['done'] = True
                        terminated[behavior_name][agent_id] = True
                        if terminal_steps[agent_id].interrupted:
                            print('agent #{} in the agent-group "{}" has reached the maximum number of steps in the episode'.format(agent_id, behavior_name)) 

                if camera_agent and (render_every != -1) and ((episode + 1) % render_every == 0):
                    decision_steps, _ = env.get_steps(camera_agent)
                    if (len(decision_steps.agent_id)):
                        camera_agent_id = decision_steps.agent_id[0]
                        image = decision_steps[camera_agent_id].obs
                        img.set_data(image[0])
                        fig.canvas.draw()

                # train the RL agent with the experience tuple
                agent.step(experience)

                # if EVERY agent in ANY agent-group is done
                if np.asarray([np.asarray(list(agent_group.values())).all() for agent_group in terminated.values()]).any():
                    break

            agent_scores = [sum([agent_score for agent_score in team_score.values()]) for team_score in score.values()]
            score = max(agent_scores)
            scores.append(score)
            scores_deque.append(score)
            mean_scores.append(np.mean(scores_deque))
            agent_scores = list(map(lambda x: '{:.4f}'.format(x), agent_scores))

            print('\r' + ' ' * 120, end='')
            print('\rEpisode {:04d}    Average Score: {:.4f}    Current Max Score: {:.4f}    Current Scores: {}'.format(episode + 1, mean_scores[-1], score, agent_scores), end='')
            if (episode + 1) % print_every == 0:
                print('\r' + ' ' * 120, end='')
                print('\rEpisode {:04d}    Average Score: {:.4f}'.format(episode + 1, mean_scores[-1]))

            if (episode + 1) % save_every == 0 or mean_scores[-1] >= TARGET_SCORE:
                torch.save({
                    'episode': episode + 1,
                    'actor': agent.actor.state_dict(),
                    'target_actor': agent.target_actor.state_dict(),
                    'actor_optimizer': agent.actor_optimizer.state_dict(),
                    'critic': agent.critic.state_dict(),
                    'target_critic': agent.target_critic.state_dict(),
                    'critic_optimizer': agent.critic_optimizer.state_dict(),
                    'scores': scores,
                    'mean_scores': mean_scores,
                    'solved': True if mean_scores[-1] >= TARGET_SCORE else False
                }, MODEL_FILE)
                if mean_scores[-1] >= TARGET_SCORE:
                    print('\nEnvironment solved in {:04d} episodes!\tAverage Score: {:.4f}'.format(episode + 1, mean_scores[-1]))
                    break

        env.close()
        return scores, mean_scores
    except KeyboardInterrupt:
        print('\nTraining Interrupted')
        env.close()
        return scores, mean_scores
    
    env.close()    
    return scores, mean_scores

### Training Loop

In [None]:
scores, mean_scores = train()

### Plot Scores

In [None]:
%matplotlib inline
fig, ax = plt.subplots()
x = np.arange(1, len(scores) + 1)
ax.plot(x, scores)
ax.plot(x, mean_scores)
ax.set_ylabel('Score')
ax.set_xlabel('Episode #')
plt.show()

---

### Load the Trained Agents for Visualization

In [None]:
env = UnityEnvironment(file_name=ENVIRONMENT, seed=1, side_channels=[])

agent = MultiAgentDDPG()

### Load the Saved Model

In [None]:
# load the weights from file
map_location = (lambda storage, loc: storage.cuda()) if torch.cuda.is_available() else 'cpu'
checkpoint = torch.load(MODEL_FILE, map_location=map_location)
agent.actor.load_state_dict(checkpoint['actor']);

### Start the Game

In [None]:
N_GAMES = 10
N_GAMES_TO_RENDER = 4
RENDER_REAL_TIME = False

In [None]:
if RENDER_REAL_TIME:
    %matplotlib notebook
    fig = plt.figure()
    plt.gcf().set_size_inches(display_width * IMAGE_SCALE / DPI, display_height * IMAGE_SCALE / DPI)
    plt.axis('off')
    img = plt.imshow(np.zeros((display_height, display_width, 3)))
else:
    %matplotlib inline
    frames = []
render_list = np.random.choice(range(N_GAMES), size=N_GAMES_TO_RENDER, replace=False)
for episode in range(N_GAMES):
    env.reset()
    terminated = {}
    while True:
        for behavior_name, agent_params in agent_info.items():
            decision_steps, terminal_steps = env.get_steps(behavior_name)
            if behavior_name not in terminated:
                terminated[behavior_name] = {}
                for i in np.concatenate((decision_steps.agent_id, terminal_steps.agent_id)):
                    terminated[behavior_name][i] = False
            for agent_id in decision_steps.agent_id:
                state = decision_steps[agent_id].obs
                action = agent.act(agent_params, state)
                env.set_action_for_agent(behavior_name, agent_id, action)
        if camera_agent and (RENDER_REAL_TIME or (episode in render_list)):
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])
        env.step()
        for behavior_name in agent_info:
            decision_steps, terminal_steps = env.get_steps(behavior_name)
            for agent_id in terminal_steps.agent_id:
                terminated[behavior_name][agent_id] = True
                if terminal_steps[agent_id].interrupted:
                    print('agent #{} in the agent-group "{}" has reached the maximum number of steps in the episode'.format(agent_id, behavior_name))
        if camera_agent and (RENDER_REAL_TIME or (episode in render_list)):
            decision_steps, _ = env.get_steps(camera_agent)
            if (len(decision_steps.agent_id)):
                camera_agent_id = decision_steps.agent_id[0]
                image = decision_steps[camera_agent_id].obs
                if RENDER_REAL_TIME:
                    img.set_data(image[0])
                    fig.canvas.draw()
                else:
                    frames.append(image[0])
        if np.asarray([np.asarray(list(agent_group.values())).all() for agent_group in terminated.values()]).any():
            break
%matplotlib inline
if not RENDER_REAL_TIME: animate_frames(frames)

In [None]:
env.close()

---

Next: [Additional Materials](./Additional.ipynb)