# Collaboration and Competition

---

This notebook is based on the notebook of the coding exercice [DDPG-pendulum](https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-pendulum) of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.


Here, we will train a MADDPG multi-agent for **Soccer** environment.


### 1. Import the Necessary Packages

We begin by importing the necessary packages.

In [1]:
from unityagents import UnityEnvironment
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from maddpg_agent import MADDPG

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Soccer.app"`
- **Windows** (x86): `"path/to/Soccer_Windows_x86/Soccer.exe"`
- **Windows** (x86_64): `"path/to/Soccer_Windows_x86_64/Soccer.exe"`
- **Linux** (x86): `"path/to/Soccer_Linux/Soccer.x86"`
- **Linux** (x86_64): `"path/to/Soccer_Linux/Soccer.x86_64"`
- **Linux** (x86, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86"`
- **Linux** (x86_64, headless): `"path/to/Soccer_Linux_NoVis/Soccer.x86_64"`

For instance, if you are using a Mac, then you downloaded `Soccer.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Soccer.app")
```

In [2]:
env = UnityEnvironment(file_name="Soccer_Windows_x86_64/Soccer.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 2
        Number of External Brains : 2
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: GoalieBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 
Unity brain name: StrikerBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 112
        Number of stacked Vector Observation: 3
        Vector Action space type: discrete
        Vector Action space size (per agent): 6
        Vector Action descriptions: , , , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we obtain separate brains for the striker and goalie agents.

In [3]:
# print the brain names
print(env.brain_names)

# set the goalie brain
g_brain_name = env.brain_names[0]
g_brain = env.brains[g_brain_name]

# set the striker brain
s_brain_name = env.brain_names[1]
s_brain = env.brains[s_brain_name]

['GoalieBrain', 'StrikerBrain']


### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)

# number of agents 
num_g_agents = len(env_info[g_brain_name].agents)
print('Number of goalie agents:', num_g_agents)
num_s_agents = len(env_info[s_brain_name].agents)
print('Number of striker agents:', num_s_agents)

# number of actions
g_action_size = g_brain.vector_action_space_size
print('Number of goalie actions:', g_action_size)
s_action_size = s_brain.vector_action_space_size
print('Number of striker actions:', s_action_size)

# examine the state space 
g_states = env_info[g_brain_name].vector_observations
g_state_size = g_states.shape[1]
print('There are {} goalie agents. Each receives a state with length: {}'.format(g_states.shape[0], g_state_size))
s_states = env_info[s_brain_name].vector_observations
s_state_size = s_states.shape[1]
print('There are {} striker agents. Each receives a state with length: {}'.format(s_states.shape[0], s_state_size))

Number of goalie agents: 2
Number of striker agents: 2
Number of goalie actions: 4
Number of striker actions: 6
There are 2 goalie agents. Each receives a state with length: 336
There are 2 striker agents. Each receives a state with length: 336


In [5]:
# Set the parameters Initialise the agents

num_agents=num_g_agents+num_s_agents
state_sizes=[g_state_size,]*num_g_agents+[s_state_size,]*num_s_agents
action_sizes=[g_action_size,]*num_g_agents+[s_action_size,]*num_s_agents
names=[g_brain_name, s_brain_name]
agents=MADDPG(state_sizes, action_sizes, num_agents, random_seed=1, path="Soccer_checkpoint", discrete=True)

### 3. Train the Agents with MADDPG

Run the code cell below to train the agents from scratch.  Alternatively, you can skip to the next code cell to load the pre-trained weights from file.

In [6]:
def trainer(n_episodes=4000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.9965):
    np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
    scores_deque = deque(maxlen=100)
    scores = []
    max_score = -np.Inf*np.ones(num_agents)
    eps = eps_start                    # initialize epsilon
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)                 # reset the environment    
        states = sum([list(env_info[name].vector_observations) for name in names],start=[])
        #agents.reset()
        agents_score = np.zeros(num_agents)
        for t in range(max_t):
           
            actions = agents.act(states,eps)
            
            g_actions, s_actions = np.vstack(actions[:num_g_agents]), np.vstack(actions[-num_s_agents:])
            
            actions_dict = dict(zip([g_brain_name, s_brain_name], [g_actions, s_actions]))
            
            env_info = env.step(actions_dict)
            
            next_states = sum([list(env_info[name].vector_observations) for name in names],start=[])
            
            rewards = sum([list(env_info[name].rewards) for name in names],start=[])
            
            dones = sum([list(env_info[name].local_done) for name in names],start=[])
            
            agents.step(states, actions, rewards, next_states, dones, t)
                
            states = next_states.copy()
            agents_score += rewards                       # update the score (for each agent)
            
            if np.any(dones):
                break 
                
        scores_deque.append(agents_score)
        scores.append(agents_score)
        mean=np.mean(scores_deque,axis=0)
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        
        print('\rEpisode {}\tAverage Scores: {} \tScores: {}'.format(i_episode, mean , agents_score), end="")
        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Scores: {}'.format(i_episode, mean))
        
        idx=np.where(mean>=max_score)[0]
        max_score[idx]=mean[idx]
        agents.save(idx)
        
        if (max_score>.5).all():
            print("\n solved in",i_episode,"episode")
            break
            
    return np.array(scores)

In [None]:
scores=trainer()

Episode 71	Average Scores: [ 0.3614  0.4853 -0.4853 -0.3614] 	Scores: [ 1.0017  1.0017 -1.0017 -1.0017]

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
itr=np.arange(1, len(scores)+1)
for i in range(num_agents):
    
    name=names[i<num_g_agents]
    
    j= i if i<num_g_agents else i-num_g_agents
        
    plt.plot(itr, scores[:,i],label="agent "+name+" "+str(j))
    
ax.legend()    
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

### 4. Watch the Smart Agent!

In the next code cell, we will load the trained weights and test the model.

In [None]:
agents.load()
np.set_printoptions(formatter={'float': '{: 0.4f}'.format})

for i in range(2):                                         # play game for 2 episodes
    env_info = env.reset(train_mode=False)                 # reset the environment    
    g_states = env_info[g_brain_name].vector_observations  # get initial state (goalies)
    s_states = env_info[s_brain_name].vector_observations  # get initial state (strikers)
    g_scores = np.zeros(num_g_agents)                      # initialize the score (goalies)
    s_scores = np.zeros(num_s_agents)                      # initialize the score (strikers)
    while True:
        
        states=list(g_states)+list(s_states)
        
        # select actions and send to environment
        
        actions = agents.act(states,0.)
            
        g_actions, s_actions = np.vstack(actions[:num_g_agents]), np.vstack(actions[-num_s_agents:])
        
        actions = dict(zip([g_brain_name, s_brain_name], [g_actions, s_actions]))
        
        env_info = env.step(actions)                       
        
        # get next states
        g_next_states = env_info[g_brain_name].vector_observations         
        s_next_states = env_info[s_brain_name].vector_observations
        
        # get reward and update scores
        g_rewards = env_info[g_brain_name].rewards  
        s_rewards = env_info[s_brain_name].rewards
        g_scores += g_rewards
        s_scores += s_rewards
        
        # check if episode finished
        done = np.any(env_info[g_brain_name].local_done)  
        
        # roll over states to next time step
        g_states = g_next_states
        s_states = s_next_states
        
        # exit loop if episode finished
        if done:                                           
            break
    print('Scores from episode {}: {} (goalies), {} (strikers)'.format(i+1, g_scores, s_scores))

When finished, you can close the environment.

In [None]:
env.close()