# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
import copy
import torch
import random
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.nn.functional as F
from unityagents import UnityEnvironment
from collections import namedtuple, deque

In [2]:
env = UnityEnvironment(file_name="Tennis_Windows_x86_64/Tennis.exe")

# get the default brain
wimbledon = env.brain_names[0]
brain = env.brains[wimbledon]

env_info = env.reset(train_mode=True)[wimbledon]

action_size = brain.vector_action_space_size
states = env_info.vector_observations
state_size = states.shape[1]

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [None]:
for i in range(1, 6):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[wimbledon]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[wimbledon]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    print('Score (max over agents) from episode {}: {}'.format(i, np.max(scores)))

When finished, you can close the environment.

## Hyperparameters

In [3]:
BUFFER_SIZE = int(1e5)     # replay buffer size
BATCH_SIZE = 5           # minibatch size
GAMMA = 0.99               # discount factor
TAU = 1e-3                 # for soft update of target parameters
LR_ACTOR = 1e-4            # learning rate of the actor 
LR_CRITIC = 3e-4           # learning rate of the critic
update_every = 5         # number of timesteps after which to run an update
SN = 0.5                   # starting value for additive noise scale (exploratory actions)
ND = 0.999                 # noise decay rate (exploratory actions)
NM = 0.01                  # noise minimum to be maintained (exploratory actions)
UC = 3                     # Number of cycles to run updates for

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## DDPG agents
____
### Define Actor and Critic

In [4]:
class Actor(nn.Module):
    def __init__(self, state_size, action_size, hu=(256, 128), activ_in = F.relu, activ_out = torch.tanh):
        super(Actor, self).__init__()
        
        self.activ_in = activ_in
        self.activ_out = activ_out
        
        self.input_layer = nn.Linear(state_size, hu[0])
        self.hl1 = nn.Linear(hu[0], hu[1])
        self.output_layer = nn.Linear(hu[-1], action_size)
        
    def forward(self, state):
        x = state
        x = self.activ_in(self.input_layer(x))
        x = self.activ_in(self.hl1(x))
        return self.activ_out(self.output_layer(x))  
    
class Critic(nn.Module):
    def __init__(self, state_size, action_size, hu=(256, 128), activ_in = F.relu):
        super(Critic, self).__init__()
        
        self.activ_in = activ_in
        
        self.input_layer = nn.Linear(state_size, hu[0])
        self.hl1 = nn.Linear(hu[0]+action_size, hu[1])
        self.output_layer = nn.Linear(hu[-1], 1)
        
    def forward(self, state, action):
        x = state
        u = action
        
        x = self.activ_in(self.input_layer(x))
        x = torch.cat((x, u), dim=1)
        x = self.activ_in(self.hl1(x))
        return self.output_layer(x)

### Define DDPG agent

In [5]:
class DDPG():
    def __init__(self, state_size, action_size, num_agents=2,
                 start_noise=SN, noise_decay=ND, noise_min=NM, add_noise=True):
        super(DDPG, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        
        ### Initialise actor online and target networks
        self.actor_online = Actor(state_size, action_size).to(device)
        self.actor_target = Actor(state_size, action_size).to(device)
        self.actor_optimizer = optim.Adam(self.actor_online.parameters(), lr=LR_ACTOR)
        
        ### Initialise critic online and target networks
        self.critic_online = Critic(num_agents*state_size, num_agents*action_size).to(device)
        self.critic_target = Critic(num_agents*state_size, num_agents*action_size).to(device)
        self.critic_optimizer = optim.Adam(self.critic_online.parameters(), lr=LR_CRITIC)
        
        ### Ensure actor and online networks start off with same parameters
        self.equalize_OnlineTarget(self.actor_target, self.actor_online)
        self.equalize_OnlineTarget(self.critic_target, self.critic_online)
        
        ### Noise parameters for exploration
        self.noise_scale = start_noise
        self.noise_decay = noise_decay
        self.noise_min = noise_min
        self.add_noise = add_noise
        
    def equalize_OnlineTarget(self, target, online):
        for target_param, online_param in zip(target.parameters(), online.parameters()):
            target_param.data.copy_(online_param.data)
            
    def generate_noise(self):
        noise = np.random.normal(loc=0, scale=self.noise_scale, size=self.action_size)
        self.noise_scale = max(self.noise_decay*self.noise_scale, self.noise_min)
        return noise
    
    def act(self, state):
        state = torch.from_numpy(state).float().to(device)
        self.actor_online.eval()
        with torch.no_grad():
            action = self.actor_online(state).cpu().data.numpy()
        self.actor_online.train()
        if self.add_noise:
            action += self.generate_noise()
        return np.clip(action, -1, 1)
    
    def soft_update(self, online_model, target_model, tau):
        for target_param, online_param in zip(target_model.parameters(), online_model.parameters()):
            target_param.data.copy_(tau*online_param.data + (1.0-tau)*target_param.data)

In [6]:
### Define a Public Replaybuffer class to store experiences, organise and sample from them
class ReplayBuffer:
    def __init__(self, action_size, buffer_size, batch_size):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["full_state", "state",\
                                                                "action", "reward",\
                                                                "full_next_state", "next_state",\
                                                                "done"])
    
    ### Adds an experience tuple to memory
    def add(self, full_state, state, action, reward, full_next_state, next_state, done):
        e = self.experience(full_state, state, action, reward, full_next_state, next_state, done)
        self.memory.append(e)

    ### Samples k (batch size) experience tuples randomly from memory
    def sample(self):
        experiences = random.sample(self.memory, k=self.batch_size)
        
        full_states = torch.from_numpy(np.vstack([e.full_state for e in experiences if e is not None])).float().to(device)
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        full_next_states = torch.from_numpy(np.vstack([e.full_next_state for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
        
        return (full_states, states, actions, rewards, full_next_states, next_states, dones)
    
    def __len__(self):
        return len(self.memory)

In [7]:
class MADDPG(object):
    def __init__(self, state_size, action_size, num_agents=2, update_cycles = UC):
        self.state_size = state_size
        self.action_size = action_size
        self.num_agents = num_agents
        self.full_action_len = self.action_size*num_agents
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE)
        
        ### Keep track of timesteps since last training update
        self.update = 0
        self.update_cycles = UC
        
        self.maddpg_agents = [DDPG(state_size, action_size, num_agents) for i in range(self.num_agents)]
        
    def step(self, states, actions, rewards, next_states, dones):
        full_states = np.concatenate(states)
        full_next_states = np.concatenate(next_states)
        
        self.memory.add(full_states, states, actions, rewards, full_next_states, next_states, dones)
    
        self.update = (self.update +1)%update_every
        
        if (self.update==0):
            if len(self.memory) > BATCH_SIZE:
                for _ in range(self.update_cycles):
                    for agent in range(self.num_agents):
                        experiences = self.memory.sample()
                        self.learn(experiences, agent, GAMMA)
                    self.soft_update_all()
                    
    def soft_update_all(self):
        for agent in self.maddpg_agents:
            agent.soft_update(agent.actor_online, agent.actor_target, TAU)
            agent.soft_update(agent.critic_online, agent.critic_target, TAU)
            
    def learn(self, experiences, id_agent, gamma):
        full_states, states, actions, rewards, full_next_states, next_states, dones = experiences
        
        full_next_actions = torch.cat([self.maddpg_agents[i].actor_target(next_states[i,:]) for i in range(self.num_agents)]).unsqueeze(dim=0)
        full_actions = torch.cat((actions[::2,:], actions[1::2,:])).unsqueeze(0)
        
        Q_targets_next = self.maddpg_agents[id_agent].critic_target(full_next_states, full_next_actions)
        Q_targets = rewards.squeeze()[id_agent] + (gamma * Q_targets_next * (1-dones.squeeze()[id_agent]))
        Q_expected = self.maddpg_agents[id_agent].critic_online(full_states, full_actions)
        
        critic_loss = F.mse_loss(Q_expected, Q_targets)
        self.maddpg_agents[id_agent].critic_optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.maddpg_agents[id_agent].critic_online.parameters(), 1)
        self.maddpg_agents[id_agent].critic_optimizer.step()
        
        
        actions_pred = self.maddpg_agents[id_agent].actor_online(states[id_agent])
        full_actions_pred = torch.cat([self.maddpg_agents[i].actor_online(states[i,:]) for i in range(self.num_agents)]).unsqueeze(dim=0)
        actor_loss = -self.maddpg_agents[id_agent].critic_online(full_states, actions_pred).mean()
        
        self.maddpg_agents[id_agent].actor_optimizer.zero_grad()
        actor_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.maddpg_agents[id_agent].actor_online.parameters(), 1)
        self.maddpg_agents[id_agent].actor_optimizer.step()

In [8]:
test_agent = MADDPG(state_size, action_size) 

In [13]:
                           # play game for 5 episodes
env_info = env.reset(train_mode=False)[wimbledon]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)                         # initialize the score (for each agent)

actions = np.random.randn(2, action_size) # select an action (for each agent)
actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
env_info = env.step(actions)[wimbledon]           # send all actions to tne environment
next_states = env_info.vector_observations         # get next state (for each agent)
rewards = env_info.rewards                         # get reward (for each agent)
dones = env_info.local_done                        # see if episode finished
                        # update the score (for each agent)
test_agent.step(states, actions, rewards, next_states, dones)

In [18]:
experiences = test_agent.memory.sample()

In [20]:
full_states, states, actions, rewards, full_next_states, next_states, dones = experiences

In [91]:
full_actions = actions.reshape((5,4)) ### full actions needs to be tensor of bsize, 2xaction size
### same needs to be done for next_states and rewards and dones

In [96]:
next_states.shape

torch.Size([10, 24])

In [93]:
full_next_states.shape

torch.Size([5, 48])

In [45]:
### Fix: actions, rewards, next_states, dones must parse into 1,5 (5=batch size)

tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]], device='cuda:0')

In [94]:
full_next_actions = torch.cat([test_agent.maddpg_agents[i].actor_target(next_states[i,:]) for i in range(test_agent.num_agents)]).unsqueeze(dim=0)


In [95]:
full_next_actions.shape

torch.Size([1, 4])

In [92]:
Q_targets_next = test_agent.maddpg_agents[0].critic_target(full_next_states, full_next_actions)


RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 1 and 5 in dimension 0 at C:/w/1/s/tmp_conda_3.6_095855/conda/conda-bld/pytorch_1579082406639/work/aten/src\THC/generic/THCTensorMath.cu:71

In [43]:
full_next_states.shape

torch.Size([5, 48])

In [None]:
full_next_actions = torch.cat([self.maddpg_agents[i].actor_target(next_states[i,:]) for i in range(self.num_agents)]).unsqueeze(dim=0)
full_actions = torch.cat((actions[0], actions[1])).unsqueeze(0)

Q_targets_next = self.maddpg_agents[id_agent].critic_target(full_next_states, full_next_actions)
Q_targets = rewards.squeeze()[id_agent] + (gamma * Q_targets_next * (1-dones.squeeze()[id_agent]))
Q_expected = self.maddpg_agents[id_agent].critic_online(full_states, full_actions)

critic_loss = F.mse_loss(Q_expected, Q_targets)
self.maddpg_agents[id_agent].critic_optimizer.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(self.maddpg_agents[id_agent].critic_online.parameters(), 1)
self.maddpg_agents[id_agent].critic_optimizer.step()


actions_pred = self.maddpg_agents[id_agent].actor_online(states[id_agent])
full_actions_pred = torch.cat([self.maddpg_agents[i].actor_online(states[i,:]) for i in range(self.num_agents)]).unsqueeze(dim=0)
actor_loss = -self.maddpg_agents[id_agent].critic_online(full_states, actions_pred).mean()

self.maddpg_agents[id_agent].actor_optimizer.zero_grad()
actor_loss.backward()
torch.nn.utils.clip_grad_norm_(self.maddpg_agents[id_agent].actor_online.parameters(), 1)
self.maddpg_agents[id_agent].actor_optimizer.step()