# Continuous Control

---

In this notebook, DDPG agent is trained on Reacher environment.

In [1]:
%matplotlib inline
import numpy as np
import torch

from matplotlib import pyplot as plt
from collections import deque
from unityagents import UnityEnvironment

from ddpg_agent import Agent
from opt import opt
from train_agent import train_agent
from utils import get_settings

### 1. Loading environment and Agent

In [2]:
env = UnityEnvironment(file_name='Reacher.app')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [3]:
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=True)[brain_name]
state_size, action_size = get_settings(env_info, brain)

Number of agents: 1
Number of actions: 4
States look like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]
States have length: 33


### 2. Hyper parameters to configure

* RANDOM_SEED : random seed
* BUFFER_SIZE : replay buffer size
* BATCH_SIZE : minibatch size
* GAMMA : discount factor
* TAU : for soft update of target parameters
* LR_ACTOR : learning rate of the actor
* LR_CRITIC : learning rate of the critic
* EIGHT_DECAY : L2 weight decay
* NUM_EPISODES : number of episodes to train
* MAX_T : maximum number of iterations to train per episode
* SUCCESS_SCORE : success score

In [4]:
RANDOM_SEED = 0
BUFFER_SIZE = int(1e5)  
BATCH_SIZE = 128        
GAMMA = 0.99            
TAU = 1e-3              
LR_ACTOR = 1e-4         
LR_CRITIC = 3e-4        
WEIGHT_DECAY = 0.0001        
NUM_EPISODES = 100     
MAX_T = 100            
SUCCESS_SCORE = 30      


### 3. DDPG

### 4. Train an agent

In [5]:
agent = Agent(state_size, action_size, random_seed=RANDOM_SEED, buffer_size=BUFFER_SIZE, batch_size=BATCH_SIZE, gamma=GAMMA, tau=TAU, lr_actor=LR_ACTOR, lr_critic=LR_CRITIC, weight_decay=WEIGHT_DECAY)


In [None]:
scores = train_agent(env, agent, brain_name, n_episodes=NUM_EPISODES, max_t=MAX_T, success_score=SUCCESS_SCORE)
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Episode 7	Average Score: 0.30	Average Time: 61.842291831970215

With random seed 0, this DDPG agent solves the task. The below figure is the plot of average scores along episodes.

### 5. Infer the agent

In [5]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.1599999964237213


When finished, you can close the environment.

In [7]:
env.close()