# Continuous Control

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.  

Please select one of the two options below for loading the environment.

In [1]:
!pip -q install ./python
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
# env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.5 which is incompatible.[0m


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   5.75471878e+00  -1.00000000e+00
   5.55726671e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [5]:
env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 0.0


When finished, you can close the environment.

In [6]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

In [1]:
import numpy as np
import random
import copy
from collections import namedtuple, deque
from ddpg_agent import Agent
from unityagents import UnityEnvironment
import torch
import torch.nn.functional as F
import torch.optim as optim


#select this option to load version 2 (with 20 agents) of the environment
env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.5 which is incompatible.[0m


INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [8]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   7.90150833e+00  -1.00000000e+00
   1.25147629e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -5.22214413e-01]


In [7]:
from tqdm import tqdm_notebook

In [None]:
# number of agents
num_agents = len(env_info.agents)
num_steps = 1000
num_episodes = 500

scores_window = deque(maxlen=100)
scores = np.zeros(num_agents)
scores_episode = []
agents = []

for i in range(num_agents):
    agents.append(Agent(state_size, action_size, random_seed=0))

for episode in tqdm_notebook(range(1, num_episodes+1)):
    print(f"starting episode {episode} ")
    env_info = env.reset(train_mode=True)[brain_name]
    states = env_info.vector_observations

    for agent in agents:
        agent.reset()

    scores = np.zeros(num_agents)

    for step in range(num_steps):
        actions = np.array([agents[i].act(states[i]) for i in range(num_agents)])

        env_info = env.step(actions)[brain_name]

        next_states = env_info.vector_observations
        rewards = env_info.rewards
        # print(rewards)
        dones = env_info.local_done

        for i in range(num_agents):
            agents[i].step(step, states[i], actions[i], rewards[i], next_states[i], dones[i]) 

        states = next_states
        scores += rewards

        if np.any(dones):
            break
    score = np.mean(scores)
    scores_window.append(score)       # save most recent score
    scores_episode.append(score)

    print(f"Episode:{episode}, Score:{np.mean(scores_window)}")
    if episode % 100 == 0:
        print(f"Episode:{episode}, Score:{np.mean(scores_window)}")

    if np.mean(scores_window)>=30.0:
        print(f"solved in {episode} episodes with average score of {np.mean(scores_window)}")
        torch.save(Agent.actor_local.state_dict(), 'checkpoint_actor.pth')
        torch.save(Agent.critic_local.state_dict(), 'checkpoint_critic.pth')
        break


# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores_episode)), scores_episode)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

In [2]:
# using saved model
import numpy as np
import random
import copy
from collections import namedtuple, deque
from ddpg_agent import Agent
from unityagents import UnityEnvironment
import torch
import torch.nn.functional as F
import torch.optim as optim


#select this option to load version 2 (with 20 agents) of the environment
env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [3]:
# get the default brain
from ddpg_model import Actor, Critic
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

# reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [  0.00000000e+00  -4.00000000e+00   0.00000000e+00   1.00000000e+00
  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00  -1.00000000e+01   0.00000000e+00
   1.00000000e+00  -0.00000000e+00  -0.00000000e+00  -4.37113883e-08
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   5.75471878e+00  -1.00000000e+00
   5.55726624e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
  -1.68164849e-01]


In [4]:
scores = np.zeros(num_agents)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

random_seed=0
Agent.actor_local = Actor(state_size, action_size, random_seed).to(device)
Agent.actor_local.load_state_dict(torch.load('checkpoint_actor.pth'))

agents =[] 

for i in range(num_agents):
    agents.append(Agent(state_size, action_size, random_seed=0))
    
while True:
    actions = np.array([agents[i].act(states[i]) for i in range(num_agents)])

    env_info = env.step(actions)[brain_name]        # send the action to the environment
    next_states = env_info.vector_observations     # get the next state
    rewards = env_info.rewards                     # get the reward
    dones = env_info.local_done        

    states = next_states
    scores += rewards

    print(f'Score: {np.mean(scores)}')
    
    if np.any(dones):
        break
        
print(f'Scores: {scores}')

Initialising ReplayBuffer
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0
Score: 0.0024999999441206455
Score: 0.006999999843537807
Score: 0.013499999698251487
Score: 0.025499999430030583
Score: 0.0429999990388751
Score: 0.06099999863654375
Score: 0.07949999822303652
Score: 0.10199999772012233
Score: 0.12949999710544943
Score: 0.1579999964684248
Score: 0.1904999957419932
Score: 0.22049999507144094
Score: 0.25149999437853693
Score: 0.2834999936632812
Score: 0.31449999297037723
Score: 0.34649999225512146
Score: 0.3804999914951622
Score: 0.41449999073520305
Score: 0.4484999899752438
Score: 0.4824999892152846
Score: 0.5164999884553254
Score: 0.5504999876953661
Score: 0.584499986935407
Score: 0.6184999861754477
Score: 0.6529999854043126
Score: 0.68899998459965
Score: 0.7249999837949872
Score: 0.758999983035028
Score: 0.7929999822750687
Score: 0.8269999815151096
Score: 0.8584999808110296
Score: 0.8899999801069498
Score: 0.92149997940287
Score

Score: 10.9079997561872
Score: 10.937999755516648
Score: 10.969999754801393
Score: 11.001999754086137
Score: 11.033999753370882
Score: 11.065999752655625
Score: 11.09799975194037
Score: 11.131499751191587
Score: 11.167499750386924
Score: 11.205499749537557
Score: 11.24199974872172
Score: 11.277999747917056
Score: 11.313499747123569
Score: 11.34749974636361
Score: 11.381499745603652
Score: 11.418499744776636
Score: 11.45649974392727
Score: 11.492999743111431
Score: 11.52899974230677
Score: 11.564999741502106
Score: 11.600999740697443
Score: 11.636999739892781
Score: 11.672999739088118
Score: 11.708999738283456
Score: 11.74349973751232
Score: 11.777499736752361
Score: 11.811499735992403
Score: 11.845499735232442
Score: 11.877499734517187
Score: 11.909499733801931
Score: 11.941499733086676
Score: 11.973499732371419
Score: 12.003499731700867
Score: 12.03299973104149
Score: 12.063999730348588
Score: 12.09599972963333
Score: 12.127999728918075
Score: 12.161499728169293
Score: 12.195499727409

Score: 22.153999504819513
Score: 22.18999950401485
Score: 22.225999503210186
Score: 22.261999502405523
Score: 22.29649950163439
Score: 22.33049950087443
Score: 22.36449950011447
Score: 22.39849949935451
Score: 22.432499498594552
Score: 22.466499497834594
Score: 22.500499497074635
Score: 22.534499496314673
Score: 22.568499495554715
Score: 22.60399949476123
Score: 22.639999493956566
Score: 22.6779994931072
Score: 22.715999492257833
Score: 22.753499491419642
Score: 22.788999490626157
Score: 22.8229994898662
Score: 22.856999489106236
Score: 22.890999488346278
Score: 22.926999487541615
Score: 22.96149948677048
Score: 22.995499486010523
Score: 23.030499485228212
Score: 23.06449948446825
Score: 23.09799948371947
Score: 23.129999483004212
Score: 23.161999482288955
Score: 23.1939994815737
Score: 23.225999480858444
Score: 23.25949948010966
Score: 23.293499479349702
Score: 23.327499478589743
Score: 23.362999477796258
Score: 23.40099947694689
Score: 23.437499476131052
Score: 23.47349947532639
Scor

Score: 33.20399925783276
Score: 33.23799925707281
Score: 33.271999256312846
Score: 33.30599925555289
Score: 33.337999254837634
Score: 33.369999254122376
Score: 33.399999253451824
Score: 33.429499252792446
Score: 33.45449925223365
Score: 33.478499251697215
Score: 33.50249925116077
Score: 33.526499250624326
Score: 33.54899925012141
Score: 33.57399924956262
Score: 33.60199924893677
Score: 33.629999248310924
Score: 33.657999247685076
Score: 33.68599924705923
Score: 33.713999246433374
Score: 33.74199924580753
Score: 33.767999245226385
Score: 33.793999244645235
Score: 33.81949924407527
Score: 33.845499243494125
Score: 33.8719992429018
Score: 33.898499242309484
Score: 33.92599924169481
Score: 33.953999241068956
Score: 33.983999240398404
Scores: [ 32.42999928  33.51999925  35.54999921  38.85999913  32.75999927
  34.17999924  33.00999926  36.56999918  30.03999933  35.7499992
  32.21999928  36.51999918  31.4499993   35.21999921  31.89999929
  34.79999922  32.18999928  35.09999922  35.8999992   3