# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

## 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [3]:
# Jedi Not Working
#  %config Completer.use_jedi = False
! pip install ../python

Processing /home/workspace/python
Collecting tensorflow==1.7.1 (from unityagents==0.4.0)
  Using cached https://files.pythonhosted.org/packages/66/83/35c3f53129dfc80d65ebbe07ef0575263c3c05cc37f8c713674dcedcea6f/tensorflow-1.7.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting jupyter (from unityagents==0.4.0)
  Using cached https://files.pythonhosted.org/packages/83/df/0f5dd132200728a86190397e1ea87cd76244e42d39ec5e88efd25b2abd7e/jupyter-1.0.0-py2.py3-none-any.whl
Collecting docopt (from unityagents==0.4.0)
Collecting qtconsole (from jupyter->unityagents==0.4.0)
  Using cached https://files.pythonhosted.org/packages/3a/57/c8fc1fc6fb6bc03caca20ace9cd0ac0e16cc052b51cbe3acbeeb53abcb18/qtconsole-5.1.1-py3-none-any.whl
Collecting jupyter-console (from jupyter->unityagents==0.4.0)
  Using cached https://files.pythonhosted.org/packages/59/cd/aa2670ffc99eb3e5bbe2294c71e4bf46a9804af4f378d09d7a8950996c9b/jupyter_console-6.4.0-py3-none-any.whl
Collecting qtpy (from qtconsole->jupyter->unityagents==0.4.

Collecting prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 (from jupyter-console->jupyter->unityagents==0.4.0)
  Using cached https://files.pythonhosted.org/packages/c6/37/ec72228971dbaf191243b8ee383c6a3834b5cde23daab066dfbfbbd5438b/prompt_toolkit-3.0.20-py3-none-any.whl
Building wheels for collected packages: unityagents
  Running setup.py bdist_wheel for unityagents ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-8rn14m_h/wheels/97/7a/24/09937717b9737178ae827bcef33ba219b540efd55be210010c
Successfully built unityagents
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.20 which is incompatible.[0m
[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
Installing collected packages: tensorflow, qtpy, qtconsole, prompt-toolkit, jupyter-console, jupyter, docopt, unityagents, widgetsnbextension
  Found existing installation: prompt-toolkit 1.0.15
    Uninstalling pr

In [4]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

## 4.1 DDPG

### 4.1.1 Project Architecture

The Architecture of this project was extended from the model archiecture used 
in project 1. Primarily, I built an environment manager API such that the minor
differences between OpenGym AI environments and Unity Environments can be
abstracted from the Trainer and Agent models such that either environment can
be used to train an agent with minimal to no change in code.

This leads to the discussion of the three primary classes within this project.
the `EnvironmentMgr`, the `Trainer`, and the `Agent` interfaces.

1. `EnvironmentMgr` - Each `EnvironmentMgr` class contains common commands the
   `Trainer` can interface with to command the environment to `start`, `step`,
   `reset`, `get_evn`, and `close`.
   
2. `Trainer` - This class is intended to hold all of the properties for the
   experiment and manipulate both the `Agent` and the `Environment`.
   
3. `Agent` - This is the class that holds the reinforement learning agent and
   manitains a similar structure to other implementations with minor edits for
   funciton encapsulation.

#### 4.1.1.1 Agent Selection

Due to the difficulty of the problem and the ammount of implementations using
Deep Deterministic Policy Gradient (DDPG), I chose to implement a similar
version, in order to leverage and compare my code with the available 
resources - and solicit feedback for others to review my code.

Using the DDPG implementation from the Bipedal and Pendulum models as starting
points I implemented my version of the DDPG agent. I implemented the 
Ornstein-Uhlenbeck process to add noise to my model similar to the example, and
following the advice of the prompt - I implemented methods to restrict learning
for the target Actor and Critic models as well as implementing a way to 
randomly sample a subset of agents (if n>1) for learning.

#### 4.1.1.2 Neural Network Model Architecture

After reviewing several times with fellow students and discussing with mentors
within the forums. I've selected an `Actor` Model consisting of `4` fully 
connected layers with hidden layers of `256`, `128`, and `64` units wide and input
units equal to the state size and output units equal to the action size. For
the `Critic` Model, I've constructed a `4` fully connected model again with
hidden layers equal to `256`, `128`, and `64` units wide, but following the 
recommendation of Agents of this structure to inject the states as inputs into
the first layer and actions into the second. Finally, outputting a single node.

For activation functions, the `ReLU` function was used to minimize complexity and
the hyberbolic tangent function (`tanh`) was used as output for the `Actor`.

Weights were initialized using uniform distribution from 
$\mp\frac{1}{\sqrt{N_{input}}}$ for all of the nodes save for the final node where
a uniform distribution between $\mp3e-3$ was established.

### 4.1.2 Primary Import and Utility Functionality

In [1]:
# %config Completer.use_jedi = True
!pip -q install toml
!pip install ../python
import numpy as np
import random

import torch
import torch.nn as nn
import torch.nn.functional as F

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from reacher_agents.ddpg_agent import DDPGAgent
from reacher_agents.trainers import MultiAgentTrainer, SingleAgentTrainer

ENV_TYPE = 'unity'      # enum ('unity', 'gym') = choose which environment to run
CLOUD = True            # True if running in Udacity venv
BUFFER_SIZE = int(1e6)  # Replay buffer size
BATCH_SIZE = 16         # minibatch size
N_EPISODES = 1000       # 300|3000 max number of episodes to run
MAX_T = 1000            # Max time steps within an episode
N_WORKERS = 20          # number of workers to run in environment
MAX_WORKERS = 10        # number of workers to learn from an episode, ignored if N_WORKERS < MAX_WORKERS
LEARN_F = 20            # Learning Frequency within epiodes
GAMMA = 0.99            # discount factor
TAU = 1e-3              # soft update target parameter
LR_ACTOR = 1e-4         # learning rate for the actor
LR_CRITIC = 1e-4        # learning rate for the critic
WEIGHT_DECAY = 0.       #0.0001 - L2 weight decay parameter
WINDOW_LEN = 100        # window length for averaging
ACTOR_HIDDEN = (256, 128)
CRITIC_HIDDEN = (256, 128)
ADD_NOISE = False

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.20 which is incompatible.[0m


In [None]:
!pip freeze > requirements.txt

In [2]:
def main():
    if ENV_TYPE.lower() == 'gym':
        import gym
        from reacher_agents.gym_environments import GymContinuousEnvMgr
    #     scenarios = {'LunarLanderContinuous-v2',
    #                  'BipedalWalker-v3',
    #                  'Pendulum-v0'}
        envh = GymContinuousEnvMgr('Pendulum-v0')
        root_name = 'gym'
        Trainer = SingleAgentTrainer
        upper_bound = 2.0
        solved = -250
    else:
        from reacher_agents.unity_environments import UnityEnvMgr
        root_name = 'unity'
        if N_WORKERS == 1:
            file_name = 'envs/Reacher_Windows_x86_64-one-agent/Reacher.exe'
        else:
            file_name = 'envs/Reacher_Windows_x86_64-twenty-agents/Reacher.exe'
        envh = UnityEnvMgr(file_name)
        Trainer = MultiAgentTrainer
        upper_bound = 1.0
        solved = 30.0

    if CLOUD:
        if N_WORKERS==1:
            file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64'
        else:
            file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64'
        envh = UnityEnvMgr(file_name)
    env = envh.start()
    state_size = envh.state_size
    action_size = envh.action_size
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    # device = torch.device("cpu")
    agent = DDPGAgent(
        state_size=state_size,
        action_size=action_size,
        buffer_size=BUFFER_SIZE,
        batch_size=BATCH_SIZE,
        gamma=GAMMA,
        tau=TAU,
        lr_actor=LR_ACTOR,
        lr_critic=LR_CRITIC,
        learn_f=LEARN_F,
        weight_decay=WEIGHT_DECAY,
        device=device,
        random_seed=42,
        upper_bound=upper_bound,
        actor_hidden=ACTOR_HIDDEN,
        critic_hidden=CRITIC_HIDDEN,
        add_nose=ADD_NOISE
    )
    trainer = Trainer(
        agent=agent,
        env=envh,
        n_episodes=N_EPISODES,
        max_t=MAX_T,
        window_len=WINDOW_LEN,
        solved=solved,
        n_workers=N_WORKERS,
        max_workers=MAX_WORKERS,  # note can be lower than n
        save_root=root_name,
    )
    return envh, agent, trainer

### 4.1.4 Grid Search

This section investigates learning rates of the `Actor` and `Critic` models as
well as learning frequencies.

Conducting a grid search with learning rate for actor and critic I the 
following relations running 50 episode epochs:
```
Actor LR: 1.0e-04	Critic LR: 1.0e-04
Episode 50	Average Score: 0.87

Actor LR: 1.0e-04	Critic LR: 1.0e-03
Episode 50	Average Score: 0.47

Actor LR: 1.0e-04	Critic LR: 2.0e-03
Episode 50	Average Score: 0.36

Actor LR: 1.0e-03	Critic LR: 1.0e-04
Episode 50	Average Score: 0.86

Actor LR: 1.0e-03	Critic LR: 1.0e-03
Episode 50	Average Score: 0.86

Actor LR: 1.0e-03	Critic LR: 2.0e-03
Episode 50	Average Score: 0.04

Actor LR: 2.0e-03	Critic LR: 1.0e-04
Episode 50	Average Score: 0.85

Actor LR: 2.0e-03	Critic LR: 1.0e-03
Episode 50	Average Score: 0.04

Actor LR: 2.0e-03	Critic LR: 2.0e-03
Episode 50	Average Score: 0.65
```

The fastest learning rates seem to be 1e-3 and 1e-4 for the actor and from 
1e-4 to 2e-3 for the critic. With LRs close to one another I found the best
performance.
I will select the learning rates `2e-3` and `1e-4` for the actor and critic
repsectively.

When investigating learning period for soft updating the following was 
observed:
```
Actor LR: 2.0e-03	Critic LR: 1.0e-04	L_Period: 1
Episode 50	Average Score: 0.86

Actor LR: 2.0e-03	Critic LR: 1.0e-04	L_Period: 5
Episode 50	Average Score: 0.77

Actor LR: 2.0e-03	Critic LR: 1.0e-04	L_Period: 10
Episode 50	Average Score: 0.44

Actor LR: 2.0e-03	Critic LR: 1.0e-04	L_Period: 15
Episode 50	Average Score: 0.66

Actor LR: 2.0e-03	Critic LR: 1.0e-04	L_Period: 20
Episode 50	Average Score: 0.66
```

The fastest learning rate was inversely proportional to the period. Keeping
in mind that I likly will need to maintain a period of `20` time steps taking
into account previous advice from Udacity

### 4.1.5 Run for Record

Reviewing implementations from other students as well reviewing comments from
the Mentor Advice board - I've constructed the following Hyper Parameters for
the run for record.

In [3]:
envh, agent, trainer = main()

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


TypeError: __init__() got an unexpected keyword argument 'actor_hidden'

In [None]:
trainer.agent.device

In [None]:
agent.actor_local

In [None]:
agent.critic_local

In [None]:
scores = trainer.train()

### 4.1.6 Visualize

Visualize the scores of your trained agent. 

* The `i_map` parameter to rotate through the seaborn color palette (paired in 
  groups of 2)
  * 0: blue
  * 1: green
  * 2: red
  * 3: orange
  * 4: purple


In [None]:
def plot_scores(trainer, i_map=0):
    sns.set_style('darkgrid')
    sns.set_context('talk')
    sns.set_palette('Paired')
    cmap = sns.color_palette('Paired')
    if trainer.n_workers > 1:
        scores = np.mean(np.array(trainer.scores_).squeeze(), 1)
    else:
        scores = np.array(trainer.scores_).squeeze()
    alr, clr, lf = trainer.agent.lr_actor, trainer.agent.lr_critic, trainer.agent.learn_f
    score_df = pd.DataFrame({'scores': scores})
    score_df = score_df.assign(mean=lambda df: df.rolling(10).mean()['scores'])

    fig ,ax = plt.subplots(1,1, figsize=(10,8))

    ax = score_df.plot(ax=ax, color=cmap[2*(i_map%4):])
    ax.set_title(f'DDPG Scores vs Time (LR=({alr:.1e}, {clr:.1e}), Lf={lf})')
    ax.set_xlabel('Episode #')
    ax.set_ylabel('Score')
    plt.show()
plot_scores(trainer)

### 4.1.6 Evaluation

In [None]:
agent.load(
    r'D:\udacity\deep-rl\projects/p2_reacher/cont-control/multi-checkpoint_actor-7.5.pth',
    r'D:\udacity\deep-rl\projects/p2_reacher/cont-control/multi-checkpoint_critic-7.5.pth',
)
etrainer = Trainer(
    agent=agent,
    env=envh,
    n_workers=N_WORKERS,
    n_episodes=100,
    max_t=1000,
    window_len=100,
    solved=30.0,
    max_workers=10,
)

In [None]:
scores = etrainer.eval(n_episodes=100, render=False)

In [None]:
plot_scores(etrainer, i_map=1)

In [None]:
envh.close()

## 4.2 Results

As shown above the DDPG implementation provides consistent, albeit slow 
learning. The Model was able to solve the environment using `20` agents in 
`TBR` epsiodes. Much slower than what was demonstrated in the problem prompt.
The slow learning rate and reducing soft updating to every `20` steps 
contributed to this rate. However, increasing learning rates demonstrated 
erratic or poor performance at low episode levels (<50). Clearly, more tuning
can improve this learning rate.

## 4.3 Future Work

The training for this particular agent is very slow - further tuning of the 
hyper parameters should improve efficiency. However, applying newer 
Actor/Critic models such as Twin Delayed DDPG (TD3) would be a direct 
improvement over the applied DDPG application. Another avenue to explore would
be to investigate an on-policy method such as Asynchronous Actor Critic (A3C)
to evaluate performance directly.