## Project 2 Report - Continous Control

- Author: Thiago Akio Nakamura
- Date: May 2021
- Repo: https://github.com/akionakamura/drlnd-p2

### Introduction
This document contains the report for the Project 2 of the Deep Reinforcement Learning Nano Degree from Udacity. The goal of this project is to train an reinforcement learning agent to for a continous control problem. This report explains the algorithms used, as well as the tests parameters and the obtained results. The reader is also encouraged to read through the source code, for the detailed implementations of the neural networks and the agent itself.

### The environment
The `Reach` environment consists in a dual-jointed robotic arm that needs to follow a target location. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible. Its observation space has 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. The action space corresponds to 4 continuous variables varying between [-1, 1], where each corresponds to torque applicable to two joints.

#### The Goal
The task is episodic, and in order to solve the environment, the agent must get an average score of +30 over 100 consecutive episodes.

In [1]:
import time
from collections import deque
from dataclasses import asdict

from unityagents import UnityEnvironment
import mlflow
import numpy as np

from agents import MultiAgent
from experiment import RunExperiments, RunConfig



In [2]:
# Global constants for the project.
# Number of episodes to train all agents.
NUM_EPISODES = 500

# Consecutive runs to average the scores.
RESULT_WINDOW = 100

# Minimum average score over the window to complete the project.
MIN_AVERAGE_SCORE = 30

In [3]:
experiments_to_run = RunExperiments(
    learn_steps=[1, 2, 4],
    sync_steps=[4, 8, 16],
    batch_sizes=[16, 32, 64],
    gammas=[0.95, 0.99],
    epsilon_decays=[0.98, 0.99]
)
NUM_EXPERIMENTS = 8

In [4]:
env = UnityEnvironment(file_name='./Reacher_Linux/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_size -> 5.0
		goal_speed -> 1.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [5]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [6]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### Learning Algorithms

We are trying to solve the problem with Q-Learning, and besides the basic approach, we are also going to try the DoubleDQN [1] and Dueling DQN [2] approaches.

#### DDPG

In [7]:
def train(num_episodes, env, agent):
    all_scores = []
    score_window = deque(maxlen=RESULT_WINDOW)
    solved = False
    best_mean_score = 0
    
    for episode_i in range(num_episodes):
    
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)

        while True:
            actions = agent.act(states)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished

            agent.step(states, actions, rewards, next_states, dones)

            states = next_states                               # roll over states to next time step
            scores += env_info.rewards                         # update the score (for each agent)
            
            if np.any(dones):                                  # exit loop if episode finished
                break
        
        agent.episode_finished()
        avg_score = np.mean(scores)
        score_window.append(avg_score)
        all_scores.append(avg_score)
        
        current_mean = np.mean(score_window)
        if current_mean >= MIN_AVERAGE_SCORE:
            solved = True
        
        if current_mean > best_mean_score:
            best_mean_score = current_mean
        print(f"\r  Average score on episode {episode_i}: {current_mean}", end="")
        
    return all_scores, solved, best_mean_score, current_mean

### Training

#### Hyper parameters

In [None]:
%%time

experiment_scores = {}
mlflow.set_experiment("reacher-ddpg")
for i, config in enumerate(experiments_to_run.get_configs(NUM_EXPERIMENTS)):
    with mlflow.start_run(run_name=f"Config {i}"):
        start_time = time.time()
        config = experiments_to_run.get_random()
        print(f"Running experiment {i} with config: {config}")
        mlflow.log_params(asdict(config))
        
        agent = MultiAgent(
            num_agents,
            state_size,
            action_size,
            gamma=config.gamma,
            learn_step=config.learn_step,
            sync_step=config.sync_step,
            epsilon_decay=config.epsilon_decay,
            batch_size=config.batch_size
        )

        scores, solved, best_mean_score, last_mean_score = train(NUM_EPISODES, env, agent)
        experiment_scores[config] = scores
        end_time = time.time()
        elapsed = end_time - start_time
        print(f"  Ran experiment in {elapsed} seconds.")
        print(f"  Best mean score of {best_mean_score}.")
        print(f"  Latest mean score of {last_mean_score}.")
        mlflow.log_metric("best_mean_score", best_mean_score)
        mlflow.log_metric("last_mean_score", last_mean_score)

        if solved:
            print("  Solved.")

Running experiment 0 with config: RunConfig(learn_step=1, sync_step=8, batch_size=64, gamma=0.95, epsilon_decay=0.99)
  Average score on episode 9: 0.362449991898611276

### Results


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(18, 8))
for c, scores in experiment_scores.items():
    plt.plot(scores, label=c, alpha=0.7)
    
plt.title("Experiment Results")
plt.ylabel("Average Score")
plt.xlabel("Episode")
plt.legend(bbox_to_anchor=(0.5, -0.08))
plt.grid()
plt.show()

When finished, you can close the environment.

In [None]:
env.close()

### Conclusion

### Future Work

### References