# Continuous Control-Reacher

---

In this notebook, you will be guided through the steps to create an agent, load the train weight for the actor, and see the performance of a trained agent. This notebook only works for MacOSX, for other operation systems please change the name of envirnoment accordingly.

Remember to change the kernel to 'drlnd', which can be set up following the [here](https://github.com/udacity/deep-reinforcement-learning#dependencies)

## 1. Before we start

Before we get started, please make sure that all necessary files are in the same folder as this notebook. Please also make sure that they are not in other sub-folders. The requirements are:
    * ppo_checkpoint.pth, which is the trained weights for the agent's network
    * infrastructures.py
    * agents.py
    * the environment file, please make sure the name is 'Reacher'
    

## 2. Start the Environment

In [3]:
import numpy as np
rewards = np.load('rewards_history.npy')
averages = [np.mean(rewards[range(0,max(1,i-100))]) for i in range(1,1501)]

In [7]:
np.sum(np.array(averages)>=30)

1276

In [8]:
1500-1276-100

124

In [None]:
from unityagents import UnityEnvironment

In [None]:
# start environment, you might need to change the name
env = UnityEnvironment(file_name='Reacher.app')

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

In [None]:
# reset the environment
env_info = env.reset(train_mode=False)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('Size of each state:', state_size)

In [None]:
import numpy as np
import torch
from collections import namedtuple, deque

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

## 3. Load .py files

For the following two blocks, you will have to run each of them twice. The first time will load the .py files into this notebook. The second time will actually execute the code in this notebook.

In [None]:
%load infrastructures.py

In [None]:
%load agents.py

## 4. See the performance

In [None]:
config = {
    'environment': {
        'state_size':  env_info.vector_observations.shape[1],
        'action_size': brain.vector_action_space_size,
        'number_of_agents': len(env_info.agents)
    },
    'pytorch': {
        'device': torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    },
    'hyperparameters': {
        'discount_rate': 0.99,
        'tau': 0.95,
        'gradient_clip': 5,
        'rollout_length': 2048,
        'optimization_epochs': 10,
        'ppo_clip': 0.2,
        'log_interval': 2048,
        'max_steps': 1e5,
        'mini_batch_number': 32,
        'entropy_coefficent': 0.01,
        'episode_count': 250,
        'hidden_size': 512,
        'adam_learning_rate': 3e-4,
        'adam_epsilon': 1e-5
    }
}
    
policy = PPOPolicyNetwork(config)
optimizier = optim.Adam(policy.parameters(), config['hyperparameters']['adam_learning_rate'], eps=config['hyperparameters']['adam_epsilon'])
agent = PPOAgent(env, brain_name, policy, optimizier, config)

In [None]:
agent.network.load_state_dict(torch.load('ppo_checkpoint.pth'))

In [None]:
agent.network.eval()

In [None]:
env_info = env.reset(train_mode=False)[brain_name]  # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)                                               # count how many time steps are there
while True:
    actions, _, _, _ = agent.network(states).detach().cpu().numpy()           # select an action (for each agent)
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

## 5. Close the environment

In [None]:
env.close()