# Navigation

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [None]:
diode = False # Set to true if in data diode
cloud = True  # set to true if in Cloud

In [2]:
if diode:
    import sys
    sys.path.append('../ml-agents-0.4.0/python')

if cloud and not diode:
    !pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.19 which is incompatible.[0m


~~The environment is already saved in the Workspace and can be accessed at the file path provided below.  Please run the next code cell without making any changes.~~

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [4]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [5]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [ 1.          0.          0.          0.          0.84408134  0.          0.
  1.          0.          0.0748472   0.          1.          0.          0.
  0.25755     1.          0.          0.          0.          0.74177343
  0.          1.          0.          0.          0.25854847  0.          0.
  1.          0.          0.09355672  0.          1.          0.          0.
  0.31969345  0.          0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agent while it is training**, and you should set `train_mode=True` to restart the environment.

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 2.0


When finished, you can close the environment.

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agent while it is training.  However, **_after training the agent_**, you can download the saved model weights to watch the agent on your own machine! 

### Model Architecture

In [1]:
diode = False # Set to true if in data diode
cloud = True  # set to true if in Cloud

In [None]:
if diode:
    import sys
    sys.path.append('../ml-agents-0.4.0/python')

if cloud and not diode:
    !pip -q install ./python

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.19 which is incompatible.[0m


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

from collections import deque
import matplotlib.pyplot as plt

from unityagents import UnityEnvironment
import numpy as np
%matplotlib inline

In [None]:
# CITATION: From Udacity's Deep Learning course, CycleGAN exercise
# ACM: I don't think I need convolutional layers since I am only 
#      ingesting state data - not pixels
def my_conv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
    """Creates a convolutional layer, with optional batch normalization.
    """
    layers = []
    conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, 
                           kernel_size=kernel_size, stride=stride, padding=padding, bias=False)
    
    layers.append(conv_layer)

    if batch_norm:
        layers.append(nn.BatchNorm2d(out_channels))
    return nn.Sequential(*layers)
    
# helper deconv function
def my_deconv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
    """Creates a transpose convolutional layer, with optional batch normalization.
    """
    layers = []
    # append transpose conv layer
    layers.append(nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, bias=False))
    # optional batch norm layer
    if batch_norm:
        layers.append(nn.BatchNorm2d(out_channels))
    return nn.Sequential(*layers)

In [None]:
## ~~! Citation: From Deep Q Network Exercise from Udacity's Deep Reinforement Learning Course

class QNetwork(nn.Module):
    """Deep Q Network Policy Model."""

    def __init__(self, state_size, action_size, seed_state, fc1_units=64, fc2_units=64):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed_state (int): Random seed
            fc1_units (int): Number of nodes in first hidden layer
            fc2_units (int): Number of nodes in second hidden layer
        """
        super(QNetwork, self).__init__()
        self.seed = torch.manual_seed(seed_state)
        self.fc1 = nn.Linear(state_size, fc1_units)
        self.fc2 = nn.Linear(fc1_units, fc2_units)
        self.fc3 = nn.Linear(fc2_units, action_size)

    def forward(self, state):
        """Build a network that maps state -> action values."""
        # Changed architecture to leaky_relu
        x = F.leaky_relu(self.fc1(state))
        x = F.leaky_relu(self.fc2(x))
        return self.fc3(x)


In [None]:
## ~~! From Deep Q Network Exercise from Udacity's Deep Reinforement Learning Course
import numpy as np
import random
from collections import namedtuple, deque
from abc import ABC, abstractmethod

import torch
import torch.nn.functional as F
import torch.optim as optim

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# @dataclass
class Agent(ABC):
    @abstractmethod
    def step(self, state, action, reward, next_state, done):
        pass
        
    @abstractmethod
    def act(self, state, eps=0.):
        pass
    
    
class EnvironmentMgr(ABC):
    @abstractmethod
    def __enter__(self):
        pass
    
    @abstractmethod
    def __exit__(self, e_type, e_value, e_traceback):
        pass

    
class UnityEnvironmentMgr(EnvironmentMgr):
    def __init__(self, file, seed):
        self.file = file
        
        self.seed = seed
        random.seed(seed)
        
    def __enter__(self):
        self.env = UnityEnvironment(file_name=self.file, worker_id=self.get_uid())
        return self.env
    
    def __exit__(self, e_type, e_value, e_traceback):
        self.env.close()
        
    def get_uid(self):
        self.id = random.randint(0,100)
        return self.id
        
class DQNAgent(Agent):
    """
    Interacts with and learns from the environment.
    
    Parameters
    ----------
    state_size : int
        dimension of each state
    action_size : int
        dimension of each action
    
    Optional Parameters
    ===================
    device : str
        'cpu' | 'cuda:0'
    seed : int
        random seed
    lr : float
        learning rate
    update_every : int
        Update Q-table after number of times
    tau : float
        interpolation parameter - from Q-Network
    gamma : float
        discount for future reward
    batch_size : int
        number of instances to include within a batch
    buffer_size : int
        tbr
    """
    def __init__(
        self,
        state_size: int,
        action_size: int,
        device: str = 'cpu',
        seed_state: int = 42,
        lr: float = 5e-4,
        update_every: int = 4,
        tau: float = 1e-3,
        gamma: float = 0.99,
        batch_size: int = 64,
        buffer_size=int(1e5),
    ):
        self.state_size = state_size
        self.action_size = action_size
        self.device = device
        self.seed_state = seed_state
        self.lr = lr
        self.update_every = update_every
        self.tau = tau
        self.gamma = gamma
        self.batch_size = batch_size
        self.buffer_size = buffer_size
        
        self.seed = random.seed(self.seed_state)
        
        # Q-Network
        self.qnetwork_local = QNetwork(self.state_size, self.action_size, self.seed_state).to(self.device)
        self.qnetwork_target = QNetwork(self.state_size, self.action_size, self.seed_state).to(self.device)
        self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=self.lr)

        # Replay memory
        self.memory = ReplayBuffer(self.action_size, self.buffer_size, self.batch_size, self.seed_state)
        # Initialize time step (for updating every self.update_every steps)
        self.t_step = 0
    
    def step(self, state, action, reward, next_state, done):
        """Advance Agent"""
        # Save experience in replay memory
        self.memory.add(state, action, reward, next_state, done)
        
        # Learn every UPDATE_EVERY time steps.
        self.t_step = (self.t_step + 1) % self.update_every
        if self.t_step == 0:
            # If enough samples are available in memory, get random subset and learn
            if len(self.memory) > self.batch_size:
                experiences = self.memory.sample()
                self.learn(experiences, self.gamma)

    def act(self, state, eps=0.):
        """Returns actions for given state as per current policy.
        
        Params
        ======
        state (array_like): current state
        eps (float): epsilon, for epsilon-greedy action selection
        """
        state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
        self.qnetwork_local.eval()
        with torch.no_grad():
            action_values = self.qnetwork_local(state)
        self.qnetwork_local.train()

        # Epsilon-greedy action selection
        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))

    def learn(self, experiences, gamma):
        """Update value parameters using given batch of experience tuples.

        Params
        ======
        experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples 
        gamma (float): discount factor
        """
        states, actions, rewards, next_states, dones = experiences

        # Get max predicted Q values (for next states) from target model
        Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
        # Compute Q targets for current states 
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Get expected Q values from local model
        Q_expected = self.qnetwork_local(states).gather(1, actions)

        # Compute loss
        loss = F.mse_loss(Q_expected, Q_targets)
        # Minimize the loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # ------------------- update target network ------------------- #
        self.soft_update(self.qnetwork_local, self.qnetwork_target, self.tau)                     

    def soft_update(self, local_model, target_model, tau):
        """Soft update model parameters.
        θ_target = τ*θ_local + (1 - τ)*θ_target

        Params
        ======
            local_model (PyTorch model): weights will be copied from
            target_model (PyTorch model): weights will be copied to
            tau (float): interpolation parameter 
        """
        for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
            target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)


# @dataclass
class ReplayBuffer:
    """
    Fixed-size buffer to store experience tuples
    
    Parameters
    ----------
    action_size : int 
        dimension of each action
    buffer_size : int 
        maximum size of buffer
    batch_size  : int 
        size of each training batch
    seed_state : int
        random seed
    device : str
        'cpu' | 'cuda:0'
    """
    def __init__(
        self,
        action_size: int,
        buffer_size: int,
        batch_size: int,
        seed_state: int = 42,
        device: str = 'cpu',
    ):
        self.action_size = action_size
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        self.seed_state = seed_state
        self.device = device
        
        self.memory = deque(maxlen=self.buffer_size)  
        self.experience = namedtuple("Experience", field_names=[
            "state", "action", "reward", "next_state", "done"
        ])
        self.seed = random.seed(self.seed_state)
    
    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)
    
    @staticmethod
    def _format_experiences(experiences, attr, to):
        attr_list = [
            getattr(e, attr)
            for e in experiences 
            if e is not None
        ]
        attr_tensor = torch.from_numpy(np.vstack(attr_list)).float()
        return attr_tensor.to(to)
        
    
    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)
        
        states, actions, rewards, next_states, dones = (
            self._format_experiences(experiences, x, self.device)
            for x in ['state', 'action', 'reward', 'next_state', 'done']
        )
        actions = self._format_experiences(experiences, 'actions', 'cpu').long().to(self.device)
        dones = torch.from_numpy(np.vstack([
            e.done
            for e in experiences 
            if e is not None
        ]).astype(np.uint8)).float().to(self.device)
  
        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

In [None]:
## ! CITATION - from DQN Exercise from Udacity's Deep Reinforement Learning Course

class Trainer(ABC):
    @abstractmethod
    def run(self):
        pass
    
class DQNTrainer(Trainer):
    def __init__(
        self,
        agent: Agent,
#         env: UnityEnvironmentMgr,
        env: UnityEnvironment,
        n_episodes=2000,
        max_t=10000,
        eps_start=1.0,
        eps_end=0.01, 
        eps_decay=0.995,
        print_every=100,
    ):
        """Deep Q-Learning.

        Parameters
        ----------
        agent : Agent
            agent to act upon
        # env : UnityEnvironmentMgr
        #     environment manager containing enter and exit methods to call UnityEnvironment
        env : UnityEnvironment
            Unity Environment - DO NOT CLOSE in v0.4.0 - this will cause you to be locked 
            out of your environment... NOTE TO UDACITY STAFF - fix this issue by upgrading
            UnityEnvironemnt requirements. See 
            https://github.com/Unity-Technologies/ml-agents/issues/1167
        n_episodes (int): 
            maximum number of training episodes
        max_t (int): 
            maximum number of timesteps per episode
        eps_start (float): 
            starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): 
            minimum value of epsilon
        eps_decay (float):
            multiplicative factor (per episode) for decreasing epsilon
        print_every : int
            update terminal with information for every specified iteration, [100]
        """
        self.agent = agent
        self.env = env
        self.n_episodes = n_episodes
        self.max_t = max_t 
        self.eps_start = eps_start
        self.eps_end = eps_end 
        self.eps_decay = eps_decay
        
        self.solved = 100 # TODO
        self.print_every = print_every
        self.brain_name = env.brain_names[0]
        
        self.scores_ = None
    
    def run(self):
        scores = []                        # list containing scores from each episode
        scores_window = self.init_scores_window()
        eps = self.init_eps()
        for i_episode in range(self.n_episodes):
            env_info = self.env.reset(train_mode=True)[self.brain_name] # reset the environment
            state = env_info.vector_observations[0]            # get the current state
            score = 0
            for t in range(self.max_t):
                action = self.agent.act(state, eps)
                # action = TODO(action_size)
                # TODO this should be handled by the bridge
                env_info = self.env.step(action)[self.brain_name]
                next_state = env_info.vector_observations[0]
                reward = env_info.rewards[0]
                done = env_info.local_done[0] 
                self.agent.step(state, action, reward, next_state, done)
                state = next_state
                score += reward
                if done:
                    break 
            scores_window.append(score)       # save most recent score
            scores.append(score)              # save most recent score
            eps = self.update_eps(eps)
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
            if (i_episode + 1) % self.print_every == 0:
                print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
            if np.mean(scores_window)>=self.solved:
                print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(
                    (i_episode + 1) - self.print_every, np.mean(scores_window)
                ))
                self.save('my-checkpoint.pth')
                break
                
        self.scores_ = scores
        return scores

    def init_eps(self):
        return self.eps_start
    
    def init_scores_window(self, l=100):
        return deque(maxlen=l)
    
    def update_eps(self, eps):
        return max(self.eps_end, self.eps_decay * eps)
    
    def save(self, file):
        torch.save(self.agent.qnetwork_local.state_dict(), file)


### Instantiate

In [None]:
if diode:
    file = "/../Banana_Linux_NoVis/Banana.x86_64"
elif cloud:
    file = "/data/Banana_Linux_NoVis/Banana.x86_64"
else:
    file = 'data/Banana_Windows_x86_64/Banana.exe'
    
# envh = UnityEnvironmentMgr(file, seed=2)
env = UnityEnvironment(file_name=file)

In [None]:
# get the default brain

brain_name = env.brain_names[0]
brain = env.brains[brain_name]
env_info = env.reset(train_mode=True)[brain_name]
action_size = brain.vector_action_space_size
state = env_info.vector_observations[0]
state_size = len(state)

In [None]:
agent = DQNAgent(state_size=state_size, action_size=action_size, seed_state=42, device=DEVICE)
# trainer = DQNTrainer(agent, envh)
trainer = DQNTrainer(agent, env)

### Run

In [None]:
scores = trainer.run()

In [None]:
scores

In [None]:


# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

