# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [28]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: -2.0


When finished, you can close the environment.

In [29]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [30]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 2.0


In [32]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [50]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state, action, reward, done)
    batch.append([action, state, reward, done])
    if done:                                       # exit loop if episode finished
        break
    
# print("Score: {}".format(score))

In [51]:
batch[0], batch[0][1].shape

([3, array([0.        , 1.        , 0.        , 0.        , 0.14733367,
         0.        , 0.        , 1.        , 0.        , 0.11928118,
         1.        , 0.        , 0.        , 0.        , 0.47576329,
         0.        , 1.        , 0.        , 0.        , 0.45386043,
         0.        , 0.        , 1.        , 0.        , 0.99189001,
         0.        , 0.        , 1.        , 0.        , 0.74783498,
         0.        , 0.        , 1.        , 0.        , 0.12713307,
         0.        , 0.        ]), 0.0, False], (37,))

In [52]:
batch[0][1].shape

(37,)

In [53]:
batch[0]

[3, array([0.        , 1.        , 0.        , 0.        , 0.14733367,
        0.        , 0.        , 1.        , 0.        , 0.11928118,
        1.        , 0.        , 0.        , 0.        , 0.47576329,
        0.        , 1.        , 0.        , 0.        , 0.45386043,
        0.        , 0.        , 1.        , 0.        , 0.99189001,
        0.        , 0.        , 1.        , 0.        , 0.74783498,
        0.        , 0.        , 1.        , 0.        , 0.12713307,
        0.        , 0.        ]), 0.0, False]

In [54]:
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [55]:
# print(rewards[-20:])
print('shapes:', np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print('dtypes:', np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('states:', np.max(np.array(states)), np.min(np.array(states)))
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print('rewards:', np.max(np.array(rewards)), np.min(np.array(rewards)))

shapes: (300,) (300, 37) (300,) (300,)
dtypes: float64 float64 int64 bool
states: 10.711230278015137 -9.99843692779541
actions: 3 0
rewards: 0.0 0.0


In [56]:
# The input data into the model
def model_input(state_size):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    reward = tf.placeholder(tf.float32, [], name='reward')
    return states, actions, reward

In [57]:
# Generator: Generating/prediting the actions
def generator(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return logits

In [58]:
def model_loss(states, actions, reward, # model input
               action_size, hidden_size): # model init
    actions_logits = generator(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    loss_prob = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                          labels=actions_labels))
    reward_prob = tf.nn.sigmoid(reward)
    loss = loss_prob * -reward_prob
    return actions_logits, loss, reward_prob

In [59]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    return opt

In [60]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.reward = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.loss, self.reward_prob = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, actions=self.actions, reward=self.reward) # model input

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

In [61]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(300, 37) actions:(300,)
action size:4


In [64]:
# Training parameters
# Network parameters
learning_rate = 0.001          # learning rate for adam
state_size = 37                # number of units for the input state/observation -- simulation
action_size = 4                # number of units for the output actions -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation

In [65]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

In [None]:
# Now train with experiences
saver = tf.train.Saver()
rewards_list, loss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-nav.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    for ep in range(1000):
        batch = [] # every data batch
        total_reward = 0
        #state = env.reset() # env first state
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment        
        state = env_info.vector_observations[0]   # get the next state

        # Training steps
        #for _ in range(max_steps): # start=0, step=1, stop=max_steps/done/reward
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
            action = np.argmax(action_logits)
            batch.append([state, action])
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name] # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done is True:
                break
        
        # Training batches
        #batch = memory.buffer
        states = np.array([each[0] for each in batch])
        actions = np.array([each[1] for each in batch])
        loss, _, reward_prob = sess.run([model.loss, model.opt, model.reward_prob],
                                        feed_dict = {model.states: states, 
                                                     model.actions: actions,
                                                     model.reward: total_reward})
        # Print out
        print('Episode: {}'.format(ep),
              'total_reward: {}'.format(total_reward),
              'reward_prob: {:.4f}'.format(reward_prob),
              'loss: {:.4f}'.format(loss))
        # Plotting out
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, loss])
        # The task is episodic, and in order to solve the environment, 
        # your agent must get an average score of +13 over 100 consecutive episodes.
        if total_reward == 13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-nav.ckpt')

Episode: 0 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5734
Episode: 1 total_reward: 0.0 reward_prob: 0.5000 loss: -0.4462
Episode: 2 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5719
Episode: 3 total_reward: 1.0 reward_prob: 0.7311 loss: -0.8171
Episode: 4 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5855
Episode: 5 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5623
Episode: 6 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5724
Episode: 7 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5276
Episode: 8 total_reward: 3.0 reward_prob: 0.9526 loss: -1.0407
Episode: 9 total_reward: 1.0 reward_prob: 0.7311 loss: -0.7913
Episode: 10 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5851
Episode: 11 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6126
Episode: 12 total_reward: 0.0 reward_prob: 0.5000 loss: -0.5783
Episode: 13 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6387
Episode: 14 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3342
Episode: 15 total_reward: 0.0 reward_prob: 0.5000

Episode: 128 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1543
Episode: 129 total_reward: 3.0 reward_prob: 0.9526 loss: -1.2403
Episode: 130 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9577
Episode: 131 total_reward: 3.0 reward_prob: 0.9526 loss: -1.2494
Episode: 132 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9570
Episode: 133 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9615
Episode: 134 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9588
Episode: 135 total_reward: 3.0 reward_prob: 0.9526 loss: -1.2431
Episode: 136 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1585
Episode: 137 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6533
Episode: 138 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6555
Episode: 139 total_reward: 3.0 reward_prob: 0.9526 loss: -1.2482
Episode: 140 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6612
Episode: 141 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1572
Episode: 142 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1598
Episode: 143 total_reward

Episode: 254 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3590
Episode: 255 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9733
Episode: 256 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6705
Episode: 257 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9767
Episode: 258 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6691
Episode: 259 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6714
Episode: 260 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6720
Episode: 261 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1817
Episode: 262 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9920
Episode: 263 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6708
Episode: 264 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3669
Episode: 265 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6783
Episode: 266 total_reward: -2.0 reward_prob: 0.1192 loss: -0.1595
Episode: 267 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6679
Episode: 268 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9680
Episode: 269 total_rew

Episode: 380 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3621
Episode: 381 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6733
Episode: 382 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3598
Episode: 383 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6756
Episode: 384 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3607
Episode: 385 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6779
Episode: 386 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9832
Episode: 387 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6784
Episode: 388 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6758
Episode: 389 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6762
Episode: 390 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9926
Episode: 391 total_reward: -2.0 reward_prob: 0.1192 loss: -0.1611
Episode: 392 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6744
Episode: 393 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1858
Episode: 394 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6767
Episode: 395 total_re

Episode: 506 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3655
Episode: 507 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3661
Episode: 508 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9918
Episode: 509 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3651
Episode: 510 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6789
Episode: 511 total_reward: -2.0 reward_prob: 0.1192 loss: -0.1617
Episode: 512 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9911
Episode: 513 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6806
Episode: 514 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3645
Episode: 515 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9909
Episode: 516 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3651
Episode: 517 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6792
Episode: 518 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9963
Episode: 519 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3663
Episode: 520 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3673
Episode: 521 tota

Episode: 632 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6568
Episode: 633 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1711
Episode: 634 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9860
Episode: 635 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6747
Episode: 636 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1768
Episode: 637 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9836
Episode: 638 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3597
Episode: 639 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6775
Episode: 640 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3615
Episode: 641 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3628
Episode: 642 total_reward: 2.0 reward_prob: 0.8808 loss: -1.1958
Episode: 643 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6809
Episode: 644 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6846
Episode: 645 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6851
Episode: 646 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3655
Episode: 647 total_re

Episode: 758 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3637
Episode: 759 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3614
Episode: 760 total_reward: 1.0 reward_prob: 0.7311 loss: -0.9937
Episode: 761 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6783
Episode: 762 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6839
Episode: 763 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6776
Episode: 764 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6768
Episode: 765 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6764
Episode: 766 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6774
Episode: 767 total_reward: -1.0 reward_prob: 0.2689 loss: -0.3650
Episode: 768 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6803
Episode: 769 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6886
Episode: 770 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6792
Episode: 771 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6777
Episode: 772 total_reward: 0.0 reward_prob: 0.5000 loss: -0.6780
Episode: 773 total_rew

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [33]:
# # import gym
# # # env = gym.make('CartPole-v0')
# # env = gym.make('CartPole-v1')
# # # env = gym.make('Acrobot-v1')
# # # env = gym.make('MountainCar-v0')
# # # env = gym.make('Pendulum-v0')
# # # env = gym.make('Blackjack-v0')
# # # env = gym.make('FrozenLake-v0')
# # # env = gym.make('AirRaid-ram-v0')
# # # env = gym.make('AirRaid-v0')
# # # env = gym.make('BipedalWalker-v2')
# # # env = gym.make('Copy-v0')
# # # env = gym.make('CarRacing-v0')
# # # env = gym.make('Ant-v2') #mujoco
# # # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# with tf.Session() as sess:
#     #sess.run(tf.global_variables_initializer())
#     saver.restore(sess, 'checkpoints/model-nav.ckpt')    
#     #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Episodes/epochs
#     for _ in range(1):
#         state = env.reset()
#         total_reward = 0

#         # Steps/batches
#         #for _ in range(111111111111111111):
#         while True:
#             env.render()
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
#             action = np.argmax(action_logits)
#             state, reward, done, _ = env.step(action)
#             total_reward += reward
#             if done:
#                 break
                
#         # Closing the env
#         print('total_reward: {:.2f}'.format(total_reward))
#         env.close()