# Project: Train a Quadcopter How to Fly

Design an agent to fly a quadcopter, and then train it using a reinforcement learning algorithm of your choice! 

Try to apply the techniques you have learnt, but also feel free to come up with innovative ideas and test them.

## Instructions

Take a look at the files in the directory to better understand the structure of the project. 

- `task.py`: Define your task (environment) in this file.
- `agents/`: Folder containing reinforcement learning agents.
    - `policy_search.py`: A sample agent has been provided here.
    - `agent.py`: Develop your agent here.
- `physics_sim.py`: This file contains the simulator for the quadcopter.  **DO NOT MODIFY THIS FILE**.

For this project, you will define your own task in `task.py`.  Although we have provided a example task to get you started, you are encouraged to change it.  Later in this notebook, you will learn more about how to amend this file.

You will also design a reinforcement learning agent in `agent.py` to complete your chosen task.  

You are welcome to create any additional files to help you to organize your code.  For instance, you may find it useful to define a `model.py` file defining any needed neural network architectures.

## Controlling the Quadcopter

We provide a sample agent in the code cell below to show you how to use the sim to control the quadcopter.  This agent is even simpler than the sample agent that you'll examine (in `agents/policy_search.py`) later in this notebook!

The agent controls the quadcopter by setting the revolutions per second on each of its four rotors.  The provided agent in the `Basic_Agent` class below always selects a random action for each of the four rotors.  These four speeds are returned by the `act` method as a list of four floating-point numbers.  

For this project, the agent that you will implement in `agents/agent.py` will have a far more intelligent method for selecting actions!

In [None]:
import random

class Basic_Agent():
    def __init__(self, task):
        self.task = task
    
    def act(self):
        new_thrust = random.gauss(450., 25.)
        return [new_thrust + random.gauss(0., 1.) for x in range(4)]

Run the code cell below to have the agent select actions to control the quadcopter.  

Feel free to change the provided values of `runtime`, `init_pose`, `init_velocities`, and `init_angle_velocities` below to change the starting conditions of the quadcopter.

The `labels` list below annotates statistics that are saved while running the simulation.  All of this information is saved in a text file `data.txt` and stored in the dictionary `results`.  

In [None]:
%load_ext autoreload
%autoreload 2

import csv
import numpy as np
from task import Task

# Modify the values below to give the quadcopter a different starting position.
runtime = 5.                                     # time limit of the episode
init_pose = np.array([0., 0., 10., 0., 0., 0.])  # initial pose # [x, y, z, phi, theta, psi]
init_velocities = np.array([0., 0., 0.])         # initial velocities # x_velocity, y_velocity, z_velocity
init_angle_velocities = np.array([0., 0., 0.])   # initial angle velocities # phi_velocity, theta_velocity, psi_velocity
file_output = 'data.txt'                         # file name for saved results

# Setup
task = Task(init_pose, init_velocities, init_angle_velocities, runtime)
agent = Basic_Agent(task)
done = False
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
          'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
          'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4']
results = {x : [] for x in labels}

# Run the simulation, and save the results.
with open(file_output, 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(labels)
    
    # reset environment
    initial_state = task.reset ()
    print ('initial state: {} {} {}'.format (initial_state[:6], initial_state[6:12], initial_state[12:]))
    
    while True:
        rotor_speeds = agent.act()
        _, _, done = task.step(rotor_speeds)
        to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) + list(rotor_speeds)
        for ii in range(len(labels)):
            results[labels[ii]].append(to_write[ii])
        writer.writerow(to_write)
        
        if done:
            break

Run the code cell below to visualize how the position of the quadcopter evolved during the simulation. Next, you can plot the Euler angles (the rotation of the quadcopter over the $x$-, $y$-, and $z$-axes),

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, axs = plt.subplots (1, 2)

axs[0].plot(results['time'], results['x'], label='x')
axs[0].plot(results['time'], results['y'], label='y')
axs[0].plot(results['time'], results['z'], label='z')
axs[0].legend(loc='upper left')
axs[0].set (title='position')

axs[1].plot(results['time'], results['phi'], label='phi')
axs[1].plot(results['time'], results['theta'], label='theta')
axs[1].plot(results['time'], results['psi'], label='psi')
axs[1].legend(loc='upper left')
axs[1].set (title='Euler angles')

_ = plt.ylim()
fig.set_size_inches ((14., 6.), forward=True)
plt.show()

The next code cell visualizes the velocity of the quadcopter. before plotting the velocities (in radians per second) corresponding to each of the Euler angles.

In [None]:
fig, axs = plt.subplots (1, 2)

axs[0].plot(results['time'], results['x_velocity'], label='x_hat')
axs[0].plot(results['time'], results['y_velocity'], label='y_hat')
axs[0].plot(results['time'], results['z_velocity'], label='z_hat')
axs[0].legend(loc='upper left')
axs[0].set (title='velocity')

axs[1].plot(results['time'], results['phi_velocity'], label='phi_velocity')
axs[1].plot(results['time'], results['theta_velocity'], label='theta_velocity')
axs[1].plot(results['time'], results['psi_velocity'], label='psi_velocity')
axs[1].legend(loc='upper left')
axs[1].set (title='velocity in Euler angles')

_ = plt.ylim()
fig.set_size_inches ((14., 6.), forward=True)
plt.show()

Finally, you can use the code cell below to print the agent's choice of actions.  

In [None]:
plt.plot(results['time'], results['rotor_speed1'], label='Rotor 1 revolutions / second')
plt.plot(results['time'], results['rotor_speed2'], label='Rotor 2 revolutions / second')
plt.plot(results['time'], results['rotor_speed3'], label='Rotor 3 revolutions / second')
plt.plot(results['time'], results['rotor_speed4'], label='Rotor 4 revolutions / second')
plt.legend()
_ = plt.ylim()

When specifying a task, you will derive the environment state from the simulator.  Run the code cell below to print the values of the following variables at the end of the simulation:
- `task.sim.pose` (the position of the quadcopter in ($x,y,z$) dimensions and the Euler angles),
- `task.sim.v` (the velocity of the quadcopter in ($x,y,z$) dimensions), and
- `task.sim.angular_v` (radians/second for each of the three Euler angles).

In [None]:
# the pose, velocity, and angular velocity of the quadcopter at the end of the episode
print(task.sim.pose)
print(task.sim.v)
print(task.sim.angular_v)

In the sample task in `task.py`, we use the 6-dimensional pose of the quadcopter to construct the state of the environment at each timestep.  However, when amending the task for your purposes, you are welcome to expand the size of the state vector by including the velocity information.  You can use any combination of the pose, velocity, and angular velocity - feel free to tinker here, and construct the state to suit your task.

## The Task

A sample task has been provided for you in `task.py`.  Open this file in a new window now. 

The `__init__()` method is used to initialize several variables that are needed to specify the task.  
- The simulator is initialized as an instance of the `PhysicsSim` class (from `physics_sim.py`).  
- Inspired by the methodology in the original DDPG paper, we make use of action repeats.  For each timestep of the agent, we step the simulation `action_repeats` timesteps.  If you are not familiar with action repeats, please read the **Results** section in [the DDPG paper](https://arxiv.org/abs/1509.02971).
- We set the number of elements in the state vector.  For the sample task, we only work with the 6-dimensional pose information.  To set the size of the state (`state_size`), we must take action repeats into account.  
- The environment will always have a 4-dimensional action space, with one entry for each rotor (`action_size=4`). You can set the minimum (`action_low`) and maximum (`action_high`) values of each entry here.
- The sample task in this provided file is for the agent to reach a target position.  We specify that target position as a variable.

The `reset()` method resets the simulator.  The agent should call this method every time the episode ends.  You can see an example of this in the code cell below.

The `step()` method is perhaps the most important.  It accepts the agent's choice of action `rotor_speeds`, which is used to prepare the next state to pass on to the agent.  Then, the reward is computed from `get_reward()`.  The episode is considered done if the time limit has been exceeded, or the quadcopter has travelled outside of the bounds of the simulation.

In the next section, you will learn how to test the performance of an agent on this task.

## The Agent

The sample agent given in `agents/policy_search.py` uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode (`score`), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.

Run the code cell below to see how the agent performs on the sample task.

In [None]:
import sys
import pandas as pd
from agents.policy_search import PolicySearch_Agent
from task import Task

num_episodes = 1000
target_pos = np.array([0., 0., 10.])
task = Task(target_pos=target_pos)
agent = PolicySearch_Agent(task) 

for i_episode in range(1, num_episodes+1):
    state = agent.reset_episode() # start a new episode
    while True:
        action = agent.act(state) 
        next_state, reward, done = task.step(action)
        agent.step(reward, done)
        state = next_state
        if done:
            print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
                i_episode, agent.score, agent.best_score, agent.noise_scale), end="")  # [debug]
            break
    sys.stdout.flush()

This agent should perform very poorly on this task.  And that's where you come in!

## Define the Task, Design the Agent, and Train Your Agent!

Amend `task.py` to specify a task of your choosing.  If you're unsure what kind of task to specify, you may like to teach your quadcopter to takeoff, hover in place, land softly, or reach a target pose.  

After specifying your task, use the sample agent in `agents/policy_search.py` as a template to define your own agent in `agents/agent.py`.  You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like `act()`, `learn()`, `reset_episode()`, etc.).

Note that it is **highly unlikely** that the first agent and task that you specify will learn well.  You will likely have to tweak various hyperparameters and the reward function for your task until you arrive at reasonably good behavior.

As you develop your agent, it's important to keep an eye on how it's performing. Use the code above as inspiration to build in a mechanism to log/save the total rewards obtained in each episode to file.  If the episode rewards are gradually increasing, this is an indication that your agent is learning.

***
### RL system - General
<img src=".\stuff\rl.system.gen.PNG" width="50%" \>

In general a MDP consits of
- a Model (defined by transitions T)
- a finite State space S
- a finite Action space A
- Rewards R

RL algorithms can be divided into
- model-based, algorithms need: T and R
- value-based, model-free, algorithms need: Q
- policy-based

Basically, all RL algorithms work well with discrete spaces (States, Actions)

### Quadcopter
- episodic task
- Goal: reach a target position - Pos (x, y, z) = (0, 0, 10)

#### State space S
Currently, every state holds the 6 degrees of freedom
- 3 translational, position (x, y, z) and
- 3 rotational, Euler angles ($\varphi, \vartheta, \psi$)

The state space is large and values are continuous, but has boundaries given by the physics simulation. Exceeding the boundaries or exceeding the runtime leads to termination of the episode.

By that, we need to use function approximation to enable the agent to choose correct actions and minimize the computation time.

Action repeats are used, [...] in order to make the problems approximately fully observable in the high dimensional environment [...]. For each timestep of the agent, we step the simulation 3 timesteps, repeating the agent’s action and rendering each time (arXiv:1509.02971, p. 5).

In [None]:
task = Task ()

print ('Size of a state (3 action repeats * 6 dof):', task.state_size)

print ('S lower boundary', task.sim.lower_bounds)
print ('S upper boundary', task.sim.upper_bounds)

#### Action space A
One action consits of 4 independent values, representing the thrust / rotor speeds.

The action space values are continuous.

In [None]:
print ('Size of an action:', task.action_size)
print ('A boundaries:', (task.action_low, task.action_high))

#### Reward function R
Currently, the reward function is
- R = 1 if target positon reached
- else: R = 1 - 30% of absolute difference between actual positon and target position

Thus, the reward will be a high negative scalar, if the quadcopter is far away from the target position

### My Environment
The environment is derived by the given class 'Task'.

In [None]:
import task

class MyTask (Task):
    
    def __init__ (self):
        Task.__init__ (self)
        
        # Goal
        self.target_pos = np.array ([0., 0., 10.])
    
    # reward function (overwrite)
    def get_reward (self):
        return super ().get_reward ()
    

env = MyTask ()

### My Agent
The agent is defined in <a href=".\agents\agent.py">agent.py</a>.

This agent discretizes the actions:
- 3 actions: -1 = 'decrease_speed', 0 = 'no_action', 1 = 'increase_speed'
- if boundaries of action space exceeded, then 'no_action' is taken

Actions upon cases:
Quadcopter has 4 motors
<pre>
       front
     (1)   (2)
        \ /
right    o     right
        / \
     (3)   (4)
        rear
</pre>
Motors (1) and (4) rotate in dir1, Motors (2) and (3) rotate in dir2

- pitch +: increase speed of (3) and (4)
- pitch -: increase speed of (1) and (2)
- roll +: increase speed of (1) and (3)
- roll -: increase speed of (2) and (4)
- yaw +: increase speed of (2) and (3)
- yaw -: increase speed of (1) and (4)

#### Approach
- continuous S
- continuous A
- Q-Learning
  - off-policy method: better overcomes exploration-exploitation dilemma
  - supports batch learning (for experience replay)
- option: experience replay
- option: function approximation

In [None]:
#DQN
import tensorflow as tf



In [None]:
## TODO: Train your agent here.
# first try with discrete actions
from agents.agent import Agent

num_episodes = 1000
env = MyTask ()
agent = Agent ()

"""
for i_episode in range(1, num_episodes+1):
    state = agent.reset_episode() # start a new episode
    while True:
        action = agent.act(state) 
        next_state, reward, done = task.step(action)
        agent.step(reward, done)
        state = next_state
        if done:
            print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
                i_episode, agent.score, agent.best_score, agent.noise_scale), end="")  # [debug]
            break
    sys.stdout.flush()
"""

for i_episode in range (1, num_episodes+1):
    # begin the episode
    state = env.reset ()
    print (state)

    while True:
        # agent selects an action
        action = agent.select_action (state)
        # agent performs the selected action
        next_state, reward, done, _ = env.step (action)

        # agent performs internal updates based on sampled experience
        agent.step (state, action, reward, next_state, done)

        # update the state (s <- s') to next time step
        state = next_state
        if done:
            print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
            i_episode, agent.score, agent.best_score, agent.noise_scale), end="")  # [debug]
        break


## Plot the Rewards

Once you are satisfied with your performance, plot the episode rewards, either from a single run, or averaged over multiple runs. 

In [None]:
## TODO: Plot the rewards.

## Reflections

**Question 1**: Describe the task that you specified in `task.py`.  How did you design the reward function?

**Answer**:

**Question 2**: Discuss your agent briefly, using the following questions as a guide:

- What learning algorithm(s) did you try? What worked best for you?
- What was your final choice of hyperparameters (such as $\alpha$, $\gamma$, $\epsilon$, etc.)?
- What neural network architecture did you use (if any)? Specify layers, sizes, activation functions, etc.

**Answer**:

**Question 3**: Using the episode rewards plot, discuss how the agent learned over time.

- Was it an easy task to learn or hard?
- Was there a gradual learning curve, or an aha moment?
- How good was the final performance of the agent? (e.g. mean rewards over the last 10 episodes)

**Answer**:

**Question 4**: Briefly summarize your experience working on this project. You can use the following prompts for ideas.

- What was the hardest part of the project? (e.g. getting started, plotting, specifying the task, etc.)
- Did you find anything interesting in how the quadcopter or your agent behaved?

**Answer**:

In [None]:
import gym
import numpy as np

# Create the Cart-Pole game environment
env = gym.make('CartPole-v1')

In [None]:
from agents import agent
import tensorflow as tf
import time

tf.reset_default_graph()
ag = agent.Agentv2 (state_size=4, action_size=2)

# Make a bunch of random actions and store the experiences
# Start new episode
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())
for ii in range(20):

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        ag.store ((state, action, reward, next_state))

        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        ag.store ((state, action, reward, next_state))
        state = next_state

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras import optimizers

dqn = agent.MLPQNet (4, 2, 0.001)
dqn.print_architecture ()

In [None]:
# Sample mini-batch from memory
batch = ag._memory.sample (5)
states = np.array ([each[0] for each in batch])
actions = np.array ([each[1] for each in batch])
rewards = np.array ([each[2] for each in batch])
next_states = np.array ([each[3] for each in batch])

next_states[3] = np.zeros (states[0].shape)
print (next_states.shape)

In [None]:
# target values - the values we want to become
p_sa_ = dqn.Net.predict (next_states)
# set target values to 0 for states where episode ends
episode_ends = (next_states == np.zeros (states[0].shape)).all (axis=1)
p_sa_[episode_ends] = np.zeros (ag.action_size)
print (p_sa_)
target_vals = rewards + 0.7 * np.max (p_sa_, axis=1)
print ('target', target_vals)


# get approximate values
approx_vals = dqn.Net.predict (states)
print ('approx', approx_vals)

# update approximate value for taken action to target value
print (actions)
for i in range (approx_vals.shape[0]):
    approx_vals[i,actions[i]] = target_vals[i]

print (approx_vals)

In [None]:
# target values - the values we want to become
p_sa_ = dqn.Net.predict (next_states)
# set target values to 0 for states where episode ends
episode_ends = (next_states == np.zeros (states[0].shape)).all (axis=1)
p_sa_[episode_ends] = np.zeros (ag.action_size)
print (p_sa_)
target_vals = rewards + 0.7 * np.max (p_sa_, axis=1)
print ('target', target_vals)


# get approximate values
approx_vals = dqn.Net.predict (states)
print ('approx', approx_vals)

# update approximate value for taken action to target value
print (actions)
for i in range (approx_vals.shape[0]):
    approx_vals[i,actions[i]] = target_vals[i]

print ('new targets', approx_vals)

for state, action, reward, next_state in batch:
    # if done, make our target reward
    target = reward
    if not (next_state == np.zeros (states[0].shape)).all ():
        # predict the future discounted reward
        p_sa_ = dqn.Net.predict(np.reshape(next_state, [1, 4]))
        print ('action probs ns', p_sa_)
        target = reward + 0.7 * np.max (p_sa_)
        print ('target', target)
        # make the agent to approximately map
        # the current state to future discounted reward
        # We'll call that target_f
        target_f = dqn.Net.predict(np.reshape(state, [1, 4]))
        print ('action probs s', target_f)
        target_f[0][action] = target
        print ('new target', target_f)
        # Train the Neural Net with the state and target_f
        #dqn.Net.fit(state, target_f, epochs=1, verbose=0)

In [None]:
train_episodes = 1000
max_steps = 200
batch_size = 20


with tf.Session() as sess:
    #saver = tf.train.Saver ()
    sess.run (tf.global_variables_initializer())
    
    # Make a bunch of random actions and store the experiences
    # Start new episode
    env.reset()
    # Take one random step to get the pole and cart moving
    state, reward, done, _ = env.step(env.action_space.sample())
    for ii in range(20):

        # Make a random action
        action = env.action_space.sample()
        next_state, reward, done, _ = env.step(action)

        if done:
            # The simulation fails so no next state
            next_state = np.zeros(state.shape)
            # Add experience to memory
            ag.store ((state, action, reward, next_state))

            # Start new episode
            env.reset()
            # Take one random step to get the pole and cart moving
            state, reward, done, _ = env.step(env.action_space.sample())
        else:
            # Add experience to memory
            ag.store ((state, action, reward, next_state))
            state = next_state

    
    # GPI with DQN
    env.reset ()
    state, reward, done, _ = env.step (env.action_space.sample())
    
    step = 0
    rewards_list = []
    for ep in range (1, train_episodes):
        total_reward = 0
        t = 0
        while t < max_steps:
            step += 1
            # Uncomment this next line to watch the training
            # env.render ()

            # Explore or Exploit
            action = ag.act (state, step, sess)

            # Take action, get new state and reward
            next_state, reward, done, _ = env.step (action)

            total_reward += reward

            if done:
                # the episode ends so no next state
                next_state = np.zeros (state.shape)
                t = max_steps

                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training loss: {:.4f}'.format(loss),
                      'Explore P: {:.4f}'.format(ag.explore_p))

                rewards_list.append((ep, total_reward))

                # Add experience to memory
                ag.store ((state, action, reward, next_state))

                # Start new episode
                env.reset ()
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                ag.store ((state, action, reward, next_state))
                state = next_state
                t += 1

            loss = ag.learn (batch_size, sess)
    
    #timestr = time.strftime("%Y%m%d-%H%M%S")
    #saver.save (sess, "agent.brain_"+timestr+".ckpt")

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

eps, rews = np.array(rewards_list).T
smoothed_rews = running_mean(rews, 10)
plt.plot(eps[-len(smoothed_rews):], smoothed_rews)
plt.plot(eps, rews, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total Reward')

In [1]:
import gym
import numpy as np

# Create the Cart-Pole game environment
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [2]:
from agents import agent
import tensorflow as tf
import time

agv3 = agent.Agentv3 (4, 2)

train_episodes = 1000
max_steps = 200
batch_size = 20

# Make a bunch of random actions and store the experiences
# Start new episode
env.reset()
# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())
for ii in range(20):

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        # Add experience to memory
        agv3.store ((state, action, reward, next_state))

        # Start new episode
        env.reset()
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        agv3.store ((state, action, reward, next_state))
        state = next_state

# GPI with DQN
env.reset ()
state, reward, done, _ = env.step (env.action_space.sample())

step = 0
rewards_list = []
for ep in range (1, train_episodes):
    total_reward = 0
    t = 0
    while t < max_steps:
        step += 1
        # Uncomment this next line to watch the training
        # env.render ()

        # Explore or Exploit
        action = agv3.act (np.reshape(state, [1, 4]), step)

        # Take action, get new state and reward
        next_state, reward, done, _ = env.step (action)

        total_reward += reward

        if done:
            # the episode ends so no next state
            next_state = np.zeros (state.shape)
            t = max_steps
            
            print('Episode: {}'.format(ep),
                  'Total reward: {}'.format(total_reward),
                  'Training loss: {}'.format(hist[0].history['loss']),
                  'Explore P: {:.4f}'.format(agv3.explore_p))

            rewards_list.append((ep, total_reward))

            # Add experience to memory
            agv3.store ((state, action, reward, next_state))

            # Start new episode
            env.reset ()
            # Take one random step to get the pole and cart moving
            state, reward, done, _ = env.step(env.action_space.sample())

        else:
            # Add experience to memory
            agv3.store ((state, action, reward, next_state))
            state = next_state
            t += 1

        hist = agv3.learn (batch_size)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Episode: 1 Total reward: 16.0 Training loss: [0.4688514769077301] Explore P: 0.9984
Episode: 2 Total reward: 61.0 Training loss: [0.39209699630737305] Explore P: 0.9924
Episode: 3 Total reward: 37.0 Training loss: [0.47647225856781006] Explore P: 0.9888
Episode: 4 Total reward: 19.0 Training loss: [0.5083566308021545] Explore P: 0.9869
Episode: 5 Total reward: 12.0 Training loss: [0.4448961019515991] Explore P: 0.9857
Episode: 6 Total reward: 12.0 Training loss: [0.39392074942588806] Explore P: 0.9846
Episode: 7 Total reward: 9.0 Training loss: [0.47412946820259094] Explore P: 0.9837
Episode: 8 Total reward: 22.0 Training loss: [0.48176196217536926] Explore P: 0.9816
Episode: 9 Total reward: 11.0 Training loss: [0.48420068621635437] Explore P: 0.9805
Episode: 10 Total reward: 10.0 Training loss: [0.45781558752059937] Explore P: 0.9795
Episode: 11 Total reward: 26.0 Training loss: [0.3893066644668579] Explore P: 0.9770
Episode: 12 Total reward: 42.0 Training loss: [0.45747628808021545] 

Episode: 98 Total reward: 15.0 Training loss: [0.40715810656547546] Explore P: 0.8004
Episode: 99 Total reward: 15.0 Training loss: [0.41753560304641724] Explore P: 0.7992
Episode: 100 Total reward: 29.0 Training loss: [0.3715587258338928] Explore P: 0.7969
Episode: 101 Total reward: 19.0 Training loss: [0.3689105808734894] Explore P: 0.7954
Episode: 102 Total reward: 19.0 Training loss: [0.40496885776519775] Explore P: 0.7939
Episode: 103 Total reward: 16.0 Training loss: [0.3429243862628937] Explore P: 0.7927
Episode: 104 Total reward: 52.0 Training loss: [0.41208159923553467] Explore P: 0.7886
Episode: 105 Total reward: 8.0 Training loss: [0.3405519127845764] Explore P: 0.7880
Episode: 106 Total reward: 10.0 Training loss: [0.3619441092014313] Explore P: 0.7872
Episode: 107 Total reward: 8.0 Training loss: [0.4001903831958771] Explore P: 0.7866
Episode: 108 Total reward: 30.0 Training loss: [0.42109522223472595] Explore P: 0.7843
Episode: 109 Total reward: 10.0 Training loss: [0.339

Episode: 193 Total reward: 24.0 Training loss: [0.3987644910812378] Explore P: 0.6829
Episode: 194 Total reward: 25.0 Training loss: [0.34927046298980713] Explore P: 0.6812
Episode: 195 Total reward: 10.0 Training loss: [0.3276810050010681] Explore P: 0.6806
Episode: 196 Total reward: 25.0 Training loss: [0.32336530089378357] Explore P: 0.6789
Episode: 197 Total reward: 25.0 Training loss: [0.4382794499397278] Explore P: 0.6772
Episode: 198 Total reward: 13.0 Training loss: [0.3519129753112793] Explore P: 0.6763
Episode: 199 Total reward: 14.0 Training loss: [0.3936276435852051] Explore P: 0.6754
Episode: 200 Total reward: 22.0 Training loss: [0.35049647092819214] Explore P: 0.6739
Episode: 201 Total reward: 15.0 Training loss: [0.3883589506149292] Explore P: 0.6730
Episode: 202 Total reward: 8.0 Training loss: [0.3456515669822693] Explore P: 0.6724
Episode: 203 Total reward: 12.0 Training loss: [0.3439328670501709] Explore P: 0.6716
Episode: 204 Total reward: 14.0 Training loss: [0.32

Episode: 288 Total reward: 14.0 Training loss: [0.3327372968196869] Explore P: 0.5881
Episode: 289 Total reward: 12.0 Training loss: [0.32124054431915283] Explore P: 0.5874
Episode: 290 Total reward: 10.0 Training loss: [0.32159867882728577] Explore P: 0.5868
Episode: 291 Total reward: 23.0 Training loss: [0.0008080892730504274] Explore P: 0.5855
Episode: 292 Total reward: 12.0 Training loss: [0.32261329889297485] Explore P: 0.5848
Episode: 293 Total reward: 19.0 Training loss: [0.37269487977027893] Explore P: 0.5837
Episode: 294 Total reward: 10.0 Training loss: [0.37318286299705505] Explore P: 0.5831
Episode: 295 Total reward: 10.0 Training loss: [0.3660750985145569] Explore P: 0.5826
Episode: 296 Total reward: 11.0 Training loss: [0.3329896032810211] Explore P: 0.5819
Episode: 297 Total reward: 11.0 Training loss: [0.34174370765686035] Explore P: 0.5813
Episode: 298 Total reward: 15.0 Training loss: [0.3299858272075653] Explore P: 0.5804
Episode: 299 Total reward: 7.0 Training loss:

Episode: 383 Total reward: 10.0 Training loss: [0.325427383184433] Explore P: 0.5203
Episode: 384 Total reward: 12.0 Training loss: [0.3306494355201721] Explore P: 0.5196
Episode: 385 Total reward: 15.0 Training loss: [0.3282284736633301] Explore P: 0.5189
Episode: 386 Total reward: 10.0 Training loss: [0.35573700070381165] Explore P: 0.5184
Episode: 387 Total reward: 13.0 Training loss: [0.35525602102279663] Explore P: 0.5177
Episode: 388 Total reward: 10.0 Training loss: [0.3276337683200836] Explore P: 0.5172
Episode: 389 Total reward: 9.0 Training loss: [0.35706570744514465] Explore P: 0.5167
Episode: 390 Total reward: 12.0 Training loss: [0.3359619081020355] Explore P: 0.5161
Episode: 391 Total reward: 13.0 Training loss: [0.3326334059238434] Explore P: 0.5155
Episode: 392 Total reward: 14.0 Training loss: [0.3550807535648346] Explore P: 0.5148
Episode: 393 Total reward: 11.0 Training loss: [0.33599162101745605] Explore P: 0.5142
Episode: 394 Total reward: 9.0 Training loss: [0.323

Episode: 478 Total reward: 18.0 Training loss: [0.35328003764152527] Explore P: 0.4620
Episode: 479 Total reward: 11.0 Training loss: [0.3296900689601898] Explore P: 0.4615
Episode: 480 Total reward: 8.0 Training loss: [0.331447958946228] Explore P: 0.4611
Episode: 481 Total reward: 10.0 Training loss: [0.32737964391708374] Explore P: 0.4607
Episode: 482 Total reward: 15.0 Training loss: [0.34854409098625183] Explore P: 0.4600
Episode: 483 Total reward: 14.0 Training loss: [0.3293924033641815] Explore P: 0.4594
Episode: 484 Total reward: 16.0 Training loss: [0.34883958101272583] Explore P: 0.4586
Episode: 485 Total reward: 9.0 Training loss: [0.3313477337360382] Explore P: 0.4582
Episode: 486 Total reward: 10.0 Training loss: [0.32101771235466003] Explore P: 0.4578
Episode: 487 Total reward: 12.0 Training loss: [0.34638458490371704] Explore P: 0.4572
Episode: 488 Total reward: 12.0 Training loss: [0.3222644031047821] Explore P: 0.4567
Episode: 489 Total reward: 9.0 Training loss: [0.32

Episode: 573 Total reward: 16.0 Training loss: [0.3292274475097656] Explore P: 0.4085
Episode: 574 Total reward: 14.0 Training loss: [0.3269740641117096] Explore P: 0.4080
Episode: 575 Total reward: 11.0 Training loss: [0.3408374786376953] Explore P: 0.4075
Episode: 576 Total reward: 14.0 Training loss: [0.00023407650587614626] Explore P: 0.4070
Episode: 577 Total reward: 10.0 Training loss: [9.919132025970612e-06] Explore P: 0.4066
Episode: 578 Total reward: 10.0 Training loss: [0.32730984687805176] Explore P: 0.4062
Episode: 579 Total reward: 13.0 Training loss: [0.32688215374946594] Explore P: 0.4057
Episode: 580 Total reward: 15.0 Training loss: [0.32345443964004517] Explore P: 0.4051
Episode: 581 Total reward: 9.0 Training loss: [1.1551023817446548e-05] Explore P: 0.4047
Episode: 582 Total reward: 10.0 Training loss: [0.3272782564163208] Explore P: 0.4043
Episode: 583 Total reward: 10.0 Training loss: [0.34412136673927307] Explore P: 0.4039
Episode: 584 Total reward: 11.0 Training

Episode: 668 Total reward: 10.0 Training loss: [0.32372692227363586] Explore P: 0.3635
Episode: 669 Total reward: 10.0 Training loss: [0.3205052316188812] Explore P: 0.3631
Episode: 670 Total reward: 9.0 Training loss: [0.3362914025783539] Explore P: 0.3628
Episode: 671 Total reward: 9.0 Training loss: [0.32577717304229736] Explore P: 0.3625
Episode: 672 Total reward: 12.0 Training loss: [0.34477970004081726] Explore P: 0.3621
Episode: 673 Total reward: 10.0 Training loss: [0.3204583525657654] Explore P: 0.3617
Episode: 674 Total reward: 11.0 Training loss: [0.32689714431762695] Explore P: 0.3613
Episode: 675 Total reward: 13.0 Training loss: [0.3455386757850647] Explore P: 0.3609
Episode: 676 Total reward: 16.0 Training loss: [0.3218079209327698] Explore P: 0.3603
Episode: 677 Total reward: 10.0 Training loss: [0.3443819284439087] Explore P: 0.3600
Episode: 678 Total reward: 12.0 Training loss: [0.32524117827415466] Explore P: 0.3595
Episode: 679 Total reward: 16.0 Training loss: [0.3

Episode: 763 Total reward: 24.0 Training loss: [7.786057949488168e-07] Explore P: 0.3269
Episode: 764 Total reward: 9.0 Training loss: [0.3260384798049927] Explore P: 0.3266
Episode: 765 Total reward: 10.0 Training loss: [0.32506000995635986] Explore P: 0.3263
Episode: 766 Total reward: 9.0 Training loss: [0.320679247379303] Explore P: 0.3260
Episode: 767 Total reward: 11.0 Training loss: [0.32257843017578125] Explore P: 0.3256
Episode: 768 Total reward: 19.0 Training loss: [0.32483866810798645] Explore P: 0.3250
Episode: 769 Total reward: 11.0 Training loss: [0.32857775688171387] Explore P: 0.3247
Episode: 770 Total reward: 16.0 Training loss: [0.3231072425842285] Explore P: 0.3242
Episode: 771 Total reward: 9.0 Training loss: [0.32352548837661743] Explore P: 0.3239
Episode: 772 Total reward: 22.0 Training loss: [0.3219069838523865] Explore P: 0.3232
Episode: 773 Total reward: 13.0 Training loss: [0.32096022367477417] Explore P: 0.3228
Episode: 774 Total reward: 12.0 Training loss: [0

Episode: 858 Total reward: 10.0 Training loss: [0.3227914571762085] Explore P: 0.2931
Episode: 859 Total reward: 16.0 Training loss: [0.3408365547657013] Explore P: 0.2926
Episode: 860 Total reward: 15.0 Training loss: [0.3250669538974762] Explore P: 0.2922
Episode: 861 Total reward: 8.0 Training loss: [0.32116350531578064] Explore P: 0.2920
Episode: 862 Total reward: 8.0 Training loss: [0.3216118812561035] Explore P: 0.2918
Episode: 863 Total reward: 8.0 Training loss: [0.3225131928920746] Explore P: 0.2915
Episode: 864 Total reward: 12.0 Training loss: [0.32264742255210876] Explore P: 0.2912
Episode: 865 Total reward: 14.0 Training loss: [0.3217160105705261] Explore P: 0.2908
Episode: 866 Total reward: 9.0 Training loss: [0.32475510239601135] Explore P: 0.2906
Episode: 867 Total reward: 17.0 Training loss: [0.32390254735946655] Explore P: 0.2901
Episode: 868 Total reward: 9.0 Training loss: [0.34008628129959106] Explore P: 0.2898
Episode: 869 Total reward: 15.0 Training loss: [0.3208

KeyboardInterrupt: 