# Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [3]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [ 0.00461563 -0.24406416  0.01748367  0.33937548] 0 1.0 False {}
state, action, reward, done, info: [-0.00026566 -0.0491953   0.02427117  0.05225679] 1 1.0 False {}
state, action, reward, done, info: [-0.00124956  0.14557038  0.02531631 -0.23267065] 1 1.0 False {}
state, action, reward, done, info: [ 0.00166185 -0.04990399  0.0206629   0.06788914] 0 1.0 False {}
state, action, reward, done, info: [ 0.00066377  0.14491572  0.02202068 -0.21820357] 1 1.0 False {}
state, action, reward, done, info: [ 0.00356208  0.33971608  0.01765661 -0.50385971] 1 1.0 False {}
state, action, reward, done, info: [ 0.0103564   0.53458479  0.00757941 -0.79092644] 1 1.0 False {}
state, action, reward, done, info: [ 0.0210481   0.72960185 -0.00823911 -1.0812153 ] 1 1.0 False {}
state, action, reward, done, info: [ 0.03564013  0.92483159 -0.02986342 -1.37647225] 1 1.0 False {}
state, action, reward, done, info: [ 0.05413677  1.12031362 -0.05739287 -1.6783431 ] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [6]:
# Data of the model
def model_input(state_size, batch_size, lstm_size):
    # Calculating Qs
    actions = tf.placeholder(tf.int32, [None], name='actions')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    
    # Calculated targetQs/nextQs
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
        
    # GRU: Gated Recurrent Units
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    # returning the given data to the model
    return actions, states, rewards, targetQs, cell, initial_state

In [7]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, batch_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        inputs_rnn = tf.reshape(inputs, [1, batch_size, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, outputs

In [10]:
def model_loss(num_classes, lstm_size, batch_size, 
               actions, states, rewards, targetQs, 
               cell, initial_state):
    
    # # Calculating targetQs/nextQs
    # actions_logits = sess.run(model.actions_logits, 
    #                           feed_dict = {model.states: states, model.initial_state: initial_state})
    # rewarded_actions_logits = np.multiply(actions_logits, rewards.reshape([-1, 1]))
    # Qs = np.max(rewarded_actions_logits, axis=1)
            
    # Calculating Qs
    actions_logits, outputs = generator(states=states, cell=cell, initial_state=initial_state,
                                        batch_size=batch_size, lstm_size=lstm_size, num_classes=num_classes)
    rewarded_actions_logits = tf.multiply(actions_logits, tf.reshape(rewards, shape=[-1, 1]))
    actions_onehot = tf.one_hot(indices=actions, depth=num_classes, dtype=actions_logits.dtype)
    Qs_onehot = tf.multiply(rewarded_actions_logits[:-1], actions_onehot[1:])
    Qs = tf.reduce_max(Qs_onehot, axis=1)
    
    # Calculating the loss: logits/predictions vs labels
    loss = tf.reduce_mean(tf.square(Qs - targetQs[1:]))
    #loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs, labels=tf.nn.sigmoid(targetQs[1:])))
    
    return actions_logits, outputs, loss

In [11]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    #grads=tf.gradients(loss, g_vars)

    train_op = tf.train.AdamOptimizer(learning_rate)
    opt = train_op.apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [12]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, batch_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.actions, self.states, self.rewards, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, batch_size=1, lstm_size=hidden_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.outputs, self.loss = model_loss(
            num_classes=action_size, lstm_size=hidden_size, batch_size=batch_size,
            states=self.states, actions=self.actions, rewards=self.rewards, targetQs=self.targetQs, 
            cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [13]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [14]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [15]:
# Training parameters
train_episodes = 1000          # max number of episodes to learn from
max_steps = 3000000000         # max steps in an episode
learning_rate = 0.001          # learning rate for adam

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation
batch_size = 50                # number of samples in the memory/ experience as mini-batch size

In [16]:
# Reset/init the graph/session
tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, batch_size=batch_size, 
              learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, 50, 64) (1, 64)
(1, 50, 64) (1, 64)
(50, 64)
(50, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [17]:
# Initialize the simulation
env.reset()

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    
    # Take one random step to get the pole and cart moving
    action = env.action_space.sample()
    state, _, done, _ = env.step(action)
    reward = 1 - float(done)
    memory.buffer.append((action, state, reward))
    
    # End of the episodes which defines the goal of the episode/mission
    if done is True:
        # Start new episode
        env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [18]:
memory.buffer[0][1].shape

(4,)

In [19]:
# states, rewards, actions

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting after training session/graph
rewards_list = [] # accuracy
loss_list = [] # loss

# TF session for training
with tf.Session() as sess:
    
    # Initialize/restore variables
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))

    # RNN state initialize for all episodes
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    
    # Explore or exploit parameter
    total_step = 0
    
    # Training episodes/epochs
    for ep in range(train_episodes):
        
        # Start new episode
        env.reset()
        total_reward = 0
        loss_batch = []

        # Training steps/batches
        for _ in range(max_steps): # start=0, step=1, stop=max_steps/done/reward
            
            # Batch from the OLD memory
            batch = memory.buffer
            #actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])
            
            # Calculating the model output: next action and final state using OLD memory
            actions_logits, outputs = sess.run([model.actions_logits, model.outputs],
                                      feed_dict = {model.states: states, 
                                                   model.initial_state: initial_state})
            # If reward is 0 or the episode is done, then there should not be any output for state input
            rewarded_outputs = np.multiply(outputs, rewards.reshape([-1, 1]))
            rewarded_actions_logits = np.multiply(actions_logits, rewards.reshape([-1, 1]))
            last_action_logits = rewarded_actions_logits[-1]
            
            # Explore (Env) or Exploit (Model)
            # Take new action, get new state and reward
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                action = np.argmax(last_action_logits)
            state, _, done, _ = env.step(action)
            reward = 1 - float(done)
            
            # New memory (time direction)
            memory.buffer.append((action, state, reward))
            
            # Feedback
            initial_state = rewarded_outputs[0].reshape([1, -1])
                        
            # Cumulative rewards
            total_reward += reward

            # Batch from the NEW memory
            batch = memory.buffer
            actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            rewards = np.array([each[2] for each in batch])

            # Calculating targetQs/nextQs using NEW memory
            actions_logits = sess.run(model.actions_logits, 
                                      feed_dict = {model.states: states, 
                                                   model.initial_state: initial_state})
            rewarded_actions_logits = np.multiply(actions_logits, rewards.reshape([-1, 1]))
            Qs = np.max(rewarded_actions_logits, axis=1)

            # Updating the model using NEW memory and targetQs/nextQs
            feed_dict = {model.actions: actions,
                         model.states: states, 
                         model.rewards: rewards, 
                         model.targetQs: Qs}
            loss, _ = sess.run([model.loss, model.opt], feed_dict)
            
            # For average loss in one episode/epoch
            loss_batch.append(loss)
            
            # At the end of steps/batches loop
            if done is True:
                break
                
        # At the end of each episode/epoch
        print('-------------------------------------------------------------------------------')
        print('Episode: {}'.format(ep),
              'Total reward: {}'.format(total_reward),
              'Average loss: {:.4f}'.format(np.mean(loss_batch)),
              'Explore P: {:.4f}'.format(explore_p))
        print('-------------------------------------------------------------------------------')

        # At the end of each episode/epoch
        # total rewards and losses for plotting
        rewards_list.append((ep, total_reward))
        loss_list.append((ep, np.mean(loss_batch)))
        
    # At the end of all training episodes/epochs
    # Save the trained model
    saver.save(sess, 'checkpoints/model-seq.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 15.0 Average loss: 0.1261 Explore P: 0.9984
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 14.0 Average loss: 0.5455 Explore P: 0.9969
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 19.0 Average loss: 1.9856 Explore P: 0.9950
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3 Total reward: 13.0 Average loss: 0.6913 Explore P: 0.9936
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Epis

-------------------------------------------------------------------------------
Episode: 36 Total reward: 53.0 Average loss: 3.3253 Explore P: 0.9149
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 37 Total reward: 16.0 Average loss: 2.5706 Explore P: 0.9133
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 38 Total reward: 15.0 Average loss: 4.1112 Explore P: 0.9119
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 39 Total reward: 19.0 Average loss: 3.5054 Explore P: 0.9101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------


-------------------------------------------------------------------------------
Episode: 72 Total reward: 33.0 Average loss: 14.3393 Explore P: 0.8313
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 73 Total reward: 33.0 Average loss: 17.4555 Explore P: 0.8285
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 74 Total reward: 21.0 Average loss: 20.2812 Explore P: 0.8267
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 75 Total reward: 17.0 Average loss: 25.4375 Explore P: 0.8253
-------------------------------------------------------------------------------
----------------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 108 Total reward: 12.0 Average loss: 85.1939 Explore P: 0.7458
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 109 Total reward: 52.0 Average loss: 71.4152 Explore P: 0.7419
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 110 Total reward: 34.0 Average loss: 45.0769 Explore P: 0.7394
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 111 Total reward: 20.0 Average loss: 82.3169 Explore P: 0.7379
-------------------------------------------------------------------------------
------------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 144 Total reward: 16.0 Average loss: 90.7053 Explore P: 0.6562
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 145 Total reward: 12.0 Average loss: 192.7719 Explore P: 0.6554
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 146 Total reward: 16.0 Average loss: 278.1204 Explore P: 0.6543
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 147 Total reward: 95.0 Average loss: 97.9616 Explore P: 0.6481
-------------------------------------------------------------------------------
----------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 180 Total reward: 89.0 Average loss: 91.8395 Explore P: 0.5651
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 181 Total reward: 23.0 Average loss: 191.3260 Explore P: 0.5637
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 182 Total reward: 149.0 Average loss: 88.1907 Explore P: 0.5555
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 183 Total reward: 26.0 Average loss: 184.5686 Explore P: 0.5540
-------------------------------------------------------------------------------
---------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 216 Total reward: 50.0 Average loss: 307.5875 Explore P: 0.4779
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 217 Total reward: 22.0 Average loss: 243.2514 Explore P: 0.4768
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 218 Total reward: 86.0 Average loss: 209.1240 Explore P: 0.4728
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 219 Total reward: 15.0 Average loss: 474.3240 Explore P: 0.4720
-------------------------------------------------------------------------------
--------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 252 Total reward: 110.0 Average loss: 130.2872 Explore P: 0.4064
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 253 Total reward: 72.0 Average loss: 236.6074 Explore P: 0.4035
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 254 Total reward: 60.0 Average loss: 310.6504 Explore P: 0.4011
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 255 Total reward: 73.0 Average loss: 227.3207 Explore P: 0.3982
-------------------------------------------------------------------------------
-------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 288 Total reward: 50.0 Average loss: 877.0593 Explore P: 0.3319
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 289 Total reward: 106.0 Average loss: 254.2402 Explore P: 0.3285
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 290 Total reward: 13.0 Average loss: 791.9738 Explore P: 0.3280
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 291 Total reward: 17.0 Average loss: 1517.4861 Explore P: 0.3275
-------------------------------------------------------------------------------
------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 324 Total reward: 11.0 Average loss: 1710.2477 Explore P: 0.2679
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 325 Total reward: 14.0 Average loss: 2331.8733 Explore P: 0.2675
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 326 Total reward: 12.0 Average loss: 1811.4525 Explore P: 0.2672
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 327 Total reward: 199.0 Average loss: 269.6663 Explore P: 0.2621
-------------------------------------------------------------------------------
----------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 360 Total reward: 18.0 Average loss: 738.2315 Explore P: 0.2093
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 361 Total reward: 15.0 Average loss: 886.4027 Explore P: 0.2090
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 362 Total reward: 10.0 Average loss: 1029.5028 Explore P: 0.2088
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 363 Total reward: 13.0 Average loss: 909.3879 Explore P: 0.2085
-------------------------------------------------------------------------------
-------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 396 Total reward: 13.0 Average loss: 558.8606 Explore P: 0.1498
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 397 Total reward: 15.0 Average loss: 963.8141 Explore P: 0.1496
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 398 Total reward: 199.0 Average loss: 199.6927 Explore P: 0.1468
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 399 Total reward: 68.0 Average loss: 360.5232 Explore P: 0.1459
-------------------------------------------------------------------------------
-------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 432 Total reward: 16.0 Average loss: 1085.0983 Explore P: 0.1132
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 433 Total reward: 13.0 Average loss: 1040.9390 Explore P: 0.1131
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 434 Total reward: 127.0 Average loss: 232.9601 Explore P: 0.1118
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 435 Total reward: 199.0 Average loss: 40.9617 Explore P: 0.1098
-------------------------------------------------------------------------------
-----------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 468 Total reward: 155.0 Average loss: 104.3406 Explore P: 0.0916
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 469 Total reward: 22.0 Average loss: 274.1132 Explore P: 0.0914
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 470 Total reward: 13.0 Average loss: 463.1880 Explore P: 0.0913
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 471 Total reward: 10.0 Average loss: 600.0411 Explore P: 0.0912
-------------------------------------------------------------------------------
-------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 504 Total reward: 11.0 Average loss: 599.5098 Explore P: 0.0714
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 505 Total reward: 91.0 Average loss: 281.3035 Explore P: 0.0709
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 506 Total reward: 136.0 Average loss: 71.3439 Explore P: 0.0700
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 507 Total reward: 15.0 Average loss: 289.7291 Explore P: 0.0699
-------------------------------------------------------------------------------
--------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 540 Total reward: 35.0 Average loss: 114.1508 Explore P: 0.0584
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 541 Total reward: 11.0 Average loss: 110.6627 Explore P: 0.0584
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 542 Total reward: 10.0 Average loss: 70.5617 Explore P: 0.0583
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 543 Total reward: 9.0 Average loss: 69.3563 Explore P: 0.0583
-------------------------------------------------------------------------------
-----------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 576 Total reward: 11.0 Average loss: 218.8190 Explore P: 0.0541
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 577 Total reward: 11.0 Average loss: 180.4467 Explore P: 0.0540
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 578 Total reward: 39.0 Average loss: 62.9306 Explore P: 0.0538
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 579 Total reward: 17.0 Average loss: 21.4977 Explore P: 0.0538
-------------------------------------------------------------------------------
----------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 612 Total reward: 199.0 Average loss: 55.5617 Explore P: 0.0364
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 613 Total reward: 199.0 Average loss: 60.6401 Explore P: 0.0359
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 614 Total reward: 199.0 Average loss: 65.2445 Explore P: 0.0354
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 615 Total reward: 199.0 Average loss: 68.0003 Explore P: 0.0349
-------------------------------------------------------------------------------
--------------------------------------------------------------------

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [23]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

with tf.Session() as sess:

    # Initialize/restore/load the trained model 
    #sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):

        # Start a new episode/epoch
        env.reset()
        
        # number of env/rob steps
        for _ in range(test_max_steps):
            
            # Rendering the env graphics
            env.render()
            
            # Batch from the OLD memory
            batch = memory.buffer
            #actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            #rewards = np.array([each[2] for each in batch])
            
            # Calculating next action using OLD memory
            feed_dict={model.states: states, model.initial_state: initial_state}
            actions_logits, final_state = sess.run([model.actions_logits, model.final_state], feed_dict)
            last_action_logits = actions_logits[-1]
            
            # Take action, get new state and reward
            #action = env.action_space.sample()
            action = np.argmax(last_action_logits)
            state, _, done, _ = env.step(action)
            reward = 1 - float(done)
            
            # New memory (time direction)
            memory.buffer.append((action, state, reward))
            
            # Feedback loop/connection for NEW memory
            initial_state = final_state
            
            # The task is done or not;
            if done:
                break
                
# Closing the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model-seq.ckpt




## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.