
# QGAN: (Q-Net) + GAN (G-Net and D-Net)

More specifically, we'll use Q-GAN to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned. Then run this command 'pip install -e gym/[all]'.

In [3]:
import gym
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    #     print('state, action, reward, done, info')
    #     print(state, action, reward, done, info)
    if done:
    #         print('state, action, reward, done, info')
    #         print(state, action, reward, done, info)
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards min and max:', np.max(np.array(rewards)), np.min(np.array(rewards)))
print('state size:', np.array(states).shape, 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
actions: 1 0
rewards min and max: 1.0 1.0
state size: (10, 4) action size: 2


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

In [6]:
# Data of the model
def model_input(state_size):
    # Current states given: input data
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    
    # Current actions given: indices
    actions = tf.placeholder(tf.int32, [None], name='actions')
    
    # Next states given: next input data
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    
    # TargetQs/values
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    
    # returning the given data to the model
    return states, actions, next_states, targetQs

In [7]:
# Q: Qfunction/Encoder/Classifier
def qfunction(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('qfunction', reuse=reuse):        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits: Sqeezed/compressed/represented states into actions size
        return logits

In [8]:
# G: Generator/Decoder: actions can be given actions, generated actions
def generator(states, actions, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # Fuse compressed states (actions fake) with actions (actions real)
        x_fused = tf.concat(axis=1, values=[states, actions]) # NxD: axis1=N, and axis2=D
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=state_size)        
        #predictions = tf.sigmoid(logits)

        # return next_states_logits
        return logits

In [10]:
# D: Descriminator/Reward function
def discriminator(states, actions, next_states, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fuse states, actions, and next states (St, at, St+1) and (St, ~at, ~St+1)
        x_fused = tf.concat(axis=1, values=[states, actions, next_states]) # NxD: axis1=N, and axis2=D

        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        #         # Fused compressed states, actions, and compressed next_states (all three in action size)
        #         #h3 = tf.layers.dense(inputs=nl2, units=action_size)
        #         h3_fused = tf.concat(axis=1, values=[states, actions, nl2])
        #         bn3 = tf.layers.batch_normalization(h3_fused, training=training)        
        #         nl3 = tf.maximum(alpha * bn3, bn3)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)   
        #predictions = tf.sigmoid(logits)

        # return reward logits
        return logits

In [11]:
def model_loss(states, actions, next_states, targetQs, # model_input
               state_size, action_size, hidden_size): # model_init
    # DQN: Q-learning - Bellman equations: loss (targetQ - Q)^2
    actions_logits = qfunction(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_real = tf.one_hot(indices=actions, depth=action_size)
    Qs = tf.reduce_sum(tf.multiply(actions_logits, actions_real), axis=1)
    q_loss = tf.reduce_mean(tf.square(targetQs - Qs))

    # GAN: Generate next states
    next_states_logits = generator(states=states, actions=actions_real, 
                                   state_size=state_size, hidden_size=hidden_size)
        
    # GAN: Discriminate between fake and real
    actions_fake = tf.nn.softmax(actions_logits)
    d_logits_fake = discriminator(states=states, actions=actions_fake, next_states=next_states_logits, 
                                  hidden_size=hidden_size, reuse=False)
    d_logits_real = discriminator(states=states, actions=actions_real, next_states=next_states, 
                                  hidden_size=hidden_size, reuse=True)    

    # GAN: Adverserial training - G-learning -  Relavistic GAN
    g_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.ones_like(d_logits_fake)))
    g_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real, labels=tf.zeros_like(d_logits_real)))
    g_loss = g_loss_real + g_loss_fake 
    
    # GAN: Adverserial training - G-learning -  Variational AE
    g_loss_reconst = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=next_states_logits, labels=tf.sigmoid(x=next_states)))
    q_loss += g_loss
    g_loss += g_loss_reconst
    
    # GAN: Adverserial training - D-learning-  Standard GAN
    d_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.zeros_like(d_logits_fake)))
    d_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real, labels=tf.ones_like(d_logits_real)))
    d_loss = d_loss_real + d_loss_fake
    
    # Rewards fake/real
    rewards_fake = tf.sigmoid(d_logits_fake)
    rewards_real = tf.sigmoid(d_logits_real)

    return actions_logits, q_loss, g_loss, d_loss, rewards_fake, rewards_real

In [12]:
def model_opt(q_loss, g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param q_loss: Qfunction/Value loss Tensor for next action prediction
    :param g_loss: Generator/Decoder loss Tensor for next state prediction
    :param d_loss: Discriminator/Reward loss Tensor for current reward function
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    q_vars = [var for var in t_vars if var.name.startswith('qfunction')] # Q: action At/at
    g_vars = [var for var in t_vars if var.name.startswith('generator')] # G: next state St/st
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')] # D: reward Rt/rt

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        q_opt = tf.train.AdamOptimizer(learning_rate).minimize(q_loss, var_list=q_vars)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return q_opt, g_opt, d_opt

In [13]:
class QGAN:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.next_states, self.targetQs = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.q_loss, self.g_loss, self.d_loss, self.rewards_fake, self.rewards_real = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, next_states=self.next_states, actions=self.actions, targetQs=self.targetQs) # model input data

        # Update the model: backward pass and backprop
        self.q_opt, self.g_opt, self.d_opt = model_opt(q_loss=self.q_loss, g_loss=self.g_loss, d_loss=self.d_loss, 
                                                       learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [14]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [15]:
print('state size:', np.array(states).shape[1], 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

state size: 4 action size: 2


In [16]:
train_episodes = 2000          # max number of episodes to learn from
max_steps = 2000000000000000   # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation

# Memory parameters
memory_size = 100000           # memory capacity
batch_size = 200               # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [17]:
tf.reset_default_graph()

model = QGAN(state_size=state_size, action_size=action_size, hidden_size=hidden_size, learning_rate=learning_rate)

## Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [18]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

# init memory
memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

## Training

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list, rewards_fake_list, rewards_real_list = [], [], []
q_loss_list, g_loss_list, d_loss_list = [], [], [] 

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    #     # Restore/load the trained model 
    #     #saver.restore(sess, 'checkpoints/model.ckpt')    
    #     saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward, rewards_fake_mean, rewards_real_mean = 0, 0, 0
        q_loss, g_loss, d_loss = 0, 0, 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            # Cumulative reward
            total_reward += reward
            
            # Episode/epoch training is done/failed!
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Average reward fake: {}'.format(rewards_fake_mean),
                      'Average reward real: {}'.format(rewards_real_mean),
                      'Training q_loss: {:.4f}'.format(q_loss),
                      'Training g_loss: {:.4f}'.format(g_loss),
                      'Training d_loss: {:.4f}'.format(d_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                rewards_fake_list.append((ep, rewards_fake_mean))
                rewards_real_list.append((ep, rewards_real_mean))
                q_loss_list.append((ep, q_loss))
                g_loss_list.append((ep, g_loss))
                d_loss_list.append((ep, d_loss))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            #rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Calculating real current reward and next action
            feed_dict = {model.states: states, model.actions: actions, model.next_states: next_states}
            next_actions_logits, rewards_fake, rewards_real = sess.run([model.actions_logits, 
                                                                        model.rewards_fake, model.rewards_real], 
                                                                       feed_dict)
            #             feed_dict={model.states: next_states}
            #             next_actions_logits = sess.run(model.actions_logits, feed_dict)

            # Mean/average fake and real rewards or rewarded generated/given actions
            rewards_fake_mean = np.mean(rewards_fake.reshape(-1))
            rewards_real_mean = np.mean(rewards_real.reshape(-1))
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            next_actions_logits[episode_ends] = (0, 0) # NOTE: action size

            # Bellman equation: Qt = Rt + max(Qt+1)
            #targetQs = rewards_fake.reshape(-1) + (gamma * np.max(next_actions_logits, axis=1))
            targetQs = rewards_real.reshape(-1) + (gamma * np.max(next_actions_logits, axis=1))

            # Updating/training/optimizing the model
            feed_dict = {model.states: states, model.actions: actions, model.next_states: next_states,
                         model.targetQs: targetQs}
            q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)
            g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
            d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
            
    # Save the trained model
    saver.save(sess, 'checkpoints/model.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 3.0 Average reward fake: 0.49238717555999756 Average reward real: 0.5137783885002136 Training q_loss: 1.7447 Training g_loss: 2.1280 Training d_loss: 1.3535 Explore P: 0.9997
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 20.0 Average reward fake: 0.45753809809684753 Average reward real: 0.581372082233429 Training q_loss: 2.1482 Training g_loss: 2.3678 Training d_loss: 1.1787 Explore P: 0.9977
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 11.0 Average reward fake: 0.48067089915275574 Average reward real: 0.5992328524589539 Training q_loss: 2.3570 Training g_loss: 2.4151 Training d_loss: 1.2444 Explore P: 0.9966

-------------------------------------------------------------------------------
Episode: 23 Total reward: 12.0 Average reward fake: 0.2853304445743561 Average reward real: 0.7617213726043701 Training q_loss: 41.2186 Training g_loss: 4.0602 Training d_loss: 0.6950 Explore P: 0.9585
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 24 Total reward: 17.0 Average reward fake: 0.2708226442337036 Average reward real: 0.763134241104126 Training q_loss: 33.8181 Training g_loss: 3.9210 Training d_loss: 0.6887 Explore P: 0.9569
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 25 Total reward: 15.0 Average reward fake: 0.21555614471435547 Average reward real: 0.7679376006126404 Training q_loss: 52.8620 Training g_loss: 4.3614 Training d_loss: 0.6076 Explore P: 0

-------------------------------------------------------------------------------
Episode: 46 Total reward: 13.0 Average reward fake: 0.13974522054195404 Average reward real: 0.7270910143852234 Training q_loss: 54.0708 Training g_loss: 7.9285 Training d_loss: 0.6521 Explore P: 0.9070
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 47 Total reward: 15.0 Average reward fake: 0.22498540580272675 Average reward real: 0.8636030554771423 Training q_loss: 44.0944 Training g_loss: 7.0909 Training d_loss: 0.4533 Explore P: 0.9056
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 48 Total reward: 16.0 Average reward fake: 0.1732444316148758 Average reward real: 0.8634973168373108 Training q_loss: 19.4807 Training g_loss: 6.8697 Training d_loss: 0.4687 Explore P:

-------------------------------------------------------------------------------
Episode: 69 Total reward: 21.0 Average reward fake: 0.08576841652393341 Average reward real: 0.8094969987869263 Training q_loss: 16.5456 Training g_loss: 7.7211 Training d_loss: 0.3653 Explore P: 0.8701
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 70 Total reward: 21.0 Average reward fake: 0.1236233338713646 Average reward real: 0.8911586999893188 Training q_loss: 26.8123 Training g_loss: 7.7958 Training d_loss: 0.2881 Explore P: 0.8683
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 71 Total reward: 25.0 Average reward fake: 0.1155589297413826 Average reward real: 0.8698179721832275 Training q_loss: 40.1660 Training g_loss: 8.3179 Training d_loss: 0.2737 Explore P: 

-------------------------------------------------------------------------------
Episode: 92 Total reward: 12.0 Average reward fake: 0.16417770087718964 Average reward real: 0.8255947232246399 Training q_loss: 20.3453 Training g_loss: 6.6805 Training d_loss: 0.4158 Explore P: 0.8322
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 93 Total reward: 21.0 Average reward fake: 0.09896309673786163 Average reward real: 0.889380931854248 Training q_loss: 50.5444 Training g_loss: 8.4406 Training d_loss: 0.2561 Explore P: 0.8305
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 94 Total reward: 10.0 Average reward fake: 0.12189680337905884 Average reward real: 0.7895339727401733 Training q_loss: 66.9664 Training g_loss: 7.6847 Training d_loss: 0.4674 Explore P:

-------------------------------------------------------------------------------
Episode: 115 Total reward: 13.0 Average reward fake: 0.16008073091506958 Average reward real: 0.8993701338768005 Training q_loss: 21.9930 Training g_loss: 9.0285 Training d_loss: 0.2372 Explore P: 0.7932
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 116 Total reward: 12.0 Average reward fake: 0.11483463644981384 Average reward real: 0.8764762878417969 Training q_loss: 22.5850 Training g_loss: 9.3570 Training d_loss: 0.3566 Explore P: 0.7923
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 117 Total reward: 14.0 Average reward fake: 0.10147202759981155 Average reward real: 0.8284211754798889 Training q_loss: 51.4809 Training g_loss: 8.4966 Training d_loss: 0.3096 Explor

-------------------------------------------------------------------------------
Episode: 138 Total reward: 23.0 Average reward fake: 0.08448352664709091 Average reward real: 0.9364770650863647 Training q_loss: 59.6148 Training g_loss: 11.1949 Training d_loss: 0.2058 Explore P: 0.7596
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 139 Total reward: 13.0 Average reward fake: 0.18528667092323303 Average reward real: 0.8642404079437256 Training q_loss: 120.5441 Training g_loss: 9.6702 Training d_loss: 0.2471 Explore P: 0.7587
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 140 Total reward: 16.0 Average reward fake: 0.05586705729365349 Average reward real: 0.5924490690231323 Training q_loss: 79.1525 Training g_loss: 10.1023 Training d_loss: 0.8594 Exp

-------------------------------------------------------------------------------
Episode: 161 Total reward: 13.0 Average reward fake: 0.07780192047357559 Average reward real: 0.9042888879776001 Training q_loss: 44.2951 Training g_loss: 10.2622 Training d_loss: 0.1986 Explore P: 0.7330
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 162 Total reward: 13.0 Average reward fake: 0.10209199041128159 Average reward real: 0.8408126831054688 Training q_loss: 37.4648 Training g_loss: 10.0036 Training d_loss: 0.2750 Explore P: 0.7321
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 163 Total reward: 22.0 Average reward fake: 0.12451833486557007 Average reward real: 0.9172399640083313 Training q_loss: 54.9497 Training g_loss: 12.3827 Training d_loss: 0.2743 Exp

-------------------------------------------------------------------------------
Episode: 184 Total reward: 22.0 Average reward fake: 0.09327058494091034 Average reward real: 0.8954849243164062 Training q_loss: 34.4507 Training g_loss: 12.4889 Training d_loss: 0.2077 Explore P: 0.7034
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 185 Total reward: 11.0 Average reward fake: 0.12510348856449127 Average reward real: 0.9209123849868774 Training q_loss: 54.3235 Training g_loss: 11.3022 Training d_loss: 0.2148 Explore P: 0.7026
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 186 Total reward: 15.0 Average reward fake: 0.08291999995708466 Average reward real: 0.9022257924079895 Training q_loss: 53.5668 Training g_loss: 11.2378 Training d_loss: 0.1920 Exp

-------------------------------------------------------------------------------
Episode: 207 Total reward: 19.0 Average reward fake: 0.11830787360668182 Average reward real: 0.9042677879333496 Training q_loss: 80.3967 Training g_loss: 22.0622 Training d_loss: 0.3746 Explore P: 0.6775
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 208 Total reward: 34.0 Average reward fake: 0.10454370826482773 Average reward real: 0.9102213978767395 Training q_loss: 49.8894 Training g_loss: 14.7975 Training d_loss: 0.2571 Explore P: 0.6753
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 209 Total reward: 10.0 Average reward fake: 0.12024883925914764 Average reward real: 0.8683951497077942 Training q_loss: 52.3890 Training g_loss: 11.9054 Training d_loss: 0.2687 Exp

-------------------------------------------------------------------------------
Episode: 230 Total reward: 11.0 Average reward fake: 0.08188457787036896 Average reward real: 0.891049861907959 Training q_loss: 103.7363 Training g_loss: 13.1930 Training d_loss: 0.2568 Explore P: 0.6532
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 231 Total reward: 20.0 Average reward fake: 0.1034911721944809 Average reward real: 0.9215692281723022 Training q_loss: 97.7086 Training g_loss: 21.5357 Training d_loss: 0.1774 Explore P: 0.6519
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 232 Total reward: 12.0 Average reward fake: 0.08310386538505554 Average reward real: 0.9013051390647888 Training q_loss: 38.6748 Training g_loss: 17.8262 Training d_loss: 0.2095 Expl

-------------------------------------------------------------------------------
Episode: 253 Total reward: 12.0 Average reward fake: 0.1016903892159462 Average reward real: 0.9690136909484863 Training q_loss: 87.4731 Training g_loss: 34.8100 Training d_loss: 0.0874 Explore P: 0.6258
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 254 Total reward: 14.0 Average reward fake: 0.030375191941857338 Average reward real: 0.7756427526473999 Training q_loss: 52.2043 Training g_loss: 32.6791 Training d_loss: 0.5180 Explore P: 0.6249
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 255 Total reward: 10.0 Average reward fake: 0.09760329872369766 Average reward real: 0.9794689416885376 Training q_loss: 86.9761 Training g_loss: 28.4390 Training d_loss: 0.1184 Exp

-------------------------------------------------------------------------------
Episode: 277 Total reward: 16.0 Average reward fake: 0.05949859693646431 Average reward real: 0.8714891076087952 Training q_loss: 57.5561 Training g_loss: 20.0658 Training d_loss: 0.2223 Explore P: 0.6074
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 278 Total reward: 15.0 Average reward fake: 0.14490383863449097 Average reward real: 0.9556508660316467 Training q_loss: 338.6979 Training g_loss: 44.4638 Training d_loss: 0.2019 Explore P: 0.6065
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 279 Total reward: 19.0 Average reward fake: 0.048807233572006226 Average reward real: 0.9068893194198608 Training q_loss: 82.7221 Training g_loss: 59.0179 Training d_loss: 0.1928 E

-------------------------------------------------------------------------------
Episode: 300 Total reward: 12.0 Average reward fake: 0.039106953889131546 Average reward real: 0.931006908416748 Training q_loss: 87.5036 Training g_loss: 18.1184 Training d_loss: 0.1063 Explore P: 0.5857
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 301 Total reward: 8.0 Average reward fake: 0.0754343569278717 Average reward real: 0.9127566814422607 Training q_loss: 45.2610 Training g_loss: 19.4270 Training d_loss: 0.1597 Explore P: 0.5853
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 302 Total reward: 19.0 Average reward fake: 0.08245790749788284 Average reward real: 0.9571886658668518 Training q_loss: 47.9344 Training g_loss: 24.2777 Training d_loss: 0.1498 Explo

-------------------------------------------------------------------------------
Episode: 323 Total reward: 31.0 Average reward fake: 0.03882545605301857 Average reward real: 0.887051522731781 Training q_loss: 52.3983 Training g_loss: 29.7934 Training d_loss: 0.1740 Explore P: 0.5660
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 324 Total reward: 16.0 Average reward fake: 0.06428186595439911 Average reward real: 0.9518378973007202 Training q_loss: 72.4566 Training g_loss: 26.3127 Training d_loss: 0.1706 Explore P: 0.5651
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 325 Total reward: 9.0 Average reward fake: 0.06288467347621918 Average reward real: 0.9098309874534607 Training q_loss: 47.4164 Training g_loss: 24.1021 Training d_loss: 0.1713 Explo

-------------------------------------------------------------------------------
Episode: 346 Total reward: 13.0 Average reward fake: 0.04268277809023857 Average reward real: 0.8383259773254395 Training q_loss: 127.1045 Training g_loss: 100.8390 Training d_loss: 0.4957 Explore P: 0.5482
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 347 Total reward: 10.0 Average reward fake: 0.016872771084308624 Average reward real: 0.8758450150489807 Training q_loss: 154.2970 Training g_loss: 113.0521 Training d_loss: 0.2114 Explore P: 0.5477
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 348 Total reward: 16.0 Average reward fake: 0.03325643017888069 Average reward real: 0.8462120294570923 Training q_loss: 119.8175 Training g_loss: 100.5207 Training d_loss: 0.3

-------------------------------------------------------------------------------
Episode: 369 Total reward: 16.0 Average reward fake: 0.12659943103790283 Average reward real: 0.930429220199585 Training q_loss: 58.5179 Training g_loss: 28.6115 Training d_loss: 0.1620 Explore P: 0.5308
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 370 Total reward: 18.0 Average reward fake: 0.11878576129674911 Average reward real: 0.9680931568145752 Training q_loss: 64.0888 Training g_loss: 29.2921 Training d_loss: 0.1310 Explore P: 0.5298
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 371 Total reward: 15.0 Average reward fake: 0.06643658876419067 Average reward real: 0.8157814145088196 Training q_loss: 54.6435 Training g_loss: 22.4839 Training d_loss: 0.3075 Expl

-------------------------------------------------------------------------------
Episode: 393 Total reward: 8.0 Average reward fake: 0.026888974010944366 Average reward real: 0.8869056701660156 Training q_loss: 65.6704 Training g_loss: 46.4170 Training d_loss: 0.1893 Explore P: 0.5133
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 394 Total reward: 16.0 Average reward fake: 0.04528756067156792 Average reward real: 0.934663712978363 Training q_loss: 122.7788 Training g_loss: 42.5611 Training d_loss: 0.1171 Explore P: 0.5125
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 395 Total reward: 32.0 Average reward fake: 0.050971757620573044 Average reward real: 0.9284844994544983 Training q_loss: 45.9233 Training g_loss: 29.4323 Training d_loss: 0.1285 Ex

-------------------------------------------------------------------------------
Episode: 416 Total reward: 16.0 Average reward fake: 0.049928031861782074 Average reward real: 0.9449729919433594 Training q_loss: 92.8088 Training g_loss: 59.0074 Training d_loss: 0.1610 Explore P: 0.4972
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 417 Total reward: 13.0 Average reward fake: 0.037074651569128036 Average reward real: 0.8745570182800293 Training q_loss: 100.4506 Training g_loss: 57.3834 Training d_loss: 0.2213 Explore P: 0.4966
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 418 Total reward: 22.0 Average reward fake: 0.06511794775724411 Average reward real: 0.9367846846580505 Training q_loss: 75.8938 Training g_loss: 56.3870 Training d_loss: 0.1636 

-------------------------------------------------------------------------------
Episode: 439 Total reward: 11.0 Average reward fake: 0.09102150797843933 Average reward real: 0.9502384066581726 Training q_loss: 132.9667 Training g_loss: 61.5172 Training d_loss: 0.2176 Explore P: 0.4819
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 440 Total reward: 15.0 Average reward fake: 0.06969163566827774 Average reward real: 0.9806258678436279 Training q_loss: 78.3447 Training g_loss: 58.1484 Training d_loss: 0.1663 Explore P: 0.4812
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 441 Total reward: 12.0 Average reward fake: 0.36712124943733215 Average reward real: 0.9406410455703735 Training q_loss: 85.1915 Training g_loss: 53.4909 Training d_loss: 0.7241 Ex

-------------------------------------------------------------------------------
Episode: 462 Total reward: 11.0 Average reward fake: 0.10231783241033554 Average reward real: 0.9692592024803162 Training q_loss: 96.9803 Training g_loss: 67.1898 Training d_loss: 0.1883 Explore P: 0.4671
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 463 Total reward: 20.0 Average reward fake: 0.084307000041008 Average reward real: 0.9437669515609741 Training q_loss: 73.0474 Training g_loss: 52.4669 Training d_loss: 0.1689 Explore P: 0.4661
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 464 Total reward: 22.0 Average reward fake: 0.07618222385644913 Average reward real: 0.9342875480651855 Training q_loss: 79.0292 Training g_loss: 49.6432 Training d_loss: 0.1389 Explo

-------------------------------------------------------------------------------
Episode: 485 Total reward: 18.0 Average reward fake: 0.05863068997859955 Average reward real: 0.9450190663337708 Training q_loss: 99.9043 Training g_loss: 64.0149 Training d_loss: 0.1467 Explore P: 0.4521
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 486 Total reward: 20.0 Average reward fake: 0.030766822397708893 Average reward real: 0.8674182295799255 Training q_loss: 91.2550 Training g_loss: 48.1296 Training d_loss: 0.1934 Explore P: 0.4512
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 487 Total reward: 11.0 Average reward fake: 0.06325861066579819 Average reward real: 0.9366245865821838 Training q_loss: 79.9186 Training g_loss: 57.1687 Training d_loss: 0.1561 Ex

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_fake_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Fake rewards')

In [None]:
eps, arr = np.array(rewards_real_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Real rewards')

In [None]:
eps, arr = np.array(q_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Q losses')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [55]:
test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

# # # Create the env after closing it.
# env = gym.make('CartPole-v0')
# # env = gym.make('Acrobot-v1')
env.reset()

with tf.Session() as sess:
    
    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/QGAN-cartpole.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            
            # Rendering the env graphics
            env.render()
            
            # Get action from DQAN
            feed_dict = {model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())
            else:
                state = next_state
                t += 1

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


In [56]:
# Closing the env
# WARNING: If you close, you can NOT restart again!!!!!!
env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.