
# QGAN: (Q-Net) + GAN (G-Net and D-Net)

More specifically, we'll use Q-GAN to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned. Then run this command 'pip install -e gym/[all]'.

In [40]:
import gym
# Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [42]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    #     print('state, action, reward, done, info')
    #     print(state, action, reward, done, info)
    if done:
    #         print('state, action, reward, done, info')
    #         print(state, action, reward, done, info)
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [43]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards min and max:', np.max(np.array(rewards)), np.min(np.array(rewards)))
print('state size:', np.array(states).shape, 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
actions: 1 0
rewards min and max: 1.0 1.0
state size: (10, 4) action size: 2


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

In [44]:
# Data of the model
def model_input(state_size):
    # Current states given
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    
    # Next states given
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    
    # Current actions given
    actions = tf.placeholder(tf.int32, [None], name='actions')

    # TargetQs/values
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    
    # returning the given data to the model
    return states, next_states, actions, targetQs

In [47]:
# Q: Qfunction/Encoder/Classifier
def qfunction(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('qfunction', reuse=reuse):        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits)

        # return actions logits: Sqeezed/compressed/represented states into actions size
        return logits

In [48]:
# G: Generator/Decoder: actions can be given actions, generated actions
def generator(states, actions, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # Fuse compressed states (actions fake) with actions (actions real)
        x_fused = tf.concat(axis=1, values=[states, actions]) # NxD: axis1=N, and axis2=D
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=state_size)        
        #predictions = tf.sigmoid(logits)

        # return next_states_logits
        return logits

In [49]:
# D: Descriminator/Reward function
def discriminator(states, actions, next_states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=next_states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=action_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Fused compressed states, actions, and compressed next_states (all three in action size)
        #h3 = tf.layers.dense(inputs=nl2, units=action_size)
        h3_fused = tf.concat(axis=1, values=[states, actions, nl2])
        bn3 = tf.layers.batch_normalization(h3_fused, training=training)        
        nl3 = tf.maximum(alpha * bn3, bn3)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl3, units=1)   
        #predictions = tf.sigmoid(logits)

        # return reward logits
        return logits

In [50]:
def model_loss(states, actions, next_states, targetQs, # model_input
               state_size, action_size, hidden_size): # model_init
    # DQN: Q-learning - Bellman equations: loss (targetQ - Q)^2
    actions_logits = qfunction(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_real = tf.one_hot(indices=actions, depth=action_size)
    Qs = tf.reduce_sum(tf.multiply(actions_logits, actions_real), axis=1)
    q_loss = tf.reduce_mean(tf.square(targetQs - Qs))

    # GAN: Generate next states
    actions_fake = tf.nn.softmax(actions_logits)
    next_states_logits = generator(states=actions_fake, actions=actions_real, 
                                   state_size=state_size, hidden_size=hidden_size)
    
    # GAN: Discriminate between fake and real
    next_states_fake = tf.sigmoid(x=next_states_logits)
    d_logits_fake = discriminator(states=actions_fake, actions=actions_real, action_size=action_size,
                                  next_states=next_states_fake, hidden_size=hidden_size, reuse=False)
    next_states_real = tf.sigmoid(x=next_states) 
    d_logits_real = discriminator(states=actions_fake, actions=actions_real, action_size=action_size,
                                  next_states=next_states_real, hidden_size=hidden_size, reuse=True)    

    # GAN: Adverserial training - G-learning -  Relavistic GAN
    g_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.ones_like(d_logits_fake)))
    g_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real, labels=tf.zeros_like(d_logits_real)))
    g_loss = g_loss_real + g_loss_fake
    
    # VAE: Variational AE reconstruction/prediction loss
    loss_reconst = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=next_states_logits, labels=next_states_real))
    q_loss += g_loss + loss_reconst 
    g_loss += loss_reconst
    
    # GAN: Adverserial training - D-learning-  Standard GAN
    d_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.zeros_like(d_logits_fake)))
    d_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real, labels=tf.ones_like(d_logits_real)))
    d_loss = d_loss_real + d_loss_fake
    
    # Rewards fake/real
    rewards_fake = tf.sigmoid(d_logits_fake)
    rewards_real = tf.sigmoid(d_logits_real)

    return actions_logits, q_loss, g_loss, d_loss, rewards_fake, rewards_real

In [51]:
def model_opt(q_loss, g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param q_loss: Qfunction/Value loss Tensor for next action prediction
    :param g_loss: Generator/Decoder loss Tensor for next state prediction
    :param d_loss: Discriminator/Reward loss Tensor for current reward function
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    q_vars = [var for var in t_vars if var.name.startswith('qfunction')] # Q: action At/at
    g_vars = [var for var in t_vars if var.name.startswith('generator')] # G: next state St/st
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')] # D: reward Rt/rt

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        q_opt = tf.train.AdamOptimizer(learning_rate).minimize(q_loss, var_list=q_vars)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)

    return q_opt, g_opt, d_opt

In [52]:
class QGAN:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.next_states, self.actions, self.targetQs = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.q_loss, self.g_loss, self.d_loss, self.rewards_fake, self.rewards_real = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, # model init parameters
            states=self.states, next_states=self.next_states, actions=self.actions, targetQs=self.targetQs) # model input data

        # Update the model: backward pass and backprop
        self.q_opt, self.g_opt, self.d_opt = model_opt(q_loss=self.q_loss, g_loss=self.g_loss, d_loss=self.d_loss, 
                                                       learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [53]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [54]:
print('state size:', np.array(states).shape[1], 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

state size: 4 action size: 2


In [56]:
train_episodes = 1000          # max number of episodes to learn from
max_steps = 2000000000000000   # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation

# Memory parameters
memory_size = 100000           # memory capacity
batch_size = 200               # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [57]:
tf.reset_default_graph()

model = QGAN(state_size=state_size, action_size=action_size, hidden_size=hidden_size, learning_rate=learning_rate)

## Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [58]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

# init memory
memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

## Training

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list, rewards_fake_list, rewards_real_list = [], [], []
q_loss_list, g_loss_list, d_loss_list = [], [], [] 

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    #     # Restore/load the trained model 
    #     #saver.restore(sess, 'checkpoints/model.ckpt')    
    #     saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward, rewards_fake_mean, rewards_real_mean = 0, 0, 0
        q_loss, g_loss, d_loss = 0, 0, 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            # Cumulative reward
            total_reward += reward
            
            # Episode/epoch training is done/failed!
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Average reward fake: {}'.format(rewards_fake_mean),
                      'Average reward real: {}'.format(rewards_real_mean),
                      'Training q_loss: {:.4f}'.format(q_loss),
                      'Training g_loss: {:.4f}'.format(g_loss),
                      'Training d_loss: {:.4f}'.format(d_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                rewards_fake_list.append((ep, rewards_fake_mean))
                rewards_real_list.append((ep, rewards_real_mean))
                q_loss_list.append((ep, q_loss))
                g_loss_list.append((ep, g_loss))
                d_loss_list.append((ep, d_loss))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            #rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Calculating real current reward and next action
            feed_dict = {model.states: states, model.actions: actions, model.next_states: next_states}
            next_actions_logits, rewards_fake, rewards_real = sess.run([model.actions_logits, 
                                                                        model.rewards_fake, model.rewards_real], 
                                                                       feed_dict)
            #             feed_dict={model.states: next_states}
            #             next_actions_logits = sess.run(model.actions_logits, feed_dict)

            # Mean/average fake and real rewards or rewarded generated/given actions
            rewards_fake_mean = np.mean(rewards_fake.reshape(-1))
            rewards_real_mean = np.mean(rewards_real.reshape(-1))
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            next_actions_logits[episode_ends] = (0, 0) # NOTE: action size

            # Bellman equation: Qt = Rt + max(Qt+1)
            #targetQs = rewards_fake.reshape(-1) + (gamma * np.max(next_actions_logits, axis=1))
            targetQs = rewards_real.reshape(-1) + (gamma * np.max(next_actions_logits, axis=1))

            # Updating/training/optimizing the model
            feed_dict = {model.states: states, model.actions: actions, model.next_states: next_states,
                         model.targetQs: targetQs}
            q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)
            g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
            d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
            
    # Save the trained model
    saver.save(sess, 'checkpoints/model.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 33.0 Average reward fake: 0.19600506126880646 Average reward real: 0.1960034817457199 Training q_loss: 2.6797 Training g_loss: 2.5373 Training d_loss: 1.8479 Explore P: 0.9967
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 41.0 Average reward fake: 0.24042381346225739 Average reward real: 0.24041762948036194 Training q_loss: 4.5685 Training g_loss: 2.3783 Training d_loss: 1.7004 Explore P: 0.9927
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 14.0 Average reward fake: 0.25487691164016724 Average reward real: 0.25488054752349854 Training q_loss: 6.5446 Training g_loss: 2.3374 Training d_loss: 1.6611 Explore P: 0.

-------------------------------------------------------------------------------
Episode: 23 Total reward: 22.0 Average reward fake: 0.4949415624141693 Average reward real: 0.49277472496032715 Training q_loss: 9.4793 Training g_loss: 2.0744 Training d_loss: 1.3908 Explore P: 0.9459
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 24 Total reward: 14.0 Average reward fake: 0.4950315058231354 Average reward real: 0.49310654401779175 Training q_loss: 7.3082 Training g_loss: 2.0752 Training d_loss: 1.3903 Explore P: 0.9446
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 25 Total reward: 16.0 Average reward fake: 0.49539700150489807 Average reward real: 0.4934948682785034 Training q_loss: 7.3157 Training g_loss: 2.0744 Training d_loss: 1.3902 Explore P: 0

-------------------------------------------------------------------------------
Episode: 46 Total reward: 25.0 Average reward fake: 0.4987887442111969 Average reward real: 0.4988235831260681 Training q_loss: 3.0473 Training g_loss: 2.0717 Training d_loss: 1.3863 Explore P: 0.9064
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 47 Total reward: 12.0 Average reward fake: 0.4989415109157562 Average reward real: 0.49893224239349365 Training q_loss: 2.9987 Training g_loss: 2.0690 Training d_loss: 1.3864 Explore P: 0.9053
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 48 Total reward: 9.0 Average reward fake: 0.4985668957233429 Average reward real: 0.49882981181144714 Training q_loss: 3.9884 Training g_loss: 2.0752 Training d_loss: 1.3857 Explore P: 0.9

-------------------------------------------------------------------------------
Episode: 69 Total reward: 13.0 Average reward fake: 0.4988137185573578 Average reward real: 0.49972784519195557 Training q_loss: 3.8863 Training g_loss: 2.0763 Training d_loss: 1.3849 Explore P: 0.8697
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 70 Total reward: 9.0 Average reward fake: 0.5002654790878296 Average reward real: 0.5005820393562317 Training q_loss: 4.3321 Training g_loss: 2.0734 Training d_loss: 1.3859 Explore P: 0.8689
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 71 Total reward: 21.0 Average reward fake: 0.49883323907852173 Average reward real: 0.500308096408844 Training q_loss: 2.8112 Training g_loss: 2.0740 Training d_loss: 1.3830 Explore P: 0.86

-------------------------------------------------------------------------------
Episode: 92 Total reward: 22.0 Average reward fake: 0.4997202157974243 Average reward real: 0.4991941750049591 Training q_loss: 4.6467 Training g_loss: 2.0631 Training d_loss: 1.3874 Explore P: 0.8218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 93 Total reward: 20.0 Average reward fake: 0.4994674026966095 Average reward real: 0.4996006488800049 Training q_loss: 2.9229 Training g_loss: 2.0643 Training d_loss: 1.3861 Explore P: 0.8201
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 94 Total reward: 56.0 Average reward fake: 0.49896126985549927 Average reward real: 0.4993942379951477 Training q_loss: 4.6402 Training g_loss: 2.0659 Training d_loss: 1.3854 Explore P: 0.8

-------------------------------------------------------------------------------
Episode: 115 Total reward: 11.0 Average reward fake: 0.49916571378707886 Average reward real: 0.49931061267852783 Training q_loss: 2.8025 Training g_loss: 2.0700 Training d_loss: 1.3858 Explore P: 0.7450
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 116 Total reward: 17.0 Average reward fake: 0.5005367398262024 Average reward real: 0.5004507303237915 Training q_loss: 32.2110 Training g_loss: 2.0987 Training d_loss: 1.3895 Explore P: 0.7438
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 117 Total reward: 31.0 Average reward fake: 0.5000736117362976 Average reward real: 0.5032782554626465 Training q_loss: 3.1685 Training g_loss: 2.0854 Training d_loss: 1.3800 Explore P

-------------------------------------------------------------------------------
Episode: 138 Total reward: 62.0 Average reward fake: 0.49945706129074097 Average reward real: 0.500033438205719 Training q_loss: 2.5994 Training g_loss: 2.0739 Training d_loss: 1.3851 Explore P: 0.6910
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 139 Total reward: 19.0 Average reward fake: 0.4996790289878845 Average reward real: 0.5000208616256714 Training q_loss: 2.4610 Training g_loss: 2.0752 Training d_loss: 1.3857 Explore P: 0.6897
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 140 Total reward: 26.0 Average reward fake: 0.5002864003181458 Average reward real: 0.5006552338600159 Training q_loss: 2.8991 Training g_loss: 2.0664 Training d_loss: 1.3856 Explore P: 0

-------------------------------------------------------------------------------
Episode: 161 Total reward: 32.0 Average reward fake: 0.5002392530441284 Average reward real: 0.5001225471496582 Training q_loss: 2.7488 Training g_loss: 2.0655 Training d_loss: 1.3866 Explore P: 0.6379
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 162 Total reward: 38.0 Average reward fake: 0.5000689625740051 Average reward real: 0.5002164244651794 Training q_loss: 2.3616 Training g_loss: 2.0685 Training d_loss: 1.3858 Explore P: 0.6355
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 163 Total reward: 15.0 Average reward fake: 0.4993981122970581 Average reward real: 0.4996839463710785 Training q_loss: 2.4063 Training g_loss: 2.0745 Training d_loss: 1.3857 Explore P: 0

-------------------------------------------------------------------------------
Episode: 184 Total reward: 27.0 Average reward fake: 0.5003901720046997 Average reward real: 0.4995577335357666 Training q_loss: 2.1440 Training g_loss: 2.0654 Training d_loss: 1.3878 Explore P: 0.5759
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 185 Total reward: 40.0 Average reward fake: 0.4999585449695587 Average reward real: 0.5003758072853088 Training q_loss: 2.5240 Training g_loss: 2.0654 Training d_loss: 1.3852 Explore P: 0.5737
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 186 Total reward: 84.0 Average reward fake: 0.49960067868232727 Average reward real: 0.49992233514785767 Training q_loss: 2.4007 Training g_loss: 2.0707 Training d_loss: 1.3859 Explore P:

-------------------------------------------------------------------------------
Episode: 207 Total reward: 13.0 Average reward fake: 0.5009104013442993 Average reward real: 0.5003104209899902 Training q_loss: 2.1682 Training g_loss: 2.0776 Training d_loss: 1.3873 Explore P: 0.5226
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 208 Total reward: 30.0 Average reward fake: 0.5002678036689758 Average reward real: 0.5015771389007568 Training q_loss: 2.4005 Training g_loss: 2.0723 Training d_loss: 1.3840 Explore P: 0.5211
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 209 Total reward: 44.0 Average reward fake: 0.5007959604263306 Average reward real: 0.5013903379440308 Training q_loss: 2.4386 Training g_loss: 2.0739 Training d_loss: 1.3851 Explore P: 0

-------------------------------------------------------------------------------
Episode: 230 Total reward: 142.0 Average reward fake: 0.5000513195991516 Average reward real: 0.5010879039764404 Training q_loss: 2.7476 Training g_loss: 2.0733 Training d_loss: 1.3845 Explore P: 0.4710
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 231 Total reward: 9.0 Average reward fake: 0.5002407431602478 Average reward real: 0.5005209445953369 Training q_loss: 3.4696 Training g_loss: 2.0706 Training d_loss: 1.3860 Explore P: 0.4705
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 232 Total reward: 28.0 Average reward fake: 0.49997633695602417 Average reward real: 0.5006880164146423 Training q_loss: 2.2351 Training g_loss: 2.0749 Training d_loss: 1.3855 Explore P: 

-------------------------------------------------------------------------------
Episode: 253 Total reward: 59.0 Average reward fake: 0.4996771216392517 Average reward real: 0.4993983507156372 Training q_loss: 2.9747 Training g_loss: 2.0798 Training d_loss: 1.3869 Explore P: 0.4314
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 254 Total reward: 35.0 Average reward fake: 0.5012040734291077 Average reward real: 0.5007290840148926 Training q_loss: 2.4865 Training g_loss: 2.0736 Training d_loss: 1.3869 Explore P: 0.4299
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 255 Total reward: 81.0 Average reward fake: 0.5006187558174133 Average reward real: 0.500103771686554 Training q_loss: 2.3693 Training g_loss: 2.0716 Training d_loss: 1.3873 Explore P: 0.

-------------------------------------------------------------------------------
Episode: 276 Total reward: 44.0 Average reward fake: 0.4991869330406189 Average reward real: 0.4998537302017212 Training q_loss: 2.1741 Training g_loss: 2.0759 Training d_loss: 1.3850 Explore P: 0.3919
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 277 Total reward: 67.0 Average reward fake: 0.5009555816650391 Average reward real: 0.5006891489028931 Training q_loss: 2.1097 Training g_loss: 2.0697 Training d_loss: 1.3859 Explore P: 0.3893
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 278 Total reward: 35.0 Average reward fake: 0.5001368522644043 Average reward real: 0.5004146695137024 Training q_loss: 2.4805 Training g_loss: 2.0778 Training d_loss: 1.3856 Explore P: 0

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_fake_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Fake rewards')

In [None]:
eps, arr = np.array(rewards_real_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Real rewards')

In [None]:
eps, arr = np.array(q_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Q losses')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [55]:
test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

# # # Create the env after closing it.
# env = gym.make('CartPole-v0')
# # env = gym.make('Acrobot-v1')
env.reset()

with tf.Session() as sess:
    
    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/QGAN-cartpole.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            
            # Rendering the env graphics
            env.render()
            
            # Get action from DQAN
            feed_dict = {model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())
            else:
                state = next_state
                t += 1

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


In [56]:
# Closing the env
# WARNING: If you close, you can NOT restart again!!!!!!
env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.