# DQAN (Deep Q-Adverserial Nets): DQN (Deep Q-Nets) + GAN (Gen. Adv. Nets)

In this notebook, we'll combine a DQN (deep Q-net) with GAN (generative adverserial net) that can learn to play games through reinforcement learning without any reward function. We'll call this network DQAN (deep Q adverserial net). 
Adverserial nets learn to maximize the current reward based the past rewards.
Q-net learns to maximize the future rewards based on the current reward.
Given a task and known when the task is done or failed, we should be able to learn the task.

# DQN
More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import tensorflow as tf
import numpy as np

>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

In [2]:
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    print('state, action, reward, done, info')
    print(state, action, reward, done, info)
    if done:
        print('state, action, reward, done, info')
        print(state, action, reward, done, info)
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

state, action, reward, done, info
[-0.0048124  -0.2450929   0.03652628  0.32928597] 0 1.0 False {}
state, action, reward, done, info
[-0.00971426 -0.05050945  0.043112    0.04834181] 1 1.0 False {}
state, action, reward, done, info
[-0.01072445  0.14396864  0.04407883 -0.23043327] 1 1.0 False {}
state, action, reward, done, info
[-0.00784507 -0.05175455  0.03947017  0.07582134] 0 1.0 False {}
state, action, reward, done, info
[-0.00888016  0.14277998  0.0409866  -0.20415198] 1 1.0 False {}
state, action, reward, done, info
[-0.00602456  0.33729255  0.03690356 -0.48362911] 1 1.0 False {}
state, action, reward, done, info
[ 7.21286127e-04  5.31874773e-01  2.72309739e-02 -7.64456531e-01] 1 1.0 False {}
state, action, reward, done, info
[ 0.01135878  0.72661135  0.01194184 -1.04844818] 1 1.0 False {}
state, action, reward, done, info
[ 0.02589101  0.92157282 -0.00902712 -1.33735872] 1 1.0 False {}
state, action, reward, done, info
[ 0.04432246  1.11680731 -0.03577429 -1.63285246] 1 1.0 Fal

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [4]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)))
print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
1 0
2
1.0 1.0
1.11680731300025 -1.6328524636998423


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [5]:
def model_input(state_size):
    # Given data
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    
    # Actions as output
    actions = tf.placeholder(tf.int32, [None], name='actions')

    # Target Q values for training
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    return states, next_states, actions, targetQs

In [6]:
# tf.layers.dense(
#     inputs, ????????????????????????
#     units, ??????????????????????
#     activation=None,
#     use_bias=True, OOOOOOOOOOOOOOOOOOOOOOOK
#     kernel_initializer=None,
#     bias_initializer=tf.zeros_initializer(), OOOOOOOOOOOOOOOK
#     kernel_regularizer=None,
#     bias_regularizer=None,
#     activity_regularizer=None,
#     kernel_constraint=None,
#     bias_constraint=None,
#     trainable=True, ??????????????????
#     name=None,
#     reuse=None
# )

In [7]:
# Q function
def generator(states, state_size, action_size, hidden_size, reuse=False, alpha=0.1): #training=True ~ batchnorm
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        #bn1 = tf.layers.batch_normalization(h1, training=training) #training=True ~ batchnorm
        nl1 = tf.maximum(alpha * h1, h1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        #bn2 = tf.layers.batch_normalization(h2, training=training) #training=True ~ batchnorm
        nl2 = tf.maximum(alpha * h2, h2)
        
        # Output layer
        logits_actions = tf.layers.dense(inputs=nl2, units=action_size)        
        #predictions = tf.nn.softmax(logits_actions)

        # Output layer
        logits_next_states = tf.layers.dense(inputs=nl2, units=state_size, trainable=False)        
        #predictions = tf.nn.softmax(logits_next_states)

        # # Output layer
        # logits = tf.layers.dense(inputs=nl2, units=(state_size + action_size))

        # # Split the states and actions (opposite of fusion)
        # logits_next_states, logits_actions = tf.split(value=logits, axis=1, num_or_size_splits=[state_size, action_size])

        return logits_actions, logits_next_states

In [8]:
# This is a reward function: Rt(St+1, at) or Rt(~St+1, ~at)
def discriminator(next_states, actions, hidden_size, reuse=False, alpha=0.1): #training=True ~ batchnorm
    with tf.variable_scope('discriminator', reuse=reuse):
        # Stack/concatenate/fuse actions and states or 
        # predicted/reconstructed actions and states
        x_fused = tf.concat(values=(next_states, actions), axis=1)
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        #bn1 = tf.layers.batch_normalization(h1, training=True)
        nl1 = tf.maximum(alpha * h1, h1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        #bn2 = tf.layers.batch_normalization(h2, training=True)
        nl2 = tf.maximum(alpha * h2, h2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)   
        #predictions = tf.sigmoid(logits)

        # logits for loss and reward/prob/out
        return logits

In [12]:
# Qt(St, At) = Rt(St+1, At) + max(alpha*Qt+1(St+1))
def model_loss(states, next_states, state_size, actions, action_size, hidden_size, targetQs, alpha=0.1):
    """
    Get the loss for the discriminator and generator
    :param states: real current input states or observations given
    :param actions: real actions given
    :return: A tuple of (discriminator loss, generator loss)
    """
    # The fake/generated actions
    actions_logits, next_states_logits = generator(states=states, state_size=state_size, hidden_size=hidden_size, 
                                              action_size=action_size)
    #print(actions_logits.shape, next_states_logits.shape)
    actions_fake = tf.nn.softmax(actions_logits)
    next_states_fake = tf.sigmoid(x=next_states_logits)
    d_logits_fake = discriminator(next_states=next_states_fake, actions=actions_fake, hidden_size=hidden_size, reuse=False)

    # The real onehot encoded actions
    actions_real = tf.one_hot(actions, action_size)
    next_states_real = tf.sigmoid(x=next_states) 
    d_logits_real = discriminator(next_states=next_states_real, actions=actions_real, hidden_size=hidden_size, reuse=True)

    # Training the rewarding function
    d_loss_real = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_real, labels=tf.ones_like(d_logits_real)))
    d_loss_fake = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.zeros_like(d_logits_fake)))
    d_loss = d_loss_real + d_loss_fake
    
    # Train the generate to maximize the current reward 0-1
    g_loss = tf.reduce_mean(
        tf.nn.sigmoid_cross_entropy_with_logits(logits=d_logits_fake, labels=tf.ones_like(d_logits_fake)))

    # Train the generator to maximize the future rewards: Bellman equations: loss (targetQ - Q)^2
    Qs = tf.reduce_sum(tf.multiply(actions_logits, actions_real), axis=1)
    q_loss = tf.reduce_mean(tf.square(targetQs - Qs))

    # The generated rewards for Bellman equation
    rewards_fake = tf.sigmoid(d_logits_fake)
    rewards_real = tf.sigmoid(d_logits_real)

    return d_loss, g_loss, q_loss, actions_logits, Qs, rewards_fake, rewards_real

In [13]:
def model_opt(d_loss, g_loss, q_loss, learning_rate):
    """
    Get optimization operations
    :param d_loss: Discriminator/Reward loss Tensor for reward function
    :param g_loss: Generator/Q-value loss Tensor for action & next state predicton
    :param q_loss: Value loss Tensor
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (discriminator training operation, generator training operation)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars)
        q_opt = tf.train.AdamOptimizer(learning_rate).minimize(q_loss, var_list=g_vars)

    return d_opt, g_opt, q_opt

In [14]:
class DQAN:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.next_states, self.actions, self.targetQs = model_input(state_size=state_size)
        #print(self.states, self.next_states, self.actions, self.targetQs)

        # Create the Model: calculating the loss and forwad pass
        self.d_loss, self.g_loss, self.q_loss, self.actions_logits, self.Qs, self.rewards_fake, self.rewards_real = model_loss(
            state_size=state_size, action_size=action_size, actions=self.actions, 
            states=self.states, next_states=self.next_states, 
            hidden_size=hidden_size, targetQs=self.targetQs)

        # Update the model: backward pass and backprop
        self.d_opt, self.g_opt, self.q_opt = model_opt(d_loss=self.d_loss, g_loss=self.g_loss, 
                                                       q_loss=self.q_loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [15]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [16]:
train_episodes = 3000          # max number of episodes to learn from
max_steps = 200               # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
hidden_size = 64              # number of units in each Q-network hidden layer -- simulation
state_size = 4                # number of units for the input state/observation -- simulation
action_size = 2               # number of units for the output actions -- simulation

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 10                # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [17]:
tf.reset_default_graph()
model = DQAN(action_size=action_size, hidden_size=hidden_size, state_size=state_size, 
                 learning_rate=learning_rate)

## Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [18]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
state, reward, done, _ = env.step(env.action_space.sample())

# init memory
memory = Memory(max_size=memory_size)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)

    if done:
        # The simulation fails so no next state
        next_state = np.zeros(state.shape)
        
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        state, reward, done, _ = env.step(env.action_space.sample())
    else:
        # Add experience to memory
        memory.add((state, action, reward, next_state))
        state = next_state

## Training

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list, rewards_fake_list, rewards_real_list = [], [], []
d_loss_list, g_loss_list, q_loss_list = [], [], [] 

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward, rewards_fake_mean, rewards_real_mean = 0, 0, 0
        d_loss, g_loss, q_loss = 0, 0, 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
    
            # Cumulative reward
            total_reward += reward
            
            # Episode/epoch training is done/failed!
            if done:
                # the episode ends so no next state
                next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Average reward fake: {}'.format(rewards_fake_mean),
                      'Average reward real: {}'.format(rewards_real_mean),
                      'Training d_loss: {:.4f}'.format(d_loss),
                      'Training g_loss: {:.4f}'.format(g_loss),
                      'Training q_loss: {:.4f}'.format(q_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                rewards_fake_list.append((ep, rewards_fake_mean))
                d_loss_list.append((ep, d_loss))
                g_loss_list.append((ep, g_loss))
                q_loss_list.append((ep, q_loss))
                
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())

            else:
                # Add experience to memory
                memory.add((state, action, reward, next_state))
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            #rewards = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            
            # Train the model
            feed_dict = {model.states: states, model.next_states: next_states, model.actions: actions}
            rewards_fake, rewards_real = sess.run([model.rewards_fake, model.rewards_real], feed_dict)
            feed_dict={model.states: next_states}
            next_actions_logits = sess.run(model.actions_logits, feed_dict)

            # Mean/average fake and real rewards or rewarded generated/given actions
            rewards_fake_mean = np.mean(rewards_fake.reshape(-1))
            rewards_real_mean = np.mean(rewards_real.reshape(-1))
            
            # Set target_Qs to 0 for states where episode ends
            episode_ends = (next_states == np.zeros(states[0].shape)).all(axis=1)
            next_actions_logits[episode_ends] = (0, 0)

            # Bellman equation: Qt = Rt + max(Qt+1)
            targetQs = rewards_fake.reshape(-1) + (gamma * np.max(next_actions_logits, axis=1))

            # Updating the model
            feed_dict = {model.states: states, model.next_states: next_states, model.actions: actions, model.targetQs: targetQs}
            d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
            g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
            q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)
            
    # Save the trained model 
    saver.save(sess, 'checkpoints/DQAN-cartpole.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 2.0 Average reward fake: 0.4857206344604492 Average reward real: 0.48793157935142517 Training d_loss: 1.3826 Training g_loss: 0.7213 Training q_loss: 0.3421 Explore P: 0.9998
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 19.0 Average reward fake: 0.4790409207344055 Average reward real: 0.5361893773078918 Training d_loss: 1.2777 Training g_loss: 0.7402 Training q_loss: 0.5650 Explore P: 0.9979
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 9.0 Average reward fake: 0.4861293435096741 Average reward real: 0.5334004163742065 Training d_loss: 1.2992 Training g_loss: 0.7248 Training q_loss: 1.2741 Explore P: 0.9970
-

-------------------------------------------------------------------------------
Episode: 23 Total reward: 37.0 Average reward fake: 0.07496147602796555 Average reward real: 0.8563590049743652 Training d_loss: 0.2776 Training g_loss: 2.6004 Training q_loss: 36.1427 Explore P: 0.9533
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 24 Total reward: 20.0 Average reward fake: 0.04531712085008621 Average reward real: 0.925586998462677 Training d_loss: 0.1295 Training g_loss: 3.1313 Training q_loss: 711.0956 Explore P: 0.9514
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 25 Total reward: 22.0 Average reward fake: 0.03569464758038521 Average reward real: 0.9571079015731812 Training d_loss: 0.0811 Training g_loss: 3.3664 Training q_loss: 1904.9053 Explore

-------------------------------------------------------------------------------
Episode: 46 Total reward: 18.0 Average reward fake: 0.0012542642652988434 Average reward real: 0.9988151788711548 Training d_loss: 0.0024 Training g_loss: 6.6910 Training q_loss: 706.6235 Explore P: 0.9079
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 47 Total reward: 9.0 Average reward fake: 0.001158001134172082 Average reward real: 0.9985052943229675 Training d_loss: 0.0027 Training g_loss: 6.7720 Training q_loss: 19.1774 Explore P: 0.9070
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 48 Total reward: 16.0 Average reward fake: 0.0009750787867233157 Average reward real: 0.9936079978942871 Training d_loss: 0.0075 Training g_loss: 6.9384 Training q_loss: 911.0749 Exp

-------------------------------------------------------------------------------
Episode: 69 Total reward: 18.0 Average reward fake: 0.11191314458847046 Average reward real: 0.9220808744430542 Training d_loss: 0.3312 Training g_loss: 5.2811 Training q_loss: 1093.0222 Explore P: 0.8682
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 70 Total reward: 13.0 Average reward fake: 0.23422174155712128 Average reward real: 0.7822372317314148 Training d_loss: 0.8907 Training g_loss: 3.7306 Training q_loss: 2612.0793 Explore P: 0.8671
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 71 Total reward: 20.0 Average reward fake: 0.16909615695476532 Average reward real: 0.7002133727073669 Training d_loss: 0.7023 Training g_loss: 3.4263 Training q_loss: 2686.6580 Exp

-------------------------------------------------------------------------------
Episode: 92 Total reward: 17.0 Average reward fake: 0.22489003837108612 Average reward real: 0.5890467166900635 Training d_loss: 0.9612 Training g_loss: 1.5748 Training q_loss: 1513.7434 Explore P: 0.8256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 93 Total reward: 11.0 Average reward fake: 0.2837497889995575 Average reward real: 0.831872284412384 Training d_loss: 0.5809 Training g_loss: 1.2960 Training q_loss: 85043.9141 Explore P: 0.8247
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 94 Total reward: 32.0 Average reward fake: 0.4360080659389496 Average reward real: 0.7237445116043091 Training d_loss: 0.9768 Training g_loss: 0.8569 Training q_loss: 1060.9236 Explo

-------------------------------------------------------------------------------
Episode: 115 Total reward: 14.0 Average reward fake: 0.1790730059146881 Average reward real: 0.5370428562164307 Training d_loss: 0.9327 Training g_loss: 1.7778 Training q_loss: 898.9618 Explore P: 0.7890
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 116 Total reward: 48.0 Average reward fake: 0.3045367896556854 Average reward real: 0.7632613182067871 Training d_loss: 0.7294 Training g_loss: 1.2031 Training q_loss: 33292.9258 Explore P: 0.7853
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 117 Total reward: 17.0 Average reward fake: 0.3115319609642029 Average reward real: 0.7854615449905396 Training d_loss: 0.6964 Training g_loss: 1.1647 Training q_loss: 6486.5757 Exp

-------------------------------------------------------------------------------
Episode: 138 Total reward: 24.0 Average reward fake: 0.3221326470375061 Average reward real: 0.5310887098312378 Training d_loss: 1.1072 Training g_loss: 1.1356 Training q_loss: 217.9126 Explore P: 0.7540
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 139 Total reward: 26.0 Average reward fake: 0.376272052526474 Average reward real: 0.512431263923645 Training d_loss: 1.2898 Training g_loss: 1.0273 Training q_loss: 89.7060 Explore P: 0.7520
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 140 Total reward: 23.0 Average reward fake: 0.7147653698921204 Average reward real: 0.48251086473464966 Training d_loss: 2.1943 Training g_loss: 0.3806 Training q_loss: 170.9080 Explore 

-------------------------------------------------------------------------------
Episode: 161 Total reward: 38.0 Average reward fake: 0.47569188475608826 Average reward real: 0.5328116416931152 Training d_loss: 1.2985 Training g_loss: 0.7597 Training q_loss: 800.6892 Explore P: 0.7163
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 162 Total reward: 13.0 Average reward fake: 0.47547999024391174 Average reward real: 0.5146859288215637 Training d_loss: 1.3276 Training g_loss: 0.7524 Training q_loss: 108.6393 Explore P: 0.7154
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 163 Total reward: 34.0 Average reward fake: 0.5025621652603149 Average reward real: 0.5047529935836792 Training d_loss: 1.3953 Training g_loss: 0.6974 Training q_loss: 167.5672 Expl

-------------------------------------------------------------------------------
Episode: 184 Total reward: 26.0 Average reward fake: 0.5167390704154968 Average reward real: 0.5160118341445923 Training d_loss: 1.3918 Training g_loss: 0.6632 Training q_loss: 99.7986 Explore P: 0.6574
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 185 Total reward: 66.0 Average reward fake: 0.519392728805542 Average reward real: 0.4848322868347168 Training d_loss: 1.4623 Training g_loss: 0.6620 Training q_loss: 237.2868 Explore P: 0.6532
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 186 Total reward: 17.0 Average reward fake: 0.5101035833358765 Average reward real: 0.5147465467453003 Training d_loss: 1.3789 Training g_loss: 0.6782 Training q_loss: 147.7136 Explore 

-------------------------------------------------------------------------------
Episode: 207 Total reward: 63.0 Average reward fake: 0.42344599962234497 Average reward real: 0.4876941740512848 Training d_loss: 1.2855 Training g_loss: 0.9155 Training q_loss: 107.7861 Explore P: 0.6068
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 208 Total reward: 61.0 Average reward fake: 0.48608484864234924 Average reward real: 0.4984307885169983 Training d_loss: 1.4111 Training g_loss: 0.7944 Training q_loss: 70.9301 Explore P: 0.6032
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 209 Total reward: 55.0 Average reward fake: 0.5102697610855103 Average reward real: 0.5120271444320679 Training d_loss: 1.3901 Training g_loss: 0.6775 Training q_loss: 87.9516 Explor

-------------------------------------------------------------------------------
Episode: 230 Total reward: 42.0 Average reward fake: 0.5282644033432007 Average reward real: 0.5222952961921692 Training d_loss: 1.4067 Training g_loss: 0.6503 Training q_loss: 540.0565 Explore P: 0.5446
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 231 Total reward: 37.0 Average reward fake: 0.4998428225517273 Average reward real: 0.5337164998054504 Training d_loss: 1.3277 Training g_loss: 0.7250 Training q_loss: 2116.4255 Explore P: 0.5426
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 232 Total reward: 49.0 Average reward fake: 0.4613123834133148 Average reward real: 0.5381189584732056 Training d_loss: 1.2590 Training g_loss: 0.8439 Training q_loss: 1274.8816 Expl

-------------------------------------------------------------------------------
Episode: 253 Total reward: 33.0 Average reward fake: 0.42002350091934204 Average reward real: 0.5142834186553955 Training d_loss: 1.2516 Training g_loss: 1.1239 Training q_loss: 1374.0295 Explore P: 0.4904
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 254 Total reward: 28.0 Average reward fake: 0.39204201102256775 Average reward real: 0.46898603439331055 Training d_loss: 1.3336 Training g_loss: 1.0823 Training q_loss: 860.7867 Explore P: 0.4891
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 255 Total reward: 53.0 Average reward fake: 0.3722448945045471 Average reward real: 0.5701927542686462 Training d_loss: 1.0997 Training g_loss: 1.4360 Training q_loss: 48.3938 Exp

-------------------------------------------------------------------------------
Episode: 276 Total reward: 42.0 Average reward fake: 0.4362109303474426 Average reward real: 0.5120466947555542 Training d_loss: 1.2700 Training g_loss: 1.0317 Training q_loss: 108.1025 Explore P: 0.4291
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 277 Total reward: 26.0 Average reward fake: 0.46622323989868164 Average reward real: 0.4624517560005188 Training d_loss: 1.4495 Training g_loss: 0.9444 Training q_loss: 98.3774 Explore P: 0.4280
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 278 Total reward: 41.0 Average reward fake: 0.5044244527816772 Average reward real: 0.5037802457809448 Training d_loss: 1.4412 Training g_loss: 0.7036 Training q_loss: 94.8034 Explore

-------------------------------------------------------------------------------
Episode: 299 Total reward: 63.0 Average reward fake: 0.3347374498844147 Average reward real: 0.5414596199989319 Training d_loss: 1.0695 Training g_loss: 1.4617 Training q_loss: 63.6905 Explore P: 0.3862
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 300 Total reward: 35.0 Average reward fake: 0.4720606803894043 Average reward real: 0.5671070218086243 Training d_loss: 1.2764 Training g_loss: 0.9716 Training q_loss: 456.8264 Explore P: 0.3849
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 301 Total reward: 50.0 Average reward fake: 0.5820040106773376 Average reward real: 0.5153404474258423 Training d_loss: 1.5491 Training g_loss: 0.5496 Training q_loss: 347.9914 Explore

-------------------------------------------------------------------------------
Episode: 322 Total reward: 76.0 Average reward fake: 0.5359446406364441 Average reward real: 0.5328087210655212 Training d_loss: 1.4033 Training g_loss: 0.6342 Training q_loss: 42.3100 Explore P: 0.3310
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 323 Total reward: 104.0 Average reward fake: 0.5191971659660339 Average reward real: 0.517368495464325 Training d_loss: 1.3925 Training g_loss: 0.6557 Training q_loss: 104.9964 Explore P: 0.3277
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 324 Total reward: 112.0 Average reward fake: 0.4584777355194092 Average reward real: 0.5031973123550415 Training d_loss: 1.3338 Training g_loss: 0.9764 Training q_loss: 2411.8755 Explo

-------------------------------------------------------------------------------
Episode: 345 Total reward: 199.0 Average reward fake: 0.5203058123588562 Average reward real: 0.5044618248939514 Training d_loss: 1.5018 Training g_loss: 0.6782 Training q_loss: 386.6452 Explore P: 0.2287
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 346 Total reward: 199.0 Average reward fake: 0.5125833749771118 Average reward real: 0.507994532585144 Training d_loss: 1.3962 Training g_loss: 0.6692 Training q_loss: 341.7980 Explore P: 0.2244
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 347 Total reward: 199.0 Average reward fake: 0.41424861550331116 Average reward real: 0.5212172269821167 Training d_loss: 1.2276 Training g_loss: 1.1884 Training q_loss: 158.7876 Exp

-------------------------------------------------------------------------------
Episode: 368 Total reward: 9.0 Average reward fake: 0.323591411113739 Average reward real: 0.7074069380760193 Training d_loss: 0.8588 Training g_loss: 1.1270 Training q_loss: 1623.8845 Explore P: 0.1751
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 369 Total reward: 11.0 Average reward fake: 0.3628038465976715 Average reward real: 0.7183411121368408 Training d_loss: 0.8915 Training g_loss: 1.0098 Training q_loss: 2037.3035 Explore P: 0.1750
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 370 Total reward: 16.0 Average reward fake: 0.35095542669296265 Average reward real: 0.7178520560264587 Training d_loss: 0.8782 Training g_loss: 1.0549 Training q_loss: 3366.5601 Expl

-------------------------------------------------------------------------------
Episode: 391 Total reward: 30.0 Average reward fake: 0.10348407924175262 Average reward real: 0.24329009652137756 Training d_loss: 1.8530 Training g_loss: 2.1776 Training q_loss: 2967.5894 Explore P: 0.1709
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 392 Total reward: 11.0 Average reward fake: 0.4303690493106842 Average reward real: 0.516538679599762 Training d_loss: 1.2557 Training g_loss: 0.8060 Training q_loss: 2347.0750 Explore P: 0.1707
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 393 Total reward: 10.0 Average reward fake: 0.37423184514045715 Average reward real: 0.5928791761398315 Training d_loss: 1.0810 Training g_loss: 1.0561 Training q_loss: 2060.4131 E

-------------------------------------------------------------------------------
Episode: 415 Total reward: 9.0 Average reward fake: 0.34877878427505493 Average reward real: 0.6036416292190552 Training d_loss: 1.0776 Training g_loss: 1.0604 Training q_loss: 1591.4636 Explore P: 0.1670
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 416 Total reward: 14.0 Average reward fake: 0.34315234422683716 Average reward real: 0.6744076013565063 Training d_loss: 0.9461 Training g_loss: 1.0615 Training q_loss: 1876.6088 Explore P: 0.1668
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 417 Total reward: 7.0 Average reward fake: 0.3596484065055847 Average reward real: 0.47103700041770935 Training d_loss: 1.3140 Training g_loss: 1.0188 Training q_loss: 3464.1750 Ex

-------------------------------------------------------------------------------
Episode: 439 Total reward: 133.0 Average reward fake: 0.24305656552314758 Average reward real: 0.6720507740974426 Training d_loss: 0.7423 Training g_loss: 1.6109 Training q_loss: 2785.4031 Explore P: 0.1606
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 440 Total reward: 135.0 Average reward fake: 0.030591070652008057 Average reward real: 0.9634857177734375 Training d_loss: 0.0685 Training g_loss: 3.8454 Training q_loss: 1791.1104 Explore P: 0.1586
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 441 Total reward: 149.0 Average reward fake: 0.44469889998435974 Average reward real: 0.41166409850120544 Training d_loss: 1.5519 Training g_loss: 0.8374 Training q_loss: 3485.

-------------------------------------------------------------------------------
Episode: 462 Total reward: 199.0 Average reward fake: 0.4757792353630066 Average reward real: 0.49551767110824585 Training d_loss: 1.3554 Training g_loss: 0.7614 Training q_loss: 253.3008 Explore P: 0.1065
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 463 Total reward: 199.0 Average reward fake: 0.5138950943946838 Average reward real: 0.5451699495315552 Training d_loss: 1.3370 Training g_loss: 0.6875 Training q_loss: 56.3705 Explore P: 0.1046
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 464 Total reward: 199.0 Average reward fake: 0.528123676776886 Average reward real: 0.5344921350479126 Training d_loss: 1.4094 Training g_loss: 0.6554 Training q_loss: 88.9191 Explo

-------------------------------------------------------------------------------
Episode: 485 Total reward: 199.0 Average reward fake: 0.36423593759536743 Average reward real: 0.7539551854133606 Training d_loss: 0.8582 Training g_loss: 1.8836 Training q_loss: 155.8489 Explore P: 0.0711
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 486 Total reward: 107.0 Average reward fake: 0.36974698305130005 Average reward real: 0.6632050275802612 Training d_loss: 0.9573 Training g_loss: 1.1120 Training q_loss: 184.9412 Explore P: 0.0704
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 487 Total reward: 15.0 Average reward fake: 0.3709791302680969 Average reward real: 0.6468536257743835 Training d_loss: 0.9693 Training g_loss: 1.5932 Training q_loss: 377.4391 Ex

-------------------------------------------------------------------------------
Episode: 508 Total reward: 7.0 Average reward fake: 0.14433881640434265 Average reward real: 0.9158690571784973 Training d_loss: 0.2941 Training g_loss: 2.7717 Training q_loss: 1854.3226 Explore P: 0.0691
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 509 Total reward: 12.0 Average reward fake: 0.289995014667511 Average reward real: 0.7051320672035217 Training d_loss: 0.7661 Training g_loss: 1.2736 Training q_loss: 6694.2212 Explore P: 0.0690
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 510 Total reward: 9.0 Average reward fake: 0.3137693405151367 Average reward real: 0.6600072383880615 Training d_loss: 0.8823 Training g_loss: 1.1697 Training q_loss: 10129.3145 Expl

-------------------------------------------------------------------------------
Episode: 531 Total reward: 10.0 Average reward fake: 0.3089780807495117 Average reward real: 0.5229489207267761 Training d_loss: 1.2614 Training g_loss: 1.1303 Training q_loss: 1932.1428 Explore P: 0.0678
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 532 Total reward: 8.0 Average reward fake: 0.337416410446167 Average reward real: 0.771574079990387 Training d_loss: 0.7651 Training g_loss: 1.6070 Training q_loss: 4587.4932 Explore P: 0.0678
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 533 Total reward: 9.0 Average reward fake: 0.2659350037574768 Average reward real: 0.5913032293319702 Training d_loss: 0.9905 Training g_loss: 1.3331 Training q_loss: 1176.7893 Explore

-------------------------------------------------------------------------------
Episode: 555 Total reward: 14.0 Average reward fake: 0.2126304805278778 Average reward real: 0.8329983949661255 Training d_loss: 0.5456 Training g_loss: 1.6462 Training q_loss: 3020.6274 Explore P: 0.0659
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 556 Total reward: 10.0 Average reward fake: 0.43542036414146423 Average reward real: 0.6664476990699768 Training d_loss: 1.0830 Training g_loss: 2.0257 Training q_loss: 1604.2598 Explore P: 0.0659
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 557 Total reward: 11.0 Average reward fake: 0.3505515158176422 Average reward real: 0.5896192789077759 Training d_loss: 1.2994 Training g_loss: 1.1341 Training q_loss: 4238.7778 Ex

-------------------------------------------------------------------------------
Episode: 579 Total reward: 15.0 Average reward fake: 0.37311241030693054 Average reward real: 0.7430019974708557 Training d_loss: 0.8592 Training g_loss: 0.9852 Training q_loss: 1078926.8750 Explore P: 0.0618
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 580 Total reward: 13.0 Average reward fake: 0.3121339976787567 Average reward real: 0.7437196373939514 Training d_loss: 0.7754 Training g_loss: 1.5761 Training q_loss: 6156.2031 Explore P: 0.0618
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 581 Total reward: 23.0 Average reward fake: 0.33561354875564575 Average reward real: 0.6585756540298462 Training d_loss: 0.9574 Training g_loss: 1.1071 Training q_loss: 3057.576

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(q_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Q losses')

## Testing

Let's checkout how our trained agent plays the game.

In [54]:
test_episodes = 10
test_max_steps = 1000
env.reset()
with tf.Session() as sess:
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    # Save the trained model 
    saver.restore(sess, 'checkpoints/DQAN-cartpole.ckpt')
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            env.render() 
            
            # Get action from DQAN
            feed_dict = {model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                state, reward, done, _ = env.step(env.action_space.sample())
            else:
                state = next_state
                t += 1

INFO:tensorflow:Restoring parameters from checkpoints/DQAN-cartpole.ckpt


NameError: name 'base' is not defined

In [28]:
env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.