## Deep Q-networks

While our ordinary Q-network was able to barely perform as well as the Q-Table in a simple game environment, Deep $Q$-Networks are much more capable. In order to transform an ordinary Q-Network into a DQN we will be making the following improvements:
+ Going from a single-layer network to a multi-layer convolutional network.
+ Implementing Experience Replay, which will allow our network to train itself using stored memories from it’s experience.
+ Utilizing a second “target” network, which we will use to compute target $Q$-values during our updates.

<img src="images/deepq1.png" alt="" style="width: 800px;"/>


See https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/

### Convolutional Layers

Since our agent is going to be learning to play video games, it has to be able to make sense of the game’s screen output in a way that is at least similar to how humans or other intelligent animals are able to. Instead of considering each pixel independently, convolutional layers allow us to consider regions of an image, and maintain spatial relationships between the objects on the screen as we send information up to higher levels of the network.

### Experience Replay

The second major addition to make DQNs work is Experience Replay. 

The problem with online learning is that the *samples arrive in order* they are experienced and as such are highly correlated. Because of this, our network will most likely overfit and fail to generalize properly.

The key idea of **experience replay** is that we store these transitions in our memory and during each learning step, sample a random batch and perform a gradient descend on it. 

The Experience Replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them. 

### Separate Target Network

This second network is used to generate the $Q$-target values that will be used to compute the loss for every action during training. 

The issue is that at every step of training, the $Q$-network’s values shift, and if we are using a constantly shifting set of values to adjust our network values, then the value estimations can easily spiral out of control. The network can become destabilized by falling into feedback loops between the target and estimated $Q$-values. In order to mitigate that risk, the target network’s weights are fixed, and only periodically or slowly updated to the primary $Q$-networks values. In this way training can proceed in a more stable manner.

Instead of updating the target network periodically and all at once, we will be updating it frequently, but slowly.

While the DQN we have described above could learn ATARI games with enough training, getting the network to perform well on those games takes at least a day of training on a powerful machine.

### Installations

In [None]:
import keras
keras.__version__
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
print 'Running! \nPlease dont interrupt this cell. It might cause serious issues..'
!pip install gym keras-rl pyglet==1.2.4 
# !apt-get install -y cmake zlib1g-dev libjpeg-dev xvfb ffmpeg xorg-dev python-opengl libboost-all-dev libsdl2-dev swig 
!pip install 'gym[atari]' 
print 'Done!'

### Imports

In [1]:
from __future__ import division

from PIL import Image
import numpy as np
import gym

from keras.models import Model
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute, Input
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

Using TensorFlow backend.


In [2]:
INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4
ENV_NAME = 'BreakoutDeterministic-v4'

In [3]:
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')  # resize and convert to grayscale
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')  # saves storage in experience memory

    def process_state_batch(self, batch):
        # We could perform this processing step in `process_observation`. In this case, however,
        # we would need to store a `float32` array instead, which is 4x more memory intensive than
        # an `uint8` array. This matters if we store 1M observations.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)

Get the environment and extract the number of actions.

In [4]:
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

Next, we build our model. We use the same model that was described by Mnih et al. (2015).

In [5]:
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE

inp = Input(shape=input_shape)
X = Permute((2, 3, 1))(inp)
X = Convolution2D(32, (8, 8), strides=(4, 4))(X)
X = Activation('relu')(X)
X = Convolution2D(64, (4, 4), strides=(2, 2))(X)
X = Activation('relu')(X)
X = Convolution2D(64, (3, 3), strides=(2, 2))(X)
X = Activation('relu')(X)
X = Flatten()(X)
X = Dense(512)(X)
X = Activation('relu')(X)
X = Dense(nb_actions)(X)
X = Activation('linear')(X)
model = Model(inputs=inp, outputs=X)
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 4, 84, 84)         0         
_________________________________________________________________
permute_1 (Permute)          (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
activation_1 (Activation)    (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_2 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv

Finally, we configure and compile our agent. You can use every built-in Keras optimizer and even the metrics!

In [6]:
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

Select a policy. We use eps-greedy action selection, which means that a random action is selected with probability eps. We anneal eps from 1.0 to 0.1 over the course of 1M steps. This is done so that the agent initially explores the environment (high eps) and then gradually sticks to what it knows (low eps). We also set a dedicated eps value that is used during testing. Note that we set it to 0.05 so that the agent still performs some random actions. This ensures that the agent cannot get stuck.

In [7]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

The trade-off between exploration and exploitation is difficult and an on-going research topic. If you want, you can experiment with the parameters or use a different policy. Another popular one is Boltzmann-style exploration:
policy = BoltzmannQPolicy(tau=1.) Feel free to give it a try!

In [8]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
               train_interval=4, delta_clip=1.)
dqn.compile(Adam(lr=.00025), metrics=['mae'])

Okay, now it's time to learn something! We capture the interrupt exception so that training
can be prematurely aborted. Notice that now you can use the built-in Keras callbacks!

In [9]:
weights_filename = 'dqn_{}_weights.h5f'.format(ENV_NAME)
checkpoint_weights_filename = 'dqn_' + ENV_NAME + '_weights_{step}.h5f'
log_filename = 'dqn_{}_log.json'.format(ENV_NAME)
callbacks = [ModelIntervalCheckpoint(checkpoint_weights_filename, interval=250000)]
callbacks += [FileLogger(log_filename, interval=100)]
# dqn.fit(env, callbacks=callbacks, nb_steps=1750000, log_interval=10000)
dqn.fit(env, callbacks=callbacks, nb_steps=1750000, log_interval=5000)

Training for 10000 steps ...
Interval 1 (0 steps performed)
28 episodes - episode_reward: 1.179 [0.000, 4.000] - ale.lives: 3.050

Interval 2 (5000 steps performed)
done, took 39.978 seconds


<keras.callbacks.History at 0x7f2d09a69c10>

After training is done, we save the final weights one more time.

In [10]:
dqn.save_weights(weights_filename, overwrite=True)

## Let's test it!

In [None]:
weights_filename = 'dqn_{}_weights.h5f'.format(ENV_NAME)
dqn.load_weights(weights_filename)

# Renderer 
for live animation in jupyter while running those command.
in case you are running on your local device you can comment the bottom lines and and run full local environment! (It looks nicer :) )

In [None]:
class Render(Callback):
    def on_step_end(self, step, logs={}):
        plt.clf()
        plt.imshow(env.render(mode='rgb_array'))
        display.display(plt.gcf())
        display.clear_output(wait=True)

Finally, evaluate our algorithm for 10 episodes.

In [None]:
dqn.test(env, nb_episodes=10, visualize=False)

Testing for 10 episodes ...
Episode 1: reward: 0.000, steps: 100000
Episode 2: reward: 0.000, steps: 100000
Episode 3: reward: 0.000, steps: 100000
Episode 4: reward: 0.000, steps: 100000
Episode 5: reward: 0.000, steps: 100000
Episode 6: reward: 0.000, steps: 100000
Episode 7: reward: 0.000, steps: 100000
Episode 8: reward: 0.000, steps: 100000
Episode 9: reward: 0.000, steps: 100000
Episode 10: reward: 0.000, steps: 100000


<keras.callbacks.History at 0x7f2d09a69450>

And for those running locally

In [None]:
# dqn.test(env, nb_episodes=10, visualize=True)

And test it!

In [None]:
weights_filename = 'dqn_{}_weights.h5f'.format(ENV_NAME)
dqn.load_weights(weights_filename)
dqn.test(env, nb_episodes=10, visualize=True)

Testing for 10 episodes ...
Episode 1: reward: 0.000, steps: 100000
