<a href="https://colab.research.google.com/github/daniloaleixo/deep-rl/blob/master/01_Introduction_to_Deep_Q_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction do Deep Q-Learning

From tutorial: https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

## Install dependencies for the CartPole environment

In [0]:
!pip install --upgrade pip
!pip install h5py
!pip install keras-rl
!pip install gym
!pip install tensorflow==1.14

Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (20.0.2)
Collecting tensorflow==1.14
  Downloading tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (109.2 MB)
[K     |████████████████████████████████| 109.2 MB 5.9 kB/s 
Collecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0
  Downloading tensorflow_estimator-1.14.0-py2.py3-none-any.whl (488 kB)
[K     |████████████████████████████████| 488 kB 57.8 MB/s 
Collecting tensorboard<1.15.0,>=1.14.0
  Downloading tensorboard-1.14.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 66.0 MB/s 
Installing collected packages: tensorflow-estimator, tensorboard, tensorflow
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.1.0
    Uninstalling tensorflow-estimator-2.1.0:
      Successfully uninstalled tensorflow-estimator-2.1.0
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.1.1
    Uninstalling tensorboard-

## Import modules

In [0]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Then set the relevant variables

In [0]:
ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions available in the Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n



Next, we will build a very simple single hidden layer neural network model:

In [0]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())




Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


Now, configure and compile our agent. We will set our policy as Epsilon Greedy and our memory as Sequential Memory because we want to store the result of actions we performed and the rewards we get for each action.

In [0]:
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. 
dqn.fit(env, nb_steps=5000, visualize=False, verbose=2)

Training for 5000 steps ...




   48/5000: episode: 1, duration: 0.589s, episode steps: 48, steps per second: 81, episode reward: 48.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.542 [0.000, 1.000], mean observation: 0.168 [-0.239, 0.766], loss: 0.450238, mean_absolute_error: 0.522733, mean_q: -0.001673
  104/5000: episode: 2, duration: 0.160s, episode steps: 56, steps per second: 349, episode reward: 56.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.500 [0.000, 1.000], mean observation: 0.102 [-0.150, 1.128], loss: 0.369756, mean_absolute_error: 0.456440, mean_q: 0.127125
  142/5000: episode: 3, duration: 0.111s, episode steps: 38, steps per second: 342, episode reward: 38.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.526 [0.000, 1.000], mean observation: 0.093 [-0.226, 0.714], loss: 0.296629, mean_absolute_error: 0.452627, mean_q: 0.308771
  181/5000: episode: 4, duration: 0.117s, episode steps: 39, steps per second: 332, episode reward: 39.000, mean reward: 1.000 [1.000, 1.000], mean act

<keras.callbacks.History at 0x7fa5f6a31908>

Test our reinforcement learning model:

In [0]:
dqn.test(env, nb_episodes=5, visualize=False)

Testing for 5 episodes ...
Episode 1: reward: 67.000, steps: 67
Episode 2: reward: 82.000, steps: 82
Episode 3: reward: 57.000, steps: 57
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 130.000, steps: 130


<keras.callbacks.History at 0x7fa61e275c88>