### Deep Q-learning using keras-rl

In this lab we will change gears just a bit and start using some of the deep q-learning software that is generally available. While it is possible to write your own code, it does take quite a bit of time to do so. One needs to implement Q-learning as well as neural network code. (FWIW, I had thought to do this with scikit-learn, but it is still a lot of code to write.)

For this lab we will follow the example at [A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python](https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/)

I encourage you to read through the whole article. However, for the computer lab experiment go to the section, __Implementing Deep Q-Learning in Python using Keras & OpenAI Gym__.

You should already have installed the OpenAI Gym environment using
```text
conda install -c conda-forge gym
```
You can use __conda__ to install keras similarly
```text
conda install -c conda-forge keras
```

Keras does not contain the reinforcement learning code. It must be installed seperately. The blog refers to __keras-rl__. However, it has subsequently been updated to __keras-rl2__. It cannot be installed using __conda__. Instead, you can either download it for the GitHub repository, or more simply install it using __pip__.
```texrt
pip install keras-rl2
```

Finally, remember that the environment visualizations cannot (easily) be run from the __jupyter__ environment. Open a terminal prompt and run python on __cartpole6.py__. You should see a training phase for the deep learning neural network(s) for the cartpole problem. This is followed by test the trained neural network.

There a many things one might investigate for this lab. For instance, one might look into the keras documentation for creating other neural network designs. See for instance, [Keras Dense layer](https://keras.io/api/layers/core_layers/dense/). Also, you will observe that the deep learning neural network does not perform all that well. What might be done to improve its performance? 

You do not need to answer my questions. Feel free to perform your own investigations!

--Doug

In [1]:
#
# https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/
#

import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
#from keras.optimizers import Adam
from tensorflow.keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions available in the Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. 
dqn.fit(env, nb_steps=5000, visualize=False, verbose=2)

dqn.test(env, nb_episodes=5, visualize=True)



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten (Flatten)           (None, 4)                 0         
                                                                 
 dense (Dense)               (None, 16)                80        
                                                                 
 activation (Activation)     (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 2)                 34        
                                                                 
 activation_1 (Activation)   (None, 2)                 0         
                                                                 
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


  super(Adam, self).__init__(name, **kwargs)


Training for 5000 steps ...
   10/5000: episode: 1, duration: 0.051s, episode steps:  10, steps per second: 196, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: --, mae: --, mean_q: --


  updates=self.state_updates,
  updates=self.state_updates,


   22/5000: episode: 2, duration: 0.435s, episode steps:  12, steps per second:  28, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.250 [0.000, 1.000],  loss: 0.468772, mae: 0.643310, mean_q: -0.288348
   32/5000: episode: 3, duration: 0.065s, episode steps:  10, steps per second: 154, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.100 [0.000, 1.000],  loss: 0.402383, mae: 0.580940, mean_q: -0.195034
   42/5000: episode: 4, duration: 0.065s, episode steps:  10, steps per second: 153, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.200 [0.000, 1.000],  loss: 0.346710, mae: 0.525158, mean_q: -0.098503




   54/5000: episode: 5, duration: 0.083s, episode steps:  12, steps per second: 145, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.083 [0.000, 1.000],  loss: 0.292350, mae: 0.452721, mean_q: 0.014419
   63/5000: episode: 6, duration: 0.069s, episode steps:   9, steps per second: 131, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.227877, mae: 0.391144, mean_q: 0.107997
   73/5000: episode: 7, duration: 0.066s, episode steps:  10, steps per second: 152, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.201441, mae: 0.352446, mean_q: 0.199884
   82/5000: episode: 8, duration: 0.060s, episode steps:   9, steps per second: 149, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.165367, mae: 0.309966, mean_q: 0.297860
   92/5000: episode: 9, duration: 0.069s, episode steps:  10, steps per seco

<keras.callbacks.History at 0x1b73805e850>