# Duelling-DQN: CartPole-v0 | EPOCH Lab

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

- Python: 3.6.12
- Keras-GPU: 2.3.1
- KerasRL2: 1.0.4

In [1]:
import gym

import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

### Build OpenAI Gym Environment

Get the environment and extract the number of states and actions.

In [2]:
ENV_NAME = 'MountainCar-v0'

In [3]:
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)



[123]

In [4]:
states = env.observation_space.shape
actions = env.action_space.n

print('States:', states[0])
print('Actions:', actions)

States: 2
Actions: 3


### Create Deep Learning Model

Build a very simple model regardless of the dueling architecture if you enable dueling network in DQN , DQN will build a dueling network base on your model automatically. Also, you can build a dueling network by yourself and turn off the dueling network in DQN.

In [5]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape = (1, ) + states))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(actions, activation = 'linear'))
    
    return model

In [6]:
model = build_model(states, actions)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 2)                 0         
_________________________________________________________________
dense (Dense)                (None, 128)               384       
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 99        
Total params: 10,819
Trainable params: 10,819
Non-trainable params: 0
_________________________________________________________________


Configure and compile our agent. You can use every built-in tensorflow.keras optimizer and even the metrics.

In [7]:
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()

dqn = DQNAgent(model=model, nb_actions=actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy)
#duel_dqn = DQNAgent(model=model, nb_actions=actions, memory=memory, nb_steps_warmup=10, enable_dueling_network=True, dueling_type='avg', target_model_update=1e-2, policy=policy)

dqn.compile(Adam(lr=1e-3), metrics=['mae'])

### Training Loop

In [8]:
dqn.fit(env, nb_steps=150000, visualize=False, verbose=2)

Training for 150000 steps ...




    200/150000: episode: 1, duration: 4.311s, episode steps: 200, steps per second:  46, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.130 [0.000, 2.000],  loss: 0.061807, mae: 0.814149, mean_q: -1.055888
    400/150000: episode: 2, duration: 3.830s, episode steps: 200, steps per second:  52, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.060 [0.000, 2.000],  loss: 0.006198, mae: 1.768291, mean_q: -2.607878
    600/150000: episode: 3, duration: 3.803s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.080 [0.000, 2.000],  loss: 0.026289, mae: 2.863523, mean_q: -4.214609
    800/150000: episode: 4, duration: 3.896s, episode steps: 200, steps per second:  51, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.995 [0.000, 2.000],  loss: 0.050789, mae: 3.993084, mean_q: -5.883262
   1000/150000: episode: 5, duration: 3.807s, ep

   7200/150000: episode: 36, duration: 3.760s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.150 [0.000, 2.000],  loss: 3.302649, mae: 26.229649, mean_q: -38.808105
   7400/150000: episode: 37, duration: 3.671s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.110 [0.000, 2.000],  loss: 2.956753, mae: 26.641050, mean_q: -39.466808
   7600/150000: episode: 38, duration: 3.522s, episode steps: 200, steps per second:  57, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.975 [0.000, 2.000],  loss: 4.120373, mae: 27.049522, mean_q: -39.955265
   7800/150000: episode: 39, duration: 3.638s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.165 [0.000, 2.000],  loss: 3.097409, mae: 27.403852, mean_q: -40.536224
   8000/150000: episode: 40, duratio

  14000/150000: episode: 70, duration: 3.767s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.985 [0.000, 2.000],  loss: 7.310924, mae: 35.670403, mean_q: -52.835567
  14200/150000: episode: 71, duration: 3.767s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.085 [0.000, 2.000],  loss: 8.669079, mae: 35.754822, mean_q: -52.918766
  14400/150000: episode: 72, duration: 3.841s, episode steps: 200, steps per second:  52, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.935 [0.000, 2.000],  loss: 4.750194, mae: 35.933838, mean_q: -53.359482
  14600/150000: episode: 73, duration: 3.739s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.005 [0.000, 2.000],  loss: 7.724871, mae: 36.065506, mean_q: -53.361401
  14800/150000: episode: 74, duratio

  20800/150000: episode: 104, duration: 3.832s, episode steps: 200, steps per second:  52, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.135 [0.000, 2.000],  loss: 9.429422, mae: 38.938541, mean_q: -57.562714
  21000/150000: episode: 105, duration: 3.696s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.020 [0.000, 2.000],  loss: 9.870944, mae: 38.836235, mean_q: -57.392773
  21200/150000: episode: 106, duration: 3.692s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.110 [0.000, 2.000],  loss: 6.743615, mae: 38.917789, mean_q: -57.728184
  21400/150000: episode: 107, duration: 3.670s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.870 [0.000, 2.000],  loss: 8.313572, mae: 39.051708, mean_q: -57.781178
  21600/150000: episode: 108, du

  27600/150000: episode: 138, duration: 3.809s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.870 [0.000, 2.000],  loss: 8.827886, mae: 40.112144, mean_q: -59.377430
  27800/150000: episode: 139, duration: 3.744s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.875 [0.000, 2.000],  loss: 9.151934, mae: 40.132145, mean_q: -59.428417
  28000/150000: episode: 140, duration: 3.575s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.010 [0.000, 2.000],  loss: 10.119085, mae: 40.153446, mean_q: -59.406406
  28200/150000: episode: 141, duration: 3.535s, episode steps: 200, steps per second:  57, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.030 [0.000, 2.000],  loss: 8.520949, mae: 40.084965, mean_q: -59.339218
  28400/150000: episode: 142, d

  34400/150000: episode: 172, duration: 3.561s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.995 [0.000, 2.000],  loss: 6.667616, mae: 41.075577, mean_q: -60.929478
  34600/150000: episode: 173, duration: 3.552s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.970 [0.000, 2.000],  loss: 8.393384, mae: 41.140156, mean_q: -61.026642
  34800/150000: episode: 174, duration: 3.685s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.075 [0.000, 2.000],  loss: 9.657503, mae: 41.132442, mean_q: -60.939453
  35000/150000: episode: 175, duration: 3.588s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.035 [0.000, 2.000],  loss: 9.528314, mae: 41.016964, mean_q: -60.756104
  35200/150000: episode: 176, du

  41200/150000: episode: 206, duration: 3.677s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.915 [0.000, 2.000],  loss: 9.122383, mae: 41.542133, mean_q: -61.542789
  41400/150000: episode: 207, duration: 3.618s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.110 [0.000, 2.000],  loss: 9.666497, mae: 41.508526, mean_q: -61.490704
  41600/150000: episode: 208, duration: 3.687s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.905 [0.000, 2.000],  loss: 10.179127, mae: 41.549442, mean_q: -61.506836
  41800/150000: episode: 209, duration: 3.629s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 10.018546, mae: 41.422752, mean_q: -61.327255
  42000/150000: episode: 210, 

  48000/150000: episode: 240, duration: 3.487s, episode steps: 200, steps per second:  57, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.820 [0.000, 2.000],  loss: 5.417083, mae: 40.687531, mean_q: -60.383202
  48200/150000: episode: 241, duration: 3.670s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 9.105212, mae: 40.726429, mean_q: -60.407761
  48400/150000: episode: 242, duration: 3.549s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.925 [0.000, 2.000],  loss: 6.980595, mae: 40.781216, mean_q: -60.445057
  48600/150000: episode: 243, duration: 3.801s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.885 [0.000, 2.000],  loss: 7.146545, mae: 40.885082, mean_q: -60.638683
  48800/150000: episode: 244, du

  54800/150000: episode: 274, duration: 3.644s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 7.075037, mae: 40.157242, mean_q: -59.559494
  55000/150000: episode: 275, duration: 3.688s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.840 [0.000, 2.000],  loss: 9.401329, mae: 40.087219, mean_q: -59.370331
  55200/150000: episode: 276, duration: 3.696s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.025 [0.000, 2.000],  loss: 5.968007, mae: 40.141178, mean_q: -59.628292
  55400/150000: episode: 277, duration: 3.688s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.945 [0.000, 2.000],  loss: 10.022139, mae: 40.148163, mean_q: -59.364120
  55600/150000: episode: 278, d

  61600/150000: episode: 308, duration: 3.600s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.000 [0.000, 2.000],  loss: 9.895343, mae: 38.492584, mean_q: -56.946503
  61800/150000: episode: 309, duration: 3.610s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.905 [0.000, 2.000],  loss: 9.472767, mae: 38.443680, mean_q: -56.967804
  62000/150000: episode: 310, duration: 3.459s, episode steps: 200, steps per second:  58, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.950 [0.000, 2.000],  loss: 6.089652, mae: 38.461544, mean_q: -56.977783
  62200/150000: episode: 311, duration: 3.629s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.860 [0.000, 2.000],  loss: 9.792767, mae: 38.539135, mean_q: -57.039532
  62400/150000: episode: 312, du

  68400/150000: episode: 342, duration: 3.586s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.745 [0.000, 2.000],  loss: 6.219409, mae: 35.226974, mean_q: -52.038662
  68600/150000: episode: 343, duration: 3.695s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.825 [0.000, 2.000],  loss: 6.670076, mae: 35.244228, mean_q: -52.050518
  68800/150000: episode: 344, duration: 3.744s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.740 [0.000, 2.000],  loss: 6.055716, mae: 35.322289, mean_q: -52.171432
  69000/150000: episode: 345, duration: 3.547s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.760 [0.000, 2.000],  loss: 5.208149, mae: 35.326851, mean_q: -52.177399
  69200/150000: episode: 346, du

  75200/150000: episode: 376, duration: 3.459s, episode steps: 200, steps per second:  58, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.070 [0.000, 2.000],  loss: 6.276225, mae: 34.084518, mean_q: -50.393505
  75400/150000: episode: 377, duration: 3.700s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.150 [0.000, 2.000],  loss: 5.842795, mae: 33.992886, mean_q: -50.369301
  75600/150000: episode: 378, duration: 3.673s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.105 [0.000, 2.000],  loss: 5.545202, mae: 34.100479, mean_q: -50.459663
  75800/150000: episode: 379, duration: 3.556s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.255 [0.000, 2.000],  loss: 5.715410, mae: 34.078579, mean_q: -50.394554
  76000/150000: episode: 380, du

  82000/150000: episode: 410, duration: 3.607s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.800 [0.000, 2.000],  loss: 6.241406, mae: 33.334408, mean_q: -49.200577
  82200/150000: episode: 411, duration: 3.675s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.830 [0.000, 2.000],  loss: 4.373890, mae: 33.469006, mean_q: -49.530399
  82400/150000: episode: 412, duration: 3.525s, episode steps: 200, steps per second:  57, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.920 [0.000, 2.000],  loss: 4.200066, mae: 33.496830, mean_q: -49.534924
  82600/150000: episode: 413, duration: 3.578s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.925 [0.000, 2.000],  loss: 3.851209, mae: 33.540787, mean_q: -49.631336
  82800/150000: episode: 414, du

  88800/150000: episode: 444, duration: 3.685s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.935 [0.000, 2.000],  loss: 6.990866, mae: 35.978809, mean_q: -53.167675
  89000/150000: episode: 445, duration: 3.512s, episode steps: 200, steps per second:  57, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.975 [0.000, 2.000],  loss: 5.539508, mae: 35.889286, mean_q: -53.090069
  89200/150000: episode: 446, duration: 3.448s, episode steps: 200, steps per second:  58, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.830 [0.000, 2.000],  loss: 5.162294, mae: 35.907822, mean_q: -53.121586
  89400/150000: episode: 447, duration: 3.617s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.900 [0.000, 2.000],  loss: 5.252377, mae: 35.923439, mean_q: -53.117115
  89600/150000: episode: 448, du

  95600/150000: episode: 478, duration: 3.587s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.940 [0.000, 2.000],  loss: 3.212555, mae: 32.147484, mean_q: -47.467140
  95800/150000: episode: 479, duration: 3.717s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.910 [0.000, 2.000],  loss: 3.827524, mae: 31.973495, mean_q: -47.160168
  96000/150000: episode: 480, duration: 3.639s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.105 [0.000, 2.000],  loss: 3.786734, mae: 31.925745, mean_q: -47.079746
  96200/150000: episode: 481, duration: 3.631s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.975 [0.000, 2.000],  loss: 5.788225, mae: 31.758984, mean_q: -46.692593
  96400/150000: episode: 482, du

 102359/150000: episode: 512, duration: 3.672s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.045 [0.000, 2.000],  loss: 2.187076, mae: 27.152021, mean_q: -39.785606
 102559/150000: episode: 513, duration: 3.588s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 2.542477, mae: 27.159500, mean_q: -39.744766
 102759/150000: episode: 514, duration: 3.565s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.140 [0.000, 2.000],  loss: 2.190366, mae: 26.896635, mean_q: -39.388138
 102959/150000: episode: 515, duration: 3.432s, episode steps: 200, steps per second:  58, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.110 [0.000, 2.000],  loss: 2.182590, mae: 26.759899, mean_q: -39.217407
 103159/150000: episode: 516, du

 109159/150000: episode: 546, duration: 3.545s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.055 [0.000, 2.000],  loss: 2.485134, mae: 23.758570, mean_q: -34.786198
 109359/150000: episode: 547, duration: 3.653s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.040 [0.000, 2.000],  loss: 2.190183, mae: 23.793716, mean_q: -34.826416
 109559/150000: episode: 548, duration: 3.633s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.125 [0.000, 2.000],  loss: 2.448547, mae: 23.751474, mean_q: -34.822277
 109759/150000: episode: 549, duration: 3.650s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.985 [0.000, 2.000],  loss: 1.748687, mae: 23.842510, mean_q: -34.982227
 109959/150000: episode: 550, du

 115959/150000: episode: 580, duration: 3.671s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.850 [0.000, 2.000],  loss: 1.877730, mae: 27.433331, mean_q: -40.456593
 116159/150000: episode: 581, duration: 3.453s, episode steps: 200, steps per second:  58, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.830 [0.000, 2.000],  loss: 3.470891, mae: 27.664062, mean_q: -40.703228
 116359/150000: episode: 582, duration: 3.679s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.920 [0.000, 2.000],  loss: 2.679851, mae: 27.851154, mean_q: -40.961884
 116559/150000: episode: 583, duration: 3.634s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.000 [0.000, 2.000],  loss: 1.887891, mae: 27.807619, mean_q: -41.018238
 116759/150000: episode: 584, du

 122759/150000: episode: 614, duration: 3.747s, episode steps: 200, steps per second:  53, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.170 [0.000, 2.000],  loss: 2.648351, mae: 28.687651, mean_q: -42.217628
 122959/150000: episode: 615, duration: 3.587s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.970 [0.000, 2.000],  loss: 2.494811, mae: 28.654043, mean_q: -42.207588
 123159/150000: episode: 616, duration: 3.738s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.045 [0.000, 2.000],  loss: 3.203196, mae: 28.650459, mean_q: -42.167778
 123359/150000: episode: 617, duration: 3.588s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.005 [0.000, 2.000],  loss: 2.384006, mae: 28.508276, mean_q: -41.989922
 123559/150000: episode: 618, du

 129559/150000: episode: 648, duration: 3.670s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.025 [0.000, 2.000],  loss: 3.206414, mae: 27.745073, mean_q: -40.791969
 129759/150000: episode: 649, duration: 3.605s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.030 [0.000, 2.000],  loss: 2.140985, mae: 27.747509, mean_q: -40.839123
 129959/150000: episode: 650, duration: 3.664s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.990 [0.000, 2.000],  loss: 2.302392, mae: 27.739817, mean_q: -40.787045
 130159/150000: episode: 651, duration: 3.709s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.020 [0.000, 2.000],  loss: 2.929875, mae: 27.671612, mean_q: -40.732922
 130359/150000: episode: 652, du

 136354/150000: episode: 682, duration: 3.671s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.020 [0.000, 2.000],  loss: 2.034496, mae: 26.897520, mean_q: -39.488964
 136554/150000: episode: 683, duration: 3.658s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.020 [0.000, 2.000],  loss: 2.083821, mae: 27.029882, mean_q: -39.710655
 136754/150000: episode: 684, duration: 3.627s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.015 [0.000, 2.000],  loss: 2.043050, mae: 26.975658, mean_q: -39.614975
 136954/150000: episode: 685, duration: 3.693s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.835 [0.000, 2.000],  loss: 1.942770, mae: 26.993130, mean_q: -39.618088
 137154/150000: episode: 686, du

 143131/150000: episode: 716, duration: 3.704s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.085 [0.000, 2.000],  loss: 1.852261, mae: 27.201763, mean_q: -39.980240
 143331/150000: episode: 717, duration: 3.704s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.015 [0.000, 2.000],  loss: 1.855474, mae: 27.209457, mean_q: -40.017262
 143531/150000: episode: 718, duration: 3.678s, episode steps: 200, steps per second:  54, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.015 [0.000, 2.000],  loss: 1.470989, mae: 27.217056, mean_q: -40.029129
 143731/150000: episode: 719, duration: 3.581s, episode steps: 200, steps per second:  56, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 0.950 [0.000, 2.000],  loss: 1.362884, mae: 27.279579, mean_q: -40.099075
 143931/150000: episode: 720, du

 149931/150000: episode: 750, duration: 3.651s, episode steps: 200, steps per second:  55, episode reward: -200.000, mean reward: -1.000 [-1.000, -1.000], mean action: 1.015 [0.000, 2.000],  loss: 1.326276, mae: 26.500669, mean_q: -38.882103
done, took 2728.478 seconds


<tensorflow.python.keras.callbacks.History at 0x7f89de775790>

In [9]:
# After training is done, we save the final weights.
dqn.save_weights('results/dqn_mountaincar_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

### Inference

In [11]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)

Testing for 5 episodes ...
Episode 1: reward: -116.000, steps: 116
Episode 2: reward: -172.000, steps: 172
Episode 3: reward: -162.000, steps: 162
Episode 4: reward: -117.000, steps: 117
Episode 5: reward: -156.000, steps: 156


<tensorflow.python.keras.callbacks.History at 0x7f8abb3d8c50>