# DQNの解説

![image.png](attachment:image.png)

画像引用:  
https://qiita.com/sugulu/items/3c7d6cbe600d455e853b

### DQNの特徴
- Q学習において状態行動テーブルを関数で表したもの.
- 離散的な行動を扱うことができる.

参考:  
http://blog.syundo.org/post/20171208-reinforcement-learning-dqn-and-impl/

### OpenAI gymのインストール

githubのレポジトリを参考に, gymモジュールをインストールしてください.  
https://github.com/openai/gym

In [7]:
import gym
import numpy as np
import renom as rm
import matplotlib.pyplot as plt
from renom.utility.initializer import Gaussian
from renom.cuda import set_cuda_active
from renom_rl.discrete.dqn import DQN
from renom_rl.environ import BaseEnv
from gym.core import Env
from PIL import Image
from logging import getLogger, StreamHandler, DEBUG, FileHandler

set_cuda_active(True)
env = gym.make('CartPole-v0')

class CustomEnv(BaseEnv):
    
    def __init__(self, env):
        self.env = env
        self.action_shape = (2,)
        self.state_shape = (4,)
    
    def reset(self):
        initial_state=self.env.reset()
        return initial_state
    
    def sample(self):
        return int(self.env.action_space.sample())
    
    def step(self, action):
        if isinstance(action, (np.ndarray, np.generic)):
            action=action[0]
        state,reward,terminal,_=env.step(int(action))
        return state, reward, terminal
    
custom_env = CustomEnv(env)
q_network = rm.Sequential([rm.Dense(30, ignore_bias=True),
                           rm.Relu(),
                           rm.Dense(30, ignore_bias=True),
                           rm.Relu(),
                           rm.Dense(custom_env.action_shape[0], ignore_bias=True)])

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




In [8]:
model = DQN(custom_env, q_network)
print(custom_env.state_shape[0])

4


In [10]:
result = model.fit(render=False, greedy_step=100, random_step=500, update_period=10)




  0%|          | 0/250000 [00:00<?, ?it/s][A[A[A


epoch 0001 loss 1.7556 rewards in epoch 1.000 episode 0000 rewards in episode 0.000.:   0%|          | 0/250000 [00:00<?, ?it/s][A[A[A


epoch 0001 loss 5.9689 rewards in epoch 2.000 episode 0000 rewards in episode 0.000.:   0%|          | 1/250000 [00:00<2:12:42, 31.40it/s][A[A[A


epoch 0001 loss 1.9260 rewards in epoch 3.000 episode 0000 rewards in episode 0.000.:   0%|          | 2/250000 [00:00<1:41:58, 40.86it/s][A[A[A


epoch 0001 loss 3.6205 rewards in epoch 4.000 episode 0000 rewards in episode 0.000.:   0%|          | 3/250000 [00:00<1:30:28, 46.05it/s][A[A[A


epoch 0001 loss 4.7884 rewards in epoch 5.000 episode 0000 rewards in episode 0.000.:   0%|          | 4/250000 [00:00<1:21:43, 50.98it/s][A[A[A


epoch 0001 loss 3.3927 rewards in epoch 6.000 episode 0000 rewards in episode 0.000.:   0%|          | 5/250000 [00:00<1:18:20, 53.18it/s][A[A[A


epoch 0001 loss 4.1489 rewards in epoch 7.000 episode

Run random 500 step for storing experiences





epoch 0001 loss 2.7690 rewards in epoch 10.000 episode 0000 rewards in episode 0.000.:   0%|          | 9/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 4.0502 rewards in epoch 11.000 episode 0000 rewards in episode 0.000.:   0%|          | 10/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 3.9656 rewards in epoch 12.000 episode 0000 rewards in episode 0.000.:   0%|          | 11/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 5.9047 rewards in epoch 13.000 episode 0000 rewards in episode 0.000.:   0%|          | 12/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 6.6625 rewards in epoch 14.000 episode 0000 rewards in episode 0.000.:   0%|          | 13/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 4.7998 rewards in epoch 15.000 episode 0000 rewards in episode 0.000.:   0%|          | 14/250000 [00:00<1:05:16, 63.82it/s][A[A[A


epoch 0001 loss 3.4553 rewards in epoch 16.000 episode 0000 rewards in episode 0.000.:

epoch 0001 loss 11.0170 rewards in epoch 58.000 episode 0003 rewards in episode 17.000.:   0%|          | 57/250000 [00:00<51:45, 80.48it/s][A[A[A


epoch 0001 loss 3.9318 rewards in epoch 59.000 episode 0003 rewards in episode 17.000.:   0%|          | 58/250000 [00:00<51:45, 80.48it/s] [A[A[A


epoch 0001 loss 3.1709 rewards in epoch 60.000 episode 0003 rewards in episode 17.000.:   0%|          | 59/250000 [00:00<51:45, 80.48it/s][A[A[A


epoch 0001 loss 7.9117 rewards in epoch 61.000 episode 0003 rewards in episode 17.000.:   0%|          | 60/250000 [00:00<51:45, 80.48it/s][A[A[A


epoch 0001 loss 7.9117 rewards in epoch 61.000 episode 0003 rewards in episode 17.000.:   0%|          | 61/250000 [00:00<50:29, 82.49it/s][A[A[A


epoch 0001 loss 7.0943 rewards in epoch 62.000 episode 0003 rewards in episode 17.000.:   0%|          | 61/250000 [00:00<50:29, 82.49it/s][A[A[A


epoch 0001 loss 7.1257 rewards in epoch 63.000 episode 0003 rewards in episode 17.000.:   0%

epoch 0001 loss 5.9608 rewards in epoch 106.000 episode 0007 rewards in episode 10.000.:   0%|          | 106/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 2.2770 rewards in epoch 107.000 episode 0007 rewards in episode 10.000.:   0%|          | 106/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 5.0935 rewards in epoch 108.000 episode 0007 rewards in episode 10.000.:   0%|          | 107/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 4.8507 rewards in epoch 109.000 episode 0007 rewards in episode 10.000.:   0%|          | 108/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 6.4014 rewards in epoch 110.000 episode 0007 rewards in episode 10.000.:   0%|          | 109/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 6.4578 rewards in epoch 111.000 episode 0007 rewards in episode 10.000.:   0%|          | 110/250000 [00:01<51:13, 81.31it/s][A[A[A


epoch 0001 loss 7.7960 rewards in epoch 112.000 episode 0007 rewards in episode 10

epoch 0001 loss 6.7327 rewards in epoch 154.000 episode 0012 rewards in episode 9.000.:   0%|          | 153/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 4.4827 rewards in epoch 155.000 episode 0012 rewards in episode 9.000.:   0%|          | 154/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 8.9149 rewards in epoch 156.000 episode 0012 rewards in episode 9.000.:   0%|          | 155/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 8.2274 rewards in epoch 157.000 episode 0012 rewards in episode 9.000.:   0%|          | 156/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 10.1068 rewards in epoch 158.000 episode 0012 rewards in episode 9.000.:   0%|          | 157/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 13.4767 rewards in epoch 159.000 episode 0012 rewards in episode 9.000.:   0%|          | 158/250000 [00:01<47:48, 87.11it/s][A[A[A


epoch 0001 loss 9.0737 rewards in epoch 160.000 episode 0012 rewards in episode 9.000.

epoch 0001 loss 2.0264 rewards in epoch 202.000 episode 0017 rewards in episode 10.000.:   0%|          | 202/250000 [00:02<47:01, 88.52it/s][A[A[A


epoch 0001 loss 14.1980 rewards in epoch 203.000 episode 0017 rewards in episode 10.000.:   0%|          | 202/250000 [00:02<47:01, 88.52it/s][A[A[A


epoch 0001 loss 21.9352 rewards in epoch 204.000 episode 0017 rewards in episode 10.000.:   0%|          | 203/250000 [00:02<47:01, 88.52it/s][A[A[A


epoch 0001 loss 5.0055 rewards in epoch 205.000 episode 0017 rewards in episode 10.000.:   0%|          | 204/250000 [00:02<47:01, 88.52it/s] [A[A[A


epoch 0001 loss 21.0486 rewards in epoch 206.000 episode 0017 rewards in episode 10.000.:   0%|          | 205/250000 [00:02<47:01, 88.52it/s][A[A[A


epoch 0001 loss 18.0323 rewards in epoch 207.000 episode 0017 rewards in episode 10.000.:   0%|          | 206/250000 [00:02<47:01, 88.52it/s][A[A[A


epoch 0001 loss 12.2118 rewards in epoch 208.000 episode 0017 rewards in epis

epoch 0001 loss 6.0686 rewards in epoch 251.000 episode 0021 rewards in episode 12.000.:   0%|          | 250/250000 [00:02<48:07, 86.48it/s] [A[A[A


epoch 0001 loss 6.0686 rewards in epoch 251.000 episode 0021 rewards in episode 12.000.:   0%|          | 251/250000 [00:02<47:37, 87.40it/s][A[A[A


epoch 0001 loss 6.4298 rewards in epoch 252.000 episode 0022 rewards in episode 10.000.:   0%|          | 251/250000 [00:02<47:37, 87.40it/s][A[A[A


epoch 0001 loss 11.3574 rewards in epoch 253.000 episode 0022 rewards in episode 10.000.:   0%|          | 252/250000 [00:02<47:37, 87.40it/s][A[A[A


epoch 0001 loss 9.4225 rewards in epoch 254.000 episode 0022 rewards in episode 10.000.:   0%|          | 253/250000 [00:02<47:37, 87.40it/s] [A[A[A


epoch 0001 loss 6.6965 rewards in epoch 255.000 episode 0022 rewards in episode 10.000.:   0%|          | 254/250000 [00:02<47:37, 87.40it/s][A[A[A


epoch 0001 loss 21.2738 rewards in epoch 256.000 episode 0022 rewards in episod

epoch 0001 loss 18.2213 rewards in epoch 298.000 episode 0026 rewards in episode 10.000.:   0%|          | 297/250000 [00:03<54:28, 76.39it/s][A[A[A


epoch 0001 loss 13.5108 rewards in epoch 299.000 episode 0026 rewards in episode 10.000.:   0%|          | 298/250000 [00:03<54:28, 76.39it/s][A[A[A


epoch 0001 loss 9.2675 rewards in epoch 300.000 episode 0027 rewards in episode 10.000.:   0%|          | 299/250000 [00:03<54:28, 76.39it/s] [A[A[A


epoch 0001 loss 20.2506 rewards in epoch 301.000 episode 0027 rewards in episode 10.000.:   0%|          | 300/250000 [00:03<54:28, 76.39it/s][A[A[A


epoch 0001 loss 29.6596 rewards in epoch 302.000 episode 0027 rewards in episode 10.000.:   0%|          | 301/250000 [00:03<54:28, 76.39it/s][A[A[A


epoch 0001 loss 15.0531 rewards in epoch 303.000 episode 0027 rewards in episode 10.000.:   0%|          | 302/250000 [00:03<54:28, 76.39it/s][A[A[A


epoch 0001 loss 15.0531 rewards in epoch 303.000 episode 0027 rewards in epi

epoch 0001 loss 19.6607 rewards in epoch 346.000 episode 0031 rewards in episode 10.000.:   0%|          | 345/250000 [00:04<52:29, 79.27it/s][A[A[A


epoch 0001 loss 13.0435 rewards in epoch 347.000 episode 0031 rewards in episode 10.000.:   0%|          | 346/250000 [00:04<52:29, 79.27it/s][A[A[A


epoch 0001 loss 13.0435 rewards in epoch 347.000 episode 0031 rewards in episode 10.000.:   0%|          | 347/250000 [00:04<53:01, 78.46it/s][A[A[A


epoch 0001 loss 19.1412 rewards in epoch 348.000 episode 0031 rewards in episode 10.000.:   0%|          | 347/250000 [00:04<53:01, 78.46it/s][A[A[A


epoch 0001 loss 17.4867 rewards in epoch 349.000 episode 0031 rewards in episode 10.000.:   0%|          | 348/250000 [00:04<53:01, 78.46it/s][A[A[A


epoch 0001 loss 40.7416 rewards in epoch 350.000 episode 0032 rewards in episode 11.000.:   0%|          | 349/250000 [00:04<53:01, 78.46it/s][A[A[A


epoch 0001 loss 19.9839 rewards in epoch 351.000 episode 0032 rewards in epi

KeyboardInterrupt: 

In [None]:
q_network.save("dqn_exp5.h5")
# model = DQN(custom_env, q_network)

In [None]:
model.test(render=True)

In [None]:
import time
start_t = time.time()
a = np.random.permutation(int(1e1))
print(time.time()-start_t)


# 