# DQNの解説

![image.png](attachment:image.png)

画像引用:  
https://qiita.com/sugulu/items/3c7d6cbe600d455e853b

### DQNの特徴
- Q学習において状態行動テーブルを関数で表したもの.
- 離散的な行動を扱うことができる.

参考:  
http://blog.syundo.org/post/20171208-reinforcement-learning-dqn-and-impl/

### OpenAI gymのインストール

githubのレポジトリを参考に, gymモジュールをインストールしてください.  
https://github.com/openai/gym

In [1]:
import gym
import numpy as np
import renom as rm
import matplotlib.pyplot as plt
from renom.utility.initializer import Gaussian

from renom_rl.discrete.dqn import DQN
from renom_rl.environ.env import BaseEnv
from gym.core import Env
from PIL import Image
from logging import getLogger, StreamHandler, DEBUG, FileHandler
from renom_rl.utility import Animation

env = gym.make('CartPole-v0')


class CustomEnv(BaseEnv):
    
    def __init__(self, env):
        self.action_shape = (2,)
        self.state_shape = (4,)
     
        self.env=env
        self.step_continue=0
        self.successful_episode=0
        self.animation=Animation()
        self.test_mode=False
        self.reward=0
        


    def reset(self):
        return self.env.reset()
        
    
    def sample(self):
        rand=env.action_space.sample()
        return rand
    
    def step(self, action):
        state,_,terminal,_=env.step(int(action))
        
        self.step_continue+=1
        reward=0
        
        if terminal:
            if self.step_continue >= 200:
                reward=1
                if self.test_mode==False:
                    print(self.successful_episode)
                    self.successful_episode+=1
            else:
                reward=-1
            self.step_continue=0
        
        if self.test_mode==True:
            self.animation.store(self.env.render(mode="rgb_array"))
        
        self.reward=reward
        
        return state, reward, terminal
    
    def terminate(self):
            if self.successful_episode >= 10:
                self.successful_episode=0
                return True
            else:
                return False

    def test_start(self):
        self.animation.reset()
        self.test_mode=True

    def test_step(self):
        self.animation.store(self.env.render(mode="rgb_array"))

    def test_close(self):
        self.env.close()
        self.env.viewer=None
        self.test_mode=False
 
    def reset_anime(self):
        self.animation.reset()
            
custom_env = CustomEnv(env)

q_network = rm.Sequential([rm.Dense(30, ignore_bias=True),
                           rm.Relu(),
                           rm.Dense(30, ignore_bias=True),
                           rm.Relu(),
                           rm.Dense(custom_env.action_shape[0], ignore_bias=True)])

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




In [2]:
model = DQN(custom_env, q_network)
print(custom_env.state_shape[0])

4


In [3]:
from renom_rl.utility import EpsilonGreedyFilter
obj=EpsilonGreedyFilter(mode="episode_inverse",max=1,alpha=1,test_greedy=0.95, greedy_step=2500)

In [5]:
result = model.fit( epoch=500,
                    epoch_step=250000,
                    batch_size=32,
                    random_step=32,
                    test_step=None,
                    update_period=2,
                    train_frequency=1,
                    action_filter=obj,
                    test_action_filter=obj,                   
                  )

epoch 0001 greedy0.0000 loss 0.0005 rewards in epoch 0.000 episode 0000 rewards in episode 0.000.:   0%|          | 13/250000 [00:00<1:05:55, 63.19it/s]

Run random 32 step for storing experiences


epoch 0001 greedy0.9091 loss 0.0094 rewards in epoch -8.000 episode 0010 rewards in episode 1.000.:   0%|          | 516/250000 [00:05<44:12, 94.07it/s]  

0


epoch 0001 greedy0.9524 loss 0.0066 rewards in epoch -16.000 episode 0020 rewards in episode 1.000.:   0%|          | 896/250000 [00:09<40:48, 101.73it/s]  

1


epoch 0001 greedy0.9583 loss 0.0049 rewards in epoch -17.000 episode 0023 rewards in episode -1.000.:   0%|          | 1115/250000 [00:11<44:15, 93.71it/s] 

2


epoch 0001 greedy0.9643 loss 0.0090 rewards in epoch -19.000 episode 0027 rewards in episode 1.000.:   1%|          | 1399/250000 [00:14<40:46, 101.60it/s] 

3


epoch 0001 greedy0.9677 loss 0.0097 rewards in epoch -20.000 episode 0030 rewards in episode 1.000.:   1%|          | 1745/250000 [00:18<40:24, 102.37it/s] 

4


epoch 0001 greedy0.9706 loss 0.0032 rewards in epoch -21.000 episode 0033 rewards in episode 1.000.:   1%|          | 2136/250000 [00:22<40:25, 102.19it/s] 

5


epoch 0001 greedy0.9722 loss 0.0002 rewards in epoch -21.000 episode 0035 rewards in episode -1.000.:   1%|          | 2336/250000 [00:24<40:42, 101.39it/s]

6


epoch 0001 greedy0.9730 loss 0.0067 rewards in epoch -20.000 episode 0036 rewards in episode 1.000.:   1%|          | 2545/250000 [00:26<44:14, 93.22it/s]  

7


epoch 0001 greedy0.9737 loss 0.0045 rewards in epoch -19.000 episode 0037 rewards in episode 1.000.:   1%|          | 2746/250000 [00:28<43:02, 95.75it/s]  

8


epoch 0001 greedy0.9737 loss 0.0077 rewards in epoch -18.000 episode 0038 rewards in episode 1.000.:   1%|          | 2928/250000 [00:30<43:05, 95.55it/s] 

9
terminated





In [14]:
model.test()

1.0

In [13]:
custom_env.animation.run()
custom_env.reset_anime()

In [None]:
q_network.save("dqn_exp5.h5")
# model = DQN(custom_env, q_network)

In [None]:
model.test(render=True)

In [None]:
import time
start_t = time.time()
a = np.random.permutation(int(1e1))
print(time.time()-start_t)


# 