# DQNの解説

![image.png](attachment:image.png)

画像引用:  
https://qiita.com/sugulu/items/3c7d6cbe600d455e853b

### DQNの特徴
- Q学習において状態行動テーブルを関数で表したもの.
- 離散的な行動を扱うことができる.

参考:  
http://blog.syundo.org/post/20171208-reinforcement-learning-dqn-and-impl/

### OpenAI gymのインストール

githubのレポジトリを参考に, gymモジュールをインストールしてください.  
https://github.com/openai/gym

In [1]:
import gym
import numpy as np
import renom as rm
import matplotlib.pyplot as plt
from renom.utility.initializer import Gaussian
from renom.cuda import set_cuda_active
from renom_rl.discrete.dqn import DQN
from renom_rl.env import BaseEnv
from gym.core import Env
from PIL import Image
from logging import getLogger, StreamHandler, DEBUG, FileHandler

set_cuda_active(True)
env = gym.make('BreakoutNoFrameskip-v4')

class CustomEnv(BaseEnv):
    
    def __init__(self, env):
        self.env = env
        self.action_shape = 4
        self.state_shape = (4, 84, 84)
        self.previous_frames = []
        self._reset_flag = True
        self._last_live = 5
        super(CustomEnv, self).__init__()
    
    def reset(self):
        if self._reset_flag:
            self._reset_flag = False
            self.env.reset()
        n_step = np.random.randint(4, 32+1)
        for _ in range(n_step):
            state, _, _ = self.step(self.env.action_space.sample())
        return state
    
    def sample(self):
        return self.env.action_space.sample()
    
    def render(self):
        self.env.render()

    def _preprocess(self, state):
        resized_image = Image.fromarray(state).resize((84, 110)).convert('L')
        image_array = np.asarray(resized_image)/255.
        final_image = image_array[26:110]
        # Confirm that the image is processed correctly.
        # Image.fromarray(np.clip(final_image.reshape(84, 84)*255, 0, 255).astype(np.uint8)).save("test.png")
        return final_image
    
    def step(self, action):
        state_list = []
        reward_list = []
        terminal = False
        for _ in range(4):
            # Use last frame. Other frames will be skipped.
            s, r, t, info = self.env.step(action)
            state = self._preprocess(s)
            reward_list.append(r)
            if self._last_live > info["ale.lives"]:
                t = True
                self._last_live = info["ale.lives"]
                if self._last_live > 0:
                    self._reset_flag = False
                else:
                    self._last_live = 5
                    self._reset_flag = True
            if t:
                terminal = True
                
        if len(self.previous_frames) > 3:
            self.previous_frames = self.previous_frames[1:] + [state]
        else:
            self.previous_frames += [state]
        state = np.stack(self.previous_frames)
        return state, np.array(np.sum(reward_list) > 0), terminal
    
custom_env = CustomEnv(env)
q_network = rm.Sequential([rm.Conv2d(32, filter=8, stride=4, ignore_bias=True),
                           rm.Relu(),
                           rm.Conv2d(64, filter=4, stride=2, ignore_bias=True),
                           rm.Relu(),
                           rm.Conv2d(64, filter=3, stride=1, ignore_bias=True),
                           rm.Relu(), 
                           rm.Flatten(), 
                           rm.Dense(512, ignore_bias=True),
                           rm.Relu(),
                           rm.Dense(custom_env.action_shape, ignore_bias=True)])

In [2]:
model = DQN(custom_env, q_network)

In [3]:
result = model.fit(render=False, greedy_step=1000000, random_step=5000, update_period=10000)

Run random 5000 step for storing experiences


epoch 001 avg_loss: 0.005 total_reward [train:20.000 test:11.000] e-greedy:0.002: 100%|██████████| 2000/2000 [00:30<00:00, 64.95it/s]
epoch 002 avg_loss: 0.004 total_reward [train:21.000 test:16.000] e-greedy:0.004: 100%|██████████| 2000/2000 [00:31<00:00, 64.37it/s]
epoch 003 avg_loss: 0.003 total_reward [train:15.000 test:13.000] e-greedy:0.005: 100%|██████████| 2000/2000 [00:31<00:00, 63.59it/s]
epoch 004 avg_loss: 0.003 total_reward [train:24.000 test:16.000] e-greedy:0.007: 100%|██████████| 2000/2000 [00:31<00:00, 64.08it/s]
epoch 005 avg_loss: 0.002 total_reward [train:18.000 test:9.000] e-greedy:0.009: 100%|██████████| 2000/2000 [00:31<00:00, 63.96it/s]
epoch 006 each step reward:0.000:   1%|          | 14/2000 [00:00<00:29, 68.12it/s]

Update 16.0


epoch 006 avg_loss: 0.004 total_reward [train:23.000 test:1.000] e-greedy:0.011: 100%|██████████| 2000/2000 [00:31<00:00, 63.55it/s]
epoch 007 avg_loss: 0.003 total_reward [train:22.000 test:14.000] e-greedy:0.013: 100%|██████████| 2000/2000 [00:30<00:00, 65.21it/s]
epoch 008 avg_loss: 0.003 total_reward [train:26.000 test:15.000] e-greedy:0.014: 100%|██████████| 2000/2000 [00:30<00:00, 64.89it/s]
epoch 009 avg_loss: 0.003 total_reward [train:17.000 test:10.000] e-greedy:0.016: 100%|██████████| 2000/2000 [00:31<00:00, 69.96it/s]
epoch 010 avg_loss: 0.003 total_reward [train:30.000 test:17.000] e-greedy:0.018: 100%|██████████| 2000/2000 [00:30<00:00, 65.21it/s]
epoch 011 each step reward:0.000:   1%|          | 12/2000 [00:00<00:26, 73.91it/s]

Update 17.0


epoch 011 avg_loss: 0.004 total_reward [train:15.000 test:20.000] e-greedy:0.020: 100%|██████████| 2000/2000 [00:31<00:00, 63.24it/s]
epoch 012 avg_loss: 0.004 total_reward [train:21.000 test:8.000] e-greedy:0.022: 100%|██████████| 2000/2000 [00:32<00:00, 61.83it/s]
epoch 013 avg_loss: 0.003 total_reward [train:22.000 test:14.000] e-greedy:0.023: 100%|██████████| 2000/2000 [00:30<00:00, 64.58it/s]
epoch 014 avg_loss: 0.003 total_reward [train:24.000 test:13.000] e-greedy:0.025: 100%|██████████| 2000/2000 [00:31<00:00, 64.40it/s]
epoch 015 avg_loss: 0.003 total_reward [train:22.000 test:16.000] e-greedy:0.027: 100%|██████████| 2000/2000 [00:30<00:00, 65.63it/s]
epoch 016 each step reward:1.000:   1%|          | 15/2000 [00:00<00:24, 80.88it/s]

Update 20.0


epoch 016 avg_loss: 0.004 total_reward [train:19.000 test:19.000] e-greedy:0.029: 100%|██████████| 2000/2000 [00:30<00:00, 65.76it/s]
epoch 017 avg_loss: 0.003 total_reward [train:17.000 test:12.000] e-greedy:0.031: 100%|██████████| 2000/2000 [00:31<00:00, 64.44it/s]
epoch 018 avg_loss: 0.003 total_reward [train:20.000 test:16.000] e-greedy:0.032: 100%|██████████| 2000/2000 [00:30<00:00, 65.69it/s]
epoch 019 avg_loss: 0.003 total_reward [train:23.000 test:13.000] e-greedy:0.034: 100%|██████████| 2000/2000 [00:29<00:00, 66.80it/s]
epoch 020 avg_loss: 0.003 total_reward [train:20.000 test:17.000] e-greedy:0.036: 100%|██████████| 2000/2000 [00:30<00:00, 65.41it/s]
epoch 021 each step reward:0.000:   1%|          | 14/2000 [00:00<00:26, 73.98it/s]

Update 19.0


epoch 021 avg_loss: 0.005 total_reward [train:26.000 test:11.000] e-greedy:0.038: 100%|██████████| 2000/2000 [00:31<00:00, 63.96it/s]
epoch 022 avg_loss: 0.003 total_reward [train:19.000 test:14.000] e-greedy:0.040: 100%|██████████| 2000/2000 [00:30<00:00, 64.67it/s]
epoch 023 avg_loss: 0.004 total_reward [train:21.000 test:2.000] e-greedy:0.041: 100%|██████████| 2000/2000 [00:31<00:00, 62.92it/s]
epoch 024 avg_loss: 0.004 total_reward [train:23.000 test:8.000] e-greedy:0.043: 100%|██████████| 2000/2000 [00:30<00:00, 65.10it/s]
epoch 025 avg_loss: 0.004 total_reward [train:26.000 test:16.000] e-greedy:0.045: 100%|██████████| 2000/2000 [00:31<00:00, 63.21it/s]
epoch 026 each step reward:1.000:   1%|          | 16/2000 [00:00<00:25, 79.13it/s]

Update 16.0


epoch 026 avg_loss: 0.005 total_reward [train:23.000 test:12.000] e-greedy:0.047: 100%|██████████| 2000/2000 [00:31<00:00, 65.92it/s]
epoch 027 avg_loss: 0.004 total_reward [train:24.000 test:13.000] e-greedy:0.049: 100%|██████████| 2000/2000 [00:30<00:00, 64.81it/s]
epoch 028 avg_loss: 0.004 total_reward [train:21.000 test:10.000] e-greedy:0.050: 100%|██████████| 2000/2000 [00:31<00:00, 62.94it/s]
epoch 029 avg_loss: 0.004 total_reward [train:17.000 test:14.000] e-greedy:0.052: 100%|██████████| 2000/2000 [00:31<00:00, 62.72it/s]
epoch 030 avg_loss: 0.004 total_reward [train:25.000 test:18.000] e-greedy:0.054: 100%|██████████| 2000/2000 [00:31<00:00, 64.36it/s]
epoch 031 each step reward:1.000:   1%|          | 14/2000 [00:00<00:28, 70.02it/s]

Update 18.0


epoch 031 avg_loss: 0.005 total_reward [train:18.000 test:15.000] e-greedy:0.056: 100%|██████████| 2000/2000 [00:31<00:00, 63.64it/s]
epoch 032 avg_loss: 0.004 total_reward [train:25.000 test:14.000] e-greedy:0.058: 100%|██████████| 2000/2000 [00:31<00:00, 64.31it/s]
epoch 033 avg_loss: 0.004 total_reward [train:26.000 test:11.000] e-greedy:0.059: 100%|██████████| 2000/2000 [00:31<00:00, 63.25it/s]
epoch 034 avg_loss: 0.004 total_reward [train:20.000 test:10.000] e-greedy:0.061: 100%|██████████| 2000/2000 [00:31<00:00, 63.78it/s]
epoch 035 avg_loss: 0.004 total_reward [train:17.000 test:3.000] e-greedy:0.063: 100%|██████████| 2000/2000 [00:31<00:00, 63.00it/s]
epoch 036 each step reward:1.000:   1%|          | 14/2000 [00:00<00:26, 74.57it/s]

Update 15.0


epoch 036 avg_loss: 0.005 total_reward [train:19.000 test:12.000] e-greedy:0.065: 100%|██████████| 2000/2000 [00:31<00:00, 63.43it/s]
epoch 037 avg_loss: 0.004 total_reward [train:19.000 test:12.000] e-greedy:0.067: 100%|██████████| 2000/2000 [00:31<00:00, 63.59it/s]
epoch 038 avg_loss: 0.005 total_reward [train:23.000 test:9.000] e-greedy:0.068: 100%|██████████| 2000/2000 [00:30<00:00, 71.26it/s]
epoch 039 avg_loss: 0.004 total_reward [train:18.000 test:16.000] e-greedy:0.070: 100%|██████████| 2000/2000 [00:31<00:00, 63.57it/s]
epoch 040 avg_loss: 0.004 total_reward [train:19.000 test:5.000] e-greedy:0.072: 100%|██████████| 2000/2000 [00:31<00:00, 63.04it/s]
epoch 041 each step reward:0.000:   1%|          | 14/2000 [00:00<00:26, 73.63it/s]

Update 16.0


epoch 041 avg_loss: 0.005 total_reward [train:23.000 test:9.000] e-greedy:0.074: 100%|██████████| 2000/2000 [00:31<00:00, 63.73it/s]
epoch 042 avg_loss: 0.004 total_reward [train:25.000 test:10.000] e-greedy:0.076: 100%|██████████| 2000/2000 [00:31<00:00, 63.87it/s]
epoch 043 avg_loss: 0.004 total_reward [train:26.000 test:7.000] e-greedy:0.077: 100%|██████████| 2000/2000 [00:31<00:00,  6.74it/s]
epoch 044 avg_loss: 0.004 total_reward [train:17.000 test:12.000] e-greedy:0.079: 100%|██████████| 2000/2000 [00:32<00:00, 62.49it/s]
epoch 045 avg_loss: 0.004 total_reward [train:17.000 test:14.000] e-greedy:0.081: 100%|██████████| 2000/2000 [00:31<00:00, 63.26it/s]
epoch 046 each step reward:0.000:   1%|          | 12/2000 [00:00<00:27, 71.14it/s]

Update 14.0


epoch 046 avg_loss: 0.006 total_reward [train:18.000 test:14.000] e-greedy:0.083: 100%|██████████| 2000/2000 [00:31<00:00, 67.71it/s]
epoch 047 avg_loss: 0.005 total_reward [train:19.000 test:8.000] e-greedy:0.085: 100%|██████████| 2000/2000 [00:31<00:00, 63.14it/s]
epoch 048 avg_loss: 0.005 total_reward [train:20.000 test:15.000] e-greedy:0.086: 100%|██████████| 2000/2000 [00:31<00:00, 63.85it/s]
epoch 049 avg_loss: 0.005 total_reward [train:24.000 test:12.000] e-greedy:0.088: 100%|██████████| 2000/2000 [00:31<00:00, 63.59it/s]
epoch 050 avg_loss: 0.005 total_reward [train:23.000 test:13.000] e-greedy:0.090: 100%|██████████| 2000/2000 [00:31<00:00, 62.92it/s]
epoch 051 each step reward:0.000:   1%|          | 11/2000 [00:00<00:31, 63.27it/s]

Update 15.0


epoch 051 avg_loss: 0.006 total_reward [train:22.000 test:15.000] e-greedy:0.092: 100%|██████████| 2000/2000 [00:31<00:00,  6.42it/s]
epoch 052 avg_loss: 0.005 total_reward [train:16.000 test:9.000] e-greedy:0.094: 100%|██████████| 2000/2000 [00:32<00:00, 62.14it/s]
epoch 053 avg_loss: 0.005 total_reward [train:21.000 test:10.000] e-greedy:0.095: 100%|██████████| 2000/2000 [00:31<00:00, 65.46it/s]
epoch 054 avg_loss: 0.005 total_reward [train:24.000 test:13.000] e-greedy:0.097: 100%|██████████| 2000/2000 [00:31<00:00, 63.95it/s]
epoch 055 avg_loss: 0.005 total_reward [train:19.000 test:15.000] e-greedy:0.099: 100%|██████████| 2000/2000 [00:31<00:00, 64.04it/s]
epoch 056 each step reward:0.000:   1%|          | 14/2000 [00:00<00:27, 72.16it/s]

Update 15.0


epoch 056 avg_loss: 0.006 total_reward [train:20.000 test:4.000] e-greedy:0.101: 100%|██████████| 2000/2000 [00:31<00:00, 63.05it/s]
epoch 057 avg_loss: 0.006 total_reward [train:24.000 test:12.000] e-greedy:0.103: 100%|██████████| 2000/2000 [00:31<00:00, 64.04it/s]
epoch 058 avg_loss: 0.005 total_reward [train:25.000 test:15.000] e-greedy:0.104: 100%|██████████| 2000/2000 [00:31<00:00, 63.47it/s]
epoch 059 avg_loss: 0.005 total_reward [train:23.000 test:6.000] e-greedy:0.106: 100%|██████████| 2000/2000 [00:31<00:00, 63.64it/s]
epoch 060 avg_loss: 0.005 total_reward [train:20.000 test:13.000] e-greedy:0.108: 100%|██████████| 2000/2000 [00:31<00:00, 64.19it/s]
epoch 061 each step reward:0.000:   1%|          | 14/2000 [00:00<00:28, 70.82it/s]

Update 15.0


epoch 061 avg_loss: 0.007 total_reward [train:17.000 test:14.000] e-greedy:0.110: 100%|██████████| 2000/2000 [00:31<00:00, 63.33it/s]
epoch 062 avg_loss: 0.006 total_reward [train:24.000 test:16.000] e-greedy:0.112: 100%|██████████| 2000/2000 [00:31<00:00, 64.40it/s]
epoch 063 avg_loss: 0.006 total_reward [train:22.000 test:12.000] e-greedy:0.113: 100%|██████████| 2000/2000 [00:31<00:00, 63.93it/s]
epoch 064 avg_loss: 0.006 total_reward [train:25.000 test:15.000] e-greedy:0.115: 100%|██████████| 2000/2000 [00:30<00:00, 66.37it/s]
epoch 065 avg_loss: 0.005 total_reward [train:27.000 test:12.000] e-greedy:0.117: 100%|██████████| 2000/2000 [00:30<00:00, 65.21it/s]
epoch 066 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 76.51it/s]

Update 16.0


epoch 066 avg_loss: 0.007 total_reward [train:26.000 test:16.000] e-greedy:0.119: 100%|██████████| 2000/2000 [00:30<00:00, 64.70it/s]
epoch 067 avg_loss: 0.005 total_reward [train:25.000 test:15.000] e-greedy:0.121: 100%|██████████| 2000/2000 [00:31<00:00, 64.80it/s]
epoch 068 avg_loss: 0.005 total_reward [train:22.000 test:14.000] e-greedy:0.122: 100%|██████████| 2000/2000 [00:31<00:00, 63.55it/s]
epoch 069 avg_loss: 0.006 total_reward [train:25.000 test:15.000] e-greedy:0.124: 100%|██████████| 2000/2000 [00:31<00:00, 63.99it/s]
epoch 070 avg_loss: 0.005 total_reward [train:27.000 test:10.000] e-greedy:0.126: 100%|██████████| 2000/2000 [00:32<00:00, 61.49it/s]
epoch 071 each step reward:0.000:   1%|          | 11/2000 [00:00<00:31, 62.96it/s]

Update 16.0


epoch 071 avg_loss: 0.006 total_reward [train:23.000 test:17.000] e-greedy:0.128: 100%|██████████| 2000/2000 [00:31<00:00, 63.44it/s]
epoch 072 avg_loss: 0.006 total_reward [train:26.000 test:13.000] e-greedy:0.130: 100%|██████████| 2000/2000 [00:30<00:00, 64.90it/s]
epoch 073 avg_loss: 0.006 total_reward [train:25.000 test:13.000] e-greedy:0.131: 100%|██████████| 2000/2000 [00:32<00:00, 61.71it/s]
epoch 074 avg_loss: 0.006 total_reward [train:23.000 test:15.000] e-greedy:0.133: 100%|██████████| 2000/2000 [00:32<00:00, 61.82it/s]
epoch 075 avg_loss: 0.006 total_reward [train:17.000 test:18.000] e-greedy:0.135: 100%|██████████| 2000/2000 [00:31<00:00, 63.23it/s]
epoch 076 each step reward:0.000:   1%|          | 13/2000 [00:00<00:32, 61.24it/s]

Update 18.0


epoch 076 avg_loss: 0.007 total_reward [train:23.000 test:14.000] e-greedy:0.137: 100%|██████████| 2000/2000 [00:31<00:00, 63.38it/s]
epoch 077 avg_loss: 0.006 total_reward [train:28.000 test:14.000] e-greedy:0.139: 100%|██████████| 2000/2000 [00:32<00:00, 62.49it/s]
epoch 078 avg_loss: 0.006 total_reward [train:27.000 test:15.000] e-greedy:0.140: 100%|██████████| 2000/2000 [00:31<00:00, 62.96it/s]
epoch 079 avg_loss: 0.006 total_reward [train:28.000 test:16.000] e-greedy:0.142: 100%|██████████| 2000/2000 [00:31<00:00, 63.56it/s]
epoch 080 avg_loss: 0.006 total_reward [train:24.000 test:18.000] e-greedy:0.144: 100%|██████████| 2000/2000 [00:31<00:00, 64.29it/s]
epoch 081 each step reward:0.000:   0%|          | 10/2000 [00:00<00:33, 60.12it/s]

Update 18.0


epoch 081 avg_loss: 0.007 total_reward [train:26.000 test:14.000] e-greedy:0.146: 100%|██████████| 2000/2000 [00:31<00:00, 63.28it/s]
epoch 082 avg_loss: 0.007 total_reward [train:21.000 test:16.000] e-greedy:0.148: 100%|██████████| 2000/2000 [00:32<00:00, 61.74it/s]
epoch 083 avg_loss: 0.007 total_reward [train:24.000 test:18.000] e-greedy:0.149: 100%|██████████| 2000/2000 [00:30<00:00, 65.74it/s]
epoch 084 avg_loss: 0.006 total_reward [train:24.000 test:15.000] e-greedy:0.151: 100%|██████████| 2000/2000 [00:30<00:00, 66.19it/s]
epoch 085 avg_loss: 0.006 total_reward [train:29.000 test:14.000] e-greedy:0.153: 100%|██████████| 2000/2000 [00:29<00:00, 79.54it/s]
epoch 086 each step reward:0.000:   1%|          | 14/2000 [00:00<00:25, 77.80it/s]

Update 18.0


epoch 086 avg_loss: 0.008 total_reward [train:23.000 test:18.000] e-greedy:0.155: 100%|██████████| 2000/2000 [00:29<00:00, 67.15it/s]
epoch 087 avg_loss: 0.007 total_reward [train:22.000 test:17.000] e-greedy:0.157: 100%|██████████| 2000/2000 [00:29<00:00, 66.87it/s]
epoch 088 avg_loss: 0.007 total_reward [train:28.000 test:18.000] e-greedy:0.158: 100%|██████████| 2000/2000 [00:29<00:00, 67.36it/s]
epoch 089 avg_loss: 0.006 total_reward [train:24.000 test:18.000] e-greedy:0.160: 100%|██████████| 2000/2000 [00:29<00:00, 67.29it/s]
epoch 090 avg_loss: 0.007 total_reward [train:25.000 test:14.000] e-greedy:0.162: 100%|██████████| 2000/2000 [00:30<00:00, 70.86it/s]
epoch 091 each step reward:0.000:   1%|          | 16/2000 [00:00<00:24, 80.75it/s]

Update 18.0


epoch 091 avg_loss: 0.008 total_reward [train:21.000 test:13.000] e-greedy:0.164: 100%|██████████| 2000/2000 [00:29<00:00, 66.94it/s]
epoch 092 avg_loss: 0.007 total_reward [train:20.000 test:14.000] e-greedy:0.166: 100%|██████████| 2000/2000 [00:30<00:00, 66.39it/s]
epoch 093 avg_loss: 0.007 total_reward [train:29.000 test:8.000] e-greedy:0.167: 100%|██████████| 2000/2000 [00:30<00:00, 66.60it/s]
epoch 094 avg_loss: 0.007 total_reward [train:29.000 test:18.000] e-greedy:0.169: 100%|██████████| 2000/2000 [00:29<00:00, 66.72it/s]
epoch 095 avg_loss: 0.007 total_reward [train:31.000 test:18.000] e-greedy:0.171: 100%|██████████| 2000/2000 [00:29<00:00, 66.74it/s]
epoch 096 each step reward:0.000:   1%|          | 17/2000 [00:00<00:24, 80.28it/s]

Update 18.0


epoch 096 avg_loss: 0.008 total_reward [train:28.000 test:11.000] e-greedy:0.173: 100%|██████████| 2000/2000 [00:29<00:00, 76.16it/s]
epoch 097 avg_loss: 0.008 total_reward [train:24.000 test:14.000] e-greedy:0.175: 100%|██████████| 2000/2000 [00:29<00:00, 67.10it/s]
epoch 098 avg_loss: 0.008 total_reward [train:29.000 test:18.000] e-greedy:0.176: 100%|██████████| 2000/2000 [00:29<00:00, 67.81it/s]
epoch 099 avg_loss: 0.007 total_reward [train:27.000 test:16.000] e-greedy:0.178: 100%|██████████| 2000/2000 [00:29<00:00, 67.05it/s]
epoch 100 avg_loss: 0.007 total_reward [train:24.000 test:14.000] e-greedy:0.180: 100%|██████████| 2000/2000 [00:29<00:00, 66.96it/s]
epoch 101 each step reward:0.000:   1%|          | 15/2000 [00:00<00:24, 81.50it/s]

Update 18.0


epoch 101 avg_loss: 0.008 total_reward [train:24.000 test:17.000] e-greedy:0.182: 100%|██████████| 2000/2000 [00:30<00:00, 66.04it/s]
epoch 102 avg_loss: 0.008 total_reward [train:22.000 test:21.000] e-greedy:0.184: 100%|██████████| 2000/2000 [00:29<00:00, 66.97it/s]
epoch 103 avg_loss: 0.008 total_reward [train:25.000 test:17.000] e-greedy:0.185: 100%|██████████| 2000/2000 [00:29<00:00, 66.93it/s]
epoch 104 avg_loss: 0.007 total_reward [train:28.000 test:16.000] e-greedy:0.187: 100%|██████████| 2000/2000 [00:30<00:00, 66.48it/s]
epoch 105 avg_loss: 0.007 total_reward [train:25.000 test:16.000] e-greedy:0.189: 100%|██████████| 2000/2000 [00:29<00:00, 67.41it/s]
epoch 106 each step reward:0.000:   1%|          | 13/2000 [00:00<00:32, 61.95it/s]

Update 21.0


epoch 106 avg_loss: 0.009 total_reward [train:29.000 test:20.000] e-greedy:0.191: 100%|██████████| 2000/2000 [00:29<00:00, 67.53it/s]
epoch 107 avg_loss: 0.009 total_reward [train:30.000 test:17.000] e-greedy:0.193: 100%|██████████| 2000/2000 [00:29<00:00, 67.45it/s]
epoch 108 avg_loss: 0.008 total_reward [train:22.000 test:21.000] e-greedy:0.194: 100%|██████████| 2000/2000 [00:29<00:00, 67.64it/s]
epoch 109 avg_loss: 0.008 total_reward [train:28.000 test:20.000] e-greedy:0.196: 100%|██████████| 2000/2000 [00:29<00:00, 73.02it/s]
epoch 110 avg_loss: 0.008 total_reward [train:28.000 test:19.000] e-greedy:0.198: 100%|██████████| 2000/2000 [00:29<00:00, 67.50it/s]
epoch 111 each step reward:0.000:   1%|          | 15/2000 [00:00<00:26, 73.62it/s]

Update 21.0


epoch 111 avg_loss: 0.010 total_reward [train:28.000 test:15.000] e-greedy:0.200: 100%|██████████| 2000/2000 [00:29<00:00, 67.37it/s]
epoch 112 avg_loss: 0.010 total_reward [train:29.000 test:24.000] e-greedy:0.202: 100%|██████████| 2000/2000 [00:29<00:00, 67.81it/s]
epoch 113 avg_loss: 0.009 total_reward [train:27.000 test:13.000] e-greedy:0.203: 100%|██████████| 2000/2000 [00:29<00:00, 66.85it/s]
epoch 114 avg_loss: 0.009 total_reward [train:26.000 test:19.000] e-greedy:0.205: 100%|██████████| 2000/2000 [00:29<00:00, 67.02it/s]
epoch 115 avg_loss: 0.009 total_reward [train:23.000 test:21.000] e-greedy:0.207: 100%|██████████| 2000/2000 [00:29<00:00, 67.55it/s]
epoch 116 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 78.83it/s]

Update 24.0


epoch 116 avg_loss: 0.011 total_reward [train:26.000 test:18.000] e-greedy:0.209: 100%|██████████| 2000/2000 [00:29<00:00, 66.72it/s]
epoch 117 avg_loss: 0.010 total_reward [train:26.000 test:15.000] e-greedy:0.211: 100%|██████████| 2000/2000 [00:29<00:00, 67.18it/s]
epoch 118 avg_loss: 0.010 total_reward [train:33.000 test:24.000] e-greedy:0.212: 100%|██████████| 2000/2000 [00:28<00:00, 69.36it/s]
epoch 119 avg_loss: 0.010 total_reward [train:26.000 test:19.000] e-greedy:0.214: 100%|██████████| 2000/2000 [00:29<00:00, 67.47it/s]
epoch 120 avg_loss: 0.010 total_reward [train:33.000 test:16.000] e-greedy:0.216: 100%|██████████| 2000/2000 [00:29<00:00, 67.35it/s]
epoch 121 each step reward:1.000:   1%|          | 16/2000 [00:00<00:25, 79.04it/s]

Update 24.0


epoch 121 avg_loss: 0.012 total_reward [train:28.000 test:15.000] e-greedy:0.218: 100%|██████████| 2000/2000 [00:29<00:00, 67.72it/s]
epoch 122 avg_loss: 0.011 total_reward [train:26.000 test:20.000] e-greedy:0.220: 100%|██████████| 2000/2000 [00:29<00:00, 67.23it/s]
epoch 123 avg_loss: 0.011 total_reward [train:28.000 test:21.000] e-greedy:0.221: 100%|██████████| 2000/2000 [00:29<00:00, 67.83it/s]
epoch 124 avg_loss: 0.011 total_reward [train:29.000 test:19.000] e-greedy:0.223: 100%|██████████| 2000/2000 [00:29<00:00, 66.85it/s]
epoch 125 avg_loss: 0.010 total_reward [train:30.000 test:19.000] e-greedy:0.225: 100%|██████████| 2000/2000 [00:29<00:00, 67.94it/s]
epoch 126 each step reward:1.000:   1%|          | 16/2000 [00:00<00:25, 78.68it/s]

Update 21.0


epoch 126 avg_loss: 0.013 total_reward [train:34.000 test:20.000] e-greedy:0.227: 100%|██████████| 2000/2000 [00:29<00:00, 68.42it/s]
epoch 127 avg_loss: 0.013 total_reward [train:30.000 test:15.000] e-greedy:0.229: 100%|██████████| 2000/2000 [00:29<00:00, 67.90it/s]
epoch 128 avg_loss: 0.012 total_reward [train:27.000 test:22.000] e-greedy:0.230: 100%|██████████| 2000/2000 [00:29<00:00, 67.09it/s]
epoch 129 avg_loss: 0.011 total_reward [train:26.000 test:23.000] e-greedy:0.232: 100%|██████████| 2000/2000 [00:29<00:00, 68.01it/s]
epoch 130 avg_loss: 0.012 total_reward [train:30.000 test:21.000] e-greedy:0.234: 100%|██████████| 2000/2000 [00:29<00:00, 68.12it/s]
epoch 131 each step reward:1.000:   1%|          | 12/2000 [00:00<00:36, 54.00it/s]

Update 23.0


epoch 131 avg_loss: 0.015 total_reward [train:26.000 test:21.000] e-greedy:0.236: 100%|██████████| 2000/2000 [00:29<00:00, 66.94it/s]
epoch 132 avg_loss: 0.014 total_reward [train:33.000 test:20.000] e-greedy:0.238: 100%|██████████| 2000/2000 [00:29<00:00, 67.87it/s]
epoch 133 avg_loss: 0.013 total_reward [train:28.000 test:18.000] e-greedy:0.239: 100%|██████████| 2000/2000 [00:29<00:00, 68.18it/s]
epoch 134 avg_loss: 0.013 total_reward [train:29.000 test:21.000] e-greedy:0.241: 100%|██████████| 2000/2000 [00:29<00:00, 68.61it/s]
epoch 135 avg_loss: 0.013 total_reward [train:22.000 test:21.000] e-greedy:0.243: 100%|██████████| 2000/2000 [00:29<00:00, 73.22it/s]
epoch 136 each step reward:0.000:   1%|          | 13/2000 [00:00<00:36, 54.34it/s]

Update 21.0


epoch 136 avg_loss: 0.016 total_reward [train:36.000 test:19.000] e-greedy:0.245: 100%|██████████| 2000/2000 [00:29<00:00, 68.08it/s]
epoch 137 avg_loss: 0.015 total_reward [train:24.000 test:21.000] e-greedy:0.247: 100%|██████████| 2000/2000 [00:29<00:00, 67.21it/s]
epoch 138 avg_loss: 0.015 total_reward [train:31.000 test:15.000] e-greedy:0.248: 100%|██████████| 2000/2000 [00:29<00:00, 69.89it/s]
epoch 139 avg_loss: 0.015 total_reward [train:32.000 test:21.000] e-greedy:0.250: 100%|██████████| 2000/2000 [00:29<00:00, 68.24it/s]
epoch 140 avg_loss: 0.014 total_reward [train:28.000 test:19.000] e-greedy:0.252: 100%|██████████| 2000/2000 [00:29<00:00, 67.23it/s]
epoch 141 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 78.05it/s]

Update 21.0


epoch 141 avg_loss: 0.018 total_reward [train:29.000 test:22.000] e-greedy:0.254: 100%|██████████| 2000/2000 [00:29<00:00, 67.96it/s]
epoch 142 avg_loss: 0.017 total_reward [train:25.000 test:19.000] e-greedy:0.256: 100%|██████████| 2000/2000 [00:29<00:00, 67.17it/s]
epoch 143 avg_loss: 0.018 total_reward [train:29.000 test:19.000] e-greedy:0.257: 100%|██████████| 2000/2000 [00:29<00:00, 67.28it/s]
epoch 144 avg_loss: 0.017 total_reward [train:38.000 test:15.000] e-greedy:0.259: 100%|██████████| 2000/2000 [00:29<00:00, 68.45it/s]
epoch 145 avg_loss: 0.017 total_reward [train:26.000 test:19.000] e-greedy:0.261: 100%|██████████| 2000/2000 [00:29<00:00, 67.97it/s]
epoch 146 each step reward:1.000:   1%|          | 16/2000 [00:00<00:24, 79.44it/s]

Update 22.0


epoch 146 avg_loss: 0.018 total_reward [train:26.000 test:21.000] e-greedy:0.263: 100%|██████████| 2000/2000 [00:29<00:00, 67.33it/s]
epoch 147 avg_loss: 0.018 total_reward [train:28.000 test:21.000] e-greedy:0.265: 100%|██████████| 2000/2000 [00:29<00:00, 76.52it/s]
epoch 148 avg_loss: 0.018 total_reward [train:22.000 test:20.000] e-greedy:0.266: 100%|██████████| 2000/2000 [00:30<00:00, 66.59it/s]
epoch 149 avg_loss: 0.018 total_reward [train:30.000 test:22.000] e-greedy:0.268: 100%|██████████| 2000/2000 [00:30<00:00, 65.48it/s]
epoch 150 avg_loss: 0.018 total_reward [train:28.000 test:22.000] e-greedy:0.270: 100%|██████████| 2000/2000 [00:31<00:00, 64.51it/s]
epoch 151 each step reward:0.000:   1%|          | 14/2000 [00:00<00:27, 71.33it/s]

Update 22.0


epoch 151 avg_loss: 0.021 total_reward [train:24.000 test:18.000] e-greedy:0.272: 100%|██████████| 2000/2000 [00:30<00:00, 64.67it/s]
epoch 152 avg_loss: 0.020 total_reward [train:34.000 test:20.000] e-greedy:0.274: 100%|██████████| 2000/2000 [00:31<00:00, 63.82it/s]
epoch 153 avg_loss: 0.020 total_reward [train:23.000 test:23.000] e-greedy:0.275: 100%|██████████| 2000/2000 [00:31<00:00, 64.11it/s]
epoch 154 avg_loss: 0.019 total_reward [train:28.000 test:20.000] e-greedy:0.277: 100%|██████████| 2000/2000 [00:31<00:00, 64.29it/s]
epoch 155 avg_loss: 0.019 total_reward [train:29.000 test:19.000] e-greedy:0.279: 100%|██████████| 2000/2000 [00:31<00:00, 64.22it/s]
epoch 156 each step reward:0.000:   1%|          | 13/2000 [00:00<00:28, 69.84it/s]

Update 23.0


epoch 156 avg_loss: 0.023 total_reward [train:31.000 test:23.000] e-greedy:0.281: 100%|██████████| 2000/2000 [00:29<00:00, 71.96it/s]
epoch 157 avg_loss: 0.022 total_reward [train:26.000 test:21.000] e-greedy:0.283: 100%|██████████| 2000/2000 [00:30<00:00, 64.92it/s]
epoch 158 avg_loss: 0.022 total_reward [train:24.000 test:20.000] e-greedy:0.284: 100%|██████████| 2000/2000 [00:31<00:00, 63.61it/s]
epoch 159 avg_loss: 0.021 total_reward [train:31.000 test:12.000] e-greedy:0.286: 100%|██████████| 2000/2000 [00:30<00:00, 65.69it/s]
epoch 160 avg_loss: 0.021 total_reward [train:24.000 test:21.000] e-greedy:0.288: 100%|██████████| 2000/2000 [00:31<00:00, 63.87it/s]
epoch 161 each step reward:0.000:   1%|          | 12/2000 [00:00<00:36, 53.77it/s]

Update 23.0


epoch 161 avg_loss: 0.024 total_reward [train:33.000 test:23.000] e-greedy:0.290: 100%|██████████| 2000/2000 [00:30<00:00, 65.48it/s]
epoch 162 avg_loss: 0.023 total_reward [train:31.000 test:21.000] e-greedy:0.292: 100%|██████████| 2000/2000 [00:30<00:00, 65.39it/s]
epoch 163 avg_loss: 0.023 total_reward [train:27.000 test:22.000] e-greedy:0.293: 100%|██████████| 2000/2000 [00:30<00:00, 68.29it/s]
epoch 164 avg_loss: 0.022 total_reward [train:35.000 test:21.000] e-greedy:0.295: 100%|██████████| 2000/2000 [00:30<00:00, 65.66it/s]
epoch 165 avg_loss: 0.023 total_reward [train:24.000 test:22.000] e-greedy:0.297: 100%|██████████| 2000/2000 [00:30<00:00,  8.48it/s]
epoch 166 each step reward:0.000:   1%|          | 14/2000 [00:00<00:29, 66.94it/s]

Update 23.0


epoch 166 avg_loss: 0.025 total_reward [train:23.000 test:19.000] e-greedy:0.299: 100%|██████████| 2000/2000 [00:30<00:00, 64.79it/s]
epoch 167 avg_loss: 0.024 total_reward [train:25.000 test:22.000] e-greedy:0.301: 100%|██████████| 2000/2000 [00:30<00:00, 65.09it/s]
epoch 168 avg_loss: 0.024 total_reward [train:26.000 test:19.000] e-greedy:0.302: 100%|██████████| 2000/2000 [00:31<00:00, 63.94it/s]
epoch 169 avg_loss: 0.023 total_reward [train:31.000 test:21.000] e-greedy:0.304: 100%|██████████| 2000/2000 [00:32<00:00, 61.93it/s]
epoch 170 avg_loss: 0.023 total_reward [train:31.000 test:16.000] e-greedy:0.306: 100%|██████████| 2000/2000 [00:30<00:00, 65.08it/s]
epoch 171 each step reward:0.000:   1%|          | 13/2000 [00:00<00:29, 68.32it/s]

Update 22.0


epoch 171 avg_loss: 0.025 total_reward [train:24.000 test:25.000] e-greedy:0.308: 100%|██████████| 2000/2000 [00:31<00:00, 64.24it/s]
epoch 172 avg_loss: 0.025 total_reward [train:24.000 test:23.000] e-greedy:0.310: 100%|██████████| 2000/2000 [00:30<00:00, 65.03it/s]
epoch 173 avg_loss: 0.025 total_reward [train:24.000 test:24.000] e-greedy:0.311: 100%|██████████| 2000/2000 [00:30<00:00, 65.06it/s]
epoch 174 avg_loss: 0.025 total_reward [train:31.000 test:23.000] e-greedy:0.313: 100%|██████████| 2000/2000 [00:30<00:00, 65.66it/s]
epoch 175 avg_loss: 0.024 total_reward [train:22.000 test:20.000] e-greedy:0.315: 100%|██████████| 2000/2000 [00:31<00:00, 63.75it/s]
epoch 176 each step reward:0.000:   1%|          | 15/2000 [00:00<00:28, 70.36it/s]

Update 25.0


epoch 176 avg_loss: 0.027 total_reward [train:31.000 test:20.000] e-greedy:0.317: 100%|██████████| 2000/2000 [00:32<00:00, 61.84it/s]
epoch 177 avg_loss: 0.026 total_reward [train:28.000 test:21.000] e-greedy:0.319: 100%|██████████| 2000/2000 [00:31<00:00, 64.27it/s]
epoch 178 avg_loss: 0.026 total_reward [train:25.000 test:20.000] e-greedy:0.320: 100%|██████████| 2000/2000 [00:32<00:00, 61.71it/s]
epoch 179 avg_loss: 0.026 total_reward [train:32.000 test:18.000] e-greedy:0.322: 100%|██████████| 2000/2000 [00:30<00:00, 64.55it/s]
epoch 180 avg_loss: 0.026 total_reward [train:29.000 test:18.000] e-greedy:0.324: 100%|██████████| 2000/2000 [00:31<00:00, 63.90it/s]
epoch 181 each step reward:0.000:   1%|          | 14/2000 [00:00<00:30, 64.75it/s]

Update 21.0


epoch 181 avg_loss: 0.028 total_reward [train:31.000 test:20.000] e-greedy:0.326: 100%|██████████| 2000/2000 [00:31<00:00, 64.13it/s]
epoch 182 avg_loss: 0.027 total_reward [train:32.000 test:20.000] e-greedy:0.328: 100%|██████████| 2000/2000 [00:31<00:00, 64.23it/s]
epoch 183 avg_loss: 0.027 total_reward [train:32.000 test:21.000] e-greedy:0.329: 100%|██████████| 2000/2000 [00:30<00:00, 64.62it/s]
epoch 184 avg_loss: 0.027 total_reward [train:25.000 test:21.000] e-greedy:0.331: 100%|██████████| 2000/2000 [00:30<00:00, 69.23it/s]
epoch 185 avg_loss: 0.027 total_reward [train:26.000 test:21.000] e-greedy:0.333: 100%|██████████| 2000/2000 [00:30<00:00, 64.81it/s]
epoch 186 each step reward:0.000:   1%|          | 13/2000 [00:00<00:27, 73.12it/s]

Update 21.0


epoch 186 avg_loss: 0.029 total_reward [train:33.000 test:21.000] e-greedy:0.335: 100%|██████████| 2000/2000 [00:30<00:00, 64.95it/s]
epoch 187 avg_loss: 0.029 total_reward [train:33.000 test:18.000] e-greedy:0.337: 100%|██████████| 2000/2000 [00:30<00:00, 64.91it/s]
epoch 188 avg_loss: 0.029 total_reward [train:22.000 test:21.000] e-greedy:0.338: 100%|██████████| 2000/2000 [00:31<00:00, 63.12it/s]
epoch 189 avg_loss: 0.029 total_reward [train:30.000 test:20.000] e-greedy:0.340: 100%|██████████| 2000/2000 [00:31<00:00, 72.18it/s]
epoch 190 avg_loss: 0.028 total_reward [train:34.000 test:19.000] e-greedy:0.342: 100%|██████████| 2000/2000 [00:31<00:00, 63.33it/s]
epoch 191 each step reward:0.000:   1%|          | 13/2000 [00:00<00:31, 64.03it/s]

Update 21.0


epoch 191 avg_loss: 0.030 total_reward [train:25.000 test:20.000] e-greedy:0.344: 100%|██████████| 2000/2000 [00:31<00:00, 72.13it/s]
epoch 192 avg_loss: 0.029 total_reward [train:27.000 test:23.000] e-greedy:0.346: 100%|██████████| 2000/2000 [00:31<00:00, 72.70it/s]
epoch 193 avg_loss: 0.028 total_reward [train:28.000 test:22.000] e-greedy:0.347: 100%|██████████| 2000/2000 [00:31<00:00, 62.92it/s]
epoch 194 avg_loss: 0.027 total_reward [train:30.000 test:20.000] e-greedy:0.349: 100%|██████████| 2000/2000 [00:31<00:00, 63.86it/s]
epoch 195 avg_loss: 0.029 total_reward [train:28.000 test:23.000] e-greedy:0.351: 100%|██████████| 2000/2000 [00:32<00:00, 69.52it/s]
epoch 196 each step reward:0.000:   1%|          | 13/2000 [00:00<00:25, 77.08it/s]

Update 23.0


epoch 196 avg_loss: 0.031 total_reward [train:24.000 test:21.000] e-greedy:0.353: 100%|██████████| 2000/2000 [00:31<00:00, 63.08it/s]
epoch 197 avg_loss: 0.029 total_reward [train:27.000 test:21.000] e-greedy:0.355: 100%|██████████| 2000/2000 [00:31<00:00,  8.54it/s]
epoch 198 avg_loss: 0.030 total_reward [train:30.000 test:22.000] e-greedy:0.356: 100%|██████████| 2000/2000 [00:31<00:00, 63.83it/s]
epoch 199 avg_loss: 0.029 total_reward [train:30.000 test:19.000] e-greedy:0.358: 100%|██████████| 2000/2000 [00:31<00:00, 71.35it/s]
epoch 200 avg_loss: 0.029 total_reward [train:24.000 test:19.000] e-greedy:0.360: 100%|██████████| 2000/2000 [00:31<00:00, 63.63it/s]
epoch 201 each step reward:0.000:   1%|          | 12/2000 [00:00<00:32, 60.41it/s]

Update 22.0


epoch 201 avg_loss: 0.033 total_reward [train:28.000 test:20.000] e-greedy:0.362: 100%|██████████| 2000/2000 [00:31<00:00, 64.35it/s]
epoch 202 avg_loss: 0.033 total_reward [train:31.000 test:20.000] e-greedy:0.364: 100%|██████████| 2000/2000 [00:31<00:00, 63.75it/s]
epoch 203 avg_loss: 0.032 total_reward [train:28.000 test:20.000] e-greedy:0.365: 100%|██████████| 2000/2000 [00:31<00:00, 63.66it/s]
epoch 204 avg_loss: 0.032 total_reward [train:29.000 test:21.000] e-greedy:0.367: 100%|██████████| 2000/2000 [00:32<00:00, 62.01it/s]
epoch 205 avg_loss: 0.032 total_reward [train:28.000 test:16.000] e-greedy:0.369: 100%|██████████| 2000/2000 [00:31<00:00, 64.01it/s]
epoch 206 each step reward:0.000:   1%|          | 14/2000 [00:00<00:25, 76.70it/s]

Update 21.0


epoch 206 avg_loss: 0.032 total_reward [train:33.000 test:21.000] e-greedy:0.371: 100%|██████████| 2000/2000 [00:30<00:00, 64.65it/s]
epoch 207 avg_loss: 0.031 total_reward [train:33.000 test:19.000] e-greedy:0.373: 100%|██████████| 2000/2000 [00:31<00:00, 63.23it/s]
epoch 208 avg_loss: 0.031 total_reward [train:34.000 test:19.000] e-greedy:0.374: 100%|██████████| 2000/2000 [00:30<00:00, 64.69it/s]
epoch 209 avg_loss: 0.031 total_reward [train:29.000 test:19.000] e-greedy:0.376: 100%|██████████| 2000/2000 [00:30<00:00, 64.65it/s]
epoch 210 avg_loss: 0.031 total_reward [train:33.000 test:14.000] e-greedy:0.378: 100%|██████████| 2000/2000 [00:30<00:00, 65.13it/s]
epoch 211 each step reward:0.000:   1%|          | 15/2000 [00:00<00:28, 70.11it/s]

Update 21.0


epoch 211 avg_loss: 0.033 total_reward [train:32.000 test:21.000] e-greedy:0.380: 100%|██████████| 2000/2000 [00:30<00:00, 64.70it/s]
epoch 212 avg_loss: 0.034 total_reward [train:28.000 test:19.000] e-greedy:0.382: 100%|██████████| 2000/2000 [00:31<00:00, 63.78it/s]
epoch 213 avg_loss: 0.033 total_reward [train:29.000 test:20.000] e-greedy:0.383: 100%|██████████| 2000/2000 [00:30<00:00, 64.62it/s]
epoch 214 avg_loss: 0.033 total_reward [train:31.000 test:20.000] e-greedy:0.385: 100%|██████████| 2000/2000 [00:31<00:00, 63.48it/s]
epoch 215 avg_loss: 0.032 total_reward [train:29.000 test:22.000] e-greedy:0.387: 100%|██████████| 2000/2000 [00:31<00:00, 62.64it/s]
epoch 216 each step reward:1.000:   1%|          | 14/2000 [00:00<00:29, 66.20it/s]

Update 22.0


epoch 216 avg_loss: 0.035 total_reward [train:26.000 test:21.000] e-greedy:0.389: 100%|██████████| 2000/2000 [00:31<00:00, 63.26it/s]
epoch 217 avg_loss: 0.034 total_reward [train:23.000 test:21.000] e-greedy:0.391: 100%|██████████| 2000/2000 [00:32<00:00, 62.20it/s]
epoch 218 avg_loss: 0.034 total_reward [train:29.000 test:20.000] e-greedy:0.392: 100%|██████████| 2000/2000 [00:31<00:00,  8.50it/s]
epoch 219 avg_loss: 0.034 total_reward [train:28.000 test:22.000] e-greedy:0.394: 100%|██████████| 2000/2000 [00:31<00:00, 62.98it/s]
epoch 220 avg_loss: 0.034 total_reward [train:26.000 test:21.000] e-greedy:0.396: 100%|██████████| 2000/2000 [00:31<00:00, 63.25it/s]
epoch 221 each step reward:0.000:   1%|          | 14/2000 [00:00<00:28, 69.27it/s]

Update 22.0


epoch 221 avg_loss: 0.037 total_reward [train:29.000 test:21.000] e-greedy:0.398: 100%|██████████| 2000/2000 [00:32<00:00,  7.17it/s]
epoch 222 avg_loss: 0.037 total_reward [train:27.000 test:20.000] e-greedy:0.400: 100%|██████████| 2000/2000 [00:31<00:00, 63.71it/s]
epoch 223 avg_loss: 0.036 total_reward [train:24.000 test:19.000] e-greedy:0.401: 100%|██████████| 2000/2000 [00:29<00:00, 67.37it/s]
epoch 224 avg_loss: 0.035 total_reward [train:27.000 test:10.000] e-greedy:0.403: 100%|██████████| 2000/2000 [00:29<00:00, 67.09it/s]
epoch 225 avg_loss: 0.035 total_reward [train:27.000 test:18.000] e-greedy:0.405: 100%|██████████| 2000/2000 [00:29<00:00, 72.49it/s]
epoch 226 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 78.28it/s]

Update 21.0


epoch 226 avg_loss: 0.036 total_reward [train:23.000 test:17.000] e-greedy:0.407: 100%|██████████| 2000/2000 [00:31<00:00, 64.26it/s]
epoch 227 avg_loss: 0.037 total_reward [train:33.000 test:21.000] e-greedy:0.409: 100%|██████████| 2000/2000 [00:30<00:00, 64.98it/s]
epoch 228 avg_loss: 0.037 total_reward [train:35.000 test:19.000] e-greedy:0.410: 100%|██████████| 2000/2000 [00:31<00:00, 63.90it/s]
epoch 229 avg_loss: 0.036 total_reward [train:27.000 test:21.000] e-greedy:0.412: 100%|██████████| 2000/2000 [00:30<00:00, 71.97it/s]
epoch 230 avg_loss: 0.036 total_reward [train:22.000 test:20.000] e-greedy:0.414: 100%|██████████| 2000/2000 [00:31<00:00, 63.90it/s]
epoch 231 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 78.11it/s]

Update 21.0


epoch 231 avg_loss: 0.036 total_reward [train:25.000 test:20.000] e-greedy:0.416: 100%|██████████| 2000/2000 [00:31<00:00, 72.22it/s]
epoch 232 avg_loss: 0.036 total_reward [train:30.000 test:18.000] e-greedy:0.418: 100%|██████████| 2000/2000 [00:30<00:00, 64.90it/s]
epoch 233 avg_loss: 0.035 total_reward [train:28.000 test:20.000] e-greedy:0.419: 100%|██████████| 2000/2000 [00:32<00:00, 61.77it/s]
epoch 234 avg_loss: 0.035 total_reward [train:37.000 test:22.000] e-greedy:0.421: 100%|██████████| 2000/2000 [00:31<00:00, 63.17it/s]
epoch 235 avg_loss: 0.035 total_reward [train:29.000 test:17.000] e-greedy:0.423: 100%|██████████| 2000/2000 [00:31<00:00, 62.93it/s]
epoch 236 each step reward:0.000:   1%|          | 12/2000 [00:00<00:29, 67.22it/s]

Update 22.0


epoch 236 avg_loss: 0.036 total_reward [train:37.000 test:18.000] e-greedy:0.425: 100%|██████████| 2000/2000 [00:31<00:00, 63.15it/s]
epoch 237 avg_loss: 0.036 total_reward [train:30.000 test:21.000] e-greedy:0.427: 100%|██████████| 2000/2000 [00:30<00:00, 65.09it/s]
epoch 238 avg_loss: 0.035 total_reward [train:27.000 test:22.000] e-greedy:0.428: 100%|██████████| 2000/2000 [00:31<00:00, 63.25it/s]
epoch 239 avg_loss: 0.035 total_reward [train:32.000 test:21.000] e-greedy:0.430: 100%|██████████| 2000/2000 [00:30<00:00, 64.93it/s]
epoch 240 avg_loss: 0.036 total_reward [train:31.000 test:17.000] e-greedy:0.432: 100%|██████████| 2000/2000 [00:30<00:00, 64.82it/s]
epoch 241 each step reward:0.000:   0%|          | 10/2000 [00:00<00:52, 37.94it/s]

Update 22.0


epoch 241 avg_loss: 0.036 total_reward [train:30.000 test:20.000] e-greedy:0.434: 100%|██████████| 2000/2000 [00:31<00:00, 63.60it/s]
epoch 242 avg_loss: 0.034 total_reward [train:31.000 test:23.000] e-greedy:0.436: 100%|██████████| 2000/2000 [00:32<00:00, 60.93it/s]
epoch 243 avg_loss: 0.035 total_reward [train:34.000 test:16.000] e-greedy:0.437: 100%|██████████| 2000/2000 [00:31<00:00, 63.58it/s]
epoch 244 avg_loss: 0.033 total_reward [train:32.000 test:22.000] e-greedy:0.439: 100%|██████████| 2000/2000 [00:31<00:00, 63.95it/s]
epoch 245 avg_loss: 0.034 total_reward [train:29.000 test:20.000] e-greedy:0.441: 100%|██████████| 2000/2000 [00:32<00:00, 61.82it/s]
epoch 246 each step reward:0.000:   1%|          | 12/2000 [00:00<00:31, 62.62it/s]

Update 23.0


epoch 246 avg_loss: 0.037 total_reward [train:29.000 test:20.000] e-greedy:0.443: 100%|██████████| 2000/2000 [00:31<00:00, 62.78it/s]
epoch 247 avg_loss: 0.036 total_reward [train:38.000 test:18.000] e-greedy:0.445: 100%|██████████| 2000/2000 [00:30<00:00, 64.97it/s]
epoch 248 avg_loss: 0.036 total_reward [train:34.000 test:21.000] e-greedy:0.446: 100%|██████████| 2000/2000 [00:31<00:00, 63.54it/s]
epoch 249 avg_loss: 0.036 total_reward [train:34.000 test:19.000] e-greedy:0.448: 100%|██████████| 2000/2000 [00:30<00:00, 66.33it/s]
epoch 250 avg_loss: 0.036 total_reward [train:36.000 test:22.000] e-greedy:0.450: 100%|██████████| 2000/2000 [00:31<00:00, 62.69it/s]
epoch 251 each step reward:0.000:   1%|          | 14/2000 [00:00<00:28, 68.58it/s]

Update 22.0


epoch 251 avg_loss: 0.038 total_reward [train:31.000 test:17.000] e-greedy:0.452: 100%|██████████| 2000/2000 [00:30<00:00, 64.68it/s]
epoch 252 avg_loss: 0.036 total_reward [train:29.000 test:14.000] e-greedy:0.454: 100%|██████████| 2000/2000 [00:32<00:00, 62.40it/s]
epoch 253 avg_loss: 0.037 total_reward [train:32.000 test:22.000] e-greedy:0.455: 100%|██████████| 2000/2000 [00:31<00:00, 64.44it/s]
epoch 254 avg_loss: 0.036 total_reward [train:28.000 test:23.000] e-greedy:0.457: 100%|██████████| 2000/2000 [00:30<00:00, 65.84it/s]
epoch 255 avg_loss: 0.037 total_reward [train:30.000 test:20.000] e-greedy:0.459: 100%|██████████| 2000/2000 [00:31<00:00, 62.85it/s]
epoch 256 each step reward:1.000:   1%|          | 17/2000 [00:00<00:24, 81.25it/s]

Update 23.0


epoch 256 avg_loss: 0.036 total_reward [train:33.000 test:21.000] e-greedy:0.461: 100%|██████████| 2000/2000 [00:29<00:00, 74.48it/s]
epoch 257 avg_loss: 0.036 total_reward [train:29.000 test:21.000] e-greedy:0.463: 100%|██████████| 2000/2000 [00:30<00:00, 76.86it/s]
epoch 258 avg_loss: 0.035 total_reward [train:31.000 test:18.000] e-greedy:0.464: 100%|██████████| 2000/2000 [00:30<00:00, 69.58it/s]
epoch 259 avg_loss: 0.035 total_reward [train:33.000 test:18.000] e-greedy:0.466: 100%|██████████| 2000/2000 [00:30<00:00, 65.59it/s]
epoch 260 avg_loss: 0.034 total_reward [train:29.000 test:20.000] e-greedy:0.468: 100%|██████████| 2000/2000 [00:30<00:00, 66.03it/s]
epoch 261 each step reward:1.000:   1%|          | 14/2000 [00:00<00:29, 67.81it/s]

Update 21.0


epoch 261 avg_loss: 0.037 total_reward [train:34.000 test:21.000] e-greedy:0.470: 100%|██████████| 2000/2000 [00:29<00:00, 66.67it/s]
epoch 262 avg_loss: 0.035 total_reward [train:29.000 test:22.000] e-greedy:0.472: 100%|██████████| 2000/2000 [00:30<00:00, 76.13it/s]
epoch 263 avg_loss: 0.036 total_reward [train:34.000 test:21.000] e-greedy:0.473: 100%|██████████| 2000/2000 [00:30<00:00, 66.58it/s]
epoch 264 avg_loss: 0.035 total_reward [train:32.000 test:19.000] e-greedy:0.475: 100%|██████████| 2000/2000 [00:30<00:00, 66.24it/s]
epoch 265 avg_loss: 0.035 total_reward [train:36.000 test:23.000] e-greedy:0.477: 100%|██████████| 2000/2000 [00:30<00:00, 66.50it/s]
epoch 266 each step reward:0.000:   1%|          | 12/2000 [00:00<00:32, 61.25it/s]

Update 23.0


epoch 266 avg_loss: 0.037 total_reward [train:35.000 test:20.000] e-greedy:0.479: 100%|██████████| 2000/2000 [00:31<00:00, 64.40it/s]
epoch 267 avg_loss: 0.036 total_reward [train:33.000 test:21.000] e-greedy:0.481: 100%|██████████| 2000/2000 [00:29<00:00, 67.33it/s]
epoch 268 avg_loss: 0.034 total_reward [train:33.000 test:16.000] e-greedy:0.482: 100%|██████████| 2000/2000 [00:30<00:00, 65.22it/s]
epoch 269 avg_loss: 0.035 total_reward [train:35.000 test:22.000] e-greedy:0.484: 100%|██████████| 2000/2000 [00:31<00:00, 63.98it/s]
epoch 270 avg_loss: 0.034 total_reward [train:37.000 test:21.000] e-greedy:0.486: 100%|██████████| 2000/2000 [00:29<00:00, 66.88it/s]
epoch 271 each step reward:0.000:   1%|          | 13/2000 [00:00<00:30, 65.95it/s]

Update 22.0


epoch 271 avg_loss: 0.038 total_reward [train:37.000 test:21.000] e-greedy:0.488: 100%|██████████| 2000/2000 [00:29<00:00, 67.59it/s]
epoch 272 avg_loss: 0.036 total_reward [train:37.000 test:20.000] e-greedy:0.490: 100%|██████████| 2000/2000 [00:30<00:00, 66.12it/s]
epoch 273 avg_loss: 0.036 total_reward [train:36.000 test:23.000] e-greedy:0.491: 100%|██████████| 2000/2000 [00:31<00:00, 63.61it/s]
epoch 274 avg_loss: 0.037 total_reward [train:36.000 test:21.000] e-greedy:0.493: 100%|██████████| 2000/2000 [00:31<00:00, 63.40it/s]
epoch 275 avg_loss: 0.036 total_reward [train:36.000 test:16.000] e-greedy:0.495: 100%|██████████| 2000/2000 [00:31<00:00, 63.18it/s]
epoch 276 each step reward:0.000:   1%|          | 11/2000 [00:00<00:41, 48.17it/s]

Update 23.0


epoch 276 avg_loss: 0.038 total_reward [train:37.000 test:20.000] e-greedy:0.497: 100%|██████████| 2000/2000 [00:31<00:00, 63.95it/s]
epoch 277 avg_loss: 0.037 total_reward [train:41.000 test:24.000] e-greedy:0.499: 100%|██████████| 2000/2000 [00:31<00:00, 71.49it/s]
epoch 278 avg_loss: 0.036 total_reward [train:37.000 test:21.000] e-greedy:0.500: 100%|██████████| 2000/2000 [00:31<00:00, 63.04it/s]
epoch 279 avg_loss: 0.037 total_reward [train:36.000 test:22.000] e-greedy:0.502: 100%|██████████| 2000/2000 [00:31<00:00, 62.56it/s]
epoch 280 avg_loss: 0.036 total_reward [train:37.000 test:23.000] e-greedy:0.504: 100%|██████████| 2000/2000 [00:31<00:00, 63.37it/s]
epoch 281 each step reward:0.000:   1%|          | 12/2000 [00:00<00:39, 49.93it/s]

Update 24.0


epoch 281 avg_loss: 0.040 total_reward [train:36.000 test:19.000] e-greedy:0.506: 100%|██████████| 2000/2000 [00:30<00:00, 65.68it/s]
epoch 282 avg_loss: 0.038 total_reward [train:37.000 test:23.000] e-greedy:0.508: 100%|██████████| 2000/2000 [00:31<00:00, 62.66it/s]
epoch 283 avg_loss: 0.036 total_reward [train:41.000 test:21.000] e-greedy:0.509: 100%|██████████| 2000/2000 [00:31<00:00, 63.38it/s]
epoch 284 avg_loss: 0.037 total_reward [train:39.000 test:22.000] e-greedy:0.511: 100%|██████████| 2000/2000 [00:31<00:00, 63.69it/s]
epoch 285 avg_loss: 0.037 total_reward [train:36.000 test:23.000] e-greedy:0.513: 100%|██████████| 2000/2000 [00:31<00:00, 62.58it/s]
epoch 286 each step reward:0.000:   0%|          | 10/2000 [00:00<00:45, 43.99it/s]

Update 23.0


epoch 286 avg_loss: 0.041 total_reward [train:38.000 test:19.000] e-greedy:0.515: 100%|██████████| 2000/2000 [00:31<00:00, 63.32it/s]
epoch 287 avg_loss: 0.039 total_reward [train:37.000 test:19.000] e-greedy:0.517: 100%|██████████| 2000/2000 [00:31<00:00, 62.92it/s]
epoch 288 avg_loss: 0.039 total_reward [train:37.000 test:20.000] e-greedy:0.518: 100%|██████████| 2000/2000 [00:31<00:00, 64.09it/s]
epoch 289 avg_loss: 0.040 total_reward [train:40.000 test:22.000] e-greedy:0.520: 100%|██████████| 2000/2000 [00:31<00:00, 63.56it/s]
epoch 290 avg_loss: 0.038 total_reward [train:38.000 test:20.000] e-greedy:0.522: 100%|██████████| 2000/2000 [00:31<00:00, 64.38it/s]
epoch 291 each step reward:0.000:   1%|          | 16/2000 [00:00<00:25, 77.76it/s]

Update 22.0


epoch 291 avg_loss: 0.039 total_reward [train:36.000 test:22.000] e-greedy:0.524: 100%|██████████| 2000/2000 [00:31<00:00, 64.14it/s]
epoch 292 avg_loss: 0.037 total_reward [train:37.000 test:22.000] e-greedy:0.526: 100%|██████████| 2000/2000 [00:31<00:00, 63.36it/s]
epoch 293 avg_loss: 0.037 total_reward [train:38.000 test:23.000] e-greedy:0.527: 100%|██████████| 2000/2000 [00:30<00:00, 66.01it/s]
epoch 294 avg_loss: 0.037 total_reward [train:34.000 test:22.000] e-greedy:0.529: 100%|██████████| 2000/2000 [00:31<00:00, 63.33it/s]
epoch 295 avg_loss: 0.037 total_reward [train:37.000 test:22.000] e-greedy:0.531: 100%|██████████| 2000/2000 [00:30<00:00, 68.11it/s]
epoch 296 each step reward:0.000:   1%|          | 16/2000 [00:00<00:26, 76.21it/s]

Update 23.0


epoch 296 avg_loss: 0.040 total_reward [train:37.000 test:21.000] e-greedy:0.533: 100%|██████████| 2000/2000 [00:29<00:00,  8.75it/s]
epoch 297 avg_loss: 0.038 total_reward [train:36.000 test:24.000] e-greedy:0.535: 100%|██████████| 2000/2000 [00:29<00:00, 67.37it/s]
epoch 298 avg_loss: 0.038 total_reward [train:34.000 test:21.000] e-greedy:0.536: 100%|██████████| 2000/2000 [00:29<00:00, 67.08it/s]
epoch 299 avg_loss: 0.038 total_reward [train:42.000 test:22.000] e-greedy:0.538: 100%|██████████| 2000/2000 [00:31<00:00, 63.69it/s]
epoch 300 avg_loss: 0.038 total_reward [train:38.000 test:23.000] e-greedy:0.540: 100%|██████████| 2000/2000 [00:31<00:00, 63.98it/s]
epoch 301 each step reward:0.000:   1%|          | 14/2000 [00:00<00:29, 67.22it/s]

Update 24.0


epoch 301 avg_loss: 0.039 total_reward [train:39.000 test:25.000] e-greedy:0.542: 100%|██████████| 2000/2000 [00:31<00:00, 64.29it/s]
epoch 302 avg_loss: 0.039 total_reward [train:39.000 test:21.000] e-greedy:0.544: 100%|██████████| 2000/2000 [00:31<00:00, 63.03it/s]
epoch 303 avg_loss: 0.037 total_reward [train:36.000 test:25.000] e-greedy:0.545: 100%|██████████| 2000/2000 [00:31<00:00, 63.19it/s]
epoch 304 avg_loss: 0.038 total_reward [train:41.000 test:23.000] e-greedy:0.547: 100%|██████████| 2000/2000 [00:31<00:00, 63.44it/s]
epoch 305 avg_loss: 0.039 total_reward [train:37.000 test:22.000] e-greedy:0.549: 100%|██████████| 2000/2000 [00:32<00:00, 62.14it/s]
epoch 306 each step reward:0.000:   1%|          | 14/2000 [00:00<00:30, 65.90it/s]

Update 25.0


epoch 306 avg_loss: 0.042 total_reward [train:42.000 test:26.000] e-greedy:0.551: 100%|██████████| 2000/2000 [00:31<00:00, 63.85it/s]
epoch 307 avg_loss: 0.040 total_reward [train:40.000 test:22.000] e-greedy:0.553: 100%|██████████| 2000/2000 [00:31<00:00, 64.25it/s]
epoch 308 avg_loss: 0.040 total_reward [train:38.000 test:22.000] e-greedy:0.554: 100%|██████████| 2000/2000 [00:31<00:00, 64.03it/s]
epoch 309 avg_loss: 0.039 total_reward [train:40.000 test:22.000] e-greedy:0.556: 100%|██████████| 2000/2000 [00:29<00:00, 72.20it/s]
epoch 310 avg_loss: 0.039 total_reward [train:38.000 test:24.000] e-greedy:0.558: 100%|██████████| 2000/2000 [00:29<00:00, 66.96it/s]
epoch 311 each step reward:0.000:   1%|          | 14/2000 [00:00<00:28, 70.88it/s]

Update 26.0


epoch 311 avg_loss: 0.040 total_reward [train:38.000 test:22.000] e-greedy:0.560: 100%|██████████| 2000/2000 [00:29<00:00, 66.70it/s]
epoch 312 avg_loss: 0.039 total_reward [train:41.000 test:20.000] e-greedy:0.562: 100%|██████████| 2000/2000 [00:30<00:00, 66.51it/s]
epoch 313 avg_loss: 0.037 total_reward [train:41.000 test:24.000] e-greedy:0.563: 100%|██████████| 2000/2000 [00:29<00:00, 67.77it/s]
epoch 314 avg_loss: 0.038 total_reward [train:42.000 test:23.000] e-greedy:0.565: 100%|██████████| 2000/2000 [00:30<00:00, 66.36it/s]
epoch 315 avg_loss: 0.038 total_reward [train:41.000 test:24.000] e-greedy:0.567: 100%|██████████| 2000/2000 [00:31<00:00, 71.81it/s]
epoch 316 each step reward:0.000:   1%|          | 14/2000 [00:00<00:27, 71.96it/s]

Update 24.0


epoch 316 avg_loss: 0.044 total_reward [train:39.000 test:23.000] e-greedy:0.569: 100%|██████████| 2000/2000 [00:31<00:00, 63.97it/s]
epoch 317 avg_loss: 0.044 total_reward [train:45.000 test:23.000] e-greedy:0.571: 100%|██████████| 2000/2000 [00:31<00:00, 63.96it/s]
epoch 318 each step reward:37.000:  90%|█████████ | 1800/2000 [00:30<00:03, 55.75it/s]

KeyboardInterrupt: 

epoch 318 each step reward:37.000:  90%|█████████ | 1801/2000 [00:50<00:03, 55.75it/s]

In [None]:
q_network.save("dqn_exp5.h5")
# model = DQN(custom_env, q_network)

In [None]:
model.test(render=True)

In [None]:
import time
start_t = time.time()
a = np.random.permutation(int(1e1))
print(time.time()-start_t)


# 