<a href="https://colab.research.google.com/github/argonism/TsukurinagaraRL/blob/master/Zerokara_chap6_Prioritized_Experience_Replay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prioritized Experience Replay

Q学習が進んでいないsのtransitionを優先的に学習する

## 優先基準
ベルマン方程式の絶対値誤差で優先順位をつける
$$Q(s_t, a_t) = R_{t+1} + \gamma Q(s_{t+1}, a_{t+1})$$

TD誤差と呼ぶ。（厳密にはTD誤差ではないらしい）
$$|[R(t+1) + \gamma \times max_a[Q(s(t+1), a)] - Q(s(t), a(t))$$

この誤差が大きい（学習が進んでいない）transitionを優先的にExperience Replay時に学習される。

メモリクラスと別に、TD誤差を保持するクラスを用意する。

学習の際は、TD誤差を確率としてtransitionからミニバッチ数取り出す。
0~TD誤差絶対値の総和の範囲で一様分布に従いミニバッチ分だけindexを取り出す。
- TD誤差の絶対値が大きさが選ばれやすさになる。
(同じtransitionが選ばれることもある?)

ただし、TD誤差が小さくなりすぎてtransitionが一切replayされないのを防ぐために、微小値TD_ERROR_EPSILONを足している。






In [None]:
# 使用するパッケージのインストール
# gym==0.17.2 pyvirtualdisplay==1.3.2
# xvfb=2:1.19.6-1ubuntu4.4 python-opengl=3.1.0+dfsg-1 ffmpeg=7:3.4.8-0ubuntu0.2
# JSAnimation==0.1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!pip install JSAnimation > /dev/null 2>&1

In [None]:
 import numpy as np
 import matplotlib.pyplot as plt
 %matplotlib inline
 import gym
from gym.wrappers import Monitor

In [None]:
# 動画の描画関数の宣言
import glob
import io
import os
import base64
from JSAnimation.IPython_display import display_animation
from matplotlib import animation
from IPython import display as ipythondisplay
from IPython.display import HTML
from pyvirtualdisplay import Display

display = Display(visible=0, size=(640, 400))
display.start()

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def reset_video():
  mp4list = glob.glob('video/*.mp4')
  for mp4 in mp4list:
    os.remove(mp4)

def wrap_env(env):
  env = Monitor(env, './video', force=True, video_callable=(lambda ep: ep % 10 == 0))
  reset_video()
  return env

In [None]:
class ReplayMemory:
  def __init__(self, CAPACITY):
    self.capacity = CAPACITY
    self.memory = []
    self.index = 0
  
  def push(self, state, action, state_next, reward):
    if len(self.memory) < self.capacity:
      self.memory.append(None)
    
    self.memory[self.index] = Transition(state, action, state_next, reward)

    self.index = (self.index + 1) % self.capacity
  
  def sample(self, batch_size):
    return random.sample(self.memory, batch_size)
  
  def __len__(self):
    return len(self.memory)


In [None]:
from torch import nn
import torch.nn.functional as F

class Net(nn.Module):

  def __init__(self, n_in, n_mid, n_out):
    super(Net, self).__init__()
    self.fc1 = nn.Linear(n_in, n_mid)
    self.fc2 = nn.Linear(n_mid, n_mid)
    self.fc3 = nn.Linear(n_mid, n_out)
  
  def forward(self, x):
    h1 = F.relu(self.fc1(x))
    h2 = F.relu(self.fc2(h1))
    output = self.fc3(h2)
    return output

In [None]:
TD_ERROR_EPSILON = 0.0001

class TDerrorMemory:
  def __init__(self, CAPACITY):
    self.capacity = CAPACITY
    self.memory = []
    self.index = 0

  def push(self, td_error):

    # この処理いる？
    if len(self.memory) < self.capacity:
      self.memory.append(None)

    self.memory[self.index] = td_error
    self.index = (self.index + 1) % self.capacity

  def __len__(self):
    return len(self.memory)
  
  def get_prioritized_indexes(self, batch_size):
    # TD誤差の総和を計算して、
    sum_absolute_td_error = np.sum(np.absolute(self.memory))
    sum_absolute_td_error += TD_ERROR_EPSILON * len(self.memory)

    # 0 ~ (TD誤差絶対値の総和)の範囲で、乱数をbatch_size個生成する。
    rand_list = np.random.uniform(0, sum_absolute_td_error, batch_size)
    rand_list = np.sort(rand_list)

    # 上の乱数を使って使うtransitionのindexを決めるわけだが、
    # 上の乱数は0 ~ TD誤差総和の範囲。
    # TD誤差をindex[0]から足して行って、その足して行った誤差が乱数を超えた時点のTD誤差に対応するindexを採用する。
    # これによってTD誤差が大きいほど採用される確率が高くなる。（そのTD誤差が採用されやすい -> TD誤差がより広い範囲をもつ -> TD誤差がより大きい)
    indexes = []
    idx = 0
    tmp_sum_absolute_td_error = 0
    for rand_num in rand_list:
      while tmp_sum_absolute_td_error < rand_num:
        tmp_sum_absolute_td_error += (
            abs(self.memory[idx]) + TD_ERROR_EPSILON
        )
        idx += 1
      if idx >= len(self.memory):
        idx = len(self.memory) - 1
      indexes.append(idx)

    return indexes

  def update_td_error(self, update_td_errors):
    self.memory = update_td_errors
  


In [None]:
import random
import torch
from torch import nn
from torch import optim
import torch.nn.functional as F

BATCH_SIZE = 32
CAPACITY = 10000

class Brain:
  def __init__(self, num_states, num_actions):
    self.num_actions = num_actions

    self.memory = ReplayMemory(CAPACITY)
    
    n_in, n_mid, n_out = num_states, 32, num_actions
    self.main_q_network = Net(n_in, n_mid, n_out)
    self.target_q_network = Net(n_in, n_mid, n_out)
    print(self.main_q_network)

    self.optimizer = optim.Adam(self.main_q_network.parameters(), lr=0.0001)

    self.td_error_memory = TDerrorMemory(CAPACITY)

  def replay(self, episode):
    '''保存した行動や結果、状態から結合パラメータを学習する'''

    if len(self.memory) < BATCH_SIZE:
      return
    
    # ミニバッチの作成
    self.batch, self.state_batch, self.action_batch, self.reward_batch, self.non_final_next_states = self.make_minibatch(episode)

    # 教師信号となるQ(s_t, a_t)を求める
    self.expected_state_action_values = self.get_expected_state_action_values()

    # 訓練
    self.update_main_q_network()

  def decide_action(self, state, episode):
    epsilon = 0.5 * (1 / (episode + 1))

    if epsilon <= np.random.uniform(0, 1):
      self.main_q_network.eval()
      with torch.no_grad():
        action = self.main_q_network(state).max(1)[1].view(1, 1)
        # view(1, 1)は、[torch.LongTensor of size 1]をsize 1x1にするらしい。逆にsize 1x1って何。

    else:
      action = torch.LongTensor([[random.randrange(self.num_actions)]])
    
    return action

  def make_minibatch(self, episode):
    if episode < 30:
      transitions = self.memory.sample(BATCH_SIZE)
    else:
      indexes = self.td_error_memory.get_prioritized_indexes(BATCH_SIZE)
      transitions = [self.memory.memory[n] for n in indexes]
    
    batch = Transition(*zip(*transitions))
    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)
    non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])

    return batch, state_batch, action_batch, reward_batch, non_final_next_states
  
  def get_expected_state_action_values(self):
    self.main_q_network.eval()
    self.target_q_network.eval()
    # self.model(state_batch) でbatch_size*2のテンソル(actionのQ値)が出る。
    # action_batchはbatch_size*(0, 1)のようになっている。
    # その後実行したactionの方のQ値を取り出すために、gatherを使っている。
    self.state_action_values = self.main_q_network(self.state_batch).gather(1, self.action_batch)

    # 次の状態(next_state)があるかどうかに注意しながら取り出し、max{Q(s_t+1, a)}を求める
    # ここではインデックスマスクを作る。...?
    # インデックスにByteTensor入れると、ByteTensorでTrueになってる要素を取り出して、そのそれぞれに対して代入を行えるっぽい。
    non_final_mask = torch.ByteTensor(tuple(map(lambda s: s is not None, self.batch.next_state)))

    next_state_values = torch.zeros(BATCH_SIZE)

    a_m = torch.zeros(BATCH_SIZE).type(torch.LongTensor)

    a_m[non_final_mask] = self.main_q_network(self.non_final_next_states).detach().max(1)[1]

    a_m_non_final_next_states = a_m[non_final_mask].view(-1, 1)

    # 次の状態をモデルに入力して、次の行動のQ値が大きいものを取り出す。
    # それをnext_state_valuesの対応するインデックスに代入していく。
    # 後々微分されないために（固定するために）detach()をする。くわしくはわからん。
    # 現在の状態についてのQ値は更新対象であるから、detachしないっぽい。
    next_state_values[non_final_mask] = self.target_q_network(self.non_final_next_states).gather(1, a_m_non_final_next_states).detach().squeeze()

    # 教師となるQ値を求める。説明で書いた式の右辺。
    expected_state_action_values = self.reward_batch + GAMMA * next_state_values

    return expected_state_action_values

  def update_main_q_network(self):
    self.main_q_network.train()

    # smooth_l1_lossはHuber関数
    # expected_state_action_valuesはsize: バッチサイズとなっているので、これをunsqueezeで(バッチサイズ*1)にする。...?
    loss = F.smooth_l1_loss(self.state_action_values, self.expected_state_action_values.unsqueeze (1))
    
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

  def update_target_q_network(self):
    self.target_q_network.load_state_dict(self.main_q_network.state_dict())
  
  def update_td_error_memory(self):
    ''' TD誤差の更新 '''
    self.main_q_network.eval()
    self.target_q_network.eval()

    transitions = self.memory.memory
    batch = Transition(*zip(*transitions))

    state_batch = torch.cat(batch.state)
    action_batch = torch.cat(batch.action)
    reward_batch = torch.cat(batch.reward)
    non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])

    state_action_values = self.main_q_network(state_batch).gather(1, action_batch)

    non_final_mask = torch.ByteTensor(tuple(map(lambda s: s is not None, batch.next_state)))

    next_state_values = torch.zeros(len(self.memory))
    a_m = torch.zeros(len(self.memory)).type(torch.LongTensor)

    a_m[non_final_mask] = self.main_q_network(non_final_next_states).detach().max(1)[1]

    # 次の状態があるものだけに絞り、32を32x1にする。
    a_m_non_final_next_states = a_m[non_final_mask].view(-1, 1)

    # それをnext_state_valuesの対応するインデックスに代入していく。
    # 次の状態をモデルに入力して、次の行動のQ値が大きいものを取り出す。
    # 後々微分されないために（固定するために）detach()をする。くわしくはわからん。
    # 現在の状態についてのQ値は更新対象であるから、detachしないっぽい。
    next_state_values[non_final_mask] = self.target_q_network(non_final_next_states).gather(1, a_m_non_final_next_states).detach().squeeze()
    td_errors = (reward_batch + GAMMA * next_state_values) - state_action_values.squeeze()

    self.td_error_memory.memory = td_errors.detach().numpy().tolist()


In [None]:
class Agent:
  def __init__(self, num_states, num_actions):
    self.brain = Brain(num_states, num_actions)
  
  def update_q_function(self, episode):
    self.brain.replay(episode)
  
  def get_action(self, state, episode):
    action = self.brain.decide_action(state, episode)
    return action

  def memorize(self, state, action, state_next, reward):
    self.brain.memory.push(state, action, state_next, reward)

  def update_target_q_function(self):
    self.brain.update_target_q_network()
  
  def memorize_td_error(self, td_error):
    self.brain.td_error_memory.push(td_error)
  
  def update_td_error_memory(self):
    self.brain.update_td_error_memory()


In [None]:
class Environment:
  def __init__(self):
    self.env = wrap_env(gym.make(ENV))
    self.num_states = self.env.observation_space.shape[0]
    self.num_actions = self.env.action_space.n

    self.agent = Agent(self.num_states, self.num_actions)
  
  def run(self):
    # 10回ごとに立ち続けたepisodeの平均を取る
    episode_10_list = np.zeros(10)
    # 連続成功記録
    complete_episodes = 0
    episode_final = False

    for episode in range(NUM_EPISODES):
      observation = self.env.reset()

      state = observation
      state = torch.from_numpy(state).type(torch.FloatTensor)
      state = torch.unsqueeze(state, 0)

      for step in range(MAX_STEPS):
        
        action = self.agent.get_action(state, episode)

        observation_next, _, done, _ = self.env.step(action.item())

        if done:
          state_next = None
          episode_10_list = np.hstack((episode_10_list[1:], step + 1))

          if step < 195:
            reward = torch.FloatTensor([-1.0])
            complete_episodes = 0
          else:
            reward = torch.FloatTensor([1.0])
            complete_episodes += 1
        else:
          reward = torch.FloatTensor([0.0])
          state_next = observation_next
          state_next = torch.from_numpy(state_next).type(torch.FloatTensor)
          state_next = torch.unsqueeze(state_next, 0)
        
        self.agent.memorize(state, action, state_next, reward)

        self.agent.memorize_td_error(0)

        self.agent.update_q_function(episode)

        state = state_next

        if done:
          print(f'{episode} Episode: Finished after {step + 1} time steps: 10思考の平均step数 = {episode_10_list.mean()}')

          if episode % 2 == 0:
            self.agent.update_target_q_function()
          break
        
      if episode_final is True:
        show_video()
        break
      
      if complete_episodes >= 10:
        print('10回連続成功')
        episode_final = True


In [None]:
from collections import namedtuple
Tr = namedtuple('tr', ('name_a', 'value_b'))
Tr_object = Tr('名前Aです', 100)
print(Tr_object)
print(Tr_object.value_b)
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward'))


ENV = 'CartPole-v0'
GAMMA = 0.99
MAX_STEPS = 200
NUM_EPISODES = 500

tr(name_a='名前Aです', value_b=100)
100


In [None]:
cartpole_env = Environment()
cartpole_env.run()

Net(
  (fc1): Linear(in_features=4, out_features=32, bias=True)
  (fc2): Linear(in_features=32, out_features=32, bias=True)
  (fc3): Linear(in_features=32, out_features=2, bias=True)
)
0 Episode: Finished after 18 time steps: 10思考の平均step数 = 1.8
1 Episode: Finished after 11 time steps: 10思考の平均step数 = 2.9
2 Episode: Finished after 14 time steps: 10思考の平均step数 = 4.3




3 Episode: Finished after 13 time steps: 10思考の平均step数 = 5.6
4 Episode: Finished after 20 time steps: 10思考の平均step数 = 7.6
5 Episode: Finished after 17 time steps: 10思考の平均step数 = 9.3
6 Episode: Finished after 22 time steps: 10思考の平均step数 = 11.5
7 Episode: Finished after 18 time steps: 10思考の平均step数 = 13.3
8 Episode: Finished after 15 time steps: 10思考の平均step数 = 14.8
9 Episode: Finished after 15 time steps: 10思考の平均step数 = 16.3




10 Episode: Finished after 22 time steps: 10思考の平均step数 = 16.7
11 Episode: Finished after 39 time steps: 10思考の平均step数 = 19.5
12 Episode: Finished after 24 time steps: 10思考の平均step数 = 20.5
13 Episode: Finished after 21 time steps: 10思考の平均step数 = 21.3
14 Episode: Finished after 88 time steps: 10思考の平均step数 = 28.1
15 Episode: Finished after 9 time steps: 10思考の平均step数 = 27.3
16 Episode: Finished after 13 time steps: 10思考の平均step数 = 26.4
17 Episode: Finished after 25 time steps: 10思考の平均step数 = 27.1
18 Episode: Finished after 22 time steps: 10思考の平均step数 = 27.8
19 Episode: Finished after 9 time steps: 10思考の平均step数 = 27.2




20 Episode: Finished after 10 time steps: 10思考の平均step数 = 26.0
21 Episode: Finished after 8 time steps: 10思考の平均step数 = 22.9
22 Episode: Finished after 10 time steps: 10思考の平均step数 = 21.5
23 Episode: Finished after 10 time steps: 10思考の平均step数 = 20.4
24 Episode: Finished after 34 time steps: 10思考の平均step数 = 15.0
25 Episode: Finished after 31 time steps: 10思考の平均step数 = 17.2
26 Episode: Finished after 29 time steps: 10思考の平均step数 = 18.8
27 Episode: Finished after 19 time steps: 10思考の平均step数 = 18.2
28 Episode: Finished after 48 time steps: 10思考の平均step数 = 20.8
29 Episode: Finished after 38 time steps: 10思考の平均step数 = 23.7




30 Episode: Finished after 64 time steps: 10思考の平均step数 = 29.1
31 Episode: Finished after 25 time steps: 10思考の平均step数 = 30.8
32 Episode: Finished after 49 time steps: 10思考の平均step数 = 34.7
33 Episode: Finished after 27 time steps: 10思考の平均step数 = 36.4
34 Episode: Finished after 89 time steps: 10思考の平均step数 = 41.9
35 Episode: Finished after 27 time steps: 10思考の平均step数 = 41.5
36 Episode: Finished after 20 time steps: 10思考の平均step数 = 40.6
37 Episode: Finished after 22 time steps: 10思考の平均step数 = 40.9
38 Episode: Finished after 23 time steps: 10思考の平均step数 = 38.4
39 Episode: Finished after 20 time steps: 10思考の平均step数 = 36.6




40 Episode: Finished after 31 time steps: 10思考の平均step数 = 33.3
41 Episode: Finished after 30 time steps: 10思考の平均step数 = 33.8
42 Episode: Finished after 29 time steps: 10思考の平均step数 = 31.8
43 Episode: Finished after 40 time steps: 10思考の平均step数 = 33.1
44 Episode: Finished after 29 time steps: 10思考の平均step数 = 27.1
45 Episode: Finished after 31 time steps: 10思考の平均step数 = 27.5
46 Episode: Finished after 25 time steps: 10思考の平均step数 = 28.0
47 Episode: Finished after 48 time steps: 10思考の平均step数 = 30.6
48 Episode: Finished after 26 time steps: 10思考の平均step数 = 30.9
49 Episode: Finished after 60 time steps: 10思考の平均step数 = 34.9




50 Episode: Finished after 64 time steps: 10思考の平均step数 = 38.2
51 Episode: Finished after 57 time steps: 10思考の平均step数 = 40.9
52 Episode: Finished after 60 time steps: 10思考の平均step数 = 44.0
53 Episode: Finished after 57 time steps: 10思考の平均step数 = 45.7
54 Episode: Finished after 33 time steps: 10思考の平均step数 = 46.1
55 Episode: Finished after 48 time steps: 10思考の平均step数 = 47.8
56 Episode: Finished after 24 time steps: 10思考の平均step数 = 47.7
57 Episode: Finished after 36 time steps: 10思考の平均step数 = 46.5
58 Episode: Finished after 62 time steps: 10思考の平均step数 = 50.1
59 Episode: Finished after 31 time steps: 10思考の平均step数 = 47.2




60 Episode: Finished after 66 time steps: 10思考の平均step数 = 47.4
61 Episode: Finished after 33 time steps: 10思考の平均step数 = 45.0
62 Episode: Finished after 38 time steps: 10思考の平均step数 = 42.8
63 Episode: Finished after 99 time steps: 10思考の平均step数 = 47.0
64 Episode: Finished after 34 time steps: 10思考の平均step数 = 47.1
65 Episode: Finished after 44 time steps: 10思考の平均step数 = 46.7
66 Episode: Finished after 50 time steps: 10思考の平均step数 = 49.3
67 Episode: Finished after 49 time steps: 10思考の平均step数 = 50.6
68 Episode: Finished after 31 time steps: 10思考の平均step数 = 47.5
69 Episode: Finished after 122 time steps: 10思考の平均step数 = 56.6




70 Episode: Finished after 35 time steps: 10思考の平均step数 = 53.5
71 Episode: Finished after 87 time steps: 10思考の平均step数 = 58.9
72 Episode: Finished after 57 time steps: 10思考の平均step数 = 60.8
73 Episode: Finished after 56 time steps: 10思考の平均step数 = 56.5
74 Episode: Finished after 47 time steps: 10思考の平均step数 = 57.8
75 Episode: Finished after 61 time steps: 10思考の平均step数 = 59.5
76 Episode: Finished after 200 time steps: 10思考の平均step数 = 74.5
77 Episode: Finished after 37 time steps: 10思考の平均step数 = 73.3
78 Episode: Finished after 63 time steps: 10思考の平均step数 = 76.5
79 Episode: Finished after 39 time steps: 10思考の平均step数 = 68.2




80 Episode: Finished after 84 time steps: 10思考の平均step数 = 73.1
81 Episode: Finished after 53 time steps: 10思考の平均step数 = 69.7
82 Episode: Finished after 66 time steps: 10思考の平均step数 = 70.6
83 Episode: Finished after 82 time steps: 10思考の平均step数 = 73.2
84 Episode: Finished after 63 time steps: 10思考の平均step数 = 74.8
85 Episode: Finished after 38 time steps: 10思考の平均step数 = 72.5
86 Episode: Finished after 54 time steps: 10思考の平均step数 = 57.9
87 Episode: Finished after 68 time steps: 10思考の平均step数 = 61.0
88 Episode: Finished after 96 time steps: 10思考の平均step数 = 64.3
89 Episode: Finished after 56 time steps: 10思考の平均step数 = 66.0




90 Episode: Finished after 66 time steps: 10思考の平均step数 = 64.2
91 Episode: Finished after 51 time steps: 10思考の平均step数 = 64.0
92 Episode: Finished after 54 time steps: 10思考の平均step数 = 62.8
93 Episode: Finished after 61 time steps: 10思考の平均step数 = 60.7
94 Episode: Finished after 63 time steps: 10思考の平均step数 = 60.7
95 Episode: Finished after 58 time steps: 10思考の平均step数 = 62.7
96 Episode: Finished after 57 time steps: 10思考の平均step数 = 63.0
97 Episode: Finished after 38 time steps: 10思考の平均step数 = 60.0
98 Episode: Finished after 52 time steps: 10思考の平均step数 = 55.6
99 Episode: Finished after 42 time steps: 10思考の平均step数 = 54.2




100 Episode: Finished after 40 time steps: 10思考の平均step数 = 51.6
101 Episode: Finished after 57 time steps: 10思考の平均step数 = 52.2
102 Episode: Finished after 43 time steps: 10思考の平均step数 = 51.1
103 Episode: Finished after 200 time steps: 10思考の平均step数 = 65.0
104 Episode: Finished after 61 time steps: 10思考の平均step数 = 64.8
105 Episode: Finished after 57 time steps: 10思考の平均step数 = 64.7
106 Episode: Finished after 159 time steps: 10思考の平均step数 = 74.9
107 Episode: Finished after 73 time steps: 10思考の平均step数 = 78.4
108 Episode: Finished after 154 time steps: 10思考の平均step数 = 88.6
109 Episode: Finished after 162 time steps: 10思考の平均step数 = 100.6




110 Episode: Finished after 169 time steps: 10思考の平均step数 = 113.5
111 Episode: Finished after 98 time steps: 10思考の平均step数 = 117.6
112 Episode: Finished after 122 time steps: 10思考の平均step数 = 125.5
113 Episode: Finished after 138 time steps: 10思考の平均step数 = 119.3
114 Episode: Finished after 65 time steps: 10思考の平均step数 = 119.7
115 Episode: Finished after 146 time steps: 10思考の平均step数 = 128.6
116 Episode: Finished after 130 time steps: 10思考の平均step数 = 125.7
117 Episode: Finished after 69 time steps: 10思考の平均step数 = 125.3
118 Episode: Finished after 91 time steps: 10思考の平均step数 = 119.0
119 Episode: Finished after 103 time steps: 10思考の平均step数 = 113.1




120 Episode: Finished after 103 time steps: 10思考の平均step数 = 106.5
121 Episode: Finished after 79 time steps: 10思考の平均step数 = 104.6
122 Episode: Finished after 147 time steps: 10思考の平均step数 = 107.1
123 Episode: Finished after 200 time steps: 10思考の平均step数 = 113.3
124 Episode: Finished after 110 time steps: 10思考の平均step数 = 117.8
125 Episode: Finished after 86 time steps: 10思考の平均step数 = 111.8
126 Episode: Finished after 185 time steps: 10思考の平均step数 = 117.3
127 Episode: Finished after 200 time steps: 10思考の平均step数 = 130.4
128 Episode: Finished after 80 time steps: 10思考の平均step数 = 129.3
129 Episode: Finished after 200 time steps: 10思考の平均step数 = 139.0




130 Episode: Finished after 177 time steps: 10思考の平均step数 = 146.4
131 Episode: Finished after 176 time steps: 10思考の平均step数 = 156.1
132 Episode: Finished after 148 time steps: 10思考の平均step数 = 156.2
133 Episode: Finished after 179 time steps: 10思考の平均step数 = 154.1
134 Episode: Finished after 97 time steps: 10思考の平均step数 = 152.8
135 Episode: Finished after 147 time steps: 10思考の平均step数 = 158.9
136 Episode: Finished after 176 time steps: 10思考の平均step数 = 158.0
137 Episode: Finished after 185 time steps: 10思考の平均step数 = 156.5
138 Episode: Finished after 129 time steps: 10思考の平均step数 = 161.4
139 Episode: Finished after 194 time steps: 10思考の平均step数 = 160.8




140 Episode: Finished after 103 time steps: 10思考の平均step数 = 153.4
141 Episode: Finished after 200 time steps: 10思考の平均step数 = 155.8
142 Episode: Finished after 200 time steps: 10思考の平均step数 = 161.0
143 Episode: Finished after 200 time steps: 10思考の平均step数 = 163.1
144 Episode: Finished after 200 time steps: 10思考の平均step数 = 173.4
145 Episode: Finished after 172 time steps: 10思考の平均step数 = 175.9
146 Episode: Finished after 88 time steps: 10思考の平均step数 = 167.1
147 Episode: Finished after 146 time steps: 10思考の平均step数 = 163.2
148 Episode: Finished after 200 time steps: 10思考の平均step数 = 170.3
149 Episode: Finished after 179 time steps: 10思考の平均step数 = 168.8




150 Episode: Finished after 200 time steps: 10思考の平均step数 = 178.5
151 Episode: Finished after 200 time steps: 10思考の平均step数 = 178.5
152 Episode: Finished after 175 time steps: 10思考の平均step数 = 176.0
153 Episode: Finished after 135 time steps: 10思考の平均step数 = 169.5
154 Episode: Finished after 87 time steps: 10思考の平均step数 = 158.2
155 Episode: Finished after 106 time steps: 10思考の平均step数 = 151.6
156 Episode: Finished after 200 time steps: 10思考の平均step数 = 162.8
157 Episode: Finished after 89 time steps: 10思考の平均step数 = 157.1
158 Episode: Finished after 183 time steps: 10思考の平均step数 = 155.4
159 Episode: Finished after 200 time steps: 10思考の平均step数 = 157.5




160 Episode: Finished after 142 time steps: 10思考の平均step数 = 151.7
161 Episode: Finished after 157 time steps: 10思考の平均step数 = 147.4
162 Episode: Finished after 80 time steps: 10思考の平均step数 = 137.9
163 Episode: Finished after 136 time steps: 10思考の平均step数 = 138.0
164 Episode: Finished after 146 time steps: 10思考の平均step数 = 143.9
165 Episode: Finished after 200 time steps: 10思考の平均step数 = 153.3
166 Episode: Finished after 185 time steps: 10思考の平均step数 = 151.8
167 Episode: Finished after 169 time steps: 10思考の平均step数 = 159.8
168 Episode: Finished after 87 time steps: 10思考の平均step数 = 150.2
169 Episode: Finished after 200 time steps: 10思考の平均step数 = 150.2




170 Episode: Finished after 200 time steps: 10思考の平均step数 = 156.0
171 Episode: Finished after 163 time steps: 10思考の平均step数 = 156.6
172 Episode: Finished after 200 time steps: 10思考の平均step数 = 168.6
173 Episode: Finished after 87 time steps: 10思考の平均step数 = 163.7
174 Episode: Finished after 169 time steps: 10思考の平均step数 = 166.0
175 Episode: Finished after 165 time steps: 10思考の平均step数 = 162.5
176 Episode: Finished after 160 time steps: 10思考の平均step数 = 160.0
177 Episode: Finished after 200 time steps: 10思考の平均step数 = 163.1
178 Episode: Finished after 200 time steps: 10思考の平均step数 = 174.4
179 Episode: Finished after 174 time steps: 10思考の平均step数 = 171.8




180 Episode: Finished after 172 time steps: 10思考の平均step数 = 169.0
181 Episode: Finished after 129 time steps: 10思考の平均step数 = 165.6
182 Episode: Finished after 141 time steps: 10思考の平均step数 = 159.7
183 Episode: Finished after 200 time steps: 10思考の平均step数 = 171.0
184 Episode: Finished after 200 time steps: 10思考の平均step数 = 174.1
185 Episode: Finished after 152 time steps: 10思考の平均step数 = 172.8
186 Episode: Finished after 144 time steps: 10思考の平均step数 = 171.2
187 Episode: Finished after 200 time steps: 10思考の平均step数 = 171.2
188 Episode: Finished after 200 time steps: 10思考の平均step数 = 171.2
189 Episode: Finished after 173 time steps: 10思考の平均step数 = 171.1




190 Episode: Finished after 141 time steps: 10思考の平均step数 = 168.0
191 Episode: Finished after 200 time steps: 10思考の平均step数 = 175.1
192 Episode: Finished after 94 time steps: 10思考の平均step数 = 170.4
193 Episode: Finished after 200 time steps: 10思考の平均step数 = 170.4
194 Episode: Finished after 152 time steps: 10思考の平均step数 = 165.6
195 Episode: Finished after 200 time steps: 10思考の平均step数 = 170.4
196 Episode: Finished after 97 time steps: 10思考の平均step数 = 165.7
197 Episode: Finished after 200 time steps: 10思考の平均step数 = 165.7
198 Episode: Finished after 150 time steps: 10思考の平均step数 = 160.7
199 Episode: Finished after 200 time steps: 10思考の平均step数 = 163.4




200 Episode: Finished after 122 time steps: 10思考の平均step数 = 161.5
201 Episode: Finished after 172 time steps: 10思考の平均step数 = 158.7
202 Episode: Finished after 146 time steps: 10思考の平均step数 = 163.9
203 Episode: Finished after 141 time steps: 10思考の平均step数 = 158.0
204 Episode: Finished after 140 time steps: 10思考の平均step数 = 156.8
205 Episode: Finished after 200 time steps: 10思考の平均step数 = 156.8
206 Episode: Finished after 187 time steps: 10思考の平均step数 = 165.8
207 Episode: Finished after 134 time steps: 10思考の平均step数 = 159.2
208 Episode: Finished after 164 time steps: 10思考の平均step数 = 160.6
209 Episode: Finished after 130 time steps: 10思考の平均step数 = 153.6




210 Episode: Finished after 167 time steps: 10思考の平均step数 = 158.1
211 Episode: Finished after 127 time steps: 10思考の平均step数 = 153.6
212 Episode: Finished after 200 time steps: 10思考の平均step数 = 159.0
213 Episode: Finished after 163 time steps: 10思考の平均step数 = 161.2
214 Episode: Finished after 130 time steps: 10思考の平均step数 = 160.2
215 Episode: Finished after 200 time steps: 10思考の平均step数 = 160.2
216 Episode: Finished after 175 time steps: 10思考の平均step数 = 159.0
217 Episode: Finished after 150 time steps: 10思考の平均step数 = 160.6
218 Episode: Finished after 186 time steps: 10思考の平均step数 = 162.8
219 Episode: Finished after 200 time steps: 10思考の平均step数 = 169.8




220 Episode: Finished after 142 time steps: 10思考の平均step数 = 167.3
221 Episode: Finished after 195 time steps: 10思考の平均step数 = 174.1
222 Episode: Finished after 200 time steps: 10思考の平均step数 = 174.1
223 Episode: Finished after 137 time steps: 10思考の平均step数 = 171.5
224 Episode: Finished after 145 time steps: 10思考の平均step数 = 173.0
225 Episode: Finished after 129 time steps: 10思考の平均step数 = 165.9
226 Episode: Finished after 195 time steps: 10思考の平均step数 = 167.9
227 Episode: Finished after 136 time steps: 10思考の平均step数 = 166.5
228 Episode: Finished after 125 time steps: 10思考の平均step数 = 160.4
229 Episode: Finished after 180 time steps: 10思考の平均step数 = 158.4




230 Episode: Finished after 149 time steps: 10思考の平均step数 = 159.1
231 Episode: Finished after 200 time steps: 10思考の平均step数 = 159.6
232 Episode: Finished after 141 time steps: 10思考の平均step数 = 153.7
233 Episode: Finished after 200 time steps: 10思考の平均step数 = 160.0
234 Episode: Finished after 199 time steps: 10思考の平均step数 = 165.4
235 Episode: Finished after 129 time steps: 10思考の平均step数 = 165.4
236 Episode: Finished after 139 time steps: 10思考の平均step数 = 159.8
237 Episode: Finished after 143 time steps: 10思考の平均step数 = 160.5
238 Episode: Finished after 147 time steps: 10思考の平均step数 = 162.7
239 Episode: Finished after 178 time steps: 10思考の平均step数 = 162.5




240 Episode: Finished after 160 time steps: 10思考の平均step数 = 163.6
241 Episode: Finished after 123 time steps: 10思考の平均step数 = 155.9
242 Episode: Finished after 200 time steps: 10思考の平均step数 = 161.8
243 Episode: Finished after 200 time steps: 10思考の平均step数 = 161.8
244 Episode: Finished after 190 time steps: 10思考の平均step数 = 160.9
245 Episode: Finished after 153 time steps: 10思考の平均step数 = 163.3
246 Episode: Finished after 145 time steps: 10思考の平均step数 = 163.9
247 Episode: Finished after 200 time steps: 10思考の平均step数 = 169.6
248 Episode: Finished after 200 time steps: 10思考の平均step数 = 174.9
249 Episode: Finished after 142 time steps: 10思考の平均step数 = 171.3




250 Episode: Finished after 143 time steps: 10思考の平均step数 = 169.6
251 Episode: Finished after 182 time steps: 10思考の平均step数 = 175.5
252 Episode: Finished after 200 time steps: 10思考の平均step数 = 175.5
253 Episode: Finished after 200 time steps: 10思考の平均step数 = 175.5
254 Episode: Finished after 146 time steps: 10思考の平均step数 = 171.1
255 Episode: Finished after 200 time steps: 10思考の平均step数 = 175.8
256 Episode: Finished after 200 time steps: 10思考の平均step数 = 181.3
257 Episode: Finished after 175 time steps: 10思考の平均step数 = 178.8
258 Episode: Finished after 150 time steps: 10思考の平均step数 = 173.8
259 Episode: Finished after 187 time steps: 10思考の平均step数 = 178.3




260 Episode: Finished after 139 time steps: 10思考の平均step数 = 177.9
261 Episode: Finished after 152 time steps: 10思考の平均step数 = 174.9
262 Episode: Finished after 180 time steps: 10思考の平均step数 = 172.9
263 Episode: Finished after 146 time steps: 10思考の平均step数 = 167.5
264 Episode: Finished after 140 time steps: 10思考の平均step数 = 166.9
265 Episode: Finished after 153 time steps: 10思考の平均step数 = 162.2
266 Episode: Finished after 153 time steps: 10思考の平均step数 = 157.5
267 Episode: Finished after 142 time steps: 10思考の平均step数 = 154.2
268 Episode: Finished after 178 time steps: 10思考の平均step数 = 157.0
269 Episode: Finished after 152 time steps: 10思考の平均step数 = 153.5




270 Episode: Finished after 126 time steps: 10思考の平均step数 = 152.2
271 Episode: Finished after 147 time steps: 10思考の平均step数 = 151.7
272 Episode: Finished after 110 time steps: 10思考の平均step数 = 144.7
273 Episode: Finished after 141 time steps: 10思考の平均step数 = 144.2
274 Episode: Finished after 133 time steps: 10思考の平均step数 = 143.5
275 Episode: Finished after 135 time steps: 10思考の平均step数 = 141.7
276 Episode: Finished after 119 time steps: 10思考の平均step数 = 138.3
277 Episode: Finished after 127 time steps: 10思考の平均step数 = 136.8
278 Episode: Finished after 191 time steps: 10思考の平均step数 = 138.1
279 Episode: Finished after 137 time steps: 10思考の平均step数 = 136.6




280 Episode: Finished after 149 time steps: 10思考の平均step数 = 138.9
281 Episode: Finished after 200 time steps: 10思考の平均step数 = 144.2
282 Episode: Finished after 165 time steps: 10思考の平均step数 = 149.7
283 Episode: Finished after 137 time steps: 10思考の平均step数 = 149.3
284 Episode: Finished after 135 time steps: 10思考の平均step数 = 149.5
285 Episode: Finished after 200 time steps: 10思考の平均step数 = 156.0
286 Episode: Finished after 168 time steps: 10思考の平均step数 = 160.9
287 Episode: Finished after 160 time steps: 10思考の平均step数 = 164.2
288 Episode: Finished after 158 time steps: 10思考の平均step数 = 160.9
289 Episode: Finished after 160 time steps: 10思考の平均step数 = 163.2




290 Episode: Finished after 180 time steps: 10思考の平均step数 = 166.3
291 Episode: Finished after 151 time steps: 10思考の平均step数 = 161.4
292 Episode: Finished after 190 time steps: 10思考の平均step数 = 163.9
293 Episode: Finished after 149 time steps: 10思考の平均step数 = 165.1
294 Episode: Finished after 164 time steps: 10思考の平均step数 = 168.0
295 Episode: Finished after 200 time steps: 10思考の平均step数 = 168.0
296 Episode: Finished after 163 time steps: 10思考の平均step数 = 167.5
297 Episode: Finished after 200 time steps: 10思考の平均step数 = 171.5
298 Episode: Finished after 188 time steps: 10思考の平均step数 = 174.5
299 Episode: Finished after 170 time steps: 10思考の平均step数 = 175.5




300 Episode: Finished after 198 time steps: 10思考の平均step数 = 177.3
301 Episode: Finished after 182 time steps: 10思考の平均step数 = 180.4
302 Episode: Finished after 193 time steps: 10思考の平均step数 = 180.7
303 Episode: Finished after 196 time steps: 10思考の平均step数 = 185.4
304 Episode: Finished after 176 time steps: 10思考の平均step数 = 186.6
305 Episode: Finished after 181 time steps: 10思考の平均step数 = 184.7
306 Episode: Finished after 192 time steps: 10思考の平均step数 = 187.6
307 Episode: Finished after 196 time steps: 10思考の平均step数 = 187.2
308 Episode: Finished after 150 time steps: 10思考の平均step数 = 183.4
309 Episode: Finished after 194 time steps: 10思考の平均step数 = 185.8




310 Episode: Finished after 195 time steps: 10思考の平均step数 = 185.5
311 Episode: Finished after 161 time steps: 10思考の平均step数 = 183.4
312 Episode: Finished after 167 time steps: 10思考の平均step数 = 180.8
313 Episode: Finished after 164 time steps: 10思考の平均step数 = 177.6
314 Episode: Finished after 172 time steps: 10思考の平均step数 = 177.2
315 Episode: Finished after 200 time steps: 10思考の平均step数 = 179.1
316 Episode: Finished after 171 time steps: 10思考の平均step数 = 177.0
317 Episode: Finished after 200 time steps: 10思考の平均step数 = 177.4
318 Episode: Finished after 200 time steps: 10思考の平均step数 = 182.4
319 Episode: Finished after 200 time steps: 10思考の平均step数 = 183.0




320 Episode: Finished after 199 time steps: 10思考の平均step数 = 183.4
321 Episode: Finished after 176 time steps: 10思考の平均step数 = 184.9
322 Episode: Finished after 200 time steps: 10思考の平均step数 = 188.2
323 Episode: Finished after 178 time steps: 10思考の平均step数 = 189.6
324 Episode: Finished after 200 time steps: 10思考の平均step数 = 192.4
325 Episode: Finished after 157 time steps: 10思考の平均step数 = 188.1
326 Episode: Finished after 183 time steps: 10思考の平均step数 = 189.3
327 Episode: Finished after 167 time steps: 10思考の平均step数 = 186.0
328 Episode: Finished after 159 time steps: 10思考の平均step数 = 181.9
329 Episode: Finished after 200 time steps: 10思考の平均step数 = 181.9




330 Episode: Finished after 200 time steps: 10思考の平均step数 = 182.0
331 Episode: Finished after 194 time steps: 10思考の平均step数 = 183.8
332 Episode: Finished after 197 time steps: 10思考の平均step数 = 183.5
333 Episode: Finished after 200 time steps: 10思考の平均step数 = 185.7
334 Episode: Finished after 179 time steps: 10思考の平均step数 = 183.6
335 Episode: Finished after 200 time steps: 10思考の平均step数 = 187.9
336 Episode: Finished after 200 time steps: 10思考の平均step数 = 189.6
337 Episode: Finished after 200 time steps: 10思考の平均step数 = 192.9
338 Episode: Finished after 177 time steps: 10思考の平均step数 = 194.7
339 Episode: Finished after 186 time steps: 10思考の平均step数 = 193.3




340 Episode: Finished after 196 time steps: 10思考の平均step数 = 192.9
341 Episode: Finished after 185 time steps: 10思考の平均step数 = 192.0
342 Episode: Finished after 178 time steps: 10思考の平均step数 = 190.1
343 Episode: Finished after 173 time steps: 10思考の平均step数 = 187.4
344 Episode: Finished after 176 time steps: 10思考の平均step数 = 187.1
345 Episode: Finished after 200 time steps: 10思考の平均step数 = 187.1
346 Episode: Finished after 200 time steps: 10思考の平均step数 = 187.1
347 Episode: Finished after 200 time steps: 10思考の平均step数 = 187.1
348 Episode: Finished after 200 time steps: 10思考の平均step数 = 189.4
349 Episode: Finished after 192 time steps: 10思考の平均step数 = 190.0




350 Episode: Finished after 200 time steps: 10思考の平均step数 = 190.4
351 Episode: Finished after 199 time steps: 10思考の平均step数 = 191.8
352 Episode: Finished after 200 time steps: 10思考の平均step数 = 194.0
353 Episode: Finished after 193 time steps: 10思考の平均step数 = 196.0
354 Episode: Finished after 196 time steps: 10思考の平均step数 = 198.0
355 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.0
356 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.0
357 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.0
358 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.0
359 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.8




360 Episode: Finished after 180 time steps: 10思考の平均step数 = 196.8
361 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.9
362 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.9
363 Episode: Finished after 196 time steps: 10思考の平均step数 = 197.2
364 Episode: Finished after 187 time steps: 10思考の平均step数 = 196.3
365 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.3
366 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.3
367 Episode: Finished after 185 time steps: 10思考の平均step数 = 194.8
368 Episode: Finished after 200 time steps: 10思考の平均step数 = 194.8
369 Episode: Finished after 200 time steps: 10思考の平均step数 = 194.8




370 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.8
371 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.8
372 Episode: Finished after 200 time steps: 10思考の平均step数 = 196.8
373 Episode: Finished after 200 time steps: 10思考の平均step数 = 197.2
374 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.5
375 Episode: Finished after 200 time steps: 10思考の平均step数 = 198.5
376 Episode: Finished after 197 time steps: 10思考の平均step数 = 198.2
377 Episode: Finished after 200 time steps: 10思考の平均step数 = 199.7
10回連続成功
378 Episode: Finished after 181 time steps: 10思考の平均step数 = 197.8
