## Monte Carlo Exploring Starts

蒙特卡洛Basic的算法的缺点是需要遍历每一个`state`的每一个`action`，得到`action value`

但在经过每一个`episode`的时候，中间会有很多重复的步骤，称之为`visit`，或许可以只根据我在一条`episode`探索的时候，中间获得的`visit`来作为当前状态动作的`action value`

当然这样是不精确的，怎么样用`visit`来做`action value` 会在下一小节提到

In [1]:
import numpy as np
import random
import os
import sys
sys.path.append(os.path.dirname(os.getcwd()))
from GridWorld import GridWorld

c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll


In [2]:
gamma = 0.9 
rows = 5
cols = 5
# 加载网格世界
grid_world = GridWorld(rows, cols, forbiddenAreaReward= -10)
grid_world.show()

⬜️⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫✅⬜️⬜️🚫


In [3]:
# 定义episode length
episode_length = 100

# state value, 初始化为0， 表示每个state的value
value = np.zeros(rows*cols) 
# action value, 初始化为0, 表示每个state的5个action的value
qtable = np.zeros((rows*cols, 5)) 

# 蒙特卡洛方法一开始是从一个随机的policy开始的，这里我们定义一个随机的policy
# np.random.seed(50)
policy = np.eye(5)[np.random.randint(0,5,size=(rows*cols))] 
grid_world.show_policy_matirx(policy)

➡️⬇️🔄⬅️⬇️
➡️🔄⬆️🔄🔄
🔄🔄⬅️⬅️➡️
⬆️⬆️⬇️⬆️⬆️
⏩️✅⬅️⬅️⏪


In [4]:
# 通过采样的方法计算action value
# 这里和policy iteration不同的地方是，在policy iteration中，我们是通过迭代的方法来计算value，而在蒙特卡洛方法中，我们是通过采样的方法来计算value
# 在policy iteration中，我们是已知一个固定策略的

print('Generate Random Policy...')
qtable = np.zeros((rows*cols, 5))
print('done!')
print('Initial Q Table')
pre_qtabel = qtable.copy() + 1
print('done!')
threshold = 0.001
print('Start Q Value Update...')
cut = 0
cut_threshold = 1000
while np.sum((pre_qtabel-qtable)**2) > threshold and cut < cut_threshold:
    print('-----------------------------------')
    print(f'q value update start at {cut}[{np.sum((pre_qtabel-qtable)**2)}]')
    pre_qtabel = qtable.copy()
    # 通过采样的方法计算action value
    # 遍历每一个状态
    for i in range(rows*cols):
        # 遍历每一个action
        for j in range(5):
            # 初始化qtable_rewards和qtable_counts

            qtable_rewards = [[0 for _ in range(5)] for _ in range(rows*cols)]
            qtable_counts = [[0 for _ in range(5)] for _ in range(rows*cols)]
            # 下面函数的返回值是一个元组列表，每一个元组包含一个episode的信息，包括state, action, reward
            episode = grid_world.get_episode_score(
                now_state=i,
                action=j,
                policy=policy,
                steps=episode_length,
            )

            reward = episode[episode_length][2]
            for k in range(episode_length-1, -1, -1):
                # 需要提取出每一个episode的信息，包括state, action, reward
                temp_state = episode[k][0]
                temp_action = episode[k][1]
                temp_reward = episode[k][2]
                # 先计算当前的action value
                reward = temp_reward + gamma * reward
                # 更新qtable_rewards和qtable_counts
                # 将episode中的reward加入到qtable_rewards中
                qtable_rewards[temp_state][temp_action] += reward
                # 将episode中的count加入到qtable_counts中
                qtable_counts[temp_state][temp_action] += 1  
                # 这里采用的是平均值的方法来更新qtable， 即every visit
                qtable[temp_state][temp_action] = qtable_rewards[temp_state][temp_action] / qtable_counts[temp_state][temp_action]

                # first visit
                # if qtable_counts[temp_state][temp_action] == 0:
                #     qtable_rewards[temp_state][temp_action] = reward
                #     qtable[temp_state][temp_action] = qtable_rewards[temp_state][temp_action] / qtable_counts[temp_state][temp_action]
                #     qtable_counts[temp_state][temp_action] += 1 
    
    # 选取最大的action value的action作为policy
    policy = np.eye(5)[np.argmax(qtable, axis=1)]
    print('now policy: ')
    grid_world.show_policy_matirx(policy)
    print(f'q value update end at : {cut}[{np.sum((pre_qtabel-qtable)**2)}]')
    cut += 1
    print('-----------------------------------')
print('Optimal Policy Found!')
print('Final Policy')
grid_world.show_policy_matirx(policy)
print('Final Q Table')
print(qtable)

Generate Random Policy...
done!
Initial Q Table
done!
Start Q Value Update...
q value update start at 0[125.0]
now policy: 
➡️➡️➡️➡️⬇️
⬆️⬆️⬆️⬆️⬆️
⏫️⬆️⬇️⬆️⬆️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 0[168171.90793243307]
q value update start at 1[168171.90793243307]
now policy: 
➡️➡️➡️➡️⬇️
⬆️⬆️⬇️⬆️⬆️
⏬⬇️⬇️⬇️⬇️
➡️⬇️⬅️⬅️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 1[167335.0540029216]
q value update start at 2[167335.0540029216]
now policy: 
➡️➡️⬇️➡️⬇️
⬆️⬇️⬇️⬇️⬇️
⏩️⬇️⬅️⬅️⬅️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 2[1198.494657015979]
q value update start at 3[1198.494657015979]
now policy: 
➡️⬇️⬇️⬇️⬇️
➡️⬇️⬅️⬅️⬅️
⏬⬇️⬇️⬇️⬇️
➡️⬇️⬅️⬅️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 3[1310.7814712809559]
q value update start at 4[1310.7814712809559]
now policy: 
⬇️⬇️⬅️⬅️⬅️
➡️⬇️⬇️⬇️⬇️
⏩️⬇️⬅️⬅️⬅️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 4[44.616643355203166]
q value update start at 5[44.616643355203166]
now policy: 
➡️⬇️⬇️⬇️⬇️
➡️⬇️⬅️⬅️⬅️
⏬⬇️⬇️⬇️⬇️
➡️⬇️⬅️⬅️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 5[9.830201680304