## Monte Carlo Basic

蒙特卡洛主要是利用了随机采样产生数据，通过产生的数据来更新策略。本质上类似于从数据估计出模型的参数。

蒙特卡洛算法是对`policy iteration`算法的修改，把里面需要模型的部分(这里的模型指的就是在某个`state`我采取某个`action`的概率是多少，本质上就是策略)用先采样数据然后自己估计出模型来替换了。

具体做法就是我遍历每一个`state`的每个`action`然后产生很多`trajactory/episode`，再求期望获取`action value`，采取最大的`action value`作为当前状态的策略。

In [1]:
import numpy as np
import random
import os
import sys
sys.path.append(os.path.dirname(os.getcwd()))
from GridWorld import GridWorld

c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll


In [2]:
gamma = 0.9 
rows = 5
cols = 5
# 加载网格世界
grid_world = GridWorld(rows, cols, forbiddenAreaReward= -10)
grid_world.show()

⬜️⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫✅⬜️⬜️🚫


In [3]:
# 定义episode length
episode_length = 100

# state value, 初始化为0， 表示每个state的value
value = np.zeros(rows*cols) 
# action value, 初始化为0, 表示每个state的5个action的value
qtable = np.zeros((rows*cols, 5)) 

# 蒙特卡洛方法一开始是从一个随机的policy开始的，这里我们定义一个随机的policy
# np.random.seed(50)
policy = np.eye(5)[np.random.randint(0,5,size=(rows*cols))] 
grid_world.show_policy_matirx(policy)

⬇️➡️🔄🔄⬇️
🔄⬆️⬅️⬅️⬇️
⏫️🔄🔄⬅️⬆️
⬅️⬅️⬇️⬆️⬅️
⏪✅⬅️⬅️⏩️


In [4]:
# 通过采样的方法计算action value
# 这里和policy iteration不同的地方是，在policy iteration中，我们是通过迭代的方法来计算value，而在蒙特卡洛方法中，我们是通过采样的方法来计算value
# 在policy iteration中，我们是已知一个固定策略的

print('Generate Random Policy...')
qtable = np.zeros((rows*cols, 5))
print('done!')
print('Initial Q Table')
pre_qtabel = qtable.copy() + 1
print('done!')
threshold = 0.001
print('Start Q Value Update...')
cut = 0
cut_threshold = 1000
while np.sum((pre_qtabel-qtable)**2) > threshold and cut < cut_threshold:
    print('-----------------------------------')
    print(f'q value update start at {cut}[{np.sum((pre_qtabel-qtable)**2)}]')
    pre_qtabel = qtable.copy()
    # 通过采样的方法计算action value
    # 遍历每一个状态
    for i in range(rows*cols):
        # 遍历每一个action
        for j in range(5):
            # 下面函数的返回值是一个元组列表，每一个元组包含一个episode的信息，包括state, action, reward
            episode = grid_world.get_episode_score(
                now_state=i,
                action=j,
                policy=policy,
                steps=episode_length,
            )
            # 然后我们需要计算这一条episode的所对应的action value
            # action value的计算方法是：Gt = Rt+1 + gamma*Rt+2 + gamma^2*Rt+3 + ... + gamma^(T-1)*Rt+T
            # q(s,a) = E[Gt|St=s, At=a]
            # temp取的是episode的最后一个状态的reward
            # 这里是从后往前计算的,最后展开就是上面对应的式子
            temp_reward = episode[episode_length][2]
            for k in range(episode_length-1, -1, -1):
                temp_reward = episode[k][2] + gamma*temp_reward
            # 更新qtable
            qtable[i][j] = temp_reward
    # 选取最大的action value的action作为policy
    policy = np.eye(5)[np.argmax(qtable, axis=1)]
    print('now policy: ')
    grid_world.show_policy_matirx(policy)
    print(f'q value update end at : {cut}[{np.sum((pre_qtabel-qtable)**2)}]')
    cut += 1
    print('-----------------------------------')
print('Optimal Policy Found!')
print('Final Policy')
grid_world.show_policy_matirx(policy)
print('Final Q Table')
print(qtable)


Generate Random Policy...
done!
Initial Q Table
done!
Start Q Value Update...
-----------------------------------
q value update start at 0[125.0]
now policy: 
➡️➡️➡️➡️⬇️
⬆️⬆️⬆️⬆️⬆️
⏫️⬆️⬇️⬆️⬆️
⬆️⬇️⬇️⬇️⬆️
⏩️✅⬅️⬅️⏪
q value update end at : 0[178116.11412857752]
-----------------------------------
-----------------------------------
q value update start at 1[178116.11412857752]
now policy: 
➡️➡️➡️➡️⬇️
⬆️⬆️⬇️⬆️⬆️
⏫️⬇️⬇️⬇️⬆️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 1[170158.66223275958]
-----------------------------------
-----------------------------------
q value update start at 2[170158.66223275958]
now policy: 
➡️➡️⬇️➡️⬇️
⬆️⬇️⬇️⬇️⬆️
⏩️⬇️⬇️⬇️⬇️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 2[2435.6935224546905]
-----------------------------------
-----------------------------------
q value update start at 3[2435.6935224546905]
now policy: 
➡️⬇️⬇️⬇️⬇️
➡️⬇️⬇️⬇️⬇️
⏩️⬇️⬇️⬇️⬇️
➡️⬇️⬇️⬇️⬅️
⏩️✅⬅️⬅️⏪
q value update end at : 3[1402.5967726007984]
-----------------------------------
---------------