## Monte Carlo Epsilon Greedy

前面`Exploring Starts`方法的缺点在于，我需要通过很多的状态作为出发点，遍历每一个状态得到很多`episode`来更新策略。

在前一节也提到，这样做是比较浪费的，因为在遍历的过程当中，会有很多重复的步骤，有没有什么办法能够让只从一个状态出发就能够充分的遍历所有的状态和动作，从而去除掉`exploring stars`这个条件呢

`exploring starts`的条件是指需要充分的遍历每一个状态的每一个动作。

答案就是`soft policy`:
$$
\pi(a \vert s)= \begin{cases} 1-\frac{\epsilon}{\vert{A(s)}\vert}({\vert{A(s)}\vert} - 1), & \text {for the greedy action} \\\\ \frac{\epsilon}{\vert{A(s)}\vert}, & \text{for the other ${\vert{A(s)}\vert} - 1$  actions} \end{cases}
$$

即在策略更新时，我有最大的概率选取`greedy action`, 但同时也有概率选取其他的`action`，并且参数 $\epsilon$ 越小，其探索性较弱，这样能保证最优性。

In [1]:
import numpy as np
import random
import os
import sys
sys.path.append(os.path.dirname(os.getcwd()))
from GridWorld import GridWorld

c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll


In [2]:
gamma = 0.9 
rows = 5
cols = 5
# 加载网格世界
grid_world = GridWorld(rows, cols, forbiddenAreaReward= -10)
grid_world.show()

⬜️⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫⬜️⬜️⬜️⬜️
⬜️⬜️⬜️⬜️⬜️
🚫✅⬜️⬜️🚫


In [3]:


# state value, 初始化为0， 表示每个state的value
value = np.zeros(rows*cols) 
# action value, 初始化为0, 表示每个state的5个action的value
qtable = np.zeros((rows*cols, 5)) 

# 蒙特卡洛方法一开始是从一个随机的policy开始的，这里我们定义一个随机的policy
# np.random.seed(50)
policy = np.eye(5)[np.random.randint(0,5,size=(rows*cols))] 
grid_world.show_policy_matirx(policy)

⬇️🔄🔄🔄🔄
⬆️⬆️🔄⬇️⬅️
⏪⬆️🔄⬅️➡️
⬆️➡️🔄🔄⬇️
⏩️✅➡️⬇️⏬


In [4]:
print('Generate Random Policy...')
qtable = np.zeros((rows*cols, 5))
print('done!')
epsilon = 0.1
print(f'inital epsilon: {epsilon}')

# 定义episode length
episode_length = 10000
# 定义迭代次数
cut = 0
cut_threshold = 200
# 初始化policy_epsilon
policy_epsilon = policy.copy()
threshold = 0.001
pre_qtabel = qtable.copy() + 1 
while np.sum((pre_qtabel-qtable)**2) > threshold and cut < cut_threshold:
    pre_qtabel = qtable.copy()
    print('------------------------------------')
    print(f'policy_epsilon start:')
    grid_world.show_policy_matirx(policy_epsilon)
    # 开始采用随机的policy进行蒙特卡洛方法
    i = random.randint(0, (rows*cols)-1)
    j = random.randint(0, cols-1)

    qtable_rewards = [[0 for _ in range(5)] for _ in range(rows*cols)]
    qtable_counts = [[0 for _ in range(5)] for _ in range(rows*cols)]
    # 注意：这里的i， j都是随机的，所以每次都是从一个随机的state开始的
    episode = grid_world.get_episode_score(
        now_state=i,
        action=j,
        policy=policy_epsilon,
        steps=episode_length,
        )
    reward = episode[episode_length][2]
    for k in range(episode_length-1, -1, -1):
        # 需要提取出每一个episode的信息，包括state, action, reward
        temp_state = episode[k][0]
        temp_action = episode[k][1]
        temp_reward = episode[k][2]

        reward = temp_reward + gamma * reward
        qtable_rewards[temp_state][temp_action] += reward
        qtable_counts[temp_state][temp_action] += 1
        qtable[temp_state][temp_action] = qtable_rewards[temp_state][temp_action] / qtable_counts[temp_state][temp_action]
    
    # 计算state value
    # state value = sum(policy[state][action] * qtable[state][action])
    value = np.sum(policy_epsilon * qtable, axis=1).reshape(rows, cols)
    print(f'iter: {cut}, value: {value}')

    # 更新policy
    policy = np.eye(5)[np.argmax(qtable,axis=1)]
    # 生成soft policy
    greedy_action_p = 1 - (epsilon * (4/5))
    other_action_p = epsilon * (1/5)

    print(f'soft policy: greedy action prob: {greedy_action_p}, other action prob: {other_action_p}')
    decision = {1:greedy_action_p, 0:other_action_p}
    # 新策略生成：对原先的policy中每个element根据概率decision进行替换
    policy_epsilon = np.vectorize(decision.get)(policy)
    print(f'policy_epsilon end:')
    grid_world.show_policy_matirx(policy_epsilon)
    print('------------------------------------')
    cut += 1

print('Optimal Policy Found!')
print('Final Policy')
grid_world.show_policy_matirx(policy_epsilon)
print('Final Q Table')
print(qtable)
print('Final Value Table')
print(value)

Generate Random Policy...
done!
inital epsilon: 0.1
------------------------------------
policy_epsilon start:
⬇️🔄🔄🔄🔄
⬆️⬆️🔄⬇️⬅️
⏪⬆️🔄⬅️➡️
⬆️➡️🔄🔄⬇️
⏩️✅➡️⬇️⏬
iter: <built-in function iter>, value: [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
soft policy: greedy action prob: 0.9199999999999999, other action prob: 0.020000000000000004
policy_epsilon end:
➡️⬆️⬆️⬆️⬆️
⬆️⬆️⬆️⬆️⬆️
⏫️⬆️⬆️⬆️⬆️
⬆️⬆️⬆️⬆️⬆️
⏫️✅⬆️⬆️⏫️
------------------------------------
------------------------------------
policy_epsilon start:
➡️⬆️⬆️⬆️⬆️
⬆️⬆️⬆️⬆️⬆️
⏫️⬆️⬆️⬆️⬆️
⬆️⬆️⬆️⬆️⬆️
⏫️✅⬆️⬆️⏫️
iter: <built-in function iter>, value: [[-78.21369729 -89.09973261 -90.51274199 -89.99708047 -91.58767775]
 [-64.62115596 -74.05743152 -77.93088347 -81.96799058 -81.21267033]
 [-65.99861811 -55.64935038 -70.61050251  -1.37565675 -65.95515485]
 [-69.10806299   0.           0.           0.           0.        ]
 [-63.61415954   0.           0.           0.           0.        ]]
soft policy: greed