## 1. 原理解释

Qlearning 与 SARSA 不同的地方只在于 TD target, 在 SARSA 中, 我们会收集 (now_state, now_action, reward, next_state, next_action) 的数据，这里的 next_action 是通过 random.choice 选取出来的，但是在 Qlearning 中，是直接利用贪心的策略，不需要输入 next_action 的数据， 而是在直接选择 next_state 下，所有动作的 action_value 的最大值。

同时需要注意的是，Qlearning 本身是一个 off-policy 的算法， 因为 Qlearning 的目的是想要得到 optimal policy , 这和我们 当前采取的 policy 是不同的。


## 2. 代码实现

### 2.1 off policy (classical ver.)

In [1]:
import numpy as np
import random
from IPython.display import clear_output
import sys, os
sys.path.append(os.path.dirname(os.getcwd()))
from GridWorld import GridWorld
from tqdm import tqdm
import time

c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\callmest\.conda\envs\RBP-TSTL\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll


In [4]:
rows = 5
columns = 5

gridworld  = GridWorld(forbiddenAreaReward=-10, reward=1, desc=[".....", ".##..", "..#..", ".#T#.", ".#..."])
print('Initial Grid World')
gridworld.show()

policy = np.eye(5)[np.random.randint(0,5,size=(rows*columns))] 
print('Initial Policy')
gridworld.show_policy_matirx(policy)

state_value = np.zeros((rows * columns))
print(f'Initial State Value: {state_value}')

action_value = np.zeros((rows * columns, 5))
print(f'Initial Action Value: {action_value}')

# Hyperparameters
num_episodes = 1000
alpha = 0.1
gamma = 0.9
epsilon = 0.1

Initial Grid World
⬜️⬜️⬜️⬜️⬜️
⬜️🚫🚫⬜️⬜️
⬜️⬜️🚫⬜️⬜️
⬜️🚫✅🚫⬜️
⬜️🚫⬜️⬜️⬜️
Initial Policy
⬅️⬇️🔄🔄➡️
⬅️⏩️⏪➡️➡️
⬅️⬅️🔄⬇️⬇️
⬅️⏪✅⏫️🔄
➡️⏫️➡️➡️🔄
Initial State Value: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0.]
Initial Action Value: [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


In [6]:

for episode in range(num_episodes):
    clear_output(wait=True)
    print(f'episode: {episode} \ {num_episodes}')
        # 定义epsilon-greedy策略
    # greedy_action_prob = 1 - epsilon * (4 / 5)
    # non_greedy_action_prob = epsilon / 5
    # action_dict = { 1: greedy_action_prob,
    #                0: non_greedy_action_prob}
    # # 这一步是根据epsilon-greedy策略赋予每个状态动作的概率
    # policy_epsilon_greedy = np.vectorize(action_dict.get)(policy)
    # 检查每个状态被访问的次数
    state_visited = [0 for _ in range(rows * columns)]
    # 随机选取一个状态和动作, 根据书中off-policy的伪代码，不需要epsilon-greedy策略
    now_state = random.choice(range(rows * columns))
    now_action = random.choice(range(5))

    # 根据伪代码，在一个 episode 下，我们根据先有的策略，生成一条轨迹，获取数据
    trajectory = gridworld.get_episode_score(
        now_state=now_state,
        action=now_action,
        policy=policy,
        steps=-1,
        stop_when_reach_target=True
    )
    print(f'episode end, trajectory length: {len(trajectory)}')

    # 现在需要qlearning来更新action_value，注意这里是反向更新, 需要给len(trajectory)减去1，因为length如果是1，那么会循环2次
    for k in range(len(trajectory) - 1, -1, -1):
        last_state, last_action, reward, next_state, next_action = trajectory[k]
        state_visited[last_state] += 1
        # qlearning的更新公式: Q(s, a)(t+1) = Q(s, a)(t) - alpha * (Q(s, a)(t) - (r + gamma * max_a' Q(s, a)(t+1)))
        # 注意这里是选择了下一个状态的最大action value 的动作，而不是next_action
        TD_target = reward + gamma * np.max(action_value[next_state])
        TD_error = action_value[last_state][last_action] - TD_target
        # 更新action_value
        action_value[last_state][last_action] -= alpha * TD_error
        # 更新policy
        policy[last_state] = np.eye(5)[np.argmax(action_value[last_state])]

    # 更新state value
    state_value = np.max(action_value, axis=1)
    print(f'state value updated: \n{state_value}')
    print('policy updated')
    gridworld.show_policy_matirx(policy)
    time.sleep(0.2)

print('Final Policy')
gridworld.show_policy_matirx(policy)
print('Final State Value')
print(state_value)



episode: 999 \ 1000
episode end, trajectory length: 8
state value updated: 
[1.82116797 2.02768054 2.25589071 2.50868309 2.78898621 1.6312208
 1.60034826 1.89171341 1.95175734 3.10038853 1.44321933 1.06824428
 4.8340822  2.81955038 3.44609244 1.13984092 4.86200633 4.73039236
 4.89368464 3.83000398 0.65329565 4.4304419  5.2566311  4.73027429
 4.2565327 ]
policy updated
➡️➡️➡️➡️⬇️
⬆️⏫️⏫️⬆️⬇️
⬆️⬅️⏬➡️⬇️
⬆️⏩️✅⏪⬇️
⬆️⏩️⬆️⬅️⬅️
Final Policy
➡️➡️➡️➡️⬇️
⬆️⏫️⏫️⬆️⬇️
⬆️⬅️⏬➡️⬇️
⬆️⏩️✅⏪⬇️
⬆️⏩️⬆️⬅️⬅️
Final State Value
[1.82116797 2.02768054 2.25589071 2.50868309 2.78898621 1.6312208
 1.60034826 1.89171341 1.95175734 3.10038853 1.44321933 1.06824428
 4.8340822  2.81955038 3.44609244 1.13984092 4.86200633 4.73039236
 4.89368464 3.83000398 0.65329565 4.4304419  5.2566311  4.73027429
 4.2565327 ]
