- 这个系列会重新开始介绍强化学习的基础理论
    - 基于这个 https://gibberblot.github.io/rl-notes/intro.html
- 后边还会再次介绍深度强化学习 DRL
    - https://github.com/huggingface/deep-rl-class/tree/main
- 最后会跟 llm 结合，我们看 trl：
    - https://github.com/huggingface/trl

In [1]:
from IPython.display import Image

## basics

$$
\mathcal M=(S, A, P, r, \gamma)
$$

- 有的地方写作 $(S, s_0, A, P, r, \gamma)$
- $S$：state space，$s_0\in S$：initial state
    - `s0 = mdp.reset()`
- $A$: action space, $A(s)\subseteq A$ applicable in each state (当前状态 $s$ 下允许的动作)
- $P_a(s'|s)$ 转移概率（transition probabilities），对于 $s\in S, a \in A(s)$
    - 有的地方也写作：$P(s'|s,a)$
    - 注意是概率化的（也即 probabilistic state model），而非 deterministic（确定性的）的；
    - 如果是确定性的，那就是一个经典的序列决策问题了；
- $r(s,a,s')$：reward function，可以为正可以为负（设计一个 MDP env，很多一部分都是在设计 reward function）
    - $r(s,a)$：确定性的情况下；
- $\gamma$：discount factor ，
    - $0\leq \gamma < 1$
    - 一般又把这一种 MDP 称之为 discounted reward MDP；
- MDP 的 solving 是一个序列决策问题（sequence decision making）

### probabilistic state model

- 抛硬币：heads ($\frac12$), tails ($\frac12$)
- 掷2个骰子面数和：2 ($\frac{1}{36}$), 3 ($\frac1{18}$), 4 ($\frac{3}{36}$), ..., 12 ($\frac1{36}$)
- 机械臂去拿一个东西：success ($\frac{4}5$), failure ($\frac15$)
- 打开一个网页：404 概率 1%, 200 概率 99%；

### discounted reward

- 如果我们的 agent 在与环境的交互过程中(s, a, s', a', ....)，得到这样的一系列reward $r_1, r_2, \cdots, $，则有 

$$
\begin{split}
V&=\gamma^0r_1+\gamma r_2 + \gamma^2 r_3+\gamma^3 r_4+\cdots\\
&=r_1+\gamma (r_2+\gamma(r_3 + \gamma(r_4+\cdots )))
\end{split}
$$


$$
V_t=r_t+\gamma V_{t+1}
$$

- 递归定义，体现了子问题的结构（$V_t$ 与 $V_{t+1}$ 的关系，动态规划）
- $V_{t+1}$ 的价值，通过 $\gamma V_{t+1}$ 折到现在；
- 因为我们要最大化 discounted reward，所以 $\gamma < 1$ 会隐式地得到一个更短的路径（另外一个角度 action 会有 cost）

## grid world

In [2]:
Image(url='https://gibberblot.github.io/rl-notes/_images/ac08de56caab98830b830b1068f4c5b87881b518b6e0e0ba02e062171bfc21e3.png')

```
class GridWorld(MDP):

    ...

    def get_transitions(self, state, action):
        transitions = []

        if state == self.TERMINAL:
            if action == self.TERMINATE:
                return [(self.TERMINAL, 1.0)]
            else:
                return []

        # Probability of not slipping left or right
        straight = 1 - (2 * self.noise)

        (x, y) = state
        if state in self.get_goal_states().keys():
            if action == self.TERMINATE:
                transitions += [(self.TERMINAL, 1.0)]

        elif action == self.UP:
            transitions += self.valid_add(state, (x, y + 1), straight)
            transitions += self.valid_add(state, (x - 1, y), self.noise)
            transitions += self.valid_add(state, (x + 1, y), self.noise)

        elif action == self.DOWN:
            transitions += self.valid_add(state, (x, y - 1), straight)
            transitions += self.valid_add(state, (x - 1, y), self.noise)
            transitions += self.valid_add(state, (x + 1, y), self.noise)

        elif action == self.RIGHT:
            transitions += self.valid_add(state, (x + 1, y), straight)
            transitions += self.valid_add(state, (x, y - 1), self.noise)
            transitions += self.valid_add(state, (x, y + 1), self.noise)

        elif action == self.LEFT:
            transitions += self.valid_add(state, (x - 1, y), straight)
            transitions += self.valid_add(state, (x, y - 1), self.noise)
            transitions += self.valid_add(state, (x, y + 1), self.noise)

        # Merge any duplicate outcomes
        merged = defaultdict(lambda: 0.0)
        for (state, probability) in transitions:
            merged[state] = merged[state] + probability

        transitions = []
        for outcome in merged.keys():
            transitions += [(outcome, merged[outcome])]

        return transitions

    def valid_add(self, state, new_state, probability):
        # If the next state is blocked, stay in the same state
        if probability == 0.0:
            return []

        if new_state in self.blocked_states:
            return [(state, probability)]

        # Move to the next space if it is not off the grid
        (x, y) = new_state
        if x >= 0 and x < self.width and y >= 0 and y < self.height:
            return [((x, y), probability)]

        # If off the grid, state in the same state
        return [(state, probability)]

    def get_reward(self, state, action, new_state):
        reward = 0.0
        if state in self.get_goal_states().keys() and new_state == self.TERMINAL:
            reward = self.get_goal_states().get(state)
        else:
            reward = self.action_cost
        step = len(self.episode_rewards)
        self.episode_rewards += [reward * (self.discount_factor ** step)]
        return reward
```

## 与 POMDP、I-POMDP


$$
(S, \Omega, O, A, P, r, \gamma)
$$

- POMDP：partial observable MDP
    - 就不再是 MDP 默认的全可观测（grid world 的 state 就是自己的位置，(x, y)）
    - $s\in S$: states 
    - $A(s)\subseteq A$
    - $P_a(s'|s)$
    - $r(s,a,s')$
    - $\gamma$
    - 观测空间 $\Omega$
    - $O(o|s), o\in \Omega$：观测函数
    - initial belief state $b_0$
- I-POMDP：多智能体环境，I 表示是的 interactive；
    - 对状态的建模，就不是只有 physical states，还包含对另一个 agent 的 belief state；