# Finite Markov Decision Processes

In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, which we try to solve in the rest of the book. This problem involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in different situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated the value $q_*(a)$ of each action $a$, in MDPs we estimate the value $q_*(s,a)$ of each action $a$ in each state $s$, or we estimate the value $v_*(s)$ of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.\
在本章中，我们将介绍有限马尔可夫决策过程（finite Markov decision processes，简称MDPs）的形式问题，我们将在本书的其余部分尝试解决这个问题。这个问题涉及到评价反馈，就像在老虎机中一样，但也涉及到associative方面——在不同的情况下选择不同的行动。MDPs是序贯决策的经典形式，其中行动不仅影响即时的reward，还影响后续的情况或状态，并通过这些未来的reward。因此，MDPs涉及延迟reward以及权衡即时和延迟奖励的需要。在bandit问题中，我们估计了每个动作$a$的value $q_*(a)$，而在MDPs中，我们估计了每个动作$a$在每个状态$s$下的value $q_*(s,a)$，或者在给定最优动作选择的情况下估计了每个状态$v_*(s)$。这些依赖于状态的数量对于准确地为个人行为选择的长期结果分配信用至关重要。

MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem’s mathematical structure, such as returns, value functions, and Bellman equations. We try to convey the wide range of applications that can be formulated as finite MDPs. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability. In this chapter we introduce this tension and discuss some of the trade-offs and challenges that it implies. Some ways in which reinforcement learning can be taken beyond MDPs are treated in Chapter 17.\
MDPs是一种数学上理想化的强化学习问题，可以对其进行精确的理论陈述。我们将介绍问题的数学结构的关键元素，如返回值、值函数和贝尔曼方程。我们试图传达广泛的应用，可以制定为有限的MDPs。就像在所有的人工智能中一样，在适用性的广度和数学上的可处理性之间存在着一种张力。在本章中，我们将介绍这种张力，并讨论它所隐含的一些权衡和挑战。强化学习可以在MDPs之外采用的一些方法在第17章中讨论。

## The Agent–Environment Interface

In a finite MDP, the sets of states, actions, and rewards ($\mathcal{S}$, $\mathcal{A}$, and $\mathcal{R}$) all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action. That is, for particular values of these random variables, $s' \in \mathcal{S}$ and $r \in \mathcal{R}$, there is a probability of those values occurring at time $t$, given particular values of the preceding state and action:\
在有限MDP中，状态、动作和奖励的集（$\mathcal{S}$，$\mathcal{A}$和$\mathcal{R}$）都有有限数量的元素。在这种情况下，随机变量$R_t$和$S_t$具有定义良好的离散概率分布，仅依赖于前面的状态和作用。也就是说，对于这些随机变量的特定值，即$s' \in \mathcal{S}$和$r \in \mathcal{R}$，在给定前一个状态和动作的特定值的情况下，这些值有可能出现在时间$t$：
$$
p(s',r|s,a) = Pr\{S_t = s', R_t = r|S_{t-1} = s, A_{t-1} = a\}
\tag{3.2}
$$
for all $s, s' \in \mathcal{S}$, $r \in \mathcal{R}$ and $a \in \mathcal{A}(s)$. 

The function $p$ defines the __dynamics__ of the MDP.

**<font color = green>Example 3.3</font> Recycling Robot**\
A mobile robot has the job of collecting empty soda cans in an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery. The robot’s control system has components for interpreting sensory information, for navigating, and for controlling the arm and gripper. High-level decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery. To make a simple example, we assume that only two charge levels can be distinguished, comprising a small state set $\mathcal{S} = \{high, low\}$. In each state, the agent can decide whether to (1) actively $search$ for a can for a certain period of time, (2) remain stationary and $wait$ for someone to bring it a can, or (3) head back to its home base to $recharge$ its battery. When the energy level is high, recharging would always be foolish, so we do not include it in the action set for this state. The action sets are then $\mathcal{A}(high) = \{search, wait\}$ and $\mathcal{A}(low) = \{search, wait, recharge\}$.\
一个移动机器人的工作是在办公环境中收集空饮料罐。它有探测罐头的传感器，一个手臂和抓手，可以把它们捡起来，并把它们放在船上的箱子里；它使用可充电电池。机器人的控制系统包含了解读感官信息、导航以及控制手臂和抓手的部件。基于电池当前的充电水平，强化学习代理对如何搜索罐进行高级决策。举个简单的例子，我们假设只有两个电荷水平可以区分，包括一个小的状态集$\mathcal{S} = \{high, low\}$。在每种状态下，agent可以决定是(1)在一段时间内主动$search$一罐电池，(2)静止不动$wait$别人给它带来一罐电池，还是(3)回到它的基地给它的电池$recharge$。当能量级别很高时，充电总是愚蠢的，所以我们不把它包含在这个状态的动作设置中。然后动作集是$\mathcal{A}(high) = \{search, wait\}$和$\mathcal{A}(low) = \{search, wait, recharge\}$。

The rewards are zero most of the time, but become positive when the robot secures an empty can, or large and negative if the battery runs all the way down. The best way to find cans is to actively search for them, but this runs down the robot’s battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In this case the robot must shut down and wait to be rescued (producing a low reward). If the energy level is high, then a period of active search can always be completed without risk of depleting the battery. A period of searching that begins with a high energy level leaves the energy level high with probability $\alpha$ and reduces it to low with probability $1 − \alpha$. On the other hand, a period of searching undertaken when the energy level is low leaves it low with probability $\beta$ and depletes the battery with probability $1 − \beta$. In the latter case, the robot must be rescued, and the battery is then recharged back to high. Each can collected by the robot counts as a unit reward, whereas a reward of −3 results whenever the robot has to be rescued. Let $r_{search}$ and $r_{wait}$, with $r_{search} > r_{wait}$, respectively denote the expected number of cans the robot will collect (and hence the expected reward) while searching and while waiting. Finally, suppose that no cans can be collected during a run home for recharging, and that no cans can be collected on a step in which the battery is depleted. This system is then a finite MDP, and we can write down the transition probabilities and the expected rewards, with dynamics as indicated in the table on the left:\
大多数情况下，奖励为零，但当机器人锁定空罐时，奖励变为正，如果电池电量耗尽，奖励变为大而负。寻找易拉罐的最佳方法是主动寻找，但这将耗尽机器人的电池，而等待则不会。每当机器人进行搜索时，它的电池就有耗尽的可能。在这种情况下，机器人必须关闭并等待救援（产生较低的奖励）。如果能量水平很高，那么总可以在不耗尽电池的情况下完成一段时间的主动搜索。从一个高能级开始的一段时间搜索，将能级以$\alpha$的概率保持在高能级，并以$1−\alpha$的概率将其降低到低能级。另一方面，当能量水平较低时进行一段时间的搜索，有可能使$\beta$处于低水平，并有可能以$1−\beta$耗尽电池。在后一种情况下，机器人必须被拯救，然后电池被重新充电到高。机器人收集到的每一个罐头都被视为一个单位奖励，而当机器人需要被拯救时，奖励结果为−3。设$r_{search}$和$r_{wait}$，其中$r_{search} > r_{wait}$分别表示机器人在搜索和等待时期望收集的罐头数（即期望的奖励）。最后，假设在回家充电的过程中无法收集到易拉罐，在电池耗尽的步骤中也无法收集到易拉罐。这个系统是一个有限的MDP，我们可以写下转移概率和期望回报，dynamics如左表所示：

Note that there is a row in the table for each possible combination of current state, $s$, action, $a \in \mathcal{A}(s)$, and next state, $s'$. Some transitions have zero probability of occurring, so no expected reward is specified for them. Shown on the right is another useful way of summarizing the dynamics of a finite MDP, as a transition graph. There are two kinds of nodes: state nodes and action nodes. There is a state node for each possible state (a large open circle labeled by the name of the state), and an action node for each state–action pair (a small solid circle labeled by the name of the action and connected by a line to the state node). Starting in state $s$ and taking action $a$ moves you along the line from state node $s$ to action node $(s, a)$. Then the environment responds with a transition to the next state’s node via one of the arrows leaving action node $(s, a)$. Each arrow corresponds to a triple $(s, s', a)$, where s0 is the next state, and we label the arrow with the transition probability, $p(s'|s, a)$, and the expected reward for that transition, $r(s, a, s')$. Note that the transition probabilities labeling the arrows leaving an action node always sum to 1.\
请注意，表中有一行用于当前状态$s$、动作$a \in \mathcal{a}(s)$和下一个状态$s'$的每个可能组合。有些转变发生的概率为零，所以没有为它们指定预期的奖励。在右边显示的是另一种有用的方法来总结有限MDP的dynamics，作为一个过渡图。有两种节点：状态节点和操作节点。每个可能的状态都有一个状态节点（一个用状态名称标记的大开放圆），每个状态-操作对都有一个操作节点（一个用动作名称标记的小实心圆，用一条线连接到状态节点）。从状态$s$开始并采取操作$a$将使您从状态节点$s$移动到操作节点$(s, a)$。然后环境通过离开操作节点$(s, a)$的一个箭头响应到下一个状态的节点。每个箭头对应一个三重$(s, s', a)$，其中$s'$是下一个状态，我们用转移概率$p(s'|s, a)$和该转移的预期回报$r(s, a, s')$标记箭头。注意，标记离开动作节点箭头的转移概率总和总是为1。

## Goals and Rewards

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the **reward**, passing from the environment to the agent. At each time step, the reward is a simple number, $R_t \in \mathbb{R}$. Informally, the agent’s goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run. We can clearly state this informal idea as the reward hypothesis:\
在强化学习中，agent的目的或目标被形式化为一种从环境传递给agent的特殊信号，称为**reward**。在每个time step，reward是一个简单的数字，$R_t \in \mathbb{R}$。非正式地说，agent的目标是使其获得的总reward最大化。这意味着最大化的不是即时reward，而是长期累积reward。我们可以将这个非正式的观点清晰地表述为reward假说：

> That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).\
我们所说的目标和目的都可以很好地理解为对接收到的标量信号（称为reward）的累积和的期望值的最大化。

The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.\
使用reward信号来形式化目标的概念是强化学习最显著的特征之一。

___
**<font color = blue>Definition</font> Reward**\
At each time step, the **reward** is a simple number, $R_t \in \mathbb{R}$.\
在每个time step，**reward**是一个简单的数字，$R_t \in \mathbb{R}$。
___
___
**<font color = blue>Definition</font> Goal (目标)**\
Informally, the agent’s **goal** is to maximize the total amount of reward it receives.\
非正式地说，agent的**目标（goal）**是使其获得的总reward最大化。
___

Although formulating goals in terms of reward signals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider examples of how it has been, or could be, used.\
尽管从reward信号的角度制定目标一开始可能显得有限，但在实践中，它被证明是灵活和广泛适用的。要了解这一点，最好的方法是考虑它是如何被使用的，或者可能被使用的例子。

___
**<font color = green>Example</font> Make a robot learn to walk**\
For example, to make a robot learn to walk, researchers have provided reward on each time step proportional to the robot’s forward motion. In making a robot learn how to escape from a maze, the reward is often -1 for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible. To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of +1 for each can collected. One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it.\
例如，为了让机器人学会走路，研究人员对机器人前进的每一步都给予相应的奖励。在让机器人学会如何逃离迷宫的过程中，每走一步就会得到-1的奖励;这鼓励代理尽可能快地逃离。为了让机器人学会寻找和收集空汽水罐进行回收利用，人们可以在大多数情况下给它零奖励，然后每收集一个空汽水罐就给予+1奖励。当机器人撞到东西或者有人对它大喊大叫时，你也可以给它一些负面奖励。
___

___
**<font color = green>Example</font> Make a robot learn to play checkers or chess**\
For an agent to learn to play checkers or chess, the natural rewards are +1 for winning, -1 for losing, and 0 for drawing and for all nonterminal positions.\
对于一个学习玩跳棋或象棋的代理来说，自然的奖励是+1赢，-1输，0和棋以及所有非终止的位置。
___

## Returns and Episodes

### Return

So far we have discussed the objective of learning informally. We have said that the agent’s goal is to maximize the cumulative reward it receives in the long run. How might this be defined formally? If the sequence of rewards received after time step $t$ is denoted $R_{t+1}, R_{t+2}, R_{t+3}, \cdots$, then what precise aspect of this sequence do we wish to maximize? In general, we seek to maximize the **expected return**, where the return, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:\
到目前为止，我们已经非正式地讨论了学习的目的。我们已经说过，agent的目标是最大化其在长期内获得的累积reward。这应该如何正式定义？如果在时间步长$t$之后收到的reward序列表示为$R_{t+1}, R_{t+2}, R_{t+3}, \cdots$，那么我们希望最大化这个序列的哪个精确方面呢？一般来说，我们寻求最大化**expected return**，其中return，记为$G_t$，被定义为reward序列的某个特定函数。在最简单的情况下，return是reward的总和:
$$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_{T} = \sum_{k=t+1}^T R_k
\tag{3.7}
\label{Eq 3.7}
$$
where $T$ is a final time step.\
其中，$T$是最后的time step。

___
**<font color = blue>Definition</font> Return**\
**Return**, denoted $G_t$, is defined as some specific function of the reward sequence.\
**Return**，记为$G_t$，被定义为reward序列的某个特定函数。
___

### Episodic Task

This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call **episodes**, such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the **terminal state**, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as ending in different ways, such as winning and losing a game, the next episode begins independently of how the previous one ended. Thus the episodes can all be considered to end in the same terminal state, with different rewards for the different outcomes.\
这种方法适用于具有最终time step的自然概念的应用，也就是说，当agent-环境交互作用自然地分解成我们称之为**episode**的子序列时，如玩游戏、穿越迷宫或任何类型的重复交互。每一episode都以一种称为**terminal状态**的特殊状态结束，然后重置到一个标准开始状态或从一个标准开始状态分布的样本结束。即使你认为episode以不同的方式结束，如游戏的胜利和失败，下一episode的开始与前一episode的结束是独立的。因此，这些episode都可以被认为以相同的terminal状态结束，不同的结果会得到不同的reward。

Tasks with episodes of this kind are called **episodic tasks**. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted $\mathcal{S}$, from the set of all states plus the terminal state, denoted $\mathcal{S}^+$. The time of termination, $T$, is a random variable that normally varies from episode to episode.\
有这种episode的任务叫做**episodic任务**。在episodic任务中，我们有时需要区分所有nonterminal状态的集合（记为$\mathcal{S}$）和所有状态加上terminal状态的集合（记为$\mathcal{S}^+$）。termination时间$T$是一个随机变量，通常因episode而异。

### Continuing Task

On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span. We call these **continuing tasks**. The return formulation $\ref{Eq 3.7}$ is problematic for continuing tasks because the final time step would be $T = \infty$, and the return, which is what we are trying to maximize, could itself easily be infinite. (For example, suppose the agent receives a reward of +1 at each time step.) Thus, in this book we usually use a definition of return that is slightly more complex conceptually but much simpler mathematically.\
另一方面，在许多情况下，agent-环境的交互作用不会自然地分解成可识别的episode，而是无限地持续下去。例如，这将是一种自然的方式来制定一个正在进行的过程控制任务，或一个具有较长寿命机器人的一个应用程序。我们称这些为**continuing任务**。return公式$\ref{Eq 3.7}$对于continuing任务来说是有问题的，因为最后的时间步是$T = \infty$，而我们试图最大化的return本身很容易是无限的。（例如，假设agent在每个时间步长获得+1的奖励。）因此，在这本书中，我们通常使用一个概念上稍微复杂一点，但数学上要简单得多的return定义。

#### Discounting

The additional concept that we need is that of **discounting**. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_t$ to maximize the expected **discounted return**:\
我们需要的另一个概念是**discounting**。根据这种方法，agent试图选择行动，使其在未来获得的discounted rewards的总和最大化。特别是，它选择$A_t$来最大化expected **discounted return**:
$$
\begin{align}
G_t 
= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots
&= \sum_{k=0}^\infty \gamma^{k} R_{t+k+1} \\
&= \sum_{k=1}^\infty \gamma^{k-1} R_{t+k}
\end{align}
\tag{3.8}
\label{Eq 3.8}
$$
where $\gamma$ is a parameter, $0 \leq \gamma \leq 1$, called the **discount rate**.\
其中$\gamma$是一个参数，$0 \leq \gamma \leq 1$称为**discount rate**。

The discount rate determines the present value of future rewards: a reward received $k$ time steps in the future is worth only $\gamma^{k-1}$ times what it would be worth if it were received immediately.\
discount rate决定了未来reward的现值：在未来$k$ time steps收到的reward价值仅为$\gamma^{k-1}$乘以立即收到的reward价值。

___
**<font color = purple>Remark</font> Discount Rate**
* That is, at time step $k$ in the future, the reward is $\gamma^{k-1} R_{t+k}$.\
即在未来time step $k$时，reward为$\gamma^{k-1} R_{t+k}$。
* If $\gamma < 1$, the infinite sum in $\ref{Eq 3.8}$ has a finite value as long as the reward sequence ${R_k}$ is bounded.\
如果$\gamma < 1$，只要reward序列${R_k}$有界，$\ref{Eq 3.8}$中的无限和就有一个有限值。
* If $\gamma = 0$, the agent is “myopic [maɪˈɒpɪk]” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose $A_t$ so as to maximize only $R_{t+1}$.\
如果$\gamma = 0$，那么agent只关心即时reward最大化是“短视的（myopic）”：在这种情况下，它的目标是学习如何选择$A_t$，从而只使$R_{t+1}$最大化。
* As $\gamma$ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more farsighted.\
当$\gamma$接近1时，return目标更强烈地考虑未来的回报；agent变得更有远见了。
___

If each of the agent’s actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize $\ref{Eq 3.8}$ by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return is reduced.\
如果agent的每一个行为都只影响眼前的reward，而不影响未来的reward，那么一个短视的agent可以通过分别最大化每个即时reward来最大化\ref{Eq 3.8}。但一般来说，最大化即时reward会减少获得未来reward的机会，从而降低return。

#### Properties of Returns at Successive Time Steps

Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning:\
连续的time steps上的return是相互关联的，这对强化学习的理论和算法很重要:
$$
\begin{align}
G_t 
&= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots \\
&= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots) \\
&= R_{t+1} + \gamma G_{t+1}
\end{align}
\tag{3.9}
\label{Eq 3.9}
$$
Note that this works for all time steps $t < T$, even if termination occurs at $t + 1$, if we define $G_T = 0$. This often makes it easy to compute returns from reward sequences.\
注意，如果我们定义$G_T = 0$，那么这适用于所有time steps $t < T$，即使termination发生在$t + 1$。这通常会使计算reward序列的return变得容易。

Note that although the return $\ref{Eq 3.8}$ is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if $\gamma < 1$.\
请注意，虽然return$\ref{Eq 3.8}$是一个无限项的和，但如果reward是非零的且是常量-如果$\gamma < 1$，它仍然是有限的。\
For example, if the reward is a constant +1, then the return is\
例如，如果奖励是常量+1，那么return就是
$$
G_t 
= \sum_{k=0}^\infty \gamma^{k} 
= \frac{1}{1 - \gamma}
\tag{3.10}
\label{Eq 3.10}
$$

## Unified Notation for Episodic and Continuing Tasks

## Policies and Value Functions

### Value Function

Almost all reinforcement learning algorithms involve estimating **value functions**—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.\
几乎所有的强化学习算法都涉及估计**值函数（value functions）**——状态（或状态-动作对）的函数，它估计agent处于给定状态有多好（或在给定状态下执行给定动作有多好）。这里“有多好”的概念是根据可以预期的未来reward来定义的，或者，准确地说，根据expected return来定义的。当然，agent期望在未来获得的reward取决于它将采取什么行动。因此，值函数是根据特定的行为方式（称为policy）定义的。

___
**<font color = blue>Definition</font> Value Function (值函数)**\
**Value functions** are functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state).\
**值函数（value function）**是状态（或状态-动作对）的函数，估计agent处于给定状态有多好（或在给定状态下执行给定动作有多好）。
___

### Policy

Formally, a **policy** is a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ is the probability that $A_t = a$ if $S_t = s$. Like $p$, $\pi$ is an ordinary function; the “|” in the middle of $\pi(a|s)$ merely reminds that it defines a probability distribution over $a \in \mathcal{A}(s)$ for each $s \in \mathcal{S}$. Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.\
形式上，一个**策略（policy）**是从状态到选择每个可能动作的概率的映射。如果agent在时间$t$遵循策略$\pi$，则$\pi(a|s)$为如果$S_t = s$，则$A_t = a$的概率。像$p$一样，$\pi$是一个普通函数；$\pi(a|s)$中间的“|”只是提醒我们，它为每个$s \in \mathcal{S}$定义了在$a \in \mathcal{A}(s)$上的概率分布。强化学习方法指定agent的策略如何因其经验而改变。

___
**<font color = blue>Definition</font> Policy (策略)**\
Formally, a **policy** is a mapping from states to probabilities of selecting each possible action.\
形式上，一个**策略（policy）**是从状态到选择每个可能动作的概率的一个映射。
___

### State-Value Function
The value function of a state $s$ under a policy $\pi$, denoted $v_\pi(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs, we can define $v_\pi$ formally by\
在策略$\pi$下状态$s$的值函数，记为$v_\pi(s)$，是从状态$s$开始，其后遵循策略$\pi$的期望return。对于MDPs，我们可以正式定义$v_\pi$
$$
\begin{align}
v_{\pi}(s)
&= \mathbb{E}_\pi[G_t|S_t=s] \\
&= \mathbb{E}_\pi[\sum_{k=0}^\infty \gamma^{k} R_{t+k+1}|S_t=s], \qquad \text{for all } s \in \mathcal{S} \\
&= \mathbb{E}_\pi[R_{t+1} + \gamma G_{t+1}|S_t=s]
\end{align}
\tag{3.12}
\label{Eq 3.12}
$$
where $\mathbb{E}_\pi[\cdot]$ denotes the expected value of a random variable given that the agent follows policy $\pi$, and $t$ is any time step.\
其中$\mathbb{E}_\pi[\cdot]$表示假设agent遵循策略$\pi$时随机变量的期望值，$t$为任意时间步长。

Note that the value of the terminal state, if any, is always zero. We call the function $v_\pi$ the **state-value function** for policy $\pi$.\
请注意，如果有terminal状态的值，则其始终为零。我们将函数$v_\pi$称为策略$\pi$的**状态值函数（state-value function）**。

**<font color = purple>Remark</font> State-Value Function**
* The state-value function $v_{\pi}(s)$ of an MDP is the expected return starting from state $s$, and then following policy $\pi$.\
MDP的state-value函数$v_{\pi}(s)$是从状态$s$开始，然后遵循策略$\pi$的期望return。
* Note that $S_t$ and $R_t$ etc are random variables. Hence the return $G_t$ is also a random variable. We thus consider the expected return $v_{\pi}(s)$.\
注意，$S_t$和$R_t$等是随机变量。因此，return $G_t$也是一个随机变量。因此，我们考虑期望return $v_{\pi}(s)$。

### Action-Value Function (Q-function)

Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted $q_\pi(s, a)$, as the expected return starting from state $s$, taking the action $a$, and thereafter following policy $\pi$:\
同样，我们定义在策略$\pi$下的状态$s$中执行动作$a$的值，记为$q_\pi(s, a)$，作为从状态$s$开始，执行动作$a$，然后遵循策略$\pi$的期望return：
$$
\begin{align}
q_{\pi}(s, a) 
&= \mathbb{E}_{\pi}[G_t|S_t=s, A_t=a] \\
&= \mathbb{E}_{\pi}[\sum_{k=0}^\infty \gamma^{k} R_{t+k+1}|S_t=s, A_t=a]\\
&= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1}|S_t=s, A_t=a]
\end{align}
\tag{3.13}
\label{Eq 3.13}
$$
We call $q_\pi$ the **action-value function** for policy $\pi$. Often called the **Q-function**.\
我们称$q_\pi$为策略$\pi$的**动作值函数（action-value function）**。通常称为**Q函数**。

### The Bellman Equation

**<font color = purple>Remark</font> The Bellman Equation**
* The value of state $s$ is the expected immediate reward plus the discounted expected value of the next state.\
状态$s$的价值是期望的即时reward加上下一个状态的折现期望值。
$$
\begin{align}
v_{\pi}(s) 
&= \mathbb{E}_\pi [G_t | S_t=s] \\
&= \mathbb{E}_\pi [R_{t+1} + \gamma G_{t+1} | S_t=s], \qquad \text{by (3.9)} \\
&= \mathbb{E}_\pi [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s]
\end{align}
$$

* The state-value of $s$ is the expected action-value:\
$s$的状态值是预期的行为值。
$$
v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s,a)
$$

___
**<font color = green>Example</font> The Bellman Equation**
<img src = "RL Figure - Example: The Bellman Equation.png">
___

## Optimal Policies and Optimal Value Functions

### Optimal Policy

Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define a partial ordering over policies.\
解决一个强化学习任务大致意味着，找到一个在长期内获得大量reward的策略。对于有限的MDP，我们可以用以下方法精确地定义最优策略。值函数定义策略的部分排序。

A policy $\pi$ is defined to be better than or equal to a policy $\pi'$ if its expected return is greater than or equal to that of $\pi'$ for all states. In other words, $\pi \geq \pi'$ if and only if $v_\pi(s) \geq v_{\pi'}(s)$ for all $s \in \mathcal{S}$.\
如果一个策略$\pi$在所有状态的预期return大于或等于$\pi'$，则该策略$\pi$被定义为优于或等于策略$\pi'$。换句话说，对于所有$s \in \mathcal{S}$，$\pi \geq \pi'$当且仅当$v_\pi(s) \geq v_{\pi'}(s)$。

___
**<font color = blue>Definition</font> Optimal Policy (最优策略)**\
There is always at least one policy that is better than or equal to all other policies. This is an **optimal policy**.\
总有至少一个策略优于或等于所有其他策略。这就是**最优策略（optimal policy）**。
___

___
**<font color = purple>Remark</font> Optimal Policy**
* All optimal policies achieve the optimal state-value function $v_{\pi_*}(s) = v_*(s)$.\
所有最优策略都达到最优状态值函数$v_{\pi_*}(s) = v_*(s)$。
* All optimal policies achieve the optimal action-value function $q_{\pi_*}(s,a) = q_*(s,a)$.\
所有最优策略都实现最优动作值函数$q_{\pi_*}(s,a) = q_*(s,a)$。
___

#### Finding an optimal policy
The policy
$$
\pi_*(s) 
= \underset{a}{\arg\max} q_*(s,a)
= \underset{a}{\arg\max} \sum_{r,s'} p(s',r|s,a) [r + \gamma v_*(s')]
$$
is optimal.

**<font color = purple>Remark</font> Finding an optimal policy**
* If we know $q_*(s,a)$ we don't need the dynamics to find an optimal policy!\
如果我们知道$q_*(s,a)$，我们就不需要dynamics来找到最优策略!

___
**<font color = green>Example</font> Finding an optimal policy**
<img src = "RL Figure - Example: Optimal Policy.png">
___

### Optimal State-Value Function
___
**<font color = blue>Definition</font> Optimal State-Value Function (最优状态值函数)**\
Although there may be more than one, we denote all the optimal policies by $\pi_*$. They share the same state-value function, called the **optimal state-value function**, denoted $v_*$, and defined as\
虽然可能有多个最优策略，但我们用$\pi_*$表示所有最优策略。它们共享相同的状态值函数，称为**最优状态值函数（optimal state-value function）**，记为$v_*$，定义为
$$
v_*(s) = \underset{\pi}{\max} v_\pi(s)
\tag{3.15}
$$
for all $s \in \mathcal{S}$.
对于所有的$s \in \mathcal{S}$。
___

___
**<font color = purple>Remark</font> Optimal State-Value Function**\
The optimal $v_*(s)$ should be the maximum of $q_*(s,a)$.\
最优的$v_*(s)$应该是$q_*(s,a)$的最大值。
$$
v_*(s) = \underset{a}{\max} q_*(s,a)
$$
___

### Optimal Action-Value Function

___
**<font color = blue>Definition</font> Optimal Action-Value Function (最优动作值函数)**\
Optimal policies also share the same **optimal action-value function**, denoted $q_*$, and defined as\
最优策略也具有相同的**最优动作值函数（optimal action-value function）**，记为$q_*$，定义为
$$
q_*(s,a) = \underset{\pi}{\max} q_\pi(s,a)
\tag{3.16}
$$
for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$.\
对于所有的$s \in \mathcal{S}$以及$a \in \mathcal{A}$。
___

For the state–action pair $(s,a)$, this function gives the expected return for taking action $a$ in state $s$ and thereafter following an optimal policy. Thus, we can write $q_*$ in terms of $v_*$ as follows:\
对于状态-动作组合$(s,a)$，该函数给出了在状态$s$中采取行为$a$并遵循最优策略的预期return。因此，我们可以将$q_*$写成$v_*$，如下所示:
$$
q_*(s,a) = \mathbb{E}_{\pi} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a]
\tag{3.17}
$$