# Finite Markov Decision Processes

In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, which we try to solve in the rest of the book. This problem involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in different situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated the value $q_*(a)$ of each action $a$, in MDPs we estimate the value $q_*(s,a)$ of each action $a$ in each state $s$, or we estimate the value $v_*(s)$ of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.\
在本章中，我们将介绍有限马尔可夫决策过程（finite Markov decision processes，简称有限MDPs）的形式问题，我们将在本书的其余部分尝试解决这个问题。这个问题涉及到评价反馈，就像在老虎机中一样，但也涉及到联合方面——在不同的情况下选择不同的行动。MDPs是序贯决策的经典形式，其中行动不仅影响即时的回报，还影响后续的情况或状态，并通过这些未来的回报。因此，MDPs涉及延迟奖励和需要权衡即时和延迟奖励。在老虎机问题中，我们估计了每个动作$a$的价值$q_*(a)$，而在MDPs中，我们估计了每个动作$a$在每个状态$s$下的价值$q_*(s,a)$，或者在给定最优行为选择的条件下，估计了每个状态的值$v_*(s)$。这些依赖于状态的数量对于准确地为个人行为选择的长期结果分配信用至关重要。

MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem’s mathematical structure, such as returns, value functions, and Bellman equations. We try to convey the wide range of applications that can be formulated as finite MDPs. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability. In this chapter we introduce this tension and discuss some of the trade-offs and challenges that it implies. Some ways in which reinforcement learning can be taken beyond MDPs are treated in Chapter 17.\
MDPs是一种数学上理想化的强化学习问题，可以对其进行精确的理论陈述。我们将介绍问题的数学结构的关键元素，如返回值、值函数和贝尔曼方程。我们试图传达广泛的应用，可以制定为有限的MDPs。就像在所有的人工智能中一样，在适用性的广度和数学上的可处理性之间存在着一种张力。在本章中，我们将介绍这种张力，并讨论它所隐含的一些权衡和挑战。强化学习可以在MDPs之外采用的一些方法在第17章中讨论。

## The Agent–Environment Interface

## Goals and Rewards

In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the **reward**, passing from the environment to the agent. At each time step, the reward is a simple number, $R_t \in \mathbb{R}$. Informally, the agent’s goal is to maximize the total amount of reward it receives. This means maximizing not immediate reward, but cumulative reward in the long run. We can clearly state this informal idea as the reward hypothesis:\
在强化学习中，agent的目的或目标被形式化为一种从环境传递给agent的特殊信号，称为**reward**。在每个time step，reward是一个简单的数字，$R_t \in \mathbb{R}$。非正式地说，agent的目标是使其获得的总reward最大化。这意味着最大化的不是即时reward，而是长期累积reward。我们可以将这个非正式的观点清晰地表述为reward假说：

> That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).\
我们所说的目标和目的都可以很好地理解为对接收到的标量信号（称为reward）的累积和的期望值的最大化。

The use of a reward signal to formalize the idea of a goal is one of the most distinctive features of reinforcement learning.\
使用reward信号来形式化目标的概念是强化学习最显著的特征之一。

---

**<font color = blue>Definition</font> Reward**\
At each time step, the **reward** is a simple number, $R_t \in \mathbb{R}$.\
在每个time step，**reward**是一个简单的数字，$R_t \in \mathbb{R}$。

---

**<font color = blue>Definition</font> Goal**\
Informally, the agent’s **goal** is to maximize the total amount of reward it receives.\
非正式地说，agent的**goal**是使其获得的总reward最大化。

---

Although formulating goals in terms of reward signals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider examples of how it has been, or could be, used.\
尽管从reward信号的角度制定目标一开始可能显得有限，但在实践中，它被证明是灵活和广泛适用的。要了解这一点，最好的方法是考虑它是如何被使用的，或者可能被使用的例子。

**<font color = green>Example</font> Make a robot learn to walk**\
For example, to make a robot learn to walk, researchers have provided reward on each time step proportional to the robot’s forward motion. In making a robot learn how to escape from a maze, the reward is often -1 for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible. To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of +1 for each can collected. One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it.\
例如，为了让机器人学会走路，研究人员对机器人前进的每一步都给予相应的奖励。在让机器人学会如何逃离迷宫的过程中，每走一步就会得到-1的奖励;这鼓励代理尽可能快地逃离。为了让机器人学会寻找和收集空汽水罐进行回收利用，人们可以在大多数情况下给它零奖励，然后每收集一个空汽水罐就给予+1奖励。当机器人撞到东西或者有人对它大喊大叫时，你也可以给它一些负面奖励。

**<font color = green>Example</font> Make a robot learn to play checkers or chess**\
For an agent to learn to play checkers or chess, the natural rewards are +1 for winning, -1 for losing, and 0 for drawing and for all nonterminal positions.\
对于一个学习玩跳棋或象棋的代理来说，自然的奖励是+1赢，-1输，0和棋以及所有非终止的位置。

## Returns and Episodes

### Return

So far we have discussed the objective of learning informally. We have said that the agent’s goal is to maximize the cumulative reward it receives in the long run. How might this be defined formally? If the sequence of rewards received after time step $t$ is denoted $R_{t+1}, R_{t+2}, R_{t+3}, \cdots$, then what precise aspect of this sequence do we wish to maximize? In general, we seek to maximize the **expected return**, where the return, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:\
到目前为止，我们已经非正式地讨论了学习的目的。我们已经说过，agent的目标是最大化其在长期内获得的累积reward。这应该如何正式定义？如果在时间步长$t$之后收到的reward序列表示为$R_{t+1}, R_{t+2}, R_{t+3}, \cdots$，那么我们希望最大化这个序列的哪个精确方面呢？一般来说，我们寻求最大化**expected return**，其中return，记为$G_t$，被定义为reward序列的某个特定函数。在最简单的情况下，return是reward的总和:
$$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_{T} = \sum_{k=t+1}^T R_k
\tag{3.7}
\label{Eq 3.7}
$$
where $T$ is a final time step.\
其中，$T$是最后的time step。

**<font color = blue>Definition</font> Return**\
**Return**, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:\
**Return**，记为$G_t$，被定义为reward序列的某个特定函数。在最简单的情况下，return是reward的总和:
$$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_{T} = \sum_{k=t+1}^T R_k
$$
where $T$ is a final time step.\
其中，$T$是最后的time step。

### Episodic Task

This approach makes sense in applications in which there is a natural notion of final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we call **episodes**, such as plays of a game, trips through a maze, or any sort of repeated interaction. Each episode ends in a special state called the **terminal state**, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as ending in different ways, such as winning and losing a game, the next episode begins independently of how the previous one ended. Thus the episodes can all be considered to end in the same terminal state, with different rewards for the different outcomes.\
这种方法适用于具有最终time step的自然概念的应用，也就是说，当agent-环境交互作用自然地分解成我们称之为**episode**的子序列时，如玩游戏、穿越迷宫或任何类型的重复交互。每一episode都以一种称为**terminal状态**的特殊状态结束，然后重置到一个标准开始状态或从一个标准开始状态分布的样本结束。即使你认为episode以不同的方式结束，如游戏的胜利和失败，下一episode的开始与前一episode的结束是独立的。因此，这些episode都可以被认为以相同的terminal状态结束，不同的结果会得到不同的reward。

Tasks with episodes of this kind are called **episodic tasks**. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted $\mathcal{S}$, from the set of all states plus the terminal state, denoted $\mathcal{S}^+$. The time of termination, $T$, is a random variable that normally varies from episode to episode.\
有这种episode的任务叫做**episodic任务**。在episodic任务中，我们有时需要区分所有nonterminal状态的集合（记为$\mathcal{S}$）和所有状态加上terminal状态的集合（记为$\mathcal{S}^+$）。termination时间$T$是一个随机变量，通常因episode而异。

### Continuing Task

On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. For example, this would be the natural way to formulate an on-going process-control task, or an application to a robot with a long life span. We call these **continuing tasks**. The return formulation $\ref{Eq 3.7}$ is problematic for continuing tasks because the final time step would be $T = \infty$, and the return, which is what we are trying to maximize, could itself easily be infinite. (For example, suppose the agent receives a reward of +1 at each time step.) Thus, in this book we usually use a definition of return that is slightly more complex conceptually but much simpler mathematically.\
另一方面，在许多情况下，agent-环境的交互作用不会自然地分解成可识别的episode，而是无限地持续下去。例如，这将是一种自然的方式来制定一个正在进行的过程控制任务，或一个具有较长寿命机器人的一个应用程序。我们称这些为**continuing任务**。return公式$\ref{Eq 3.7}$对于continuing任务来说是有问题的，因为最后的时间步是$T = \infty$，而我们试图最大化的return本身很容易是无限的。（例如，假设agent在每个时间步长获得+1的奖励。）因此，在这本书中，我们通常使用一个概念上稍微复杂一点，但数学上要简单得多的return定义。

#### Discounting

The additional concept that we need is that of **discounting**. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_t$ to maximize the expected **discounted return**:\
我们需要的另一个概念是**discounting**。根据这种方法，agent试图选择行动，使其在未来获得的discounted rewards的总和最大化。特别是，它选择$A_t$来最大化expected **discounted return**:
$$
G_t 
= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots 
= \sum_{k=0}^\infty \gamma^{k} R_{t+k+1}
= \sum_{k=1}^\infty \gamma^{k-1} R_{t+k}
\tag{3.8}
\label{Eq 3.8}
$$
where $\gamma$ is a parameter, $0 \leq \gamma \leq 1$, called the **discount rate**.\
其中$\gamma$是一个参数，$0 \leq \gamma \leq 1$称为**discount rate**。

The discount rate determines the present value of future rewards: a reward received $k$ time steps in the future is worth only $\gamma^{k-1}$ times what it would be worth if it were received immediately.\
discount rate决定了未来reward的现值：在未来$k$ time steps收到的reward价值仅为$\gamma^{k-1}$乘以立即收到的reward价值。

___
**<font color = purple>Remark</font> Interpretation of Discount Rate**\
That is, at time step $k$ in the future, the reward is $\gamma^{k-1} R_{t+k}$.\
即在未来time step $k$时，reward为$\gamma^{k-1} R_{t+k}$。
___

* If $\gamma < 1$, the infinite sum in $\ref{Eq 3.8}$ has a finite value as long as the reward sequence ${R_k}$ is bounded. As $\gamma$ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more farsighted.\
如果$\gamma < 1$，只要reward序列${R_k}$有界，$\ref{Eq 3.8}$中的无限和就有一个有限值。当$\gamma$接近1时，return目标更强烈地考虑未来的回报；agent变得更有远见了。
* If $\gamma = 0$, the agent is “myopic [maɪˈɒpɪk]” in being concerned only with maximizing immediate rewards: its objective in this case is to learn how to choose $A_t$ so as to maximize only $R_{t+1}$.\
如果$\gamma = 0$，那么agent只关心即时reward最大化是“短视的（myopic）”：在这种情况下，它的目标是学习如何选择$A_t$，从而只使$R_{t+1}$最大化。

If each of the agent’s actions happened to influence only the immediate reward, not future rewards as well, then a myopic agent could maximize $\ref{Eq 3.8}$ by separately maximizing each immediate reward. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return is reduced.\
如果agent的每一个行为都只影响眼前的reward，而不影响未来的reward，那么一个短视的agent可以通过分别最大化每个即时reward来最大化\ref{Eq 3.8}。但一般来说，最大化即时reward会减少获得未来reward的机会，从而降低return。

#### Properties of Returns at Successive Time Steps

Returns at successive time steps are related to each other in a way that is important for the theory and algorithms of reinforcement learning:\
连续time steps上的return是相互关联的，这对强化学习的理论和算法很重要:
$$
\begin{align}
G_t 
&= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \cdots \\
&= R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots) \\
&= R_{t+1} + \gamma G_{t+1}
\end{align}
\tag{3.9}
\label{Eq 3.9}
$$
Note that this works for all time steps $t < T$, even if termination occurs at $t + 1$, if we define $G_T = 0$. This often makes it easy to compute returns from reward sequences.\
注意，如果我们定义$G_T = 0$，那么这适用于所有time steps $t < T$，即使termination发生在$t + 1$。这通常会使计算reward序列的return变得容易。

Note that although the return $\ref{Eq 3.8}$ is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if $\gamma < 1$.\
请注意，虽然return$\ref{Eq 3.8}$是一个无限项的和，但如果reward是非零的且是常量-如果$\gamma < 1$，它仍然是有限的。\
For example, if the reward is a constant +1, then the return is\
例如，如果奖励是常量+1，那么return就是
$$
G_t 
= \sum_{k=0}^\infty \gamma^{k} 
= \frac{1}{1 - \gamma}
\tag{3.10}
\label{Eq 3.10}
$$

## Policies and Value Functions

Almost all reinforcement learning algorithms involve estimating **value functions**—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.\
几乎所有的强化学习算法都涉及到估计**价值函数**——状态（或状态-动作对）的函数，它估计agent在给定状态下*有多好*（或在给定状态下执行给定动作有多好）。这里“有多好”的概念是根据可以预期的未来回报来定义的，或者，准确地说，根据预期回报来定义的。当然，agent期望在未来获得的奖励取决于它将采取什么行动。因此，价值函数是根据特定的行为方式（称为策略）定义的。

Formally, a **policy** is a mapping from states to probabilities of selecting each possible action. If the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ is the probability that $A_t = a$ if $S_t = s$. Like $p$, $\pi$ is an ordinary function; the “|” in the middle of $\pi(a|s)$ merely reminds that it defines a probability distribution over $a \in \mathcal{A}(s)$ for each $s \in \mathcal{S}$. Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.\
形式上，**策略**是一种从状态到选择每个可能动作的概率的映射。如果agent在时间$t$时遵循策略$\pi$，则$\pi(a|s)$为如果$S_t = s$，$A_t = a$的概率。像$p$一样，$\pi$是一个普通函数；$\pi(a|s)$中间的“|”只是提醒我们，它为每一个$s \in \mathcal{S}$定义了在$a \in \mathcal{A}(s)$上的概率分布。强化学习方法指定agent的策略如何因其经验而改变。

The value function of a state $s$ under a policy $\pi$, denoted $v_\pi(s)$, is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs, we can define $v_\pi$ formally by\
策略$\pi$下的状态$s$的价值函数，记为$v_\pi(s)$，是从状态$s$开始，其后遵循策略$\pi$的期望返回。对于MDPs，我们可以正式定义$v_\pi$