# Finite Markov Decision Processes

In this chapter we introduce the formal problem of finite Markov decision processes, or finite MDPs, which we try to solve in the rest of the book. This problem involves evaluative feedback, as in bandits, but also an associative aspect—choosing different actions in different situations. MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to tradeoff immediate and delayed reward. Whereas in bandit problems we estimated the value $q_*(a)$ of each action $a$, in MDPs we estimate the value $q_*(s,a)$ of each action $a$ in each state $s$, or we estimate the value $v_*(s)$ of each state given optimal action selections. These state-dependent quantities are essential to accurately assigning credit for long-term consequences to individual action selections.\
在本章中，我们将介绍有限马尔可夫决策过程（finite Markov decision processes，简称有限MDPs）的形式问题，我们将在本书的其余部分尝试解决这个问题。这个问题涉及到评价反馈，就像在老虎机中一样，但也涉及到联合方面——在不同的情况下选择不同的行动。MDPs是序贯决策的经典形式，其中行动不仅影响即时的回报，还影响后续的情况或状态，并通过这些未来的回报。因此，MDPs涉及延迟奖励和需要权衡即时和延迟奖励。在老虎机问题中，我们估计了每个动作$a$的值$q_*(a)$，而在MDPs中，我们估计了每个动作$a$在每个状态$s$下的值$q_*(s,a)$，或者在给定最优动作选择的情况下估计了每个状态$v_*(s)$。这些依赖于状态的数量对于准确地为个人行为选择的长期结果分配信用至关重要。

MDPs are a mathematically idealized form of the reinforcement learning problem for which precise theoretical statements can be made. We introduce key elements of the problem’s mathematical structure, such as returns, value functions, and Bellman equations. We try to convey the wide range of applications that can be formulated as finite MDPs. As in all of artificial intelligence, there is a tension between breadth of applicability and mathematical tractability. In this chapter we introduce this tension and discuss some of the trade-offs and challenges that it implies. Some ways in which reinforcement learning can be taken beyond MDPs are treated in Chapter 17.\
MDPs是一种数学上理想化的强化学习问题，可以对其进行精确的理论陈述。我们将介绍问题的数学结构的关键元素，如返回值、值函数和贝尔曼方程。我们试图传达广泛的应用，可以制定为有限的MDPs。就像在所有的人工智能中一样，在适用性的广度和数学上的可处理性之间存在着一种张力。在本章中，我们将介绍这种张力，并讨论它所隐含的一些权衡和挑战。强化学习可以在MDPs之外采用的一些方法在第17章中讨论。

## Returns and Episodes

So far we have discussed the objective of learning informally. We have said that the agent’s goal is to maximize the cumulative reward it receives in the long run. How might this be defined formally? If the sequence of rewards received after time step $t$ is denoted $R_{t+1}, R_{t+2}, R_{t+3}, \cdots$, then what precise aspect of this sequence do we wish to maximize? In general, we seek to maximize the **expected return**, where the return, denoted $G_t$, is defined as some specific function of the reward sequence. In the simplest case the return is the sum of the rewards:\
到目前为止，我们已经讨论了非正式学习的目的。我们已经说过，代理人的目标是最大化其在长期内获得的累积回报。这应该如何正式定义?如果在时间步长$t$之后收到的奖励序列表示为$R_{t+1}， R_{t+2}， R_{t+3}， \cdots$，那么我们希望最大化这个序列的哪个精确方面呢?一般来说，我们寻求最大化**期望回报**，其中的返回值，表示$G_t$，被定义为奖励序列的某个特定函数。在最简单的情况下，回报是奖励的总和:\
$$
G_t = R_{t+1} + R_{t+2} + R_{t+3} + \cdots + R_{T} = \sum_{k=t+1}^T R_k
\tag{3.7}
$$

The additional concept that we need is that of **discounting**. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_t$ to maximize the expected **discounted return**:\
我们需要的另一个概念是**贴现**。根据这种方法，agent试图选择行动，使其在未来获得的折现奖励的总和最大化。特别是，它选择$A_t$来最大化预期的**折现收益**:
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^\infty R_{t+k+1}
\tag{3.8}
$$
where $\gamma$ is a parameter, $0 \leq \gamma \leq 1$, called the **discount rate**.\
其中$\gamma$是一个参数，$0 \leq \gamma \leq 1$称为**折现率**。