### **Human-in-the-Loop Reinforcement Learning in Continuous-Action Space**
**连续动作空间的人在回路强化学习**

Biao Luo , Senior Member, IEEE, Zhengke Wu, Fei Zhou, and Bing-Chuan Wang

**Abstract** — Human-in-the-loop for reinforcement learning (RL) is usually employed to overcome the challenge of sample inef- ficiency, in which the human expert provides advice for the agent when necessary. The current human-in-the-loop RL (HRL) results mainly focus on discrete action space. In this article, we propose a Q value-dependent policy (QDP)-based HRL (QDP- HRL) algorithm for continuous action space. Considering the cognitive costs of human monitoring, the human expert only selectively gives advice in the early stage of agent learning, where the agent implements human-advised action instead. The QDP framework is adapted to the twin delayed deep deterministic policy gradient algorithm (TD3) in this article for the convenience of comparison with the state-of-the-art TD3. **Specifically, the human expert in the QDP-HRL considers giving advice in the case that the difference between the twin Q-networks output exceeds the maximum difference in the current queue. Moreover, to guide the update of the critic network, the advantage loss function is developed using expert experience and agent policy, which provides the learning direction for the QDP-HRL algorithm to some extent.** To verify the effectiveness of QDP-HRL, the experiments are conducted on several continuous action space tasks in the OpenAI gym environment, and the results demonstrate that QDP-HRL greatly improves learning speed and performance.

**摘要** - “人在回路”强化学习（RL）指的是人类专家在必要时为智能体提供建议，通常用于解决样本效率不高的问题。目前在人在回路强化学习（HRL）方向的成果主要集中在离散动作空间的强化学习这一方面。在本文中，我们提出了一种用于连续动作空间的，基于Q值依赖的策略的人在回路强化学习算法（QDP-HRL）。考虑到人类监测的认知成本，人类专家只在智能体学习的早期阶段选择性地向其提供建议，而代理则执行人类建议的动作。本文中的双延迟深度确定性策略梯度算法（TD3）就采用这种QDP框架，以方便与最先进的TD3进行比较。**具体来说就是，QDP-HRL中的人类专家会在双Q网络输出差异超过当前队列的最大差异的情况下提供建议。此外，为了指导评论家网络的更新，使用专家经验和智能体策略来导出优势损失函数，这在一定程度上为QDP-HRL算法提供了学习方向**。为了验证QDP-HRL的有效性，在OpenAI gym环境中对几个连续动作空间任务进行了实验，结果表明QDP-HRL大大提高了学习速度和性能。

**Index Terms** Continuous action space, human-in-the-loop, reinforcement learning (RL).

**关键词** 连续动作空间，人在回路，强化学习（RL）

#### **PRELIMINARIES**

In this section, some necessary preliminaries are presented. To begin with, the Markov decision process (MDP) is given, which is the standard problem for RL algorithms. Subsequently, the twin delayed deep deterministic policy gradient algorithm (TD3) [52] is introduced, which is used as a representative algorithm to intergrade our QDP strategy in this article. Finally, the basic idea of active learning is discussed, which is usually used to determine when human advice is required.

在本节中，展示了一些必备知识。首先，给出了马尔可夫决策过程（MDP），这是RL算法的标准问题。随后，引入了双延迟深度确定性策略梯度算法（TD3）[52]，本文中该算法被用作融合QDP策略的代表性算法。最后，讨论了主动学习的基本概念，通常用于确定何时需要人类建议。

##### A.Markov Decision Process

RL is usually used to solve sequential decision problems, which can be modeled as an MDP [53]. MDP is represented by with $(\mathcal{S}, \mathcal{A}, \mathcal{R}, T, \gamma)$, where $\mathcal{S}$ denotes the state space, $s \in \mathcal{S}$ is the state, $\mathcal{A}$ is the action space, $a \in \mathcal{A}$ is the action, $\mathcal{R}$ means the reward function space, $r \in \mathcal{R}$ is the instant reward, $T$ is the state transition function of the environment, and $\gamma \in [0, 1]$ represents the discount factor. $\pi(s):\mathcal{S} \to \mathcal{A}$ is the agent’s policy that is a mapping from state space to action space. The goal of the agent is to find an optimal policy $\pi^{*}(s)$ in which the cumulative reward $R_t=\sum_{k=0}^{\infty } \gamma^{k}r_{t+k}$ is maximized. The state transition function $T$ is unknown for the agent for model-free RL algorithms. Hence, the agent should interact with the environment to collect samples as the type of tuple $(s, a, r, s')$, which is utilized for training the agent. For a policy $\pi(s)$, there are two types of value functions, that is, the action-value function $Q_\pi(s, a)$ and the state-value function $V_\pi(s)$. $Q_\pi(s, a)$ is defined as the expected return by following the policy π(s) after taking the action $a$ at the current state $s$, that is, 

强化学习通常被用来解决能被建模为马尔科夫决策过程(MDP)的序列决策问题。马尔科夫决策过程可以简要得由$(\mathcal{S}, \mathcal{A}, \mathcal{R}, T, \gamma)$来说明，其中$\mathcal{S}$代表状态空间，$s \in \mathcal{S}$是状态，$\mathcal{A}$是动作空间，$a \in \mathcal{A}$是动作，$\mathcal{R}$代表奖励函数空间，$r \in \mathcal{R}$是即时奖励，$T$是环境的状态转移函数，$\gamma \in [0, 1]$代表折扣因子。$\pi(s):\mathcal{S} \to \mathcal{A}$是由状态空间到动作空间的映射，即智能体的策略。智能体的最终目标是找到一个最优策略$\pi^{*}(s)$，使得奖励的加和$R_t=\sum_{k=0}^{\infty } \gamma^{k}r_{t+k}$最大。对于无模型的强化学习算法来说，状态转移函数$T$是未知的。因此，智能体应与环境交互，以收集类型如$(s, a, r, s')$的样本用于训练。对于策略$\pi(s)$来说有两种值函数——动作值$Q_\pi(s, a)$函数和状态值函数$V_\pi(s)$。$Q_\pi(s, a)$被定义为在当前状态$s$下，根据策略$\pi(s)$采取动作$a$得到的回报的期望，即：
$$
Q_\pi(s,a)=E\left [\sum_{k=0}^{\infty }\gamma^kr_{t+k+1}|s_t=s,a_t=a\right ] 
$$
$V_\pi(s)$ is defined as the expected return that by following policy $\pi(s)$ at state $s$, that is,

$V_\pi(s)$则被定义为根据策略$\pi(s)$在当前状态$s$下得到的的回报的期望。
$$
V_\pi(s)=E\left [\sum_{k=0}^{\infty }\gamma^kr_{t+k+1}|s_t=s\right ]
$$

##### B.TD3 Algorithm

The TD3 algorithm [52] is well suited for continuous action space, which is designed based on the actor-critic framework. It greatly reduces the overestimation problem that widely exists in RL methods. In the TD3, two critic networks are employed to estimate $Q(s,a)$, and the smaller estimation values are used as the target value, which reduces the deviation of $Q$ value estimation to some extent. With training, the $Q$ value estimation will become more accurate, and finally, the two evaluations converge to the optimal $Q$ function according to the Bellman equation. The loss function of the critic network is given by:

TD3算法[52]非常适合基于演员-评论家框架设计的连续动作空间，其大大减少了强化学习方法中普遍存在的Q值高估问题。在TD3中，使用两个评论家网络来估计$Q(s,a)$，使用较小的估计值作为目标值，这在一定程度上减少了Q值估计的偏差。通过训练，Q值估计将变得更加准确，最终根据贝尔曼方程，这两个评估会收敛到最优Q值函数。评论家网络的损失函数由下式给出：
$$
y_i = r+\gamma Q_{\theta_i'}(s',\tilde{a} )\\
lossQ = N^{-1}\sum(\underset{i=1,2}{min}(y_i)-Q_{\theta_i}(s,a))^2
$$
where $Q_{\theta_i'} (s, a)$ represents the target-critic network, $y_i$ is the updating target, $i$ represents one of the two critical networks with a value of 1 or 2, and $\tilde{a}$ means the output of target policy network $\pi_{\phi'}(s)$ by adding the noise $\varepsilon$, respectively. The noise $\varepsilon$ is sampled from the Gaussian distribution. Because the $Q$ value of similar actions in the same state is similar in the continuous action space, the purpose of adding noise is to make the critic network smoother. The actor-network adopts delayed update technology to update parameters by gradient ascent. The update process is given as follows:

其中$Q_{\theta_i'} (s, a)$代表目标评论网络，$y_i$是要更新的目标值，$i$代表两个评论家网络的索引，$\tilde{a}$的意思是目标策略网络$\pi_{\phi'}(s)$加入噪声$\varepsilon$后的输出，噪声$\varepsilon$服从高斯分布。在连续动作空间中，由于相同状态-动作对的Q值也相似，所以加入噪声来使评论网络更平滑。动作网络采用梯度上升延迟更新技术来更新参数。更新过程如下：
$$
\nabla J(\phi)=\frac{\sum \nabla _aQ_{\theta_1(s,a)}|_{a=\pi_{\phi}(s)}\nabla _{\phi}\pi_{\phi}(s)}{N} 
$$
where $Q_{\theta_1}(s, a)$ represents the critic network, $a$ is the output of actor-network $\pi_{\phi}(s)$, and $N$ is the batch size. The neural network weights of the target network are obtained by the soft update of the training network, and the soft update process of the target network is given as follows:

其中$Q_{\theta_1}(s, a)$代表评论网络，$a$是动作网络$\pi_{\phi}(s)$的输出，$N$是batch-size（批处理的大小）。目标网络的权重通过软更新获得，其软更新过程如下：
$$
\theta _i'\gets \tau \theta_i+(1-\tau)\theta_i'\\
\phi'\gets \tau \phi+(1-\tau)\phi'
$$
where $\theta_i'$ is the parameter of the target-critic network, and $\phi '$ is the parameter of the target policy network. The setting of the target network makes the agent more stable in the training process.

其中$\theta_i'$是目标评论网络的参数，$\phi '$是目标策略网络（也叫动作网络）的参数。这种目标网络的更新设定使得训练过程中的智能体表现得更加稳定。

##### C.Active Learning

Active learning [54], [55], [56] is a machine-learning method used to reduce the labeling cost of human experts during training. Specifically, the model determines whether the current data needs to be labeled by experts according to the query strategy, and the query strategy is designed based on the uncertainty of the current model. If the predetermined condition is satisfied, the model will actively ask the human for the current data’s label, reducing the unnecessary labeling cost. The training data with high uncertainty is given priority to human experts for labeling, and the training data is obtained during repeated interactions with experts. Obviously, the active learning method can reduce the workload of human annotation. In this article, the idea of active learning is also adopted in the developed QDP-HRL, where the agent actively asks the human experts what actions they should take in the current state according to the QDP query strategy to obtain the experience of the human experts. In the training process, the experience of human experts is integrated into the policy of the agent, thereby improving the training efficiency of the agent.

主动学习[54]、[55]、[56]是一种机器学习方法，用于降低人类专家在训练期间的标签成本。具体来说，该模型通过查询策略决定是否需要对当前数据进行标记，并且查询策略是根据当前模型的不确定性设计的。如果满足预定条件，模型将主动向人类询问当前数据的标签，从而降低不必要的标签成本。高不确定性的训练数据优先给人类专家进行标注，在与专家的反复互动中获得训练数据。显然，主动学习方法可以减少人类注释的工作量。在本文中，QDP-HRL也采用了主动学习的想法，根据QDP查询策略，智能体主动询问人类专家他们应该在当前状态下采取什么行动，以获得人类专家的经验。在培训过程中，人类专家的经验被整合到智能体的策略中，从而提高智能体的训练效率。

#### **QDP-BASED HRL**

In this section, we propose a novel QDP-HRL algorithm to address the challenge of low sampling efficiency of agents in continuous action space. The architecture of QDP is shown in *Fig. 1*, which is divided into two parts. The first part is the action selection strategy based on active learning. The agent actively seeks advice from human experts according to the query strategy in the early stage of training. Advice from human experts is stored in the experience buffer pool $B_h$, and the agent’s interaction data is stored in $B$. The second part uses the experience $(s, a_h)$ of human experts and the agent’s interactive data $(s，a，r，s')$ to train the critic network so that the critic network can be updated directionally to improve learning efficiency.

在本节中，我们提出了一种新的QDP-HRL算法，以解决智能体在连续作用空间中采样效率低的问题。QDP的架构如*图1*所示分为两部分：第一部分是基于主动学习的动作选择策略。智能体在训练的早期阶段根据查询策略积极向人类专家寻求建议。来自人类专家的建议存储在经验缓冲池$B_h$中，智能体的交互数据存储在$B$中；第二部分使用人类专家的经验$(s，a_h)$和智能体的交互数据$(s，a，r，s')$来训练评论网络，以便对评论网络进行定向更新来提高学习效率。

##### A.Action Query Strategy

Action query strategy is an important part of HRL, which directly determines when the agent seeks advice from the human expert. In the QDP, we propose a query strategy adapted to continuous action spaces based on active learning. First, the agent computes the critic network difference for the current state–action pair with

动作查询策略是HRL的重要组成部分，它直接决定了智能体何时向人类专家寻求建议。在QDP中，我们提出了一种适用于连续行动空间的基于主动学习的查询策略。首先，智能体计算当前状态-动作对的临界网络差值
$$
I(s)=Q_1(s,a)-Q_2(s,a)
$$
where $Q_{i=1,2}(s, a)$ are the expected returns of the current critic networks for the state–action pair $(s,a)$, and $I(s)$ represents the difference between the current-critic network estimates. Due to the distribution of $I(s)$ varying greatly in different tasks, it is impossible to set a fixed threshold, so we adopt the sliding window idea to store the most recent $I(s)$ in a fixed-length queue $Que_n$. When $I(s)$ satisfies the following condition:

其中$Q_{i=1,2}(s, a)$是当前评论网络对状态-行动对$(s,a)$的回报的期望，$I(s)$表示当前评论网络估计值之间的差异。由于$I(s)$在不同任务中的分布差异很大，不可能设置固定的阈值，因此我们采用滑动窗口的想法将最新的$I(s)$存储在固定长度的队列$Que_n$中。当满足以下条件时：
$$
I(s)>max(Que_n)
$$
that is, the agent asks for human advice if $I(s)$ exceeds the maximum value of the elements in the current queue. When the queue length exceeds the fixed length $n$, the first element to enter the queue is the first to go out. The action query strategy is shown in *Fig. 2*. Generally speaking, the value of $I(s)$ reflects the inaccuracy of the critic network estimation. Thus, the larger $I(s)$ implies that the current exploration of the agent is maybe inappropriate. Human experts’ experience may help the agent choose appropriate actions in this situation. To guarantee the agent’s exploration ability for the environ- ment, human experts will only advise on the early stage of training by using the following method:

即如果$I(s)$超过当前队列中元素的最大值，智能体会征求人类建议。当队列长度超过固定长度$n$时，第一个进入队列的元素则第一个退出。动作查询策略如*图2*所示。一般来说，$I(s)$的值反映了评论网络估计的不准确。因此，更大的$I(s)$意味着当前智能体的探索可能不太合适。人类专家的经验可能会帮助智能体在这种情况下选择适当的行动作。为了保证智能体对环境的探索能力，人类专家只会使用以下方法在训练早期阶段提出建议：
$$
R_{episode}<I_R
$$
where $R_{episode}$ is the episode accumulate reward, $I_R = R_{max}/t_h$ is the threshold, $R_{max}$ is the maximum accumulate reward that can be given as appropriate value based on the task and experience, and $t_h$ is an adjustable parameter represents the degree of human participation in the training of the agent and determines when humans stop giving advice to the agent. Because the parameter $I_R$ plays a role in achieving a tradeoff between using human advice and encouraging exploration. A small $I_R$ means that human advice has less chance to participate in learning to improve performance, while exploration is encouraged. In contrast, a large $I_R$ results in human advice playing a more important role in the learning process, while the exploration level will be reduced. Thus, the parameter $I_R$ provides more flexibility for the HRL algorithm based on the practical requirement. Another merit of (8) is that it can reduce human monitoring costs. When the HRL algorithm has already achieved a good convergence, human advice is not required, and exploration of the environment is encouraged to find potential better policies practical.

其中$R_{episode}$是累积奖励，$I_R = R_{max}/t_h$是阈值，$R_{max}$是最大累积奖励，可以根据任务和经验给予适当的值，$t_h$是一个可调节的参数，代表人类参与训练的程度，并决定人类何时停止向智能体提供建议。因为参数$I_R$在平衡人类建议和鼓励探索之间发挥着重要作用。小的$I_R$意味着人类建议参与学习以提高性能的机会较小，而鼓励探索。相比之下，大的$I_R$意味着人类建议在学习过程中发挥更重要的作用，而探索程度将降低。因此，基于实际要求，参数$I_R$为HRL算法提供了更大的灵活性。式（8）的另一个优点是可以降低人工监控成本。当HRL算法已经具备了良好的收敛性时，则不再需要人类建议，此时鼓励探索环境，以找到潜在更好的实用策略。

##### B.Policy Training

Considering TD3 is an actor-critic framework algorithm with excellent performance in continuous action space, we implement the QDP strategy based on TD3 as an illustration. It is worth pointing out that the proposed QDP can be combined with other DRL algorithms with a pre-condition that the algorithm includes two evaluation networks, for example, DDPG, SAC, and so on. The main reason is that the query strategy in the QDP relies on two evaluation networks. In the combination process, it requires increasing the query strategy in the DRL algorithms and the advantage loss function in the training process. In the actor-critic framework, the more accurate the critic network estimation is, the more beneficence the actor-network will gain. Therefore, we use the advantage loss function defined in QDP to improve the convergence speed of the critic network, thereby improving the performance of the actor network. The experience buffer pool $B_h$ stores the human’s advice $(s, a_h)$, where $a_h$ is a human’s action regarded as the suboptimal action at state $s$. For the case that the critic network $Q(s, a)$ converges to its optimum $Q^*(s, a)$, the human’s action $a_h$ usually results in a larger Q value than other actions $a$, that is, $Q^∗(s,a_h) > Q^∗(s,a)$. Based on this idea, we define the new advantage loss function $A_i(s, a)$ by using the experience of a human expert, which is given by

考虑到TD3是一种演员-评论家框架算法，在连续动作空间中具有出色的性能，我们就基于TD3来应用QDP策略。值得指出的是，所提出的QDP可以与其他DRL算法相结合，前提是该算法包括两个评论网络，例如DDPG、SAC等。主要原因是QDP中的查询策略依赖于两个评论网络。在组合过程中，它需要在DRL算法中加入查询策略以及训练过程中加入优势损失函数。在演员-评论家框架中，评论家网络估计越准确，演员网络获得的益处就越多。因此，我们使用QDP中定义的优势损失函数来提高评论网络的收敛速度，从而提高动作网络的性能。经验缓冲池$B_h$存储人类的建议$(s, a_h)$，其中$a_h$是人类指导的行为，它被视为状态$s$的次优行为。在评论网络$Q(s, a)$收敛到其最佳$Q^*(s, a)$的情况下，人类的动作$a_h$通常比其他动作$a$产生更大的Q值，即$Q^∗(s,a_h) > Q^∗(s,a)$。基于这一想法，我们利用人类专家的经验来定义新的优势损失函数$A_i(s, a)$，该经验由下式给出：
$$
A_i(s,a)=\frac{\sum [Q_i(s,a_h)-Q_i(s,\pi(s))]}{N} 
$$
where $Q(s,\pi(s))$ is the expected return of the agent’s action in the state $s$ by following the policy $\pi(s)$, $Q(s,a_h)$ represents the expected return of human expert experience and its training data are sampled from the human expert experience buffer pool $B_h$, and $N$ is the batch size. The advantage loss function represents the difference between the expert’s experience and the agent’s current policy. Considering the expert experience is usually better than the current action, we use the gradient ascent to improve the Q value of human expert experience and provide a direction for updating the critic network. On the other hand, the critic network is updated based on the following TD error:

其中$Q(s,\pi(s))$是智能体在状态$s$遵循策略$\pi(s)$做出动作的预期回报，，$Q(s,a_h)$代表人类专家经验的预期回报，其训练数据是从人类专家经验缓冲区$B_h$抽样得到，$N$是batch-size批处理大小。优势损失函数代表了专家的经验和智能体当前策略之间的差异。考虑到专家经验通常比当前行动更好，我们使用梯度上升来提高人类专家经验的Q值，并为更新评论网络提供方向。另一方面，评论网络根据以下TD误差进行更新：
$$
lossQ = \frac{\sum(\underset{i=1,2}{min}(y_i)-Q_{\theta_i}(s,a))^2}{N}
$$
where $y_i$ represents the mininum value of the target network given by:

$y_i$代表目标网络的最小值，如下式：
$$
y_i=r + \gamma Q_{\theta_i'}(s,\pi_{\phi'}(s')+\varepsilon)\\
\varepsilon \sim clip(N(0,\sigma),-c,c)
$$
The training process of the critic network is shown in *Fig. 3*. Using the target network mechanism and the minimum target value mechanism can effectively reduce the influence of the bias in the Q value estimation on the algorithm. The advantage loss function provides a direction for updating the critic network. In the training process, the purpose of the advantage loss function is to improve the Q value at $(s,a_h)$, which is the criterion for guiding the update of the critic network to improve the Q value. Because the actor network is constantly updated, the action $a$ taken by the agent in the state $s$ will be closer to $a_h$. To adapt the advantage loss function to the policy, we use the difference of $Q_1(s,a_h) − Q_1(s,\pi(s))$. It is found that using the critic network $Q_1$ only can improve the algorithm’s performance and stability, where the main reason may result from the use of (4) for updating the actor network based on $Q_1$. The TD error makes the critic network estimate the current state–action pair’s Q value more accurately. Because TD error is the part that plays an essential role in the update of the critic network, the advantage loss function will be limited during training to make it only fine-tune the critic network. Each training-critic network will be updated according to the advantage loss function and TD error, in which the training data for calculating $lossQ$ is sampled from the buffer pool $B$. The actor network is updated with

评论网络的训练过程如*图3*所示。使用目标网络机制和最小目标值机制可以有效地减少Q值估计中偏差对算法的影响。优势损失函数为更新评论家网络提供了方向。在训练过程中，优势损失函数的目的是提高$(s,a_h)$的Q值，这是指导评论网络更新以提高Q值的规范。由于演员网络不断更新，智能体在状态$s$采取的行动$a$将更接近$a_h$。为了使优势损失函数适应策略，我们使用$Q_1(s,a_h) − Q_1(s,\pi(s))$的差值。我们发现只有使用评论网络$Q_1$才能提高算法的性能和稳定性，主要原因可能是使用（4）基于$Q_1$更新演员网络。TD误差使评论网络更准确地估计当前状态-动作对的Q值。由于TD误差是更新评论网络中发挥重要作用的部分，因此在训练期间，优势损失函数将受到限制，使其只对评论网络进行微调。每个正在训练的评论网络将根据优势损失函数和TD误差进行更新，其中用于计算损失Q的训练数据是从缓冲池$B$中采样的。演员网络更新为:
$$
\nabla J(\phi)=\frac{\sum \nabla _aQ_{\theta_1(s,a)}|_{a=\pi_{\phi}(s)}\nabla _{\phi}\pi_{\phi}(s)}{N} 
$$
The use of human expert experience improves the convergence speed of the critic network, and the more accurate the critic network is, the more favorable it is for the actor network, which indirectly affects the actor network and improves its training speed.

使用人类专家经验可以提高评论网络的收敛速度，评论网络越准确，对动作网络就越有利，这间接影响了动作网络并提高其训练速度。

The advantage loss function (9) defined in QDP captures the relative effect of the Q value action of the human expert and the Q value of the agent’s action. During the training process, the agent’s policy is constantly changing, and the agent can timely correct the updating direction of the critic network according to the advantage loss function. Note that the QDP-HRL algorithm is suitable for continuous action space since it only uses the actions generated by the policy without knowing the Q value of all action–state pairs. The detailed QDP-HRL algorithm is described in Algorithm 1.

QDP中定义的优势损失函数（9）捕获了人类专家Q值动作的相对效应和智能体动作的Q值。在训练过程中，智能体的策略在不断变化，其可以根据优势损失函数及时纠正评论网络的更新方向。请注意，QDP-HRL算法适用于连续操作空间，因为它只使用策略生成的动作，而不知道所有状态-动作对的Q值。详细的QDP-HRL算法在*算法1*中进行了描述。