__Christopher Miller__

__EECS495 - Deep Reinforcement Learning From Scratch__

__Final Project__

__Spring 2019__ 

__VIDEO LINK:__ https://youtu.be/P6JgZoxgOEU

__Introduction__
Throughout a regular day, one’s ability to focus on and perform tasks involving machinery, be it a car, a wheelchair, or a robot, changes. From drivers on Lake Shore Drive to fighter pilots in combat to wheelchair users in a city to manufacturing plant workers, each could benefit from an intelligent system capable of detecting changes in one’s cogency. Measuring a human partner’s cogency, or their ability to clearly reason at any given time, is presently proposed as a means of determining the amount of robotic assistance to provide a user to either optimize their performance or maintain operator safety; this measure is referred to as robotic trust.$^1$ Robotic trust could be used to linearly blend the human-robot control signals or to select a known level of assistance$^2$. 

The purpose of this project is to briefly review present methods of measuring robotic trust, present literature at the interface of robotic trust, shared control, and reinforcement learning, and briefly discuss proposed future work and known issues. 

__Robotic Trust__

Within the field of human-robot control sharing, a subfield of human-robot interaction (HRI), we seek to optimize the amount of assistance a robotic agent provides its human partner.$^2$ This optimization often seeks to maximize user control, safety, and performance. Robotic trust seeks to use measures of cogency to temper the temper the control authority allocation depending on the human’s abilities; in other words, trust quantifies the usability of the human’s control signals to their robotic partner and informs the amount of control authority to grant the human operator.$^{2,3}$

Robotic trust is presently founded in optimal and model-based control theory: trust is commonly the measure of the divergence between a computed optimal control trajectory and the human’s control inputs.$^{1,4}$ Others have used model-based non-linear control to define a robot’s error function (i.e. to define safety or divergence from a preferred state) to inform a user of their suboptimal commands.$^5$ The common theme among these control theoretic methods is the measurement of the difference between a human’s control signals and a robot’s preferred control signals. While divergence from optimality may work within well-defined environments, such as manufacturing, these methods may not perform well in unconstrained environments where a divergence may indicate optimal solution known only to the human or a change in human preference. The results of [6] show that the mathematically optimal cost functions are often divergent from the human-preferred cost functions in control sharing. More so, this paper shows the complexity in modeling human preference within most mathematical frameworks.$^6$

__Reinforcement Learning in Shared Control__

With control theoretic somewhat limited to constrained environments, new ways of robotic trust may be necessary for the real world. To the best of this author’s knowledge, there are no reinforcement learning methods used to define robotic trust. However, there are RL methods used within shared control which indicates their potential for use defining a possible trust measure and improving the state-of-the-art in shared control. Unlike other methods in shared control, RL methods do not depend on explicit models of either the human or the robot (except for, perhaps, a model of the robot’s dynamics for basic motion control).7-9 Model-free methods are extremely attractive as it’s exceedingly difficult to model human behavior or robot performance in unknown environments.$^6$ 

The work presented in [7] used a model-free RL approach to learn a shared control policy for dexterous motion, specifically for book page turning in a virtual. The work presented a method where a robotic partner is attempting to complete a task and incorporated human actions if they would help to complete the given task more effectively; in other words, if the human operator’s inputs would result in a higher reward (quicker and/or better page turn), they would be incorporated and the actions would be stored. This method of control sharing differs significantly from typical paradigms as the human is helping the robot complete the task instead of the converse.$^7$ A more conventional control sharing problem is solved using RL methods is presented in [8] where a robotic walking-aid (i.e. a walker for the elderly) adapted to different operator habits and motor abilities based via another model-free RL approach. Here, control is shared again by choosing which action – the user’s or the robot’s – that maximizes the robot’s reward function; the reward here is maximized when the user’s applied weight (i.e. the measured effort by the person to move with the walker) is minimized. In [9] the problem of navigating a robot with complex dynamics both in simulation (OpenAI Lunar Lander) and reality (a quadrotor) is explored. In this work, the goal is neither inferred nor known a priori, instead the policy decodes the user’s intents to complete the task and generate the desired policy. Here, the robot attempts to match the user’s command as closely as possible while attempting to maximize the reward function, diverging when the user’s command is significantly suboptimal. Again, the action that maximizes the learned reward function is selected to share control between the human and the robot. 

All three methods, [7-9], model shared control as a Markov Decision Process (MDP); this diverges from simpler shared control methods where either a decision tree or a type of linear blending of human and robot control signals are used.$^2$ There has yet to be a comparison of MDP-based shared control and more studied methods in the literature. Further, all of the aforementioned methods don’t consider user safety and only [8] considers user comfort (i.e. walker smoothness of motion). 

If some form of goal inference is possible, such as is presented in [10], it’s feasible that a series of models be generated and model-based reinforcement learning becomes possible. In theory, a different RL model could be used for each task to share control within the human-robot team. [6, 11] indicate the need for individualized models. To generate these individualized models, Learning from Demonstration (LfD) could be used.$^{12}$ In LfD, a model would be constructed from a user’s demonstrations of the robot completing the inferred task, thus pre-optimizing to the user’s preferences. A body of literature has indicated that demonstrated that demonstrated-model-based reinforcement learning performs adequately in simulation,$^{13-14}$ however, there is no literature presenting results using robotic hardware. 

__Reinforcement Learning and Robotic Trust__
As mentioned, there are no existing robotic trust measures relying on reinforcement learning. Further, there is, for the moment, sparse research within RL-based shared control. The following section is a literature-informed set of possible research ideas and issues within shared control and, generally, HRI. (Author Note: The names of the following sections won’t be great. Please, don’t judge me by the quality of my naming but by the quality of my ideas)

_Divergence-Based RL-Trust:_ A simple RL-trust measure would compute the optimal action to maximize the reward at each step and compute the mean difference between the human’s true action and the reward of the optimal action. This method would modify the same methods as in [9] but only use the difference in reward functions to update trust and use trust to share the control between the human and the agent.  

_Social Influence Divergence-Based RL-Trust:_ In [15], an autonomous agent within robotic swarm relies on reinforcement learning to generate a model of its counterparts to complete a known task. The robots have no knowledge of the other agents aside from what each other agent is performing. Here, the agents “imagine” the other actions the agent(s) could have made and determine how much to weight the behaviors of their partners based upon how much the partners’ action influenced the global reward function. 

This method would involve first computing all of the possible human actions at a given time step using some form of intent inference$^{10}$. This would assign each action some probability. For each of these actions, the reward would be computed$^{7-9}$. The reward for the action taken, of the possible inferred actions (or the closest action) would be recorded. Here, trust would be the measure how the action, its likelihood of occurrence, and the associated reward. A higher trust would occur from choosing the optimal action if it’s very unlikely. A lower trust would occur from choosing a likely, extremely suboptimal action. The exact computation of the aforementioned is being explored. 

_Safety-Aware RL-Based Trust:_ In this model, the methods presented in [9] would be mostly reused, however, measures of safety would be appended. Every time a user attempts to perform an unsafe action, it will be blocked using a known action-blocking method, such as the Maxwell’s Demon Algotithm.$^{16}$ Trust here would be the measure of blocked actions (more blocked actions, lower trust) and the quality of the actions allowed to pass (i.e. if an action is suboptimal but ‘close enough’  to the optimal action, it’s still allowed to pass). 

For all of the methods mentioned above, trust could be used to determine the level-of-autonomy (LoA) in which to place the robot. For this review, the LoA is a discretized level describing a constant amount of assistance provided to the user$^{17}$. If trust is high, the human has more control with some robotic assistance. If trust is low, the human has little control of the system. 

_Final Thoughts:_ Generally, reinforcement learning is not yet widely used in shared control. Where RL has been used, it’s been used for simpler problems, in simulation, or on prohibitively expensive robotic platforms (e.g. Medicare would never pay for a whole-home Vicon system or a Velodyne LiDAR system for the wheelchair of a person with a disability). Another aim of this research is to determine if RL is useful in shared control, in measuring robotic trust, and if it’s feasible to use on realistic hardware. Implementing these methods on actual hardware, performing proper human studies, and leveraging the imperfections of 'actual hardware' are areas for future research. 


__References__
1.	B. D. Argall, et al. Computable trust in human instruction. In Artificial Intelligence for Human-Robot Interaction - Papers from the AAAI Fall Symposium, Technical Report. 2014.
2.	S. Musić, et al., Control sharing in human-robot team interaction, Annual Reviews in Control, Volume 44, 2017, Pages 342-354, ISSN 1367-5788.
3.	O. Horn, "Smart wheelchairs: Past and current trends," 2012 1st International Conference on Systems and Computer Science (ICSCS), Lille, 2012, pp. 1-6. 2012.
4.	A. Broad, et al., "Trust Adaptation Leads to Lower Control Effort in Shared Control of Crane Automation," in IEEE Robotics and Automation Letters, vol. 2, no.1, pp. 239-246, Jan. 2017.
5.	H. Saeidi et al., A Trust-Based Mixed-Initiative Teleoperation Scheme for the Shared Control of Mobile Robotic Systems. 2016. (doi: 10.13140/RG.2.2.10840.90888.)
6.	D. Gopinath, et al., "Human-in-the-Loop Optimization of Shared Autonomy in Assistive Robotics," in IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 247-254, Jan. 2017.
7.	T. Matsubara, T. Hasegawa and K. Sugimoto, "Reinforcement learning of shared control for dexterous telemanipulation: Application to a page turning skill," 2015 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Kobe, 2015, pp. 343-348.
8.	W. Xu, J. Huang, Y. Wang and H. Cai, "Study of reinforcement learning based shared control of walking-aid robot," Proceedings of the 2013 IEEE/SICE International Symposium on System Integration, Kobe, 2013, pp. 282-287. S. Reddy, et al. Shared Autonomy via Deep Reinforcement Learning. In Proceedings of Robotics: Science and Systems (RSS ’18), 2018
9.	S. Jain and B. Argall. Recursive Bayesian Intent Inference in Shared-Control Robotics. In Proceedings of the IEEE International Conference on Intelligent Robots (IROS), Madrid, Spain, Oct. 2018.
10.	 M. Young, C. Miller, et al., “Implications of Task Features in Robotic Manipulation for Dynamic Autonomy Allocation,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
11.	Brenna D. Argall, et al., A survey of robot learning from demonstration, in Robotics and Autonomous Systems, Volume 57, Issue 5, 2009, Pages 469-483.
12.	Y. Gao, et al. “Reinforcement Learning from Imperfect Demonstrations.” 2019.  arXiv:1802.05313 [cs.AI]
13.	Tim Brys, et al. 2015. Reinforcement learning from demonstration through shaping. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI'15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press 3352-3358.
14.	N. Jaques, et al., “Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. 2019. arXiv:1810.08647 [cs.LG]
15.	K. Fitzsimons, et al., "Optimal human-in-the-loop interfaces based on Maxwell's Demon," 2016 American Control Conference (ACC), Boston, MA, 2016, pp. 4397-4402
16.	M. Chiou, et al., "Experimental analysis of a variable autonomy framework for controlling a remotely operating mobile robot," in Proceedings of the IEEE/RSJ International Intelligent Robots and Systems (IROS), 2016.

