### Christopher Miller 
### EECS435: Deep Leaning Foundations from Scratcj 
### Final Project
#### Video:  https://youtu.be/YaCRCO-yEJU

#### Introduction
As a result of the recent explosion in available data and access to affordable computational power, the human-robot interaction community has followed the trend of many fields and embraced machine learning techniques. Human-robot interaction, commonly referred to as HRI, is a field dedicated to the study of robotic systems used by or alongside humans [1]. The vast majority of robots today, from autonomous cars and manufacturing robots to the Mars rovers and bomb disposal robots interface with humans. Compiling a review of all applications of machine learning within HRI would be extremely cumbersome, therefore the scope of this review is further refined. This review will review of current (past ~5 years) machine learning methods human-like interaction in areas such as emotional classification, social interactions with humans and robots, and human-robot control sharing.

#### Literature Review

It is common knowledge to say that most people properly respond to others by first observing the emotional state of the other person; people have some culturally built-in model of how to respond to one another [2]. There is agreement among most researchers that emotional expression is categorial and can be broken down into eight common emotions (anger, disgust, fear, happiness, sadness, and surprise) [3]; such categorization is advantageous as it allows for deep learning methods. In this author’s compilation of literature, two primary emotional classification methods arose: primarily, via facial recognition [4-7] or otherwise [8-10]. 

Most recent advances in facial expression recognition (FER) are the result of significant improvements in computational power and the drastic growth in both access and size of labeled databases [5]. Further, significant advances in preprocessing (e.g. facial alignment via neural networks, illumination normalization, scaling) have also improved the ability for deep networks to perform classification [5]. Presently, the state-of-the-art in research seeks to move out of the laboratory environment and begin FER “in-the-wild.” [4, 6-7] In other words, researchers want to perform FER on imperfect images in real-time. However, it’s been noted that some FER methods will function well on some datasets where others may perform poorly on others thus limiting their use in-the-wild [4]. This appears to be the result of differences in labeling and dataset composition. One method, proposed by [4] developed a selective joint multi-task (SJMT) approach which implements a selective sigmoid cross-entropy function to address this multi-dataset, multi-label problems. The selective sigmoid cross-entropy loss function realized in [4] allows for a single CNN to classify multiple emotions as opposed to typical methods where different deep CNNs were used to classify different emotions; the accuracy this new method was comparable or improved as compared to the present state-of-the-art. 

Another advance in classifying emotions in-the-wild is presented in [7] where FER performed well by combining more than just facial features to achieve accurate emotional classifications. The authors of [7] combined audio and visual features using Multiple Kernel Learning to find which features would be optimal for use in a support vector machine classifier (SVM). By tracking faces, physical features, and audio cues, the authors of [7] were able to show a 15% improvement over baseline methods for FER in-the-wild (~37% vs 22% baseline). However, the performance presented in [6] showed an improvement of more than 20% from similar baseline measures (~56% vs ~36% baseline) for FER classifications in-the-wild; this is a, at the time of publication, a state-of-the-art result. In [6], the authors realize a more traditional deep CNN for FER and significantly improve their classifications by generating randomized perturbations that are applied to the faces being classified. The authors also presented a multiple network framework (i.e. running multiple CNNs in parallel) and a voting mechanism where each network is given an adaptive weight. Each weight is determined by [6]. This is unlike present methods which average the output of multiple networks.  

Unlike FER methods, methods reliant on social dynamics depend on more than facial recognition or weigh facial expressions less heavily. For example, the work presented in [8] depends more heavily on body language than facial expression. The authors of [8] here relied on a feed-forward, biologically inspired neural network for low- and high-level feature processing and classification and a linear SVM for final categorization. This resulted in an emotion classifier with 82% accuracy (compared to the 87% accuracy achieved by humans). The authors of [8] also concluded that only 2D human pose information is necessary for emotional classification, thus simplifying the need for more complex cameras or processing. However, the authors of [9] moved away from vision methods completely and implemented a deep neural network to classify emotion based on speech recognition. After using a deep neural network to extract high level features (segmenting utterances), the methods used an extreme learning machine; an extreme neural network is comprised of one hidden layer with many hidden units whose weights are assigned at random. [9] The results of this work indicated its viability for speech-based emotion classification using deep learning. Finally, some researchers have used EEG signals fused with eye gaze data and passed through a support vector machine to classify emotional response to videos with reasonable (~68% accuracy) success. [10] The results from [8-10] indicate the feasibility of using more-than-FER methods for classification and strongly suggest the viability for emotional classification using a combination of the aforementioned methods.

How robots respond to emotions once classified is another area of active research. One research group allowed a robot to interact in-the-wild with humans for 14 days using a multi-modal deep Q-network (MDQN). [11] In this aforementioned work, the robot would use a “guess and check” method to learn how to properly interact with a human given the perceived state of the human. While the robot only had four possible interact with the human, the realized MDQN achieved an accuracy exceeding 80%. Others have built entire environments to observe how humans interact with the environment’s contents using a series of cameras and microphones in order to teach a robot a discrete set of possible interactions with a human counterpart. [12] In the aforementioned work, the virtual environment was a mock electronics store and a Naïve Bayesian classifier was realized to classify human actions that were associated with desired actions. This technique was notable as it was extremely robust to sensor and environmental noise (~85% classification accuracy) and it proved the viability of big data driven approaches in human-centric environments. 

Teaching robots how to interact with humans in a social environment can be extremely difficult because there’s no apparent cost functions for social interactions. [13] This leads to researchers arguing that humans will best learn how to interact with humans by learning from human interactions with human guidance. The researchers in [13] developed a system reliant on SPARC (Supervised Progressively Autonomous Robot Competencies) where a human allows, blocks, or suggests actions by a robotic platform. Other methods may rely on robots learning from human demonstration; this is known as Learning from Demonstration (LfD). In LfD, an “expert” teacher defines a robot’s control policy by demonstrating a task (e.g. painting the hull of a ship) to define robot action. The authors of [15] developed a model based on a deep-Q network that is able to learn from human demonstrations in order to teach a robot how to properly response to social interactions from a person. Most notably, [15]’s authors claim this is the first implementation of a deep-Q network used for high-level LfD. 

Not all robots need to interact with people in a social sense. Many robots need to know how to best help a human complete a task or know when to provide more assistance. Those who are influenced by changing health or environmental factors are not consistently reliable when operating complex machines. Some researchers seek to quantify a human’s cogency, or the extent to which a robot should act upon a human’s commands while maximizing human safety and task performance [16-17]. Some seek to use measures of cogency, commonly referred to as trust, to linearly blend the amount of assistance provided by the robot with the human’s control [18] where as others may use trust measures to shift between discretely defined levels-of-autonomy. [19]

Present measures of robotic trust are founded more heavily in optimal control than in machine learning [20-21]. However, some have begun to apply machine learning methods to measure human ability and inform trust measures. In work presented by [22], the amount of assistance a human requires is defined by the human’s requests; a human operator is able to directly modulate the amount of assistance provided by their robotic partner. The results of [22] allowed for the development for a human-robot pair-specific cost function over which some machine learning algorithms could be used to learn a proper assistance blending policy. 

It is difficult to classify human intent, emotions, and needs. Human-robot interaction seeks to better understand how humans and robots interact and, as a result, build robots capable of best assisting or working along side their human operators. The recent explosion of computing power and data has allowed for rapid advances in HRI using numerous machine learning methods, most notably deep learning techniques. However, as noted in [22], all engineers and robotics today must remember that humans will act based upon their desires and sometimes, these desires aren’t in anyway optimal. This fact must be considered in the development of all machine learning algorithms intended for use in HRI. 


#### Works Cited
1)	M. A. Goodrich and A. C. Schultz. Human-Robot Interaction: A Survey. Foundations and Trends in Human-Computer Interaction, 1(3), 2007, pp 203-275. 

2)	Konrad Schindler, Luc Van Gool, Beatrice de Gelder, Recognizing emotions expressed by body pose: A biologically inspired neural model, Neural Networks, Volume 21, Issue 9, 2008, Pages 1238-1246, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2008.05.003.

3)	P. Ekman, Universal facial expressions of emotion, California Mental Health Research Digest 8 (1970) 151–158.

4)	Pons, Gerard and David Masip. “Multi-task, multi-label and multi-domain learning with residual convolutional networks for emotion recognition.” CoRR abs/1802.06664 (2018): n. pag.

5)	Li, Shan and Weihong Deng. “Deep Facial Expression Recognition: A Survey .“ CoRRabs/ 1804.08348 (2018): n. pag.

6)	 Zhiding Yu and Cha Zhang. 2015. Image based Static Facial Expression Recognition with Multiple Deep Network Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI '15). ACM, New York, NY, USA, 435-442. DOI: https://doi.org/10.1145/2818346.2830595

7)	Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort, and Marian Bartlett. 2013. Multiple kernel learning for emotion recognition in the wild. In Proceedings of the 15th ACM on International conference on multimodal interaction (ICMI '13). ACM, New York, NY, USA, 517-524. DOI: https://doi.org/10.1145/2522848.2531741

8)	Konrad Schindler, Luc Van Gool, Beatrice de Gelder, Recognizing emotions expressed by body pose: A biologically inspired neural model, Neural Networks, Volume 21, Issue 9, 2008, Pages 1238-1246, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2008.05.003.

9)	Han, Kun, Dong Yu and Ivan Tashev. “Speech emotion recognition using deep neural network and extreme learning machine.” INTERSPEECH (2014).

10)	  M. Soleymani, M. Pantic and T. Pun, "Multimodal Emotion Recognition in Response to Videos," in IEEE Transactions on Affective Computing, vol. 3, no. 2, pp. 211-223, April-June 2012. doi: 10.1109/T-AFFC.2011.37

11)	A. H. Qureshi, Y. Nakamura, Y. Yoshikawa and H. Ishiguro, "Robot gains social intelligence through multimodal deep reinforcement learning," 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), Cancun, 2016, pp. 745-751. doi: 10.1109/HUMANOIDS.2016.7803357

12)	P. Liu, D. F. Glas, T. Kanda, and H. Ishiguro, "Data-Driven HRI: Learning Social Behaviors by Example from Human-Human Interaction," IEEE Transactions on Robotics, vol. 32, pp. 988-1008, 2016. DOI: 10.1109/TRO.2016.2588880

13)	SENFT, E.; LEMAIGNAN, S.; BAXTER, P.; BELPAEME, T.. Toward Supervised Reinforcement Learning with Partial States for Social HRI. AAAI Fall Symposium Series, North America, oct. 2017. Date accessed: 20 Mar. 2019.

14)	 Brenna D. Argall, et al., A survey of robot learning from demonstration, in Robotics and Autonomous Systems, Volume 57, Issue 5, 2009, Pages 469-483.

15)	Madison Clark-Turner and Momotaz Begum. 2018. Deep Reinforcement Learning of Abstract Reasoning from Demonstrations. In HRI ’18: 2018 ACM/IEEE International Conference on Human-Robot Interaction, March 5–8, 2018, Chicago, IL, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/ 10.1145/3171221.3171289 

16)	O. Horn, "Smart wheelchairs: Past and current trends," 2012 1st International Conference on Systems and Computer Science (ICSCS), Lille, 2012, pp. 1-6. 2012.

17)	S. Musić, et al., Control sharing in human-robot team interaction, Annual Reviews in Control, Volume 44, 2017, Pages 342-354, ISSN 1367-5788.

18)	A. Erdogan, et al., The effect of robotic wheelchair control paradigm and interface on user performance, effort and preference…, in Robotics and Autonomous Systems, 2017.

19)	M. Chiou, et al., "Experimental analysis of a variable autonomy framework for controlling a remotely operating mobile robot," in Proceedings of the IEEE/RSJ International Intelligent Robots and Systems (IROS), 2016.

20)	B. D. Argall, et al. Computable trust in human instruction. In Artificial Intelligence for Human-Robot Interaction - Papers from the AAAI Fall Symposium, Technical Report. 2014.

21)	A. Broad, et al., "Trust Adaptation Leads to Lower Control Effort in Shared Control of Crane Automation," in IEEE Robotics and Automation Letters, vol. 2, no.1, pp. 239-246, Jan. 2017.

22)	D. Gopinath, et al., "Human-in-the-Loop Optimization of Shared Autonomy in Assistive Robotics," in IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 247-254, Jan. 2017.
