### Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets

##### Introduction to the Problem
A key factor in deployment of robots in the real-world is their ability to learn from data. Robot learning frameworks include:
- Reinforcement Learning: a skill is learned based on the interaction of the robot with its environment
- Imitation Learning: the robot is presented with a demonstration of a skill that it should imitate
The latter has focused on isolated demonstrations of a particular skill in the form of kinesthetic teaching.
To improve the scalability, this framework can learn to imitate skills from unstructured an unlabeled demonstrations of various tasks.

A complex activity includes a set of simpler skills. To learn this data, it's required to:
- Ability to map the image stream (video) to state-action pairs
- Ability to segment data into simple skills
- Ability to imitate each of the segmented skills

This imitation learning method proposed learns a multi-modal stochastic policy able to imitate a number of automatically segmented tasks using a set of unlabelend unstructured demostrations.

Similar works assume a way of communicating a new task through a single demostration, much like a prompt. The main difference is that here the goal is to separate the demonstrations into single skills and learn a policy that can imitate all of them, eliminating the need of new demonstrations at test time.

We have an MDP $$M = (\underset{\text{State space}}{S}, \underset{\text{Action space}}{A}, \underset{\underset{P:S\times A \times S \rightarrow \mathbb{R}^+}{\text{State-Transition probability}}}{P}, \underset{\underset{R : S\times A\rightarrow \mathbb{R}}{\text{Reward function}}}{R}, \underset{\underset{p_0 : S \rightarrow \mathbb{R}^+}{\text{Initial state distribution}}}{p_0}, \underset{\text{Reward discount factor}}{\gamma}, \underset{\text{Horizon}}{T})$$

- $S$ is the state space
- $A$ is the action space
- $P:S\times A \times S \rightarrow \mathbb{R}^+$ is the state-transition probability function
- $R : S\times A\rightarrow \mathbb{R}$ is the reward function
- $p_0 : S \rightarrow \mathbb{R}^+$ the initial state distribution
- $\gamma$ the reward discount factor
- $T$ the horizon


With $\tau = (s_0,a_0,\ldots,s_T,a_T)$ we denote a trajectory of states and actions and with $$R(\tau) = \sum_{t=0}^T\gamma^tR(s_t,a_t)$$ the trajectory reward.

##### Model Description

The goal of reinforcement learning is to find the parameters $\theta$ of a policyt $\pi_\theta(a\:|\:s)$ that maximizes the expected discounted reward over trajectories induced by the policy. In imitation learning the reward function is unknown but can be recovered from the set of demonstrated trajectories: those trajectories are originated from some policy $\pi_{E_1}$ that optimizes a reward function $R_{E_1}$, so we can estimate $R_{E_1}$ and optimze $\pi_\theta$ with respect to it recovering the policy $\pi_{E_1}$. This approach is called Inverse Reinforcemente Learning (IRL).

In order to model a variety of behaviors, it's better to find the policy $\pi_\theta$ with the highest entropy $H(\pi_\theta)$ that optimizes $R_{E_1}$: Maximum-Entropy IRL with the following optimization objective
$$\min_R\left(\max_{\pi_\theta} H(\pi_\theta) + \mathbb{E}_{\pi_\theta}[R(s,a)]\right) - \mathbb{E}_{\pi_{E_1}}[R(s,a)]$$
This aims to find a reward function $R$ that assigns low reward to the maximum entropy policy $\pi_\theta$ and high reward to the expert policy $\pi_{E_1}$, getting the reward function that allows to find the expert policy.

This problem can be redefined by sampling multiple dimostrations from a single expert policy $\pi_{E_1}$ as an optimization of GANs:
- The policy $\pi_\theta(a\:|\:s)$ is the generator
- $D_w(s,a)$ is the discriminator, parametrized by $w$, whose goal is to distinguish between imitated samples from $\pi_\theta$ (labeled 0) and demonstrated samples from $\pi_{E_1}$ (labeled 1)

The joint optimization is defined as:
$$\max_\theta \min_w \mathbb{E}_{(s,a)\sim \pi_\theta}[\log(D_w(s,a))] + \mathbb{E}_{(s,a)\sim \pi_{E_1}}[\log(1-D_w(s,a))] + \lambda_H H(\pi_\theta)$$
Which means finding a saddle point $(\theta, w) = (\pi_\theta, D_w)$ where a policy $\pi_\theta$ is able to make it hard for the discriminator network $D_w$ to distinguish between samples drawn from it and samples drawn from the expert policy $\pi_{E_1}$

The discriminator is trained on a mixed set of expert and generator samples, and output the probability that a given sample has originated from the expert policy. This is used as a reward signal for the generator, that tries to behave as similar as possible to the expert policy.

The generator, which is our policy $\pi_\theta$, is trained with the TRPO algorithm with cost function $\log(D_w(s,a))$, aiming to decrease it with the following gradient step:
$$\mathbb{E}_{(s,a)\sim \pi_\theta}[\nabla_\theta \log\pi_\theta(a\:|\:s)\log(D_w(s,a))]+\lambda_H\nabla_\theta H(\pi_\theta)$$
which corresponds to minimizing the previous equation with respect to the policy $\pi_\theta$

##### Key Catch of the Model
It this work, the demonstrations come from a mixture of policies $\pi_E$ of $\pi_{E_1},\ldots,\pi_{E_k}$ where $k$ can be unknown. The policy input is augmented with a latent intention variable $i$ from a categorical or uniform distribution $p(i)$: the goal of this variable is to select a specific mode of the policy corresponding to one of the skills. The augmented policy can be expressed as
$$\pi(a\:|\:s,i) = p(i\:|\:s,a)\frac{\pi(a\:|\:s)}{p(i)}$$
The trajectory is also augmented to include the latent intention, giving $\tau_i=(s_0,a_0,i_0,\ldots,s_T,a_T,i_T)$, resulting in an augmented reward $R(s,a,i)$ depending on the latent intention and a reward function for the trajectory $$R(\tau_i) = \sum_{t=0}^T R(s_t,a_t,i_t)$$
The discounted reward is $$\mathbb{E}_{\pi_\theta}[R(\tau_i)] = \int R(\tau_i)\pi_\theta(\tau_i)\:d\tau_i$$ $$\pi_\theta(\tau_i) = p_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\:|\:s_t,a_t)\pi_\theta(a_t\:|\:s_t,i_t)p(i_t)$$

So we have an augmented policy $\pi(a\:|\:s,i)$ with the latent intention variable $i$, which uses demonstrations from a set of expert policies $\pi_E$, and we aim at the maximum entropy policy that can be determined from the latent intention $i$. With $$\pi(a\:|\:s) = \sum_i \pi(a\:|\:s,i)p(i)$$ we transform the original IRL problem $$\min_R\left(\max_\pi H(\pi(a\:|\:s)) - H(\pi(a\:|\:s,i)) + \mathbb{E}_\pi[R(s,a,i)]\right) - \mathbb{E}_{\pi_E}[R(s,a,i)]$$
This reflects our goal of aiming at a multi-modal policy with high entropy when presented with no given intention, but that collapses to a particular task when the intention is specified. This goal: finds $\pi$ by maximizing its entropy over all intentions, with this entropy it tries to minimize the subtraction with the entropy of the policy for a given intention $i$ penalized with the sum of its reward function in order to give high reward to the $\pi_E$ we're looking for.

Analogously as before, this results in the optimization goal of the Generative Adversarial Imitiation Learning Network, with the exception that state-action pairs $(s,a)$ are sampled from $\pi_E$
$$\max_\theta\min_w \mathbb{E}_{i\sim p(i), (s,a)\sim \pi_\theta}[\log (D_w(s,a))] + \mathbb{E}_{(s,a)\sim\pi_E}[1-\log(D_w(s,a))] + \lambda_H H(\pi_\theta(a\:|\:s)) - \lambda_I H(\pi_\theta(a\:|\:s,i))$$
Which aims to find a saddle point $(\theta, w) = (\pi_\theta, D_w)$ where the discriminator network $D_w$ has difficulty in distinguishing samples from $\pi_\theta$ and samples from $\pi_E$. This because samples from $\pi_\theta$ are labeled 0, and samples from $\pi_E$ are labeled 1: $D_w$ output the probability of $(s,a)$ being sampled from $\pi_E$, with 1 meaning that $D_w$ is sure that the sample comes from $\pi_E$. So it tries to find $\theta$ that maxes out this probability, meaning a policy $\pi_\theta$ able to successfully confuse $D_w$, while at the same time fitting parameters $w$ with the aim of correctly distinguishing samples from the two policies. In addition to this, we aim to a policy $\pi_\theta$ with maximum entropy when presented without latent intention and with minimum entropy when a latent intention is given. $\lambda_H$ and $\lambda_I$ are weighting parameters on the respective objectives.

The entropy $H(\pi_\theta(a\:|\:s,i))$ can be expressed as:
$$H(\pi_\theta(a\:|\:s,i)) = \mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}([-\log(\pi_\theta(a\:|\:s,i))])=$$
$$=-\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}\left[\log\left(p(i\:|\:s,a)\frac{\pi_\theta(a\:|\:s)}{p(i)}\right)\right] = $$
$$=-\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log(p(i\:|\:s,a))] - \underset{=-H(\pi_\theta(a\:|\:s))}{\underbrace{\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log(\pi_\theta(a\:|\:s))]}}+\underset{=H(i)}{\underbrace{\mathbb{E}_{i\sim p(i)}[\log(p(i))]}}=$$
$$=-\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log(p(i\:|\:s,a))]+H(\pi_\theta(a\:|\:s))+H(i)$$

Which follows from the definition $H(P(x)) = \mathbb{E}[-\log(P(x))]$. This results in the final objective:
$$\max_\theta\min_w \mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log (D_w(s,a))] + \mathbb{E}_{(s,a)\sim\pi_E}[1-\log(D_w(s,a))] + (\lambda_H-\lambda_I)H(\pi_\theta(a\:|\:s)) + \lambda_I\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log(p(i\:|\:s,a))]+\lambda_IH(i)$$

A full commentary on this. The main aim of this model is to find a maximum-entropy multi-modal policy $\pi_\theta$ as similar as possible to the original mixture of expert policies $\pi_E$. To achieve this, we modeled the problem as a Imitation Learning GAN: we use $\pi_\theta$ as the generator and include a network $D_w$ which is trained to distinguish between samples from $\pi_\theta$ and from $\pi_E$. So we aim for parameters $\theta$ that maximize in order for $\pi_\theta$ to successfully confuse the discriminator $D_w$, while aiming for parameters $w$ in order for the discriminator $D_w$ to be able to successfully distinguish between the two policies. We max with $\theta$ also because the policy $\pi_\theta$ we're looking for also needs to have high entropy averaged over all the intentions $i$, while rewarding state-action pairs that makes the inference of the latent intention $i$ easier. $H(i)$ is a constant that doesn't influence the optimization problem, because $i$ comes from a uniform or categorical distribution so each $i$ carries the same entropy. This results in: we're looking for a saddle point $(\theta, w) = (\pi_\theta, D_w)$ where we have maximum entropy of a $\pi_\theta$ able to successfully confuse a fairly good discriminator network $D_w$ with state-action pairs useful in distinguishing the latent intention $i$.

The reward function of the generator is 
$$\mathbb{E}_{i\sim p(i),(s,a)\sim \pi_\theta}[\log(D_w(s,a))] + \lambda_I\mathbb{E}_{i\sim p(i),(s,a)\sim\pi_\theta}[\log(p(i\:|\:s,a))]+\lambda_{H'}H(\pi_\theta(a\:|\:s))$$
Which is maximized when $\log(D_w(s,a)) = 0 \Leftrightarrow D_w(s,a) = 1$ meaning that $D_w$ classifies the sample as it would come from $\pi_E$, when $(s,a)$ is useful in easily infer the latent intention $i$ and when $\pi_\theta$ has high entropy averaged over all the intentions.

##### Key Empirical Result
The main result of this model is its ability to correctly segment the training demonstrations and, when presented with categorical latent intentions, correctly learning a multi-modal policy able to distinguish all the different skills.

In one of the experimental setup, a robotic arm with two degrees of freedom is able to reach up to four different targets from a random initial position autonomously. When the latent intention is omitted, the network isn't able to discover different skills and struggles in distinguishing correctly all the tasks. With the latent intention, the network is often able to reach the targets without problems. The latent intention also has the effect of encouraging different behaviors for different intentions, resulting in the arm reaching the target with different trajectories for each intention.

Another experimental setup includes an arm with a grappler at the end, with the double goal of reaching a target and then pushing it into another target. When presented with demonstrations of reaching the target and demonstrations of pushing a grappled target, the model is able to correctly segment the mix of expert policies into separate skills. The two subtasks start from different initial conditions, and the categorical intention is manually changed when the target is grasped. Changin the intention results in a pushing policy that brings the object to the designed target, highlighting the capacity of the model in distinguishing two different subtasks.

So this model learns the concept of intention and is able to perform different tasks based on a intention input, learning a multi-modal policy that is able to imitate all of the automatically segmented skills.

##### Comments
The main novelty of this model is it's ability to include in a single policy knowledge about different skills and tasks, which is surely a requirement for future general purpose systems and robots. It's also able to distinguish different skills automatically, based on unstructured and unlabeled examples, making the training of such systems much easier and with access to a lot more data possibly resulting in better trained models.

A weakness is highlighted in the paper: in a experimental setup, a humanoid robot is presented with three tasks. The humanoid robot is high-dimensional, with 16 degrees of freedom, so the tasks are much more difficult than the previous ones, and the policy is only able to mimic two of the three tasks with good result, only achieving suboptimal results in the third task. This is still better than a GAN without the latent intention information, which collapses to a unimodal policy that maps all the tasks to a single one out of the three.