### Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets

##### Introduction to the Problem
A key factor in deployment of robots in the real-world is their ability to learn from data. Robot learning frameworks include:
- Reinforcement Learning: a skill is learned based on the interaction of the robot with its environment
- Imitation Learning: the robot is presented with a demonstration of a skill that it should imitate
The latter has focused on isolated demonstrations of a particular skill in the form of kinesthetic teaching.
To improve the scalability, this framework can learn to imitate skills from unstructured an unlabeled demonstrations of various tasks.

A complex activity includes a set of simpler skills. To learn this data, it's required to:
- Ability to map the image stream (video) to state-action pairs
- Ability to segment data into simple skills
- Ability to imitate each of the segmented skills

This imitation learning method proposed learns a multi-modal stochastic policy able to imitate a number of automatically segmented tasks using a set of unlabelend unstructured demostrations.

Similar works assume a way of communicating a new task through a single demostration, much like a prompt. The main difference is that here the goal is to separate the demonstrations into single skills and learn a policy that can imitate all of them, eliminating the need of new demonstrations at test time.

We have an MDP $$M = (\underset{\text{State space}}{S}, \underset{\text{Action space}}{A}, \underset{\underset{P:S\times A \times S \rightarrow \mathbb{R}^+}{\text{State-Transition probability}}}{P}, \underset{\underset{R : S\times A\rightarrow \mathbb{R}}{\text{Reward function}}}{R}, \underset{\underset{p_0 : S \rightarrow \mathbb{R}^+}{\text{Initial state distribution}}}{p_0}, \underset{\text{Reward discount factor}}{\gamma}, \underset{\text{Horizon}}{T})$$

- $S$ is the state space
- $A$ is the action space
- $P:S\times A \times S \rightarrow \mathbb{R}^+$ is the state-transition probability function
- $R : S\times A\rightarrow \mathbb{R}$ is the reward function
- $p_0 : S \rightarrow \mathbb{R}^+$ the initial state distribution
- $\gamma$ the reward discount factor
- $T$ the horizon


With $\tau = (s_0,a_0,\ldots,s_T,a_T)$ we denote a trajectory of states and actions and with $$R(\tau) = \sum_{t=0}^T\gamma^tR(s_t,a_t)$$ the trajectory reward.

##### Model Description

The goal of reinforcement learning is to find the parameters $\theta$ of a policyt $\pi_\theta(a\:|\:s)$ that maximizes the expected discounted reward over trajectories induced by the policy. In imitation learning the reward function is unknown but can be recovered from the set of demonstrated trajectories: those trajectories are originated from some policy $\pi_{E_1}$ that optimizes a reward function $R_{E_1}$, so we can estimate $R_{E_1}$ and optimze $\pi_\theta$ with respect to it recovering the policy $\pi_{E_1}$. This approach is called Inverse Reinforcemente Learning (IRL).

In order to model a variety of behaviors, it's better to find the policy $\pi_\theta$ with the highest entropy $H(\pi_\theta)$ that optimizes $R_{E_1}$: Maximum-Entropy IRL with the following optimization objective
$$\min_R\left(\max_{\pi_\theta} H(\pi_\theta) + \mathbb{E}_{\pi_\theta}[R(s,a)]\right) - \mathbb{E}_{\pi_{E_1}}[R(s,a)]$$
This aims to find a reward function $R$ that assigns low reward to the maximum entropy policy $\pi_\theta$ and high reward to the expert policy $\pi_{E_1}$, getting the reward function that allows to find the expert policy.

This problem can be redefined by sampling multiple dimostrations from a single expert policy $\pi_{E_1}$ as an optimization of GANs:
- The policy $\pi_\theta(a\:|\:s)$ is the generator
- $D_w(s,a)$ is the discriminator, parametrized by $w$, whose goal is to distinguish between imitated samples from $\pi_\theta$ (labeled 0) and demonstrated samples from $\pi_{E_1}$ (labeled 1)

The joint optimization is defined as:
$$\max_\theta \min_w \underset{(s,a)\sim \pi_\theta}{\mathbb{E}}[\log(D_w(s,a))] + \underset{(s,a)\sim \pi_{E_1}}{\mathbb{E}}[\log(1-D_w(s,a))] + \lambda_H H(\pi_\theta)$$
Which means finding a saddle point $(\theta, w) = (\pi_\theta, D_w)$ where a policy $\pi_\theta$ is able to make it hard for the discriminator network $D_w$ to distinguish between samples drawn from it and samples drawn from the expert policy $\pi_{E_1}$

The discriminator is trained on a mixed set of expert and generator samples, and output the probability that a given sample has originated from the expert policy. This is used as a reward signal for the generator, that tries to behave as similar as possible to the expert policy.

The generator, which is our policy $\pi_\theta$, is trained with the TRPO algorithm with cost function $\log(D_w(s,a))$, aiming to decrease it with the following gradient step:
$$\underset{(s,a)\sim \pi_\theta}{\mathbb{E}}[\nabla_\theta \log\pi_\theta(a\:|\:s)\log(D_w(s,a))]+\lambda_H\nabla_\theta H(\pi_\theta)$$
which corresponds to minimizing the previous equation with respect to the policy $\pi_\theta$

##### Key Catch of the Model
It this work, the demonstrations come from a mixture of policies $\pi_E$ of $\pi_{E_1},\ldots,\pi_{E_k}$ where $k$ can be unknown. The policy input is augmented with a latent intention variable $i$ from a categorical or uniform distribution $p(i)$: the goal of this variable is to select a specific mode of the policy corresponding to one of the skills. The augmented policy can be expressed as
$$\pi(a\:|\:s,i) = p(i\:|\:s,a)\frac{\pi(a\:|\:s)}{p(i)}$$
The trajectory is also augmented to include the latent intention, giving $\tau_i=(s_0,a_0,i_0,\ldots,s_T,a_T,i_T)$, resulting in an augmented reward $R(s,a,i)$ depending on the latent intention and a reward function for the trajectory $$R(\tau_i) = \sum_{t=0}^T R(s_t,a_t,i_t)$$
The discounted reward is $$\mathbb{E}_{\pi_\theta}[R(\tau_i)] = \int R(\tau_i)\pi_\theta(\tau_i)\:d\tau_i$$ $$\pi_\theta(\tau_i) = p_0(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\:|\:s_t,a_t)\pi_\theta(a_t\:|\:s_t,i_t)p(i_t)$$

So we have an augmented policy $\pi(a\:|\:s,i)$ with the latent intention variable $i$, which uses demonstrations from a set of expert policies $\pi_E$, and we aim at the maximum entropy policy that can be determined from the latent intention $i$. With $$\pi(a\:|\:s) = \sum_i \pi(a\:|\:s,i)p(i)$$ we transform the original IRL problem $$\min_R\left(\max_\pi H(\pi(a\:|\:s)) - H(\pi(a\:|\:s,i)) + \mathbb{E}_\pi[R(s,a,i)]\right) - \mathbb{E}_{\pi_E}[R(s,a,i)]$$
This reflects our goal of aiming at a multi-modal policy with high entropy when presented with no given intention, but that collapses to a particular task when the intention is specified. This goal: finds $\pi$ by maximizing its entropy over all intentions, with this entropy it tries to minimize the subtraction with the entropy of the policy for a given intention $i$ penalized with the sum of its reward function in order to give high reward to the $\pi_E$ we're looking for.

##### Key Empirical Result

##### Comments