# Notes

They are specifically concerned with imitation learning cases where the model has to learn from expert demonstrations, can’t query the expert after initial data collection, and doesn’t have any other form of reinforcement learning signal.

> Behavioral cloning, while appealingly simple, only tends to succeed with large amounts of data, due to compounding error caused by covariate shift.

IRL gets over this problem by learning a cost function from the expert behavior, which allows the model to infer what the expert would do in out of distribution states.

> Unfortunately, many IRL algorithms are extremely expensive to run, requiring reinforcement learning in an inner loop. Scaling IRL methods to large environments has thus been the focus of much recent work.

Since RL training is part of each iteration of IRL against new inferred cost functions, it can be computationally expensive.

> Our characterization introduces a framework for directly learning policies from data, bypassing any intermediate IRL step.

> We find that [our algorithm] outperforms competing methods by a wide margin in training policies for complex, high-dimensional physics-based control tasks over various amounts of expert data.

### Background

This paper is built on maximum causal entropy IRL with the following optimization problem:

$$
\underset{c \in C}{\textrm{maximize}} \left( \underset{\pi\in\Pi}{\min} -H(\pi) + \mathbb{E}_\pi[c(s,a)]  \right) - \mathbb{E}_{\pi_E}[c(s, a)]
$$

This optimization finds the cost function in the space of viable cost functions $c \in C$ that maximizes the cost difference between the expert policy and the best known high-entropy policy.

The inner section finds the policy $\pi \in \Pi$ with the best tradeoff of minimizing cost and maximizing entropy (for the sake of exploration) given a cost function $c$. The outer section then finds the cost function where the best policy has the largest difference from the expert policy.

### Characterizing the induced optimal policy

Their goal is to find an imitation learning algorithm that doesn’t need the IRL step of inferring a cost function and training an RL policy many times, and that works well in large environments.

Using expressive cost functions is important to make IRL work properly (cost function needs to be able to express all the necessary information). Neural networks are often a choice here.

They consider the most brought possible set of learned cost function $\mathcal{C}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$.

Given how large this space is, they need a regularizer $\psi: \mathbb{R}^{\mathcal{S} \times \mathcal{A}} \rightarrow \overline{\mathbb{R}}$ that can regularize the cost:

$$
\textrm{IRL}_\psi(\pi_E) = \underset{c \in \mathbb{R}^{\mathcal{S}\times\mathcal{A}}}{\arg \max} -\psi(c) + \left( \underset{\pi\in\Pi}{\min} -H(\pi) + \mathbb{E}_\pi[c(s,a)]  \right) - \mathbb{E}_{\pi_E}[c(s, a)]
$$

Then we can look at the characteristics of the specific output policy given by the RL algorithm.

For any policy $\pi \in \Pi$, it’s occupancy measure $\rho_\pi: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is defined by $\rho_\pi(s, a) = \pi(a|s) \sum_{t=0}^\infty \gamma^t P(s_t = s|\pi)$.

This gives the distribution of state-action pairs that an agent would encounter when using policy $\pi$ to navigate the environment over infinite trajectories.

Importantly, there’s a one-to-one mapping from an occupancy measure to a policy. Policy $\pi_\rho$ is the only policy that has occupancy measure $\rho$.

They then show that the RL to imitation learning pipeline can be described by the following optimization:

$$
\arg \min_{\pi \in \Pi} -H(\pi) + \psi^*(\rho_\pi - \rho_{\pi_E})
$$

This optimization can be viewed as trying to bring the occupancy measure of the learned policy to be as close as possible to the expert policy while maintaining entropy.

If the convex regularization function were constant (no regularization), the model would learn the exact observed occupancy measure, which is nowhere close to the real occupancy measure in a sufficiently complex environment.

This is why the regularizer is necessary.

Then, we can switch our framing of IRL.

> IRL is traditionally defined as the act of finding a cost function such that the expert policy is uniquely optimal, but now, we can alternatively view IRL as a procedure that tries to induce a policy that matches the expert’s occupancy measure.

### Practical occupancy measure matching

In practice, constant regularizer functions are impractical since the model has to learn from a finite set of expert samples.

This requires a relaxation on the occupancy measure matching:

$$
\textrm{minimize}_\pi d_\psi(\rho_\pi, \rho_E) - H(\pi)
$$

Such that $d_\psi$ smoothly penalizes violations in difference between occupancy measures.

**1. Entropy-regularized apprenticeship learning**

> It turns out that with certain settings of $\psi$, the above equation takes on the form of regularized variants of existing apprenticeship learning algorithms, which indeed do scale to large environments with parameterized policies.

Since many IRL approaches use a cost function that’s limited to a linear combination of feature vectors, they show that these algorithms are equivalent to running an optimization with the above equation using a linear cost function $\psi = \delta_\mathcal{C}$.

This feature matching approach often doesn’t allow expert policies to be accurately recovered.

> We can understand exactly why apprenticeship learning may fail to imitate: it forces $\pi_E$ to be encoded as an element of $\mathcal{C}$. If $\mathcal{C}$ does not include a cost function that explains expert behavior well, then attempting to recover a policy from such an encoding will not succeed.

By forcing the cost function to fit into a linear combination of the features, FEM limits recovery of the actual occupancy measures.

> Use of these linear cost function classes, however, limits their approach to settings in which expert behavior is well-described by such classes.

### Generative adversarial imitation learning

They select the following regularizer which has the property that it makes the regularization difference between the expert and predicted distribution equal to the Jensen-Shannon divergence between them.

![Screenshot 2024-11-06 at 6.38.42 PM.png](../../images/Screenshot_2024-11-06_at_6.38.42_PM.png)

$$
\underset{\pi}{\textrm{minimize}} \: \psi^*_{GA}(\rho_\pi - \rho_{\pi_E} - \lambda H(\pi)) = D_{JS}(\rho_\pi, \rho_{\pi_E}) - \lambda H(\pi)
$$

> [This optimization] finds a policy whose occupancy measure minimizes Jensen-Shannon divergence to the expert’s. It minimizes a true metric between occupancy measures, so, unlike linear apprenticeship learning algorithms, it can imitate expert policies exactly.

The Jensen-Shannon divergence is a squared metric between distributions.

This equation then resembles a GAN. The purpose of the discriminator part $D_{JS}$ is to determine the difference between a generated distribution and target distribution, which resembles the role of the discriminator in a GAN.

Then, they want to find the saddle point where $D$ optimizes to detect between predicted and expert distributions:

$$
\mathbb{E}_\pi[\log(D(s, a))] + \mathbb{E}[\log(1-D(s,a))] - \lambda H(\pi)
$$

They use a parameterized policy $\pi_\theta$ and a discriminator network $D_w: \mathcal{S} \times \mathcal{A} \rightarrow (0,1)$. Then they alternate between an Adam optimized gradient step on $w$ and a TRPO step on $\theta$ to improve the policy until the model converges.

![Screenshot 2024-11-06 at 6.43.00 PM.png](../../images/Screenshot_2024-11-06_at_6.43.00_PM.png)

### Experiments

They test on low-level control tasks from traditional RL and difficult high-dimensional tasks like 3D humanoid locomotion.

![Screenshot 2024-11-06 at 4.55.37 PM.png](../../images/Screenshot_2024-11-06_at_4.55.37_PM.png)

They tested their algorithm against behavioral cloning, feature expectation matching (FEM), and game-theoretic apprenticeship learning (GTAL).

> Our algorithm almost always achieved at least 70% of expert performance for all dataset, nearly always dominating all the baselines.

### Discussion

> As we demonstrated, our method is generally quite sample efficient in terms of expert data.

They are sample efficient with expert data, but not environment interaction during training.

> Our approach does not interact with the expert during training

> We believe that we could significantly improve learning speed for our algorithm by initializing policy parameters with behavioral cloning, which requires no environment interaction at all.

Cool concept. They suggest using a far less effective but less computationally expensive algorithm for weight initialization to accelerate early training for their more effective algorithm.