# Lecture 7 - Imitation Learning

provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Introduction">Introduction</a>
    <li>2. <a href="#2.-Problem-Setup">Problem Setup</a>
    <li>3. <a href="#3.-Behavioral-Cloning">Behavioral Cloning</a>
    <li>4. <a href="#4.-Inverse-RL">Inverse RL</a>
    <li>5. <a href="#5.-Apprenticeship-Learning">Apprenticeship Learning</a>
    <li>6. <a href="#6.-Resource">Resource</a></li>
</ul>
</div>

# 1. Introduction

Some environments may have sparse rewards and even a DQN wouldn't be able to succeed in the environment. An example is Montezuma's Revenge which is a game where a character navigates a 2D world in order find a key and open a door.

RL is good for simple and cheap data and parallelization is easy. However, it wouldn't be practical for cases where executing actions is slow, expensive to fail, or safety is priority. 

Problems with RL:
* needs lots of data
* needs lot of time
* sparse rewards
* hard to learn 
* execution of actions is slow
* very expensive to fail
* not safe 

__Imitation Learning__:
* learn from imitating behavior
* rewards are dense in time to closely guide the agent

# 2. Problem Setup

Input:
* state space, action space
* Transition model $P(s' ~|~ s, a)$
* No reward function $R$
* Set of one or more teacher's demonstrations $(s_{0}, a_{0}, s_{1}, a_{1}, s_{2})$

__Behavioral Cloning__:
* Can we directly learn the teacher's policy using supervised learning?

__Inverse RL__:
* Can we recover $R$?

__Apprenticeship Learning via Inverse RL__:
* Can we use the R we find in Inverse RL to generate a good policy?

# 3. Behavioral Cloning

Behavioral Cloning:
* the second your model deviates from the teacher behavior, it will have no idea what to do
* fine so long as your data covers all possible states encountered

Initialize $D \leftarrow \emptyset$ <br>
Initialize $\hat{\pi}_{1}$ to any policy in $\Pi$ <br>
for $i = 1$ to $N$ do <br>
$\quad$ Let $\pi_{i} = \beta_{i}\pi^{*} + (1 - \beta_{i})\hat{\pi}_{i}$ <br>
$\quad$ Sample $T$-step trajectories using $\pi_{i}$ <br>
$\quad$ Get dataset $D_{i} = \{(s, \pi^{*}(s))\}$ of visited states by $\pi_{i}$ and actions given by expert <br>
$\quad$ Aggregate datasets: $D \rightarrow D \cup D_{i}$ <br>
$\quad$ Train classifier $\hat{\pi}_{i + 1}$ on $D$. <br>

Return best $\hat{\pi}_{i}$ on validation 
<br><br>

_Algorithm 1. DAGGER: Dataset Aggregation._

The basic principle behind __DAGGER__ for behavior cloning is that you continually build up the dataset, train, and repeat. 

# 4. Inverse RL

We have to estimate the $R$ reward function. There is no unique $R$ for a given set of data. 

$R(s) = \textbf{w}^{T}x(s)$ where $w \in \mathbb{R}^{n}$, $x : S \rightarrow \mathbb{R}^{n} \hspace{1em} (Eq.~1)$

The value function for a policy $\pi$ is now:

$$
\begin{equation}
	\begin{split}
V^{\pi} & \underset{s \thicksim \pi}{=} \mathbb{E}[\sum_{t = 0}^{\infty}\gamma^{t}R(s_{t}) ~|~ \pi]\\
    & = \mathbb{E}[\sum_{t = 0}^{\infty}\gamma^{t}\textbf{w}^{T}x(s_{t}) ~|~ \pi]\\
    & = \textbf{w}^{T} \mathbb{E}[\sum_{t = 0}^{\infty}\gamma^{t}x(s_{t}) ~|~ \pi]\\
    & = \textbf{w}^{T} \mu(\pi)\\
    \end{split}
\end{equation}
\hspace{1em} (Eq.~2)\\
$$

$\mu(\pi)(s)$ is the discounted weighted frequency of state features under policy $\pi$.

# 5. Apprenticeship Learning

$$
V^{\pi} = \textbf{w}^{T} \mu(\pi)
$$

$$
\mathbb{E}[\sum_{t = 0}^{\infty}\gamma^{t}R^{*}(s_{t}) ~|~ \pi^{*}] = V^{*} \ge V^{\pi} = \mathbb{E}[\sum_{t = 0}^{\infty}\gamma^{t}R^{*}(s_{t}) ~|~ \pi] \hspace{1em} (Eq.~3)\\
w^{*^{T}} \mu(\pi^{*}) \ge w^{*^{T}} \mu(\pi), \forall ~ \pi \ne \pi^{*} \hspace{1em} (Eq.~4)\\
$$

If:

$$
||\mu(\pi) - \mu(\pi^{*})||_{1} \le \epsilon \hspace{1em} (Eq.~5)\\
$$

then for all $w$ with $||w||_{\infty} \le 1$: <br><br>
$$
|w^{T}\mu(\pi) - w^{T}\mu(\pi^{*})| \le \epsilon \hspace{1em} (Eq.~6)
$$

Assumption: $R(s) = w^{T}x(s)$ <br>
Initialize policy $\pi_{0}$
for $i = 1, 2, ...$ <br>
$\quad$ Find a reward function ($\textbf{w}$) such that the teacher maximally outperforms all previous controllers:

$$
\underset{\textbf{w}}{argmax} ~ \underset{\gamma}{max} ~ s.t. ~~ w^{T} \mu(\pi^{*}) \ge w^{T}\mu(\pi) + \gamma ~~~ \forall \pi \in \{\pi_{0}, \pi_{1}, ..., \pi_{i - 1}\} ~~ s.t. ~~ ||w||_{2} \le 1 \hspace{1em} (Eq.~7)\\
$$

$\quad$ Find optimal control policy $\pi_{i}$ for the current $\textbf{w}$ <br>
$\quad$ Exit if $\gamma \le \frac{\epsilon}{2}$ <br>

_Algorithm 2. Apprenticeship Learning._

# 6. Resource

If you missed the link right below the title, I'm providing the resource here again along with the course website.

- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)
- [Course Website](http://web.stanford.edu/class/cs234/index.html)

This is a series of 15 lectures provided by Stanford.
