In [3]:
from IPython.html.services.config import ConfigManager
from IPython.paths import locate_profile
cm = ConfigManager(profile_dir=locate_profile(get_ipython().profile))
cm.update('livereveal', {
              'theme': 'solarized',
              'transition': 'slide',
              'start_slideshow_at': 'selected',
              'progress': 'true',
})

{'background': '#ff0000',
 'height': 768,
 'progress': 'true',
 'scroll': 'true',
 'start_slideshow_at': 'selected',
 'theme': 'solarized',
 'transition': 'slide',
 'width': 1024}

<center>
<h2>Imitation learning</h2>
</center>

### Imitation learning for part-of-speech tagging

<table style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th>
<th>can</th>
<th>fly</th>
</tr>
</thead>
<tbody>
<tr>
<td><span>Pronoun</span></td>
<td><span>Modal</span></td>
<td><span>Verb</span></td>
</tr>
</tbody>
</table>

**Task loss**: <span class="fragment">Hamming loss: number of incorrectly predicted tags</span>

**Transition system**: <span class="fragment">Tag each token left-to-right</span>

**Expert policy**: <span class="fragment">Return the next tag from the gold standard</span>

<h3>Gold standard in search space</h3>

<img src="images/tikz/posImitGold.png" style="width:75%; float:left;">
<br>

<p style="float:left;">
<ul style="float:left;">
<li>Three actions to complete the output</li>
<li>Expert policy replicates the gold standard</li>
</ul>
</p>

<h3>Training a classifier<span class="fragment" data-fragment-index="1"> with structure features </span></h3>

<img src="images/tikz/posImitClassifierTraining.png" style="width:75%; float:left;">

<table style="font-size:100%; border-style:hidden; border-collapse:collapse; padding:50px; float:left;">
<thead>
<tr>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Pronoun</b></td>
<td>token=I, ...<span class="fragment" data-fragment-index="1">, prev=<b>NULL</b></span></td>
</tr>
<tr>
<td><b>Modal</b></td>
<td>token=can, ...<span class="fragment" data-fragment-index="1">, prev=<b>Pronoun</b></span></td>
</tr>
<tr>
<td><b>Verb</b></td>
<td>token=fly, ...<span class="fragment" data-fragment-index="1">, prev=<b>Modal</b></span></td>
</tr>
</tbody>
</table>

<p style="float:left; font-size: 100%" class="fragment" data-fragment-index="3">With logistic regression and $k$ previous tags: training a $kth$-order Maximum Entropy Markov Model (<a href="http://people.csail.mit.edu/mcollins/6864/slides/memm.pdf">McCallum et al., 2000</a>)</p>

The feature restricition though is needed only to be able to use dynamic programming (i.e. Viterbi) for efficient joint inference. In incremental model though this is not needed thus features can use all previous tags.
<p style="float:left; font-size: 100%" class="fragment" data-fragment-index="2">We learn how to imitate the expert assuming no deviations</p>

### Exposure bias

<img src="images/tikz/posImitOffGold.png" style="width:75%; align:left;">
<br>
<p style="float: left">We had seen: &nbsp;&nbsp; 
<table style="float: left; border-style: hidden; border-collapse: collapse;">
<thead>
<tr>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Verb</b></td>
<td>token=fly,..., prev=<b>Modal</b></td>
</tr>
</tbody>
</table>
</p>

<p style="float: left">but not: &nbsp;&nbsp;
<table style="float: left; border-style: hidden; border-collapse: collapse; padding: 50px;">
<thead>
<tr>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Verb</b></td>
<td>token=fly,..., prev=<b>Verb</b></td>
</tr>
</tbody>
</table></p>

### Addressing exposure

<p style="float: left;">Allow the classifier to guide the learning<br></p>  <a href="https://www.pinterest.com/explore/affordable-driving-school/"><img src="images/driving_mix.jpg" style="width:35%; float: right;"></a>

- 1st iteration: **roll-in** through the data with the expert
- 2nd onwards: mix expert and classifier to expose the classifier to its own actions

### DAgger algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L\\
& \textbf{Output:} \; classifier \; H\\
& training\; examples\; \cal E = \emptyset, \; expert\; probability\; \beta=1\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin \; policy \; \pi^{in} = \beta H + (1-\beta)\pi^{\star}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \text{extract } features=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (features,\alpha^{\star})\\
& \quad \text{learn} \; H\; \text{from}\; \cal E\\
& \quad \text{decrease} \; \beta\\
\end{align}
</p>

### DAgger algorithm

Proposed by [Ross et al. (2011)](http://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) motivated by robotics
- first iteration is standard classification training
- task loss and gold standard are implicitly considered via the expert
- DAgger: the Datasets in each iteration are Aggregated

So far we looked at how to overcome previous errors. What about anticipating (and avoiding) the future ones?

### Training labels as costs

<img src="images/tikz/posImitClassifierTraining.png" style="width:75%; float:left;">

<table style="float: left; border-style: hidden; border-collapse: collapse; padding: 50px; float:left;">
<thead>
<tr>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>token=I, prev=<b>NULL</b>...</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>token=fly, prev=<b>Modal</b>...</td>
</tr>
</tbody>
</table>

<h3>Cost break down</h3>

<img src="images/tikz/posImitActionCosting1.png" style="float: left; width:50%">
<img src="images/tikz/posImitActionCosting2.png" style="float: left;width:50%">


<p style="float:left;">
<ul>
<li><b>roll-in</b> to a point in the sentence</li>
<li>try each possible label and <b>rollout</b> till the end</li>
<li>evaluate the complete output with the task loss</li>
<li>If <b>roll-out</b> with expert only, correct action has 0 cost, incorrect 1.</li>
</ul>
</p>

### Mixed roll-outs

Rolling out with the classifier allows us to see future mistakes

<img src="images/tikz/posImitActionCosting3.png" style="width:75%; float:left;">

<table style="float: left; border-style: hidden; border-collapse: collapse; padding: 50px; float:left;">
<thead>
<tr>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
</tbody>
</table>

### DAgger with roll-outs

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L\\
& \textbf{Output:} \; classifier \; H\\
& training\; examples\; \cal E = \emptyset, \; expert\; probability\; \beta=1\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin/out \; policy \; \pi^{in/out} = \beta H + (1-\beta)\pi^{\star}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in/out}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \mathbf{for} \; \alpha \in {\cal A} \; \mathbf{do}\\
& \quad \quad \quad \quad \text{rollout} \; S_{final} = \pi^{in/out}(S_{t-1}, \alpha, \mathbf{x})\\
& \quad \quad \quad \quad cost\; c_{\alpha}=L(S_{final}, \mathbf{y})  \\
& \quad \quad \quad \text{extract } features=\phi(\mathbf{x}, S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (features,\mathbf{c})\\
& \quad \text{learn} \; H\; \text{from}\; \cal E\\
& \quad \text{decrease} \; \beta\\
\end{align}
</p>

### Roll-outs

- can learn with non-decomposable losses
- can learn with sub-optimal experts
- expensive when there are many actions and long sequences to complete outputs 


Some history:
- first proposed in SEARN ([Daumé III et al., 2009](http://hunch.net/~jl/projects/reductions/searn/searn.pdf))

- used to hybridise DAgger by [Vlachos and Clark](http://www.aclweb.org/anthology/Q14-1042), referred to later as V-DAgger ([Goodman et al. 2016](http://aclweb.org/anthology/P16-1001))

- also proposed as look-aheads ([Tsuruoka et al. 2011](http://www.anthology.aclweb.org/W/W11/W11-0328.pdf))

### LoLS

Locally Optimal Learning to Search ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf))

<img src="images/lols.png" style="width:70%;">


- rollin always with the classifier
- each rollout uses either the expert or the classifier exclusively 

<h3>Generic imitation learning</h3>

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L\\
& \textbf{Output:} \; classifier \; H\\
& training\; examples\; \cal E = \emptyset\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin \; policy \; \pi^{in} = mix(H,\pi^{\star})\\
& \quad \text{set} \; rollout \; policy \; \pi^{out} = mix(H,\pi^{\star})\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \text{rollout to obtain costs}\; c \; \text{for all possible actions using}\; L\;  \\
& \quad \quad \quad \text{extract features}\; f=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (f,c)\\
& \quad \text{learn} \; H\; \text{from}\; \cal E\\
\end{align}
</p>

### Comparison

SEARN, DAgger, V-DAgger and LoLS

Goodman style

### Summary so far

- basic intuition behind IL
- rollin and the DAgger algorithm 
- rollouts and V-DAgger and LoLS
- generic imitation learning recipe