In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
import il_tutorial.cost_graphs as cg
import il_tutorial.util as util
from IPython.display import HTML

In [82]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

<center>
<h2>Imitation learning</h2>
</center>

### Imitation learning for part-of-speech tagging

<table style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th>
<th>can</th>
<th>fly</th>
</tr>
</thead>
<tbody>
<tr>
<td><span>Pronoun</span></td>
<td><span>Modal</span></td>
<td><span>Verb</span></td>
</tr>
</tbody>
</table>

 **Task loss**: <span class="fragment">Hamming loss: number of incorrectly predicted tags</span>

**Transition system**: <span class="fragment">Tag each token left-to-right</span>

**Expert policy**: <span class="fragment">Return the next tag from the gold standard</span>

<h3>Gold standard in search space</h3>

In [16]:
paths = [[],[(0,4),(1,3)],[(0,4),(1,3),(2,2)],[(0,4),(1,3),(2,2),(3,1)]]
rows = ['Noun', 'Verb', 'Modal', 'Pronoun','NULL']
columns = ['NULL','I', 'can', 'fly']
cbs = []
for path in paths:
    cbs.append(cg.draw_cost_breakdown(rows, columns, path))
util.Carousel(cbs)

<p>
<ul>
<li>Three actions to complete the output</li>
<li>Expert policy replicates the gold standard</li>
</ul>
</p>

<h3>Training a classifier<span class="fragment" data-fragment-index="0"> with structure features </span></h3>

In [17]:
gold_path = [(0,4),(1,3),(2,2),(3,1)]
cb_gold = cg.draw_cost_breakdown(rows, columns, gold_path)
cb_gold

<table style="font-size:100%; border-style:hidden; border-collapse:collapse; padding:50px; float:left;">
<thead>
<tr>
<th>timestep</th>
<th>label ($\alpha_t$)</th>
<th>features ($\phi(S_{t-1},\mathbf{x})$)</th>
</tr>
</thead>
<tbody>
<tr>
<td> $t=1$ </td>
<td><b>Pronoun</b></td>
<td>token=I, ...<span class="fragment" data-fragment-index="1">, prev=<b>NULL</b></span></td>
</tr>
<tr>
<td> $t=2$ </td>
<td><b>Modal</b></td>
<td>token=can, ...<span class="fragment" data-fragment-index="1">, prev=<b>Pronoun</b></span></td>
</tr>
<tr>
<td> $t=3$ </td>
<td><b>Verb</b></td>
<td>token=fly, ...<span class="fragment" data-fragment-index="1">, prev=<b>Modal</b></span></td>
</tr>
</tbody>
</table>

The feature restricition though is needed only to be able to use dynamic programming (i.e. Viterbi) for efficient joint inference. In incremental model though this is not needed thus features can use all previous tags.
<p style="float:left; font-size: 100%" class="fragment" data-fragment-index="2">We learn how to imitate the expert assuming no deviations</p>

### Algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\
& \text{set training examples}\; \cal E = \emptyset\\
& \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \text{generate expert trajectory} \; \alpha_1^{\star}\dots \alpha_T^{\star}  = \pi^{\star}(\mathbf{x},\mathbf{y})\\
& \quad \mathbf{for} \; \alpha^{\star}_t \in \alpha_1^{\star}\dots \alpha_T^{\star} \; \mathbf{do}\\
& \quad \quad \text{extract features}\; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star}_t)\\
& \text{learn} \; H\; \text{from}\; \cal E\\
\end{align}
</p>


<p style="float:left; font-size: 100%" class="fragment" data-fragment-index="2">With logistic regression and $k$ previous tags: training a $kth$-order Maximum Entropy Markov Model (<a href="http://people.csail.mit.edu/mcollins/6864/slides/memm.pdf">McCallum et al., 2000</a>)</p>

### Exposure bias

In [18]:
wrong_path = [(0,4),(1,2)]
cb_wrong = cg.draw_cost_breakdown(rows, columns, wrong_path)
util.Carousel([cb_gold, cb_wrong])

<p style="float: left">We had seen: &nbsp;&nbsp; 
<table style="float: left; border-style: hidden; border-collapse: collapse;">
<thead>
<tr>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Verb</b></td>
<td>token=fly,..., prev=<b>Modal</b></td>
</tr>
</tbody>
</table>
</p>

<p style="float: left">but not: &nbsp;&nbsp;
<table style="float: left; border-style: hidden; border-collapse: collapse; padding: 50px;">
<thead>
<tr>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>Verb</b></td>
<td>token=fly,..., prev=<b>Verb</b></td>
</tr>
</tbody>
</table></p>

### Addressing exposure with Rollins

<p style="float: left;">Allow the classifier to guide the learning<br></p>  <a href="https://www.pinterest.com/explore/affordable-driving-school/"><img src="images/driving_mix.jpg" style="width:35%; float: right;"></a>

Define a **rollin** policy that sometimes uses the expert $\pi^{\star}$ and other times the classifier $H$:

$$\pi^{in} = \beta\pi^{\star} + (1-\beta)H$$

### DAgger algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\
& \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})}\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \color{red}{\text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1})} \\
& \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\
& \quad \text{learn}\; H \; \text{from}\; \cal E\\
& \quad \color{red}{\text{decrease} \; \beta}\\
\end{align}
</p>

Proposed by [Ross et al. (2011)](http://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) motivated by robotics
- first iteration is standard classification training
- task loss and gold standard are implicitly considered via the expert
- DAgger: the Datasets in each iteration are Aggregated

**rollins** expose to previous mistakes. Future ones?

**rollouts**: expose the classifier to future mistakes!

### Training labels as costs


In [19]:
cb_gold

<table style="float: left; border-style: hidden; border-collapse: collapse; padding: 50px; float:left;">
<thead>
<tr>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>token=I, prev=<b>NULL</b>...</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>token=fly, prev=<b>Modal</b>...</td>
</tr>
</tbody>
</table>

<h3>Cost break down</h3>

In [24]:
p = gold_path.copy()
cost = 1
cb_costs = []
cb_costs.append(cg.draw_cost_breakdown(rows, columns, [(0,4),(1,3)], roll_in_cell=p[1]))
cb_costs.append(cg.draw_cost_breakdown(rows, columns, [(0,4),(1,3),(2,0)], roll_in_cell=p[1], explore_cell=(2,0)))
cb_costs.append(cg.draw_cost_breakdown(rows, columns, [(0,4),(1,3),(2,0),(3,1)], roll_in_cell=p[1], explore_cell=(2,0),roll_out_cell=(3,0)))
cb_costs.append(cg.draw_cost_breakdown(rows, columns, [(0,4),(1,3),(2,0),(3,1)], cost, p[3], roll_in_cell=p[1], explore_cell=(2,0),roll_out_cell=(3,0)))
for i in range(1,4):
    p = gold_path.copy()
    p[2] = (gold_path[2][0],i)
    if p == gold_path:
        cost = 0
    else:
        cost = 1
    cb_costs.append(cg.draw_cost_breakdown(rows, columns, p, cost, p[3], roll_in_cell=p[1],roll_out_cell=(3,0), explore_cell=p[2]))
util.Carousel(cb_costs)

<table style="border-style: hidden; border-collapse: collapse; padding: 50px;">
<thead>
<tr>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
</tbody>
</table>

<p>
<ul>
<li><b>rollin</b> to a point in the sentence</li>
<li><b>explore</b> each action: <b>rollout</b> and cost with task loss</li>
<li>expert only <b>rollout</b>: correct action 0 cost, incorrect 1</li>
</ul>
</p>

### Mixed rollouts

In [25]:
cb_mix_costs = []
for i in range(4):
    p = gold_path.copy()
    p[2] = (gold_path[2][0],i)
    if p == gold_path:
        cost = 0
    elif i==1:
        cost =2
        p[3] = (3,0)
    else:
        cost = 1
    cb_mix_costs.append(cg.draw_cost_breakdown(rows, columns, p, cost, p[3], roll_in_cell=p[1],roll_out_cell=(3,0), explore_cell=p[2]))
util.Carousel(cb_mix_costs)

<table style="border-style: hidden; border-collapse: collapse; padding: 50px;">
<thead>
<tr>
<th><b>Pronoun</b></th>
<th><b>Modal</b></th>
<th><b>Verb</b></th>
<th><b>Noun</b></th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>token=can, prev=<b>Pronoun</b>...</td>
</tr>
</tbody>
</table>

<p>
Define a **rollout** policy that sometimes uses the expert $\pi^{\star}$ and other times the classifier $H$:

$$\pi^{out} = \beta\pi^{\star} + (1-\beta)H$$
</p>

### DAgger with roll-outs

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 75%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H, \; \text{loss} \; L\\
& \text{set training examples}\; \cal E = \emptyset, \; \pi^{\star}\; \mathrm{probability}\; \beta=1\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\text{set rollin/out policy} \; \pi^{in/out} = \beta\pi^{\star} + (1-\beta)H}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in/out}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \mathbf{for} \; \alpha \in {\cal A} \; \mathbf{do}\\
& \quad \quad \quad \quad \color{red}{\text{rollout} \; S_{final} = \pi^{in/out}(S_{t-1}, \alpha, \mathbf{x})}\\
& \quad \quad \quad \quad \color{red}{\text{cost}\; c_{\alpha}=L(S_{final}, \mathbf{y})}\\
& \quad \quad \quad \text{extract features}\; \mathit{feat}=\phi(\mathbf{x}, S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\mathbf{c})\\
& \quad \text{learn} \;H \; \text{from}\; \cal E\\
& \quad \text{decrease} \; \beta\\
\end{align}
</p>

### Roll-outs

- can learn with non-decomposable losses
- can learn with sub-optimal experts
- expensive when there are many actions and long sequences to complete outputs 


- first proposed in SEARN ([Daumé III et al., 2009](http://hunch.net/~jl/projects/reductions/searn/searn.pdf))

- used to hybridise DAgger by [Vlachos and Clark (2014)](http://www.aclweb.org/anthology/Q14-1042), referred to later as V-DAgger ([Goodman et al. 2016](http://aclweb.org/anthology/P16-1001))

- also proposed as look-aheads ([Tsuruoka et al. 2011](http://www.anthology.aclweb.org/W/W11/W11-0328.pdf))

### LoLS

Locally Optimal Learning to Search ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf))

<img src="images/lols.png" style="width:60%;">


- rollin always with the classifier
- each rollout uses only the expert or the classifier 

<h3>Generic imitation learning</h3>

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \text{classifier} \; H, \; \text{loss} \; L\\
& \text{set training examples}\; \cal E = \emptyset\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\text{set rollin policy} \; \pi^{in} = mix(H,\pi^{\star})}\\
& \quad \color{red}{\text{set rollout policy} \; \pi^{out} = mix(H,\pi^{\star})}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \color{red}{\text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})}\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \color{red}{\text{rollout to obtain costs}\; c \; \text{for all possible actions using}\; L}\\
& \quad \quad \quad \text{extract features}\; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},c)\\
& \quad \text{learn}\; H \; \text{from}\; \cal E\\
\end{align}
</p>

### Overview

<table style="border-style: hidden">
<thead>
<tr>
<th style="padding: 10px;">Method</th>
<th style="padding: 10px;">rollin</th>
<th style="padding: 10px;">rollout</th>
<th style="padding: 10px;">loss</th>
<th style="padding: 10px;">expert decay</th>
<th style="padding: 10px;">training data</th>
</tr>
</thead>
<tbody>
<tr>
<td style="padding: 10px;">classification</td>
<td style="padding: 10px;">expert</td>
<td style="padding: 10px;">N/A</td>
<td style="padding: 10px;">0/1</td>
<td style="padding: 10px;">N/A</td>
<td style="padding: 10px;">single iteration</td>
</tr>
<tr>
<td style="padding: 10px;">DAgger</td>
<td style="padding: 10px;">mix</td>
<td style="padding: 10px;">N/A</td>
<td style="padding: 10px;">0/1</td>
<td style="padding: 10px;">decrease</td>
<td style="padding: 10px;">all iterations</td>
</tr>
<tr>
<td style="padding: 10px;">V-DAgger</td>
<td style="padding: 10px;">mix</td>
<td style="padding: 10px;">mix</td>
<td style="padding: 10px;">task</td>
<td style="padding: 10px;">exponential</td>
<td style="padding: 10px;">all iterations</td>
</tr>
<tr>
<td style="padding: 10px;">LOLS</td>
<td style="padding: 10px;">classifier</td>
<td style="padding: 10px;">action-level mix</td>
<td style="padding: 10px;">task</td>
<td style="padding: 10px;">no decay</td>
<td style="padding: 10px;">averaged across iterations</td>
</tr>
<tr>
<td style="padding: 10px;">SEARN</td>
<td style="padding: 10px;">mix</td>
<td style="padding: 10px;">mix</td>
<td style="padding: 10px;">task</td>
<td style="padding: 10px;">exponential</td>
<td style="padding: 10px;">weighted averaged across iterations</td>
</tr>
</tbody>
</table>

### Summary so far

- basic intuition behind IL
- rollin and the DAgger algorithm 
- rollouts, V-DAgger and LoLS
- generic imitation learning recipe