### First part recap ###

Imitation Learning
- DAgger and LOLS.

- Meta-learning framework, over an existing classifier.

- In practice, generates more (and better) training data, to improve the existing classifier.

For applied Imitation Learning we need to define:
- Transition system
- Loss function
- Expert policy

### Part 2: NLP Applications and practical advice

- Applications:
    - Dependency parsing 
    - Natural language generation
    - Semantic parsing

- Practical advice
    - Expert policy definition
    - Accelerating cost estimation
    - Trouble-shooting

<center>
<h2>Applying Imitation Learning on Dependency Parsing</h2>
</center>

### Dependency parsing ###

<img src="images/toBeAnimated/depParse1.png">

To represent the syntax of a sentence as directed labeled edges between words.
- Where labels represent dependencies between words.

### What would we like to improve? ###

Transition suffers from error propagation:
- Due to the greedy decoding of the incremental model.

- The first error will confuse the classifier, since the resulting state was not encountered during training.

- More errors are likely to follow, as we move into increasingly more foreign states.

### How can Imitation Learning help with that? ### 

Imitation Learning addresses error propagation:
- It considers the interaction between the action being considered and later actions in the sequence.

- Explores the unknown search space, but avoids enumerating all possible outputs.

- It also learns how to recover from errors.

###  Applying Imitation Learning ###

[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059), [Goldberg and Nivre 2013](https://www.aclweb.org/anthology/Q/Q13/Q13-1033.pdf) proposed an Imitation Learning system for dependency parsing.

Very similar to DAgger.
- There may be multiple correct actions at each time-step.

### DAgger Reminder ###

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; L,\\
& classifier \; H,\; training\; examples\; \cal E = \emptyset, \; expert\; probability\; \beta=1\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \text{set} \; rollin \; policy \; \pi^{in} = \beta + (1-\beta)\pi^{\star}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \text{rollin to predict} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \text{extract } features=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (features,\alpha^{\star})\\
& \quad \text{learn classifier} \; \text{from}\; \cal E\\
& \quad \text{decrease} \; \beta\\
\end{align}
</p>

### Before applying Imitation Learning ###

For each task, we need to define:
- Transition system
- Loss function
- Expert policy

### Transition system? ###

We can assume any of the proposed transition-based systems (e.g. Arc-Eager).

- The length of the transition system is variable.
  - vs. POS tagging where it is fixed to the length of the sentence.

- The state consists of the already inserted arcs, a stack to keep track of the nodes under examination, and a buffer with the unexamined nodes.

- Action space for Arc-Eager consists of Shift, Reduce, Arc-Left, and Arc-Right actions.
  - Arc-Left and Arc-Right are further conditioned by particular labels,
  - but there is a limited number of labels in dependency parsing (#).

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> -
<br>
<b>Buffer:</b> ROOT, 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> ROOT, 'economic'
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_2.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_2.png">

<b>Stack:</b> ROOT, 'news'
<br>
<b>Buffer:</b> 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_3.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b>'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/depParse1.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> -

### Loss function? ###

Hamming loss: given the predicted arcs, how many parents were incorrectly predicted. 

- Directly corresponds to the attachment score metrics used to evaluate dependency parsers.
- Decomposable? No, when using this transition model! We cannot score shift actions independent of the arc actions!

### Expert policy? ###

<center>
<img src="images/oracle-delphi.jpg">
</center>

By consulting the gold reference graphs:
- We can easily derive a <b> single static canonical </b> sequence of actions from initial to terminal state, to form a static expert policy.

LABEL TRANSITION EXAMPLE

A static expert is only defined on states visited through the single static canonical sequence of actions.

The static policy assumes all the past actions are optimal.
- It is suboptimal for states that are not part of the static action of sequences,
  - In our example, it defaults to shift actions. 
- and thus cannot recover from mistakes.

Static expert policies (π*) can work well for tasks where we do not care whether the previous actions were optimal or not.
- e.g. for the Part-of-Speech tagging task.

### Static policy? But what if there are multiple correct transitions? ###

<img src="images/toBeAnimated/depParse_2.png">

<b>Stack:</b> 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'

We can either: <i>reduce 'her'</i> or <i>shift 'a'</i>

How to chose?
- A deterministic policy may arbitarilly chose a transition (e.g. prioritize shifts over other actions).

Why chose?
- Chosing any one action indirectly labels the alternative actions as incorrect!
- Which introduces noise in the training data.

### And what if a mistake happens by the learned policy during the rollin? ###

<img src="images/toBeAnimated/depParse_mistake_1.png">

<b>Stack:</b> 'wrote'
<br>
<b>Buffer:</b> 'her', 'a', 'letter', '.'

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse_mistake_2.png">

<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse3.png">

Worst case scenario: it will only perform actions in states it recognizes.

### How does a dynamic expert policy work then? ###

We need to determine the set of actions that lead from this state to the best reachable terminal state.

- Quite possibly not an optimal terminal state, if we have made an error before.

- Considered "best" according to some loss function, and in relation to the optimal terminal state.

For each possible action:
- perform a roll-out till the terminal state.
- score the state according to the loss function.
- return the set of actions that lead to the best reachable terminal state.

Or we can heuristically infer an action sequence by comparing to the gold standard.

MOVE TO NLG

The loss function used by the expert policy may not be the same as the overall task loss function.
- To save computation time, when calculating the expert policy.

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse3.png">


<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse_expActions.png">


<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'

The dynamic expert policy will consider all possible (even erroneous) actions.

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse4.png">

Trying to find the action that leads to the best reachable terminal state. 
- A loss of 1, if we use the labeled attachment score as a loss function.

### Dynamic policy ###

- Allows for different transitions to reach the optimal state.
- Recovers from errors of the learned policy.

Imitation learning assumes a dynamic policy.
- Otherwise, it cannot explore alternative actions!

### Effect of k and p ###
<img src="images/dependHeatMaps.png">

### Results ###
<img src="images/dependResults.png">

### Summary so far ### 

We discussed modifications to the DAgger framework.
- Hard decay schedule after $k$ epochs when determining the roll-in and roll-out policies.
- Using a mix of expert and learned policy during roll-outs.

We showed that dynamic oracles improves on the results of static orcales.