### First part recap ###

Imitation Learning
- DAgger and LOLS.

- Meta-learning framework, over an existing classifier.

- In practice, generates more (and better) training data, to improve the existing classifier.

For applied Imitation Learning we need to define:
- Transition system
- Loss function
- Expert policy

### Part 2: NLP Applications and practical advice

- Applications:
    - Dependency parsing 
    - Natural language generation
    - Semantic parsing

- Practical advice
    - Expert policy definition
    - Accelerating cost estimation
    - Trouble-shooting

<center>
<h2>Applying Imitation Learning on Dependency Parsing</h2>
</center>

### Dependency parsing ###
###### ([Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059), [Goldberg and Nivre 2013](https://www.aclweb.org/anthology/Q/Q13/Q13-1033.pdf)) ###### 

<img src="images/toBeAnimated/depParse1.png">

To represent the syntax of a sentence as directed labeled edges between words.
- Where labels represent dependencies between words.

### What would we like to improve? ###

Transition suffers from error propagation:
- Due to the greedy decoding of the incremental model.

- The first error will confuse the classifier, since the resulting state was not encountered during training.

- More errors are likely to follow, as we move into increasingly more foreign states.

### How can Imitation Learning help with that? ### 

Imitation Learning addresses error propagation:
- It considers the interaction between the action being considered and later actions in the sequence.

- Explores the unknown search space, but avoids enumerating all possible outputs.

- It also learns how to recover from errors.

### Before applying Imitation Learning ###

For each task, we need to define:
- Transition system
- Loss function
- Expert policy

### Transition system? ###

We can assume any of the proposed transition-based systems (Arc-Eager, Arc-Standard, Easy-First, etc.).

- In essence, the actions regard which nodes to consider, and which arc and label to add next?

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> -
<br>
<b>Buffer:</b> ROOT, 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'economic', 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_1.png">

<b>Stack:</b> ROOT, 'economic'
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_2.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> 'news', 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_2.png">

<b>Stack:</b> ROOT, 'news'
<br>
<b>Buffer:</b> 'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/transitionEx_3.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b>'had', 'little', 'effect', 'on', 'financial', 'markets', '.'

### Transition-based dependency parsing in action! ###

<img src="images/toBeAnimated/depParse1.png">

<b>Stack:</b> ROOT
<br>
<b>Buffer:</b> -

### Loss function? ###

Hamming loss: the number of incorrectly predicted labeled or unlabeled dependency arcs.

- Directly related to the attachment score metrics used to evaluate dependency parsers.

### Expert policy? ###

<center>
<img src="images/oracle-delphi.jpg">
</center>

By consulting the gold reference graphs:
- We can easily derive a <b> single static canonical </b> sequence of actions from initial to terminal state.

Static expert policies (π*) can work well for tasks where we do not care whether the previous actions were optimal or not.
- e.g. for the Part-of-Speech tagging task.

But they can be quite restricting in tasks where a suboptimal action can have negative effect on future actions.
- i.e. tasks with error propagation.

### Deterministic policy? But what if there are multiple correct transitions? ###

<img src="images/toBeAnimated/depParse_2.png">

<b>Stack:</b> 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'

We can either: <i>reduce 'her'</i> or <i>shift 'a'</i>

How to chose?
- A deterministic policy may arbitarilly chose a transition (e.g. prioritize shifts over other actions).

Why chose?
- Chosing any one action indirectly labels the alternative actions as incorrect!
- Which introduces noise in the training data.

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse_mistake_1.png">

<b>Stack:</b> 'wrote'
<br>
<b>Buffer:</b> 'her', 'a', 'letter', '.'

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse_mistake_2.png">

<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'a', 'letter', '.'

The static policy assumes all the past actions are optimal.
- It is undefined for states that are not part of the gold action of sequences,
- and thus cannot recover from mistakes.

### And what if a mistake happens? ###

<img src="images/toBeAnimated/depParse3.png">

Worst case scenario: it will only perform actions in states it recognizes.

### Dynamic policy ###

i.e. non-deterministic and complete policy.
- Allows for ambiguous transitions.

- Defined for all states.

- Recovers from errors.
  
<img src="images/toBeAnimated/depParse4.png">

### How does a dynamic expert policy work then? ###

Given a particular state, where an error may or may not have already occured:

We need to determine the set of actions that lead from this state to the best reachable terminal state.

- Quite possibly not an optimal terminal state, if we have made an error before.

- Considered "best" according to some loss function, and in relation to the optimal terminal state.

The loss function used by the expert policy may not be the same as the overall task loss function.
- To save computation time, when calculating the expert policy.

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse3.png">


<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse_expActions.png">


<b>Stack:</b> 'wrote', 'her'
<br>
<b>Buffer:</b> 'letter', '.'

The dynamic expert policy will consider all possible (even erroneous) actions.

### Expert policy in action! ###

<img src="images/toBeAnimated/depParse4.png">

Trying to find the action that leads to the best reachable terminal state. 
- A loss of 1, if we use the labeled attachment score as a loss function.

###  Applying Imitation Learning ###

[Goldberg and Nivre (2013)](https://www.aclweb.org/anthology/Q/Q13/Q13-1033.pdf) proposed a system that employed dynamic expert policies for dependency parsing.

- As well as an algorithm to learn parameters by exploration.

- Very similar to DAgger.
  - Roll-in is a mix of the learned and expert policies, at the step level.
  - There may be multiple correct actions at each time-step.

### Effect of k and p ###
<img src="images/dependHeatMaps.png">

### Results ###
<img src="images/dependResults.png">

### Summary so far ### 

We discussed modifications to the DAgger framework.
- Hard decay schedule after $k$ epochs when determining the roll-in and roll-out policies.
- Using a mix of expert and learned policy during roll-outs.

We showed that dynamic oracles improves on the results of static orcales.