In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import il_tutorial.cost_graphs as cg
import il_tutorial.util as util
from IPython.display import HTML

In [4]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

<center>
<h2>Natural Language Processing</h2>
<br>
<small>(COM4513/6513)</small>
<br>
<h2>Imitation Learning for Structured Prediction</h2>
<p style="text-align:center">
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>


<p style="text-align:center">Based on the <a href="http://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/">EACL2017 tutorial</a><br>with <a href="http://glampouras.github.io">Gerasimos Lampouras</a> and 
<a href="http://www.riedelcastro.org/">Sebastian Riedel</a></p>

</center>

<h3>Some structured prediction tasks we know</h3>

<table  style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>London</th>
<th>with</th>
<th>Sebastian</th>
<th>Riedel</th>
</tr>
</thead>
<tbody style="font-size:100%">
<tr>
<td><span class="fragment" data-fragment-index="1">PRP</span></td>
<td><span class="fragment" data-fragment-index="1">VBD</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="1">IN</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
</tr>
<tr>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-LOC</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-PER</span></td>
<td><span class="fragment" data-fragment-index="2">I-PER</span></td>
</tr>
</tbody>
</table>

<p>
				<ul>
  			<li class="fragment" data-fragment-index="1">part of speech (PoS) tagging</li>
  			<li class="fragment" data-fragment-index="2">named entity recognition (NER)</li>
				</ul>
			</p>


<p><b>Input:</b> a sentence $\mathbf{x}=[x_1...x_N]$<br> <b>Output:</b> a sequence of labels $\mathbf{y}=[y_{1}\ldots y_{N}] \in {\cal Y}^N$</p>

<h3>More Structured Prediction</h3>

<img src="images/toBeAnimated/depParse1.png" style="width:100%;">

<p>Syntactic parsing, but also semantic parsing, semantic role labeling, question answering over knowledge bases, etc.</p>
<p><b>Input:</b> a sentence $\mathbf{x}=[x_1...x_N]$<br>
<b>Output:</b> a meaning representation graph $\mathbf{G}=(V,E) \in {\cal G_{\mathbf{x}}}$</p> 

<h3>More Structured Prediction</h3>

<img src="images/nlg.png" style="width:90%;">

<p>Natural language generation (NLG), but also summarization, decoding in machine translation, etc.</p>

<p><b>Input:</b> a meaning representation<br>
<b>Output:</b> $\mathbf{w}=[w_1...w_N], w\in {\cal V}\cup END, w_N=END$</p>  

### In this lecture: Imitation Learning

<p style="float: left;">We assume gold standard<br> output for training</p> 
<img src="images/tikz/StucturedPredictionDef.png" style="width:40%; float: right;">

<p style="float: left;">But we train a classifier to predict<br> <b>actions</b> constructing the output.</p> 
<img src="images/tikz/StucturedPrediction.png" style="width:40%; float: right;">

### Originated in robotics


<a href="http://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-Slides.pdf"><img src="images/imitationFromRoss.png" style="width:75%;"></a>


**Meta-learning**: better model (&asymp;policy) by generating better training data from expert demonstrations. 

### Two main paradigms

Joint modeling, a.k.a: 
- global inference
- structured models

Incremental modeling, a.k.a:
- local 
- greedy
- pipeline
- transition-based
- history-based

### Joint modeling

A model (e.g. conditional random fields) that scores complete outputs (e.g. label sequences):

$$\mathbf{\hat y} =\hat y_{1}\ldots \hat y_{N} = \mathop{\arg \max}_{Y \in {\cal Y}^N} f(y_{1}\ldots y_{N}, \mathbf{x})$$

<ul class="fragment">
					<li>exhaustive exploration of the search space</li>
					<li>large/complex search spaces are challenging</li>
					<li>efficient dynamic programming restricts modelling flexibility
						(i.e. Markov assumptions)</li>
				</ul>


### Incremental modeling

A classifier predicting a label at a time given the previous ones:


\begin{align}
\hat y_1 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}),\\
\mathbf{\hat y} = \quad \hat y_2 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_1), \cdots\\
\hat y_N &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_{1} \ldots \hat y_{N-1})
\end{align}

<ul class="fragment">
					<li>use our favourite classifier</li>
					<li>no restrictions on features</li>
					<li>prone to error propagation (i.i.d. assumption broken)</li>
					<li>local model not trained wrt the task-level loss</li>
				</ul>


### Imitation learning

Improve incremental modeling to:
- address error-propagation
- train wrt the task-level loss function

**Meta-learning**: use our favourite classifier and features,
but generate better (non-i.i.d.) training data

To apply IL we need:
- transition system (what our classifier can do)
- task loss (what we optimize for)
- expert policy (the teacher to help us)

<h3>Transition system</h3>

<p>The <b>actions</b> $\cal A$ the classifier $f$ can predict and their effect on the <b>state</b> which tracks the prediction: $S_{t+1}=S_1(\alpha_1\ldots\alpha_t)$</p>

<img src="images/tikz/IncrementalStructure.png" style="align:center; width:65%">

<h3>Transition system</h3>

<p style="text-align: left; border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 75%">
\begin{align}
& \textbf{Input:} \; sentence \; \mathbf{x}\\
& state \; S_1=initialize(\mathbf{x}); timestep \; t = 1\\
& \mathbf{while}\; S_t \; \text{not final}\; \mathbf{do}\\
& \quad action \; \alpha_t = \mathop{\arg \max}_{\alpha \in {\cal A}} f(\alpha, \mathbf{x})\\
& \quad S_{t+1}=S_t(\alpha_t); t=t+1\\
\end{align}
</p>

<ul>
<li><b>PoS tagging?</b> <span class="fragment">for each word, left-to-right, predict a PoS tag which is added to the output</span></li>
<li class="fragment"><b>Named entity recognition?</b> <span class="fragment">As above, just use NER tags!</span></li>
</ul>

### Supervising the classifier

What are good actions in incremental structured prediction?

Those that reach $S_{final} = S_1(\alpha_1\ldots\alpha_T)$ with low **task loss**:

$$loss  = L(S_{final}, \mathbf{y}) \geq 0$$

<ul>
<li><b>PoS tagging?</b> <span class="fragment">Hamming loss: number of incorrect tags</span></li>
<li class="fragment"><b>NER?</b> <span class="fragment">number of false positives and false negatives</span></li>
</ul>

### Action assessment 

<table style="font-size:80%; border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>London</th>
<th>with</th>
<th>Sebastian</th>
<th>Riedel</th>
</tr>
</thead>
<tbody>
<tr>
<td>PRP</td>
<td>VBD</td>
<td>IN</td>
<td>NNP</td>
<td>IN</td>
<td>NNP</td>
<td><span class="fragment" data-fragment-index="1">NNP</span></td>
</tr>
<tr>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-LOC</span></td>
<td><span class="fragment" data-fragment-index="2">O</span></td>
<td><span class="fragment" data-fragment-index="2">B-PER</span></td>
<td><span class="fragment" data-fragment-index="3">I-PER</span></td>
</tr>
</tbody>
</table>

<p>How many incorrect PoS tags due to $\alpha_6$  being NNP? <span class="fragment" data-fragment-index="1"><b>0</b></span>
</p>
<p class="fragment" data-fragment-index="2"> How many $FP+FN$ due to $\alpha_6$ being B-PER? <span class="fragment" data-fragment-index="3"><br><b>Depends!</b> If $\alpha_7$ is</span>
<ul class="fragment" data-fragment-index="3">
<li>I-PER:  $0$ (correct)</li> 
<li>O: $2$ (1FP+1FN)</li>
<li>B-*: $3$ (2FP+1FN)</li>
</ul>
</p>
<p class="fragment" data-fragment-index="3">$FP+FN$ loss is <b>non-decomposable</b> wrt the transition system
</p>

### Expert policy

Returns the best action at the current state by looking at the gold standard assuming **future actions are also optimal**:

$$\alpha^{\star}=\pi^{\star}(S_t, \mathbf{y}) = \mathop{\arg \min}_{\alpha \in {\cal A}} L(S_t(\alpha,\pi^{\star}),\mathbf{y})$$

<p style="float: left;">Only available for the training data: an expert<br>demonstrating how to perform the task </p> <a href="http://www.salon.com/2016/10/06/what-makes-a-good-teacher-why-certifications-and-standards-dont-guarantee-quality-educators_partner/"><img src="images/english_teacher.jpg" style="width:20%; float: right;"></a>

### Expert policy

**In pairs**: What action should $\pi^{\star}$ return?

<table style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th> 
<th>studied</th>
<th>in</th>
<th>London</th>
<th>with</th>
<th>Sebastian</th>
<th>Riedel</th>
</tr>
</thead>
<tbody>
<tr>
<td>O</td>
<td>O</td>
<td>O</td>
<td>B-LOC</td>
<td>O</td>
<td>B-PER</td>
<td><span class="fragment" data-fragment-index="1">I-PER</span></td>
</tr>
<tr class="fragment" data-fragment-index="2">
<td>O</td>
<td>O</td>
<td>O</td>
<td>B-LOC</td>
<td>O</td>
<td>O</td>
<td><span class="fragment" data-fragment-index="3">O</span></td>
</tr>
</tbody>
</table> 

Takes previous actions into account (**dynamic** vs **static**)

Finding the optimal action can be expensive but we can learn with **sub-optimal** experts.

### Imitation learning for part-of-speech tagging

<table style="border-style: hidden; border-collapse: collapse; padding: 50px">
<thead>
<tr>
<th>I</th>
<th>can</th>
<th>fly</th>
</tr>
</thead>
<tbody>
<tr>
<td><span>Pronoun</span></td>
<td><span>Modal</span></td>
<td><span>Verb</span></td>
</tr>
</tbody>
</table>

 **Task loss**: <span class="fragment">Hamming loss: number of incorrectly predicted tags</span>

**Transition system**: <span class="fragment">Tag each token left-to-right</span>

**Expert policy**: <span class="fragment">Return the next tag from the gold standard</span>

<h3>Gold standard in search space</h3>

In [5]:
paths = [[],[(0,4),(1,3)],[(0,4),(1,3),(2,2)],[(0,4),(1,3),(2,2),(3,1)]]
rows = ['Noun', 'Verb', 'Modal', 'Pronoun','NULL']
columns = ['NULL','I', 'can', 'fly']
cbs = []
for path in paths:
    cbs.append(cg.draw_cost_breakdown(rows, columns, path))
util.Carousel(cbs)

<p>
<ul>
<li>Three actions to complete the output</li>
<li>Expert policy replicates the gold standard</li>
</ul>
</p>

<h3>Training a classifier<span class="fragment" data-fragment-index="1"> with structure features </span></h3>

In [6]:
gold_path = [(0,4),(1,3),(2,2),(3,1)]
cb_gold = cg.draw_cost_breakdown(rows, columns, gold_path)
cb_gold

<table style="font-size:100%; border-style:hidden; border-collapse:collapse; padding:50px; float:left;">
<thead>
<tr>
<th>timestep</th>
<th>label ($\alpha_t$)</th>
<th>features ($\phi(S_{t-1},\mathbf{x})$)</th>
</tr>
</thead>
<tbody>
<tr>
<td> $t=1$ </td>
<td><b>Pronoun</b></td>
<td>token=I, ...<span class="fragment" data-fragment-index="1">, prev=<b>NULL</b></span></td>
</tr>
<tr>
<td> $t=2$ </td>
<td><b>Modal</b></td>
<td>token=can, ...<span class="fragment" data-fragment-index="1">, prev=<b>Pronoun</b></span></td>
</tr>
<tr>
<td> $t=3$ </td>
<td><b>Verb</b></td>
<td>token=fly, ...<span class="fragment" data-fragment-index="1">, prev=<b>Modal</b></span></td>
</tr>
</tbody>
</table>

### Algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\
& \text{set training examples}\; \cal E = \emptyset\\
& \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \text{generate expert trajectory} \; \alpha_1^{\star}\dots \alpha_T^{\star}  = \pi^{\star}(\mathbf{x},\mathbf{y})\\
& \quad \mathbf{for} \; \alpha^{\star}_t \in \alpha_1^{\star}\dots \alpha_T^{\star} \; \mathbf{do}\\
& \quad \quad \text{extract features}\; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star}_t)\\
& \text{learn} \; H\; \text{from}\; \cal E\\
\end{align}
</p>

### Exposure bias

In [7]:
wrong_path = [(0,4),(1,3),(2,1)]
cb_wrong = cg.draw_cost_breakdown(rows, columns, wrong_path)
util.Carousel([cb_gold, cb_wrong])

<p style="float: left; font-size: 80%">We had seen: &nbsp;&nbsp; 
<table style="float: left; border-style: hidden; border-collapse: collapse; font-size: 80%">
<thead>
<tr>
<th>timestep</th>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>t=3</td>
<td><b>Verb</b></td>
<td>token=fly,..., prev=<b>Modal</b></td>
</tr>
</tbody>
</table>
</p>

<p style="float: left; font-size: 80%">but not: &nbsp;&nbsp;
<table style="float: left; border-style: hidden; border-collapse: collapse; font-size: 80%">
<thead>
<tr>
<th>timestep</th>
<th>label</th>
<th>features</th>
</tr>
</thead>
<tbody>
<tr>
<td>t=3</td>
<td><b>Verb</b></td>
<td>token=fly,..., <span style="color:red">prev=<b>Verb</b></span></td>
</tr>
</tbody>
</table></p>

### Addressing exposure with Rollins

<p style="float: left;">Allow the classifier to guide the learning<br></p>  <a href="https://www.pinterest.com/explore/affordable-driving-school/"><img src="images/driving_mix.jpg" style="width:35%; float: right;"></a>

Define a **rollin** policy that sometimes uses the expert $\pi^{\star}$ and other times the classifier $H$:

$$\pi^{in} = \beta\pi^{\star} + (1-\beta)H$$

### DAgger algorithm

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\
& \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})}\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \color{red}{\text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1})} \\
& \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\
& \quad \text{learn}\; H \; \text{from}\; \cal E\\
& \quad \color{red}{\text{decrease} \; \beta}\\
\end{align}
</p>

### DAgger algorithm

Proposed by [Ross et al. (2011)](http://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf) motivated by robotics
- first iteration is standard classification training
- task loss and gold standard are implicitly considered via the expert
- DAgger: the Datasets in each iteration are Aggregated

**rollins** help recover from previous mistakes. How do we learn the future impact of a mistake?

**rollout**: try each action available and see what happens when future actions are taken by mixing the classifier and the expert

### Rollins and rollouts

<img src="images/lols.png" style="width:60%;">
- first proposed in SEARN ([Daum√© III et al., 2009](http://hunch.net/~jl/projects/reductions/searn/searn.pdf))
- used to hybridise DAgger by [Vlachos and Clark (2014)](http://www.aclweb.org/anthology/Q14-1042)
- Locally Optimal Learning to Search ([Chang et al., 2015](https://arxiv.org/pdf/1502.02206.pdf))

### Let's see some applications

Remember this?

<img src="images/toBeAnimated/depParse1.png">

<center>
<img src="images/stateTransitExpert.png">
</center>

### Learning a classifier

In [8]:
dep_rows = ['SHIFT', 'REDUCE', 'ARC-L', 'ARC-R','NULL']
dep_columns = ['S0','S1', 'S2', 'S3']
dep_cbs = []
dep_paths = [[],[(0,4),(1,0)],[(0,4),(1,0),(2,0)],[(0,4),(1,0),(2,0),(3,2)]]
for path in dep_paths:
    dep_cbs.append(cg.draw_cost_breakdown(dep_rows, dep_columns, path))
util.Carousel(dep_cbs)

<table style="font-size:100%; border-style:hidden; border-collapse:collapse; padding:50px; float:left;">
<thead>
<tr>
<th>timestep</th>
<th>label ($\alpha_t$)</th>
<th>features ($\phi(S_{t-1},\mathbf{x})$)</th>
</tr>
</thead>
<tbody>
<tr>
<td> $t=1$ </td>
<td><b>SHIFT</b></td>
<td>??</td>
</tr>
<tr>
<td> $t=2$ </td>
<td><b>SHIFT</b></td>
<td>??</td>
</tr>
<tr>
<td> $t=3$ </td>
<td><b>REDUCE</b></td>
<td>??</td>
</tr>
</tbody>
</table>

<h3>Features</h3>
<img src="images/tikz/depParseArcEager5.png" style="width:1000px; border:none; box-shadow:none;">

<p style="font-size: 100%">Stack = [ROOT, had, effect]</b>

<p style="font-size: 100%">Buffer = [on, financial, markets, .]</b>

<p style="font-size: 100%" class="fragment">Features based on the words/PoS in stack and buffer:
					<br> wordS1=effect, wordB1=on, wordS2=had, posS1=NOUN, etc.
				</p>
<p style="font-size: 100%" class="fragment">Features based on the dependencies so far:
					<br> depS1=dobj, depLeftChildS1=amod, depRightChildS1=NULL, etc.
				</p>

<p style="font-size: 100%" class="fragment">Features based on previous transitions:
					<br> $r_{t-1}=\text{Right-Arc}(dobj)$, etc.
				</p>


### When we fall off the trajectory

<center>
<img src="images/stateTransit.png" width="100%">
</center>

<center>
<img src="images/stateTransitErrorPart.png" width="90%">
</center>

- No suitable training data to teach us what to do.
- Next gold action might not be possible/optimal

### Finding the best action

<center>
<img src="images/determReachStatesScoredBest.png">
</center>

SHIFT would be the best if everything had been correct (Loss=0), but not anymore.

### Imitation learning for dependency parsing

<p style="border:3px; border-radius: 25px; background-color:lightgrey; border-style:solid; border-color:black; padding: 0.3em; font-size: 80%">
\begin{align}
& \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; \text{expert}\; \pi^{\star}, \; \text{classifier} \; H\\
& \text{set training examples}\; \cal E = \emptyset ,\; \color{red}{\pi^{\star}\; \mathrm{probability}\; \beta=1}\\
& \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\
& \quad \color{red}{\text{set rollin policy} \; \pi^{in} = \beta\pi^{\star} + (1-\beta)H}\\
& \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\
& \quad \quad \color{red}{\text{generate trajectory} \; \hat \alpha_1\dots\hat \alpha_T  = \pi^{in}(\mathbf{x},\mathbf{y})}\\
& \quad \quad \mathbf{for} \; \hat \alpha_t \in \hat \alpha_1\dots\hat \alpha_T \; \mathbf{do}\\
& \quad \quad \quad \color{red}{\text{ask expert for best action}\; \alpha^{\star} = \pi^{\star}(\mathbf{x},S_{t-1})} \\
& \quad \quad \quad \text{extract features} \; \mathit{feat}=\phi(\mathbf{x},S_{t-1}) \\
& \quad \quad \quad \cal E = \cal E \cup (\mathit{feat},\alpha^{\star})\\
& \quad \text{learn}\; H \; \text{from}\; \cal E\\
& \quad \color{red}{\text{decrease} \; \beta}\\
\end{align}
</p>

### Expert policy

<center>
<img src="images/oracle-delphi.jpg">
</center>

Just like the oracle, but it takes the previous actions into account (dynamic oracle)

### Results<h5>[Goldberg and Nivre 2012](http://www.aclweb.org/anthology/C12-1059)</h5>
<img src="images/dependResultBars.png">

### More applications

<p  style="float: left;">Incremental coreference resolution<br>
(<a href="http://cs.stanford.edu/people/kevclark/resources/clark-manning-acl15-entity.pdf">Clark and Manning, 2015</a>)
</p>
<a href="http://nlp.stanford.edu/projects/coref.shtml"><img src="images/corefexample.png" style="width:40%; float: right;"></a>

<p  style="float: left;">Recurrent Neural Network training<br>
(<a href="https://arxiv.org/pdf/1511.06732.pdf">Ranzato et al., 2016</a>)
</p>
<img src="images/mixer.png" style="width:40%; float: right;">

And some of my own:
- [Biomedical information extraction](http://www.biomedcentral.com/content/pdf/1471-2105-13-S11-S5.pdf)
- [Natural Language Generation](https://aclweb.org/anthology/C/C16/C16-1105.pdf)
- [Semantic Parsing](http://aclweb.org/anthology/P16-1001)

### Summary

Imitation learning:
- better training data for incremental predictors
- addresses error propagation
- many successful applications

### Coming up next

Fake News class project: [the world is watching](nytimes.com/2017/05/01/business/europe-election-fake-news.html)!

<img src="https://c1.staticflickr.com/1/133/366958167_939986949c_b.jpg" style="width:50%; border:none; box-shadow:none;">

**Everything can be examined**: lecture slides, bibliography, classroom discussion, guest lectures, lab content, etc.

Prepare following the advice in the booklets ([UG](http://www.dcs.shef.ac.uk/intranet/teaching/public/tutorials/level1/Personal%20Tutorial%20booklet.pdf),[PG](http://www.dcs.shef.ac.uk/intranet/teaching/public/tutorials/MScTutorialSystem1617.pdf))

<img src="https://c1.staticflickr.com/5/4079/4759535950_7bca6684c8_b.jpg" style="width:60%; border:none; box-shadow:none;">

...for your hard work during the module and your feedback!

Please give more feedback on the course in the forms provided by the department!

Future students will be grateful!