In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
# %cd .. 
import sys
sys.path.append("..")

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\indi}{\mathbb{I}}
$$

In [27]:
import statnlpbook.util as util
util.execute_notebook('structured_prediction.ipynb')

# Structured Prediction

No emerging unified _theory of NLP_, most textbooks and courses explain NLP as 

> a collection of problems, techniques, ideas, frameworks, etc. that really are not tied together in any reasonable way other than the fact that they have to do with NLP.
>
>  -- <cite>[Hal Daume](http://nlpers.blogspot.co.uk/2012/12/teaching-intro-grad-nlp.html)</cite>

but there is a reoccuring pattern ... the
## Structured Prediction Recipe

## Problem Signature 

* Given given some input structure \\(\x \in \Xs \\), such as a word, sentence, or document ...  
* predict an **output structure** \\(\y \in \Ys \\), such as a class label, a sentence or syntactic tree.

## Approach

 * Define a parametrized _model_ \\(s_\params(\x,\y)\\) that measures the _match_ of a given \\(\x\\) and \\(\y\\) using _representations_ $\repr(\x)$ and $\repry(\y)$.

 * _Learn_ the parameters \\(\params\\) from the training data \\(\train\\) (a _continuous optimization problem_).

 * Given an input \\(\x\\) find the highest-scoring output structure $$ \y^* = \argmax_{\y\in\Ys} s(\x,\y) $$ (a _discrete optimization problem_).  

**Good NLPers** combine **three skills** in accordance with this recipe: 

* modelling,
* continuous optimization and
* discrete optimization.

## Example
* Difficult to show meaningful example without going into depth (as we will later)
* Instead consider a toy example that uses same ingredients and steps

### Task
"Machine translation" from Enlish into Japanese sentences

In [7]:
train

[('I want coffe', 'コヒがほし'), ('where is the restroom?', 'コヒがほし')]

In [9]:
test

[('As he crossed toward the pharmacy at the corner he involuntarily turned his head because of a burst of light that had ricocheted from his temple',
  'コヒがほし')]

Too difficult! Let's make simplified
### Assumptions
* There are only 4 target Japanese sentences we care about.
* The lengths of the source English and target Japanese sentence are sufficient representations of the problem.

Our 
### Output Space
is simply:

In [11]:
y_space

['コヒがほし', 'コヒがほし', 'コヒがほし']

### Representation
* $\repr(\x)=|\x|$ 
* $\repry(\y)=|\y|$ 

### Model
$$
s_\param(\x,\y) = \param |\repr(\x) - \repry(\y)|
$$

Note: if $\param>0$ bigger difference in length is rewarded, else it is penalised.

Let us inspect this model: 

In [19]:
s(-1., "Blah", "Blub Blub")

-5.0

How to estimate $\param \in \{-1,1\}$? Let us define a 
### Loss Function
$$
l(\param)=\sum_{(\x,\y) \in \train} \indi(\y=\y'_{\param})
$$
where $\y'_{\param} \in \Ys$ is highest scoring translation
$$\y'_{\param}=\argmax_\y s_\param(\x,\y).$$


In [31]:
train

[('I want coffe', 'コヒがほし'),
 ('where is the restroom?', 'This is not Japanese at all')]

In [36]:
loss(1.0, train)

2.0

### Learning
is as simple as choosing the parameter with lowest loss:

$$
\param^* = \argmin_{\param \in \{-1,1\}} l(\param) 
$$


In [37]:
theta_star = 1.0 if loss(1.0, train) < loss(-1.0, train) else -1.0
theta_star

-1.0

### Prediction
same thing, just in $\Ys$:

$$\y'_{\param}=\argmax_\y s_\param(\x,\y).$$

Seen before? Yes, training often involves prediction in inner loop.

In [41]:
predict(1, test[0])

'What the hell, Sebastian! Fix this example'

### In Practice
1. Feature representations and scoring functions are **more elaborate**
   1. involve several **non-linear** transformations of both input and output. 

1. Parameter space usually **multi-dimensional** (millions of dimensions). 
   1. **Impossible to search exhaustively**.
   2. **Numeric optimisation algorithms** (often SGD).

1. Output space often exponentional sized (e.g. *all* Jap. sentences)
   1. **Impossible to search exhaustively**.
   2. **Discrete optimisation algorithms** (Dynamic Programming, Greedy, integer linear programming)