# Tagging lab

## Implementing the Viterbi algorithm


### Summary
_(See the lecture notes for a detailed explanation of the algorithm)_

The Markov assumption:
$$
\mathbb{P}(w_1, w_2, \ldots w_n \ | \ l_1, l_2 \ldots l_n) =
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$

The Viterbi function determines the optimal label:
$$
V(w_1, w_2, \ldots w_n) =
{\max}_{l_i\in L}
    \prod_{i=1}^n\mathbb{P}(l_i \ | l_{i-1})\cdot\mathbb{P}(w_i \ | \ l_i)
$$
For small $n$-s:
$$
V(w_1) = {\max}_{l_1\in L} \mathbb{P}(w_1 \ | \ l_1)
$$

$$
V(w_1, w_2) = {\max}_{l_1, l_2\in L} \mathbb{P}(w_1 \ | \ l_1) \cdot \mathbb{P}(l_2 \ | \ l_1) \cdot \mathbb{P}(w_2 \ | \ l_2)
$$

$$
V(w_1, w_2, w_3) = {\max}_{l_1, l_2, l_3\in L} V(w_1, w_2) \cdot \mathbb{P}(l_3 \ | \ l_2) \cdot \mathbb{P}(w_3 \ | \ l_3)
$$

$$
V(w_1, w_2, w_3, w_4) = {\max}_{l_1, l_2, l_3, l_4\in L} V(w_1, w_2, w_3) \cdot \mathbb{P}(l_4 \ | \ l_3) \cdot \mathbb{P}(w_4 \ | \ l_4)
$$

### Task 1a
Implement a generator function that reads a tab-separated corpus file and yields tuples of (word, label).
The file contains one token per line, in the following format:

`word TAB pos-tag`

_The file may contain a few malformed lines, use exception handling to print warnings and skip these lines._

### Task 1b
Download the file [`umbc.casesensitive.word_pos.1M.txt`](http://sandbox.hlt.bme.hu/~gaebor/ea_anyag/python_nlp)
and use the generator to iterate through sentences, building a vocabulary of words and labels.
You'll have to build:
* a dict of words to indices
* a dict of labels to indices
* two reverse dicts: indices to words and indices to labels

### Task 2a
Now read through the data again and build two matrices:
* a $|V|\times |L|$ matrix of integers
  $$M(i,j) = \# \ i^\text{th} \text{ word with pos tag } j$$
* an $|L|\times |L|$ matrix of integers
  $$N(i,j) = \# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}$$

### Task 2b

Compute the empirical probabilities
* a $|V|\times |L|$ matrix of floats
  $$P_1(i,j) = \frac{\# \ i^\text{th} \text{ word with pos tag } j}{\# \ \text{pos tag } j}$$
* an $|L|\times |L|$ matrix of floats
  $$P_2(i,j) = \frac{\# \ i^\text{th} \text{ pos after } j^\text{th} \text { pos}}{\# \ \text{pos tag } j}$$

### Task 3a
Implement the Viterbi-algorithm (for $k=2$) based on the pseudocode in lecture notes.

### Task 3b
Add the backtracking with an extra table, which stores the argmax, not the max value.

In [1]:
# viterbi(["You", "talk", "the", "talk", "."])