# Sequence Labeling with Hidden Markov Models
- Language Understanding Systems
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

This notebook is part of the Laboratory Work for [Language Understanding Systems class](http://disi.unitn.it/~riccardi/page7/page13/page13.html) of [University of Trento](https://www.unitn.it/en).
Laboratory has been ported to jupyer notebook format for remote teaching during [COVID-19 pandemic](https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic).

*Recommended Reading*:
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)

*Notebook Covers Material of*:
- [SLP](https://web.stanford.edu/~jurafsky/slp3/8.pdf) Chapter 8: Part-of-Speech Tagging (HMMs)
- [NLTK](https://www.nltk.org/book/ch05.html) Chapter 5: Part of Speech Tagging 

__Requirements__

- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset
- [NLTK](https://www.nltk.org/)
- [`conll.py`](https://github.com/esrel/LUS/) (in `src` folder)

## 1. Sequence Labeling (Tagging)
[Classification](https://en.wikipedia.org/wiki/Statistical_classification) is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

[Sequence Labeling](https://en.wikipedia.org/wiki/Sequence_labeling) is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. It is a sub-class of [structured (output) learning](https://en.wikipedia.org/wiki/Structured_prediction), since we are predicting a *sequence* object rather than a discrete or real value predicted in classification problems.

- Can be treated as a set of independent classification tasks, one per member of the sequence;
- Performance is generally improved by making the optimal label for a given element dependent on the choices of nearby elements;

Due to the complexity of the model and the interrelations of predicted variables the process of prediction using a trained model and of training itself is often computationally infeasible and [approximate inference](https://en.wikipedia.org/wiki/Approximate_inference) and learning methods are used. 

[Markov Chain](https://en.wikipedia.org/wiki/Markov_chain) is a stochastic model used to describe sequences. It is the simplest [Markov Model](https://en.wikipedia.org/wiki/Markov_model). In order to make inference tractable, a process that generated the sequence is assumed to have [Markov Property](https://en.wikipedia.org/wiki/Markov_property), i.e. future states depend only on the current state, not on the events that occurred before it. (An [ngram](https://en.wikipedia.org/wiki/N-gram) [language model](https://en.wikipedia.org/wiki/Language_model) is a $(n-1)$-order Markov Model.) 

In Statical Language Modeling, we are modeling *observed sequences* represented as Markov Chains. Since the states of the process are *observable*, we only need to compute __transition probabilities__. 

In Sequence Labeling, we assume that *observed sequences* (__sentences__) have been generated by a Markov Process with *unobservable* (i.e. hidden) states (__labels__), i.e. [Hidden Markov Model](https://en.wikipedia.org/wiki/Hidden_Markov_model) (__HMM__). 
Since the states of the process are hidden and the output is observable, each state has a probability distribution over the possible output tokens, i.e. __emission probabilities__. 

Using these two probability distributions (__transition__ and __emission__), in sequence labeling, we are *inferring* the sequence of state transitions, given a sequence of observations.

### 1.1. Natural Language Processing (NLP) Tasks

Below are some examples of NLP tasks that Sequence Labeling is applied to as one of the methods.

The scenario when members of a sequence are mapped to higher order units (i.e. grouped together `[['a'],['b','c']]`) and assigned a category) is known as __shallow parsing__.

- [Part-of-Speech Tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging)
- [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) (Chunking)
    - [Phrase Chunking](https://en.wikipedia.org/wiki/Phrase_chunking)
    - [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) 
    - [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling)
    - Dependency [Parsing](https://en.wikipedia.org/wiki/Parsing) 
    - Discourse Parsing
    - (Natural/Spoken) __Language Understanding__: Concept Tagging/Entity Extraction

### 1.2. The General Setting for Sequence Labeling

- Create __training__ and __testing__ sets by tagging a certain amount of text by hand
    - i.e. map each word in corpus to a tag
- Train tagging model to extract generalizations from the annotated __training__ set
- Evaluate the trained tagging model on the annotated __testing__ set
- Use the trained tagging model too annotate new texts

### 1.3. Hidden Markov Model Tagging
Tagging is one of the tasks [Hidden Markov Models](https://en.wikipedia.org/wiki/Hidden_Markov_model) are used for.

Given s word sequence $w_{1}^{n}$ the goal is to find the most probable tag sequence $t_{1}^{n}$. 

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} p(t_{1}^{n} | w_{1}^{n})$$

We assume that a tag sequence has generated the given sequence of words. 

Using __Bayes's Rule__ 

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Consequently, we compute:

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}}\frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

Probability of a word sequence is the same for all tags, thus:

$$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} p(w_{1}^{n} | t_{1}^{n})p(t_{1}^{n})$$


#### 1.3.1. Parameter Learning
The parameter learning task in HMMs is to find, given an output sequence or a set of such sequences, the best set of *state transition* and *emission probabilities*. The task is usually to derive the [*maximum likelihood estimate*](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) of the parameters of the HMM given the set of output sequences. 

##### Simplifying Assumptions

- Probability of a word only depends on its own tag, not tags of other words in a sentence, thus:

$$p(w_{1}^{n}|t_{1}^{n}) \approx p(w_1|t_1)p(w_2|t_2) ... p(w_n|t_n)$$

- Probability of a tag depends on previous N tags; i.e. Markov assumption (ngram), thus:

$$p(t_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i | t_{i-n+1}^{i-1})}$$

- The (first-order) Markov assumption (bigram):

$$p(t_{1}^{n}) \approx p(t_1|t_0) p(t_2|t_1) ... p(t_n|t_{n-1})$$
- or:
$$p(t_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i | t_{i-1})}$$

##### 1.3.1.1. Estimating Transition Probabilities from Data

- *Transition Probabilities* $p(t_i | t_{i-n+1}^{i-1})$ is an ngram model, and it is estimated using the same recipe we use for ngram language modeling; but using tag ngrams instead of word-ngrams. 
- It is assumed that the set of states is *finite* and known (i.e. there is no unknown (or OOV) state).
- The same principles of *smoothing* apply for ngrams of state transitions


*Calculating Probability from Frequencies*

Probabilities of ngrams can be computed *normalizing* frequency counts (*Maximum Likelihood Estimation*): dividing the frequency of an ngram sequence by the frequency of its prefix (*relative frequency*).

N-gram   | Equation                      
:--------|:------------------------------
Unigram  | $$p(t_i) = \frac{c(t_i)}{T}$$ 
Bigram   | $$p(t_i|t_{i-1}) = \frac{c(t_{i-1},t_i)}{c(t_{i-1})}$$ 
Ngram    | $$p(t_i|t_{i-N+1}^{i-1}) = \frac{c(t_{i-N+1}^{i-1}, t_i)}{c(t_{i-N+1}^{i-1})}$$ 

where:
- $T$ is the total number of tags in a corpus
- $c(x)$ is the count of occurrences of $x$ in a corpus ($x$ could be unigram, bigram, etc.)

##### 1.3.1.2. Estimating Emission Probabilities from Data
Similar to *transition probabilities*, *emission probabilities* can be estimated from annotated data counting relative frequencies of observations. Since we assume that probability of a word depends only on its tag, the equation is the following.

$$p(w_i|t_i) = \frac{c(t_i,w_i)}{c(t_i)}$$

*Unknown Words* & *Unknown Word Models* 

Emission probabilities are subject to data sparseness; thus require handling unknown words. 
Consequently, we need to estimate probabilities for $p($ `<unk>` $|t_i)$. 

- We can assume that all tags ($t_i$) have equal probability of emitting `<unk>`; and estimate it as $\frac{1}{V}$, where $V$ is the size of tag vocabulary.
    - i.e. use Additive Smoothing
- We can estimate them from data replacing OOV with `<unk>` and computing the probabilities
- We can build __Unknown Word Model__ (like in Part-of-Speech Tagging), for instance using:
    - word shape (capitalization)
    - word class (word, punctuation, number)
    - part-of-speech tags (generalize)
    - word suffixes (last characters): e.g. suffixes of lengths (1 to 5) (e.g. [Samuelsson (1993)](https://www.aclweb.org/anthology/W93-0420.pdf))


#### 1.3.2. Decoding
$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$

| __Model__ | __Equation__ |
|:----------|:--------------
| *unigram* | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i)$$
| *bigram*  | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-1})$$
| *trigram* | $$t_{1}^{n} = \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-2}, t_{i-1})$$

where:
- $p(w_i|t_i)$ -- *emission probability*, i.e. of seeing current word given the current tag
- $p(t_i|t_{i-n+1}^{i-1})$ -- *transition probability*, i.e. of seeing the current tag given the tags we just saw 

##### Viterbi Algorithm
The decoding algorithm for HMMs is the [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm) -- an instance of dynamic programming. Bigram version of the algorithm is not difficult to implement (see pseudo-code in [SLP 8.4.5](https://web.stanford.edu/~jurafsky/slp3/8.pdf)); trigram, however, is more complex, and practical taggers incorporate other advanced features. 

There are numerous implementation available.

### 1.4. Maximum Likelihood Estimation (__MLE__)

Let's compare *emission probability* to *bigram probability* estimation computation:
- Maximum Likelihood Estimation (__MLE__) from frequency counts
- suffer from data sparseness:
    - smoothing (__+1S__ - add-one smoothing, for simplicity)
    - out-of-vocabulary (__OOV__, `<unk>`) word uniform probability estimation

|         | __bigram *p*__ | __emission *p*__ |
|:--------|:-----------------------|:-------------------------|
| __MLE__ | $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}$$ | $$p(w_i|t_i) = \frac{c(t_i,w_i)}{c(t_i)}$$
| __+1S__ | $$p(w_i | w_{i-1}) = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1})+V}$$ | $$p(w_i|t_i)=\frac{c(t_i,w_i)+1}{c(t_i)+V}$$
| __OOV__ | $$\frac{1}{V}$$ | $$\frac{1}{V}$$ 

In practice this means that we can estimate emission probabilities as ngram probabilities, i.e. using the same functions for counting and smoothing, treating $c(t_i,w_i)$ as $c(w_{i-1},w_i)$, i.e. as `[t_i, w_i]` ngram.


## 2. Shallow Parsing

As we have already mentioned, [Shallow Parsing](https://en.wikipedia.org/wiki/Shallow_parsing) is a kind of Sequence Labeling. The main difference from Sequence Labeling task, such as Part-of-Speech tagging, where there is an output label (tag) per token; Shallow Parsing additionally performs __chunking__ -- segmentation of input sequence into constituents. Chunking is required to identify categories (or types) of *multi-word expressions*.
In other words, we want to be able to capture information that expressions like `"New York"` that consist of 2 tokens, constitute a single unit.

What this means in practice is that Shallow Parsing performs *jointly* (or not) 2 tasks:
- __Segmentation__ of input into constituents (__spans__)
- __Classification__ (Categorization, Labeling) of these constituents into predefined set of labels (__types__)


### 2.1. Revisiting Joint Probability Factorization
In [*generative approach*](https://en.wikipedia.org/wiki/Generative_model) to Sequence Labeling we are modeling [joint probability distribution](https://en.wikipedia.org/wiki/Joint_probability_distribution).

$$p(w_{1}^{n},t_{1}^{n}) = p(w_1, w_2, ..., w_n, t_1, t_2, ..., t_n)$$ 

To make the inference tractable, we factor the joint distribution using Chain Rule and apply [conditional independence assumption](https://en.wikipedia.org/wiki/Independence_(probability_theory)).

$$P(A,B) = P(B|A)P(A) = P(A|B)P(B)$$

It is common to mistakenly assume that $P(A|B) = P(B|A)$, known as [Confusion of the Inverse](https://en.wikipedia.org/wiki/Confusion_of_the_inverse).

The relation between $P(A|B)$ and $P(B|A)$ is given by the Bayes Rule:

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Consequently: 

$$p(t_{1}^{n}|w_{1}^{n}) = \frac{p(w_{1}^{n},t_{1}^{n})}{p(w_{1}^{n})} = \frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

If events $A$ and $B$ are conditionally independents, we have: 

$$P(A,B) = P(A)P(B) \rightarrow P(A) = P(A|B); P(B) = P(B|A)$$

Applying, Markov assumption to $p(t_{1}^{n})$ and conditional independence assumption to $p(w_{1}^{n} | t_{1}^{n})$ we end-up with our ngram sequence labeling model.


$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$

If we would not apply conditional independence assumption to $p(w_{1}^{n}|t_{1}^{n})$, we would be modeling $p(w_{1}^{n},t_{1}^{n})$ __jointly__.

Applying just Markov assumption, i.e. modeling it as an ngram (Markov Chain), we will be solving the following equation:

$$p(w_{1}^n,t_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_{i},t_{i}|w_{i-N+1}^{i-1},t_{i-N+1}^{i-1})}$$

Because:

$$p(t_{1}^{n}|w_{1}^{n}) = \frac{p(w_{1}^{n},t_{1}^{n})}{p(w_{1}^{n})} = \frac{p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})}{p(w_{1}^{n})}$$

$$t_{1}^{n} = \arg\max \limits_{t_{1}^{n}} p(t_{1}^{n}|w_{1}^{n}) = \arg\max \limits_{t_{1}^{n}} p(w_{1}^{n},t_{1}^{n}) = \arg\max \limits_{t_{1}^{n}} p(w_{1}^{n} | t_{1}^{n}) p(t_{1}^{n})$$

Most probable sequence can be obtained either way.

$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod^n_{i=1} p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})$$ 
$$t_{1}^{n} \approx \arg\max\limits_{t_{1}^{n}} \prod_{i=1}^{n}{p(w_{i},t_{i}|w_{i-N+1}^{i-1},t_{i-N+1}^{i-1})}$$

Factorization and conditional independence assumptions reduce computational complexity and requirements, and *amount of observations* needed to estimate model probabilities.

Both models are applied to Shallow Parsing and Sequence Labeling in general:
e.g. Hidden Markov Model Tagger and Stochastic Conceptual Language Models for Spoken Language Understanding in [Raymond & Riccardi (2007)](https://disi.unitn.it/~riccardi/papers2/IS07-GenerDiscrSLU.pdf).

### 2.2. Joint Segmentation and Classification
In Shallow Parsing, the segmentation and label information is generally modeled *jointly*. 
In practice, it means that our output labels ($t_i$) can be decomposed into ($c_i,s_i$), where $c_i$ is classification label for token $i$, and $s_i$ segmentation label for token $i$.

Consequently, in shallow parsing, we are modeling:

$$p(w_{1}^{n},t_{1}^{n}) \rightarrow p(w_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i,s_i)p(c_i,s_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1})$$

The joint modeling implies that we do not make conditional independence assumption between segmentation and classification label. However, assuming that probability of a words depends on segmentation and classification labels independently, while both depend on their previous N labels, we can factorize it as:

$$p(w_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i)p(c_i|c_{i-N+1}^{i-1})p(w_i|s_i)p(s_i|s_{i-N+1}^{i-1})$$

The *events* could be modeled independently as well: i.e. we can predict either classification labels only, or segmentation labels only.

*Segmentation*:
$$p(w_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|s_i)p(s_i|s_{i-N+1}^{i-1})$$
*Classification*
$$p(w_{1}^{n},c_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i)p(c_i|c_{i-N+1}^{i-1})$$

#### 2.2.1. Join Modeling for Features
In Shallow Parsing we jointly model *output label*.
The principles of joint modeling could be applied to introduce additional features for *input tokens* as well. 
For instance, we could model jointly words and part-of-speech tags ($p_i$) for shallow parsing as:

$$p(w_{1}^{n},p_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i,p_i|c_i,s_i)p(c_i,s_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1})$$

or predict them jointly with segmentation and classification labels as:

$$p(w_{1}^{n},p_{1}^{n},c_{1}^{n},s_{1}^{n}) \approx \prod^n_{i=1} p(w_i|c_i,s_i,p_i)p(c_i,s_i,p_i|c_{i-N+1}^{i-1},s_{i-N+1}^{i-1},p_{i-N+1}^{i-1})$$

In the first case our input is *word-pos* pairs, we don't make independence assumptions, consequently they are treated as a single unit (i.e. you need to generate *pos* per word some other way for tagging). Same applies to *segmentation-classification* (or *segmentation-classification-pos*) output labels.

- In joint modeling our observations for tokens and ngrams are more sparse: *word-pos* pair usually appears in data less than *word* and *pos* separately (same applies for their ngrams). 
- In joint modeling of output labels, we will have to estimate more of them, thus will have less observations for each.


#### 2.1.2. [Bayesian Categorization](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

Assuming conditional independence between *word* and *pos* leads to Bayesian Categorization.

$$p(C_k|x) = \frac{p(x|C_k)p(C_k)}{p(x)} = p(C_k) \prod^n_{i=1} p(x_i|C_k)$$

$$p(t_{i}|w_{i},p_{i}) \approx p(t_{i})p(w_i|t_i)p(p_i|t_i)$$

## 3. Encoding Segmentation Information: CoNLL Corpus Format

Corpus in CoNLL format consists of series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type. 

The set of columns used by CoNLL-style files can vary from corpus to corpus.

```
who    O
plays  O
luke   B-character.name
on     O
star   B-movie.name
wars   I-movie.name
new    I-movie.name
hope   I-movie.name
```

### 3.1. [IOB Scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging))

- The notation scheme is used to label *multi-word* spans in token-per-line format.
    - *star wars new hope* is a *movie.name* concept that has 4 tokens
- Both, prefix and suffix notations are commons: 
    - prefix: __B-movie.name__
    - suffix: __movie.name-B__

- Meaning of Prefixes
    - __B__ for (__B__)eginning of span
    - __I__ for (__I__)nside of span
    - __O__ for (__O__)tside of span (no prefix or suffix, just `O`)

#### 3.1.1. Alternative Schemes:
- No prefix or suffix (useful when there are no *multi-word* concepts)
```
        who    O
        plays  O
        luke   character.name
        on     O
        star   movie.name
        wars   movie.name
        new    movie.name
        hope   movie.name
```
- __IOB/IOB2/BIO__

- __IOBE__
    - IOB + 
    - __E__ for (__E__)nd of span (or __L__ for (__L__)ast)
```
        who    O
        plays  O
        luke   B-character.name
        on     O
        star   B-movie.name
        wars   I-movie.name
        new    I-movie.name
        hope   E-movie.name
```
    
- __BILOU/BIOES__
    - IOB + 
    - __L__ for (__L__)ast word of span
    - __U__ for (__U__)nit word (or __S__ for (__S__)ingleton)
```
        who    O
        plays  O
        luke   U-character.name
        on     O
        star   B-movie.name
        wars   I-movie.name
        new    I-movie.name
        hope   L-movie.name
```

#### 3.1.2. Choice of Scheme
- It is possible to convert IOB, IOBE, & BILOU formats to each other
- Each prefix is applied to every concept label, consequently we increase the number of transitions whose probabilities we need to estimate; 
    - increasing data sparseness, as for each label we will have less observations
- The choice of scheme depends on the amount of available data:
    - __IOB__ for least amount
    - __BILOU__ for the most amount 

#### 3.1.3. Terminology
There is no strict naming convention regarding schemes (see alternatives) or how each constituent is termed. 
Below is the terminology used in this notebook. 

CoNLL is not the only data format for sequence labeling.

Let's convert this representation into [rasa](https://rasa.com/) [markdown](https://rasa.com/docs/rasa/nlu/training-data-format/#markdown-format) and [json](https://rasa.com/docs/rasa/nlu/training-data-format/#json-format) formats, as a forward looking example, and explain terminology.

__CoNLL__
```
    who    O
    plays  O
    luke   B-character.name
    on     O
    star   B-movie.name
    wars   I-movie.name
    new    I-movie.name
    hope   I-movie.name
```

__Markdown__
```markdown
who plays [luke](character.name) on [star wars new hope](movie.name)
```

__JSON__
```json
      {
        "intent": "actor",
        "entities": [
          {
            "start": 10,
            "end": 14,
            "value": "luke",
            "entity": "character.name"
          },
          {
            "start": 18,
            "end": 36,
            "value": "star wars new hope",
            "entity": "movie.name"
          }
        ],
        "text": "who plays luke on star wars new hope"
      }
```

##### Interpretation
All 3 formats encode the following information (ignoring some):
- in string (sentence) `"who plays luke on star wars new hope"`
- there are 2 __entities__ (a.k.a. concepts or slots, depending on NLP task and perspective), that have __types__ (labels)
    - `character.name`
    - `movie.name`
    
- entity of __type__ `movie.name`: 
    - has __span__:
        - as tokens from `0` for *CoNLL*: `[5:7]`
        - as string for *rasa md*: `"star wars new hope"`
        - in characters from `0` for *rasa json*: `[18:36]`
    - has __value__: `"star wars new hope"`
        - string *covered by* (*on included*) in __span__
        - __value__ could be *normalized* to a common value, e.g. `"Star Wars: Episode IV - A New Hope"` (not shown in examples)
 
What *CoNLL* format additionally encodes is __tokenization__ informations. In other words, how string `"star wars new hope"` is split into tokens. Since most Sequence Labeling algorithms operate on token level, internally the strings are split into tokens, applying *IOB*-like schemes.

### 2.2. Working with CoNLL Format Corpora: Exercise

Using provided functions (adding your own), compute:
- number of entity types (concepts) (implemented as an example as `get_chunks`)
- number of output labels (i.e. `iob+type`)
- frequency of each entity type in the NL2SparQL4NLU training set
- frequency of each output label in the NL2SparQL4NLU training set
- minimum, maximum, and average span sizes (in number of words) for each entity type and overall.
- minimum, maximum, and average number of entities per sentence

In [1]:
import re

def read_corpus_conll(corpus_file, fs="\t"):
    """
    read corpus in CoNLL format
    :param corpus_file: corpus in conll format
    :param fs: field separator
    :return: corpus
    """
    featn = None  # number of features for consistency check
    sents = []  # list to hold words list sequences
    words = []  # list to hold feature tuples

    for line in open(corpus_file):
        line = line.strip()
        if len(line.strip()) > 0:
            feats = tuple(line.strip().split(fs))
            if not featn:
                featn = len(feats)
            elif featn != len(feats) and len(feats) != 0:
                raise ValueError("Unexpected number of columns {} ({})".format(len(feats), featn))
            words.append(feats)
        else:
            if len(words) > 0:
                sents.append(words)
                words = []
    return sents

def parse_iob(t):
    m = re.match(r'^([^-]*)-(.*)$', t)
    return m.groups() if m else (t, None)

def get_chunks(corpus_file, fs="\t", otag="O"):
    sents = read_corpus_conll(corpus_file, fs=fs)
    return set([parse_iob(token[-1])[1] for sent in sents for token in sent if token[-1] != otag])

#### Exercise [Optional: Advanced]
Implement bigram HMM
- using Maximum Likelihood Estimation & smoothing compute:
    - emission probabilities
    - transition probabilities
- implement Viterbi algorithm to work with estimated probabilities    

## 3. Sequence Labeling with NLTK
[NLTK](https://www.nltk.org/api/nltk.tag.html) provides implementations of popular sequence labeling algorithms for Part-of-Speech Tagging (including [HMM](https://www.nltk.org/api/nltk.tag.html#module-nltk.tag.hmm)), that can be used for Sequence Labeling in general. 

- Loading & working with CoNLL format corpora in NLTK
- Tagger training & testing (running)

To have a custom tagger that labels input text with our __custom label set__, we need to __train__ it on a corpus annotated with this __custom label set__.

### 3.1. CoNLL format Corpus Loading

NLTK provides method of working with CoNLL format corpora via [ConllCorpusReader](https://www.nltk.org/_modules/nltk/corpus/reader/conll.html).

> The set of columns used by CoNLL-style files can vary from corpus to corpus; the `ConllCorpusReader` constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with the separator argument you can set a string to split by (e.g. ' ').

`ConllChunkCorpusReader(root, fileids, chunk_types)` is a subclass of `ConllCorpusReader` for reading corpora consisting of 3 columns (it works with 2 columns (without pos-tags), as well):
- words
- part-of-speech tags
- IOB-chunks (IOB-tagged labels)

To load a corpus from file, the function requires us to provide
- `root` - path to corpus
- `fileids` - pattern to use to read files (filename also works)
- `chunk_types` - set of *IOB-tag stripped labels* excluding out-of-span tag `'O'`
    - output of `get_chunks` above


In [2]:
from nltk.corpus.reader.conll import ConllChunkCorpusReader
import nltk.tag.hmm as hmm
import pandas as pd

trn='NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.conll.txt'

# reading our corpus (nltk requires us to provide chunk labels, i.e. concepts)
concepts = sorted(get_chunks(trn))

# loading training & test sets providing
# - path
# - file name pattern (to allow reading multiple files)
# - chunk label set
trn_data = ConllChunkCorpusReader('NL2SparQL4NLU/dataset/',  r'NL2SparQL4NLU.train.conll.txt', concepts)
tst_data = ConllChunkCorpusReader('NL2SparQL4NLU/dataset/',  r'NL2SparQL4NLU.test.conll.txt', concepts)
   
pd.DataFrame(concepts)

Unnamed: 0,0
0,actor.name
1,actor.nationality
2,actor.type
3,award.category
4,award.ceremony
5,character.name
6,country.name
7,director.name
8,director.nationality
9,movie.description


##### Common Data Access Methods

If the corpus has a relevant annotation, methods below could be used to access it after loading.

| __Method__                 | __Returns__ |
|:---------------------------|:------------|
| `I{corpus}.words()`        | list of str
| `I{corpus}.sents()`        | list of (list of str)
| `I{corpus}.paras()`        | list of (list of (list of str))
| `I{corpus}.tagged_words()` | list of (str,str) tuple
| `I{corpus}.tagged_sents()` | list of (list of (str,str))
| `I{corpus}.tagged_paras()` | list of (list of (list of (str,str)))
| `I{corpus}.chunked_sents()`| list of (Tree w/ (str,str) leaves)
| `I{corpus}.parsed_sents()` | list of (Tree with str leaves)
| `I{corpus}.parsed_paras()` | list of (list of (Tree with str leaves))
| `I{corpus}.xml()`          | A single xml ElementTree
| `I{corpus}.raw()`          | str (unprocessed corpus contents)


### 3.2. Training NLTK Taggers

In [3]:
# training hmm on training data
hmm_model = hmm.HiddenMarkovModelTrainer()
hmm_tagger = hmm_model.train(trn_data.tagged_sents())

# tagging sentences in test set
for s in tst_data.sents():
    print("INPUT: {}".format(s))
    print("TAG  : {}".format(hmm_tagger.tag(s)))
    print("PATH : {}".format(hmm_tagger.best_path(s)))
    break
    
# evaluation
accuracy = hmm_tagger.evaluate(tst_data.tagged_sents())

print("Accuracy: {:6.4f}".format(accuracy))

INPUT: ['star', 'of', 'thor']
TAG  : [('star', 'O'), ('of', 'O'), ('thor', 'B-movie.name')]
PATH : ['O', 'O', 'B-movie.name']
Accuracy: 0.9087


### 3.3. Exercises

#### 3.3.1. Tagging with NLTK
- Experiment with different taggers provided in NLTK (e.g. NgramTagger)
- Explore and experiment with different tagger parameters
    - some of them have *cut-off*
- For each report evaluation accuracy

#### 3.3.2. Segmentation 
- Strip concept information from output labels
- Train tagger to predict segmentation labels
- Evaluate segmentation performance

## 4. Evaluation

### 4.1. Basic Concepts

- *Why do we want to evaluate a system / an algorithm's performance?*
    - To measure one or more of its qualities.
    - Proper evaluation criteria is a way to specify the problem.
- *How do we evaluate a system / an algorithms performance?*

#### 4.1.1. Automatic vs. Manual Evaluation
- Automatic Evaluation (__OBJECTIVE__)
    - Compare the system's output with the gold standard (reference)
        - *Cons*: An effort to produce the gold standard (manual)
        - *Pros*: Re-usable; no additional cost
- Manual Evaluation (__SUBJECTIVE__)
    - Ask human judges to estimate the quality w.r.t. certain criteria
        - For some tasks the gold standard might be unobtainable
        - No agreed automatic evaluation method

#### 4.1.2. Intrinsic vs. Extrinsic Evaluation
- Intrinsic
    - in isolation
    - w.r.t. gold standard (references)
    - e.g. Tagging performance
- Extrinsic
    - as a part of other system
    - usefulness for some other task
    - e.g. effect of POS-Tagger on parsing performance

#### 4.1.3. Black-Box vs. Glass-Box Evaluation
- Black-Box
    - Evaluation of Performance
        - speed
        - accuracy 
        - etc.
- Glass-Box
    - Evaluation of Design algorithm
        - used resources 
        - etc.        

#### 4.1.4. Gold Standard / References
- *Where Gold Standard comes from?*
    - Annotation by experts (human judges)
- *How do we know that Gold Standard is good?*
    - Evaluate agreement between the annotators/judges
    - Most simple agreement measure: % of agreed instances

#### 4.1.5. Lower & Upper Bounds of the Performance
- Lower Bound: __Baseline__ 
    - trivial solution to the problem: 
        - *random*: random decision
        - *chance*: random decision w.r.t. the distribution of categories in the training data
        - *majority*: assign everything to the largest category 
        - etc.
- Upper Bound
    - Inter-rater agreement, i.e. human performance.

__A system is expected to perform within the lower and upper bounds.__

#### 4.1.5. Data Split

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


### 4.2. Evaluation Metrics

#### 4.2.1. Accuracy: Simplest Case

$$ Accuracy = \frac{\text{Num. of Correct Decisions}}{\text{Total Num. of Testing Instances}}$$

- Known number of instances
- Single decision for each instance
- Single correct answer for each instance
- All errors are equal

#### 4.2.2. [Contingency Table](https://en.wikipedia.org/wiki/Contingency_table)
Contingency table represents frequencies of correct and wrong decisions by our system, and is useful to describe evaluation metrics.
Hypotheses (__HYP__) on rows & references (__REF__) on columns.


|         | __POS__ | __NEG__ |
|---------|---------|---------|
| __POS__ | TP      | FP      |
| __NEG__ | FN      | TN      |

Notation:

|        |      |                | *Description*      |          |
|--------|------|:---------------|--------------------|:---------|
| __TN__ | *a*  | True Positive  | `HYP: + && REF: +` | correct
| __FP__ | *b*  | False Positive | `HYP: + && REF: -` | error 
| __FN__ | *c*  | False Negative | `HYP: - && REF: +` | error 
| __TN__ | *d*  | True Negative  | `HYP: - && REF: -` | correct


__Accuracy__:

$$Accuracy = \frac{TP + TN}{TP + FP + FN + TN}$$

- What if __TN__ is infinite or unknown? (e.g.: Number of irrelevant queries to a search engine)

__Precision & Recall__

$$Precison = \frac{TP}{TP + FP}$$

$$Recall = \frac{TP}{TP + FN}$$

- 2 values

__F-Measure__: precision-recall trade-off

- Harmonic Mean of Precision & Recall 
- Usually evenly weighted 
    - $\beta = 1$ is $F_{1}$-measure
    
$$F_\beta = \frac{(1 +\beta^2)*Precision * Recall}{\beta^2*Precision + Recall}$$

$$F_1= \frac{2*Precision * Recall}{Precision + Recall}$$


$F_{1}$-measure is better for evaluation of Shallow Parsing.

#### 4.2.3. Other Topics in Evaluation
- Edit Distance (Error Rate)
- Cross-Validation
- Significance Tests
- Agreement Measures
- Sampling (random, stratified)
- Binary vs. Multi-class classification
- Multi-label data
- Regression
- Re-ranking
- Ensemble Methods
- etc.

### 4.3. Exercises

#### 4.3.1. Evaluation Metrics
Using references and prediction of the HMM-tagger on `NL2SparQL4NLU` dataset
- compute raw TP, FP, FN, TN (on `iob+type`)
- implement evaluation metrics & report:

    - Accuracy
    - Precision
    - Recall
    - $F_{1}$-Measure
 
- compare accuracy to output of NLTK

You can collect predictions as:
```python
hyps = [hmm_tagger.tag(s) for s in tst_data.sents()]
```

### 4.3.2. CoNLL Eval: Exercise
CoNLL Community developed a perl script to evaluate *segmentation* and *labeling* performance jointly using IOB information. Such evaluation provides more accurate assessment of the shallow parsing performance, in comparison to token-level metrics (e.g. NLTK accuracy).

- import `evaluate` function from `conll.py` (example shown)
- evaluate tagger predictions
- compare performances to token-level accuracies

In [4]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

# getting references
refs = [s for s in tst_data.tagged_sents()]
# getting hypotheses
hyps = [hmm_tagger.tag(s) for s in tst_data.sents()]

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
director.name,0.584,0.556,0.57,81
movie.star_rating,1.0,0.0,0.0,1
movie.subject,0.737,0.636,0.683,44
movie.release_date,0.741,0.69,0.714,29
movie.name,0.853,0.748,0.797,473
award.category,1.0,0.0,0.0,2
movie.release_region,0.5,0.25,0.333,4
movie.type,1.0,0.0,0.0,4
character.name,0.583,0.467,0.519,15
producer.name,0.882,0.616,0.726,73


### 4.3.3. OOV: Exercise

- extend unknown word handling function to CoNLL format
- train and evaluate tagger on OOV-processed training data
- compare performance