# Sequence Labeling with Weighted Finite State Machines

- Language Understanding Systems Lab
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

This notebook is part of the Laboratory Work for [Language Understanding Systems class](http://disi.unitn.it/~riccardi/page7/page13/page13.html) of [University of Trento](https://www.unitn.it/en).

__Requirements__

- [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome)
- [OpenGRM](http://www.opengrm.org/twiki/bin/view/GRM/NGramLibrary)
- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

## 1. Shallow Parsing with WFSMs: Natural Language Understanding

Language Understanding of several tasks, one of which is entity extraction (concept tagging). The task is usually approached as Shallow Parsing, where we segment the input into constituents and label them using IOB-schemes.

Using Weighted Finite State Machines for the task provides several benefits:
- Even though we saw how to do sequence labeling using HMM, in real applications models can become quite complex to solve. 
- The task usually involves several components (e.g. *emission* & *transition* probabilities), and WFSMs provide an efficient way to represent and process this components via intersection and composition operations.
    - WFSTs are good at modeling HMM and solving state machine problems
    - Weights can be associated with edges as costs or probabilities (default: cost = negative log probability)

### 1.1. Common Sequence Labeling Pipeline

The common approach to concept tagging (or sequence labeling in general) makes use of 3 components:

|                   | Description                      
|:------------------|:------------------------------
| $$\lambda_{W}$$   | FSA representation of an input sentence
| $$\lambda_{W2T}$$ | FST to translate words into output labels (e.g. `iob+type`)
| $$\lambda_{*LM}$$ | FSA Ngram Language Model to score the sequences of output labels

Consequently, Sequence Labeling ($\lambda$) is performed by composition of these three components as:

$$\lambda = \lambda_{W} \circ \lambda_{W2T} \circ \lambda_{*LM}$$

- It is common to include other components to perform intermediate operations for:
    - generalization of input ($\lambda_{G}$)
    - cleaning of output 
    - etc.

### 1.2. General Setup
Let's start by preparing our workspace.

- What we have is:
    - training & test sets in utterance-per-line format
    - training & test sets in CoNLL format that contain word-tag observations
    
- To work with this data we need functions to:
    - apply frequency cut-offs to handle OOV
    - reading CoNLL format corpus for processing

- For working with WFSMs we need:
    - input symbol table: words
    - output symbol table: tags
    - data in utterance-per-line & CoNLL formats
    
- Our main mechanism for OOV Handling will be frequency cut-off on lexicon, replacement of OOV in training and testing data will be done using __command-line tools__.
- We will be also *extensively* writing our own FSMs



#### 1.2.1. Python Functions for Corpus and Lexicon Preprocessing

##### Corpus Reading: Utterance-per-line Format (from Lab on Ngram Modeling)

In [1]:
def read_corpus(corpus_file):
    """
    read corpus into a list-of-lists, splitting sentences into tokens by space (' ')
    :param corpus_file: corpus file in sentence-per-line format (tokenized)
    """
    return [line.strip().split() for line in open(corpus_file, 'r')]

##### CoNLL Corpus Reading (from Lab on Sequence Labeling)

In [2]:
def read_corpus_conll(corpus_file, fs="\t"):
    """
    read corpus in CoNLL format
    :param corpus_file: corpus in conll format
    :param fs: field separator
    :return: corpus
    """
    featn = None  # number of features for consistency check
    sents = []  # list to hold words list sequences
    words = []  # list to hold feature tuples

    for line in open(corpus_file):
        line = line.strip()
        if len(line.strip()) > 0:
            feats = tuple(line.strip().split(fs))
            if not featn:
                featn = len(feats)
            elif featn != len(feats) and len(feats) != 0:
                raise ValueError("Unexpected number of columns {} ({})".format(len(feats), featn))

            words.append(feats)
        else:
            if len(words) > 0:
                sents.append(words)
                words = []
    return sents

In [3]:
import re

# Utility function to get labels stripped of IOB
def parse_iob(t):
    m = re.match(r'^([^-]*)-(.*)$', t)
    return m.groups() if m else (t, None)

def get_chunks(corpus_file, fs="\t", otag="O"):
    sents = read_corpus_conll(corpus_file, fs=fs)
    return set([parse_iob(token[-1])[1] for sent in sents for token in sent if token[-1] != otag])

In [4]:
# Let's define a function to simplify working with data
# get column from loaded corpus (tokens are tuples)
def get_column(corpus, column=-1):
    return [[word[column] for word in sent] for sent in corpus]

##### Frequency Cut-Off using Corpus (from Lab on Ngram Modeling)

In [5]:
def compute_frequency_list(corpus):
    """
    create frequency list for a corpus
    :param corpus: corpus as list of lists
    """
    frequencies = {}
    for sent in corpus:
        for token in sent:
            frequencies[token] = frequencies.setdefault(token, 0) + 1
    return frequencies

In [6]:
def cutoff(corpus, tf_min=2):
    """
    apply min cutoffs
    :param tf_min: minimum token frequency for lexicon elements (below removed); default 2
    :return: lexicon as set
    """
    frequencies = compute_frequency_list(corpus)
    return sorted([token for token, frequency in frequencies.items() if frequency >= tf_min])

##### Evaluation (from Lab on Sequence Labeling)
- For evaluation we are going to use `conll.py`'s `evaluate` (in CoNLL eval style)
- Results will be reported using `pandas` Data Frames

In [7]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

#### 1.2.2. Setting Up...

##### Preparing Input Symbol Tables (`isyms.txt`)
- Since we will be using corpus files a lot, let's copy them into current directory with shorter names.

In [8]:
%%bash
dpath='NL2SparQL4NLU/dataset/NL2SparQL4NLU'

cp $dpath.train.utterances.txt trn.txt
cp $dpath.test.utterances.txt tst.txt

cp $dpath.train.conll.txt trn.conll
cp $dpath.test.conll.txt tst.conll

- Let's create symbol tables for our data
    - apply cut-off using our functions
    - create symbol table using `ngramsymbols`

In [9]:
trn_data = read_corpus('trn.txt')
trn_lex = cutoff(trn_data)

with open('isyms.trn.txt', 'w') as f:
    f.write("\n".join(trn_lex) + "\n")

In [10]:
%%bash
ngramsymbols isyms.trn.txt isyms.txt

In [11]:
%%bash
# let's compile both training and test set into far using this symbol table
farcompilestrings \
    --symbols=isyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.txt trn.far

farcompilestrings \
    --symbols=isyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    tst.txt tst.far

As a result we have:
- Symbol table (`isyms.txt`)
    - contains `['<s>', '</s>', '<epsilon>', '<unk>']` that are added automatically
- Training data as FAR with OOV replaced (`trn.far`)
- Test data as FAR with OOV replaced (`tst.far`)

##### Generating Output Symbol Table (`osyms.txt`)
To do sequence labeling we additionally require *output symbol table*.

In case we know the concepts (__types__), we can build our output symbol table without looking at data.

Since our output labels are composed of segmentation (`iob`) and classification (`type`) labels, we can make sure that each __type__ has all possible IOB prefixes.

- prefix each __type__ with all possible IOB prefixes (i.e. `I-` and `B-`)
- generate symbol table using `ngramsymbols` (as shown above)

In case we don't know the list of __types__, we have to extract __types__ from training data in CoNLL format (stripping `iob` prefix). 

In [12]:
# create a unique list of types
types = get_chunks('trn.conll')

with open('osyms.u.lst.txt', 'w') as f:
    # let's add 'O'
    f.write("O" + "\n")
    for c in sorted(list(types)):
        # prefix each type with segmentation information
        f.write("B-"+ c + "\n")
        f.write("I-"+ c + "\n")

In [13]:
%%bash
# generate output symbol table with iob-prefixed typed
ngramsymbols osyms.u.lst.txt osyms.txt

### 1.3. Applying FSTs
- Our objective is to be able to annotate input sentences (as FSAs in FAR) using the machines we are going to build. 
- Our test data has been loaded into FAR archive
- We will can `farextract` to extract these sentence FSAs and apply our FSMs to them
    - `farextract --filename_prefix="<odir>" <FAR>` will extract contents of `<FAR>` into directory `<odir>`
    - we can iterate over files in the directory and apply operations (see below)
- We can also create FAR of the processed FSMs using `farcreate` as
    - `farcreate --file_list_input <list of FST filenames> <output FAR>`


In [14]:
%%bash
wdir='wdir'
mkdir -p $wdir

farextract --filename_prefix="$wdir/" tst.far
cp $wdir/tst.txt-0001 sent.fsa

fstprint sent.fsa

0	1	star	star
1	2	of	of
2	3	<unk>	<unk>
3


- For testing we are going to use this `sent.fsa`
- For evaluation we are going to iterate over whole FAR

### 1.3. Baselines
Let's demonstrate the process by building some simple Shallow Parsing models.

### 1.3.1. Random
The simplest solution is to assign output labels __randomly__. 

To achieve this we need to:
- implement an FST that translates our words into output symbols ($\lambda_{W2T}$) with equal cost, or no cost at all (i.e. unweighted FST)
    - let's call it $\lambda_{W2T_{U}}$ for [universe](https://en.wikipedia.org/wiki/Universe_(mathematics)).
- compose it with our sentence
- choose a random path in FST

The FST $\lambda_{W2T_{U}}$ represents search space for $p(w_i|t_i)$ without being exposed to any observation. It is build using only our knowledge of the __domain__:
- vocabulary of language (input symbols)
- concepts __types__ in our domain

input and output symbols and all translations are possible. 

Since we have no model yet, the whole pipeline is:

$$\lambda_{R} = \lambda_{W} \circ \lambda_{W2T_{U}}$$

- __*random path*__ here is opposed to __*best path*__ or __*shortest path*__

- Let's define a function in python to write FST specification given input and output symbol tables as below
    - we will be using it a lot

In [15]:
def make_w2t(isyms, osyms, out='w2t.tmp'):
    special = {'<epsilon>', '<s>', '</s>'}
    oov = '<unk>'  # unknown symbol
    state = '0'    # wfst specification state
    fs = " "       # wfst specification column separator
    
    ist = sorted(list(set([line.strip().split("\t")[0] for line in open(isyms, 'r')]) - special))
    ost = sorted(list(set([line.strip().split("\t")[0] for line in open(osyms, 'r')]) - special))
    
    with open(out, 'w') as f:
        for i in range(len(ist)):
            for j in range(len(ost)):
                f.write(fs.join([state, state, ist[i], ost[j]]) + "\n")
        f.write(state + "\n")

In [16]:
make_w2t('isyms.txt', 'osyms.txt', out='w2t_u.txt')

In [17]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.txt \
    --osymbols=osyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2t_u.txt w2t_u.bin

fstinfo w2t_u.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               osyms.txt
# of states                                       1
# of arcs                                         45648
initial state                                     0
# of final states                                 1


#### Exercise
- Compute input and output symbol table sizes
- Compare their multiplication to `# of arcs`

#### Testing

- note the usage of `fstrandgen` instead of `fstshortestpath` to get __random__ paths of FST.
- All tokens will be predicted as `O`, if we use `fstshortestpath`.
    - Bonus Question: *Why?* (try uncommenting & running)

In [18]:
%%bash
fstcompose sent.fsa w2t_u.bin | fstrandgen | fstrmepsilon | fsttopsort | fstprint --isymbols=isyms.txt
# fstcompose sent.fsa w2t_u.bin | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	B-actor.type
1	2	of	I-actor.nationality
2	3	<unk>	B-country.name
3


#### Evaluation
- For evaluation we are going to use `conll.py`'s `evaluate` function (provided is `src` directory)
- We first will collect predictions

- Collecting prediction from our model & storing them into a file

In [19]:
%%bash
wdir='wdir'
farr=($(ls $wdir))

for f in ${farr[@]}
do
    fstcompose $wdir/$f w2t_u.bin | fstrandgen | fstrmepsilon | fsttopsort | fstprint --isymbols=isyms.txt
done > w2t_u.out

- below is the function to process this output and read it for evaluation: `read_fst4conll`
    - additionally it substitutes `<unk>` in output labels with '`O`', for function to work (there is no `<unk>` in IOB scheme)
- it is the modified version of `read_corpus_conll` which we will use to load references

In [53]:
# modified version to support fst-output
def read_fst4conll(fst_file, fs="\t", oov='<unk>', otag='O', sep='+', split=False):
    """
    :param corpus_file: corpus in conll format
    :param fs: field separator
    :param oov: token to map to otag (we need to get rid of <unk> in labels)
    :param otag: otag symbol
    :param sep: 
    :param split:
    :return: corpus 
    """
    sents = []  # list to hold words list sequences
    words = []  # list to hold feature tuples

    for line in open(fst_file):
        line = line.strip()
        if len(line.strip()) > 0:
            feats = tuple(line.strip().split(fs))
            # arc has minimum 3 columns, else final state
            if len(feats) >= 3:
                ist = feats[2]  # 3rd column (input)
                ost = feats[3]  # 4th column (output)
                # replace '<unk>' with 'O'
                ost = otag if ost == oov else ost
                # ignore for now
                ost = ost.split(sep)[1] if split and ost != otag else ost
                
                words.append((ist, ost))
            else:
                sents.append(words)
                words = []
        else:
            if len(words) > 0:
                sents.append(words) 
                words = []
    return sents

In [21]:
refs = read_corpus_conll('tst.conll')
hyps = read_fst4conll('w2t_u.out')

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
country.name,0.007,0.032,0.011,62
movie.location,0.0,0.0,0.0,7
award.ceremony,0.004,0.143,0.007,7
award.category,0.0,0.0,0.0,2
director.nationality,0.0,0.0,0.0,1
character.name,0.003,0.067,0.006,15
movie.description,0.0,0.0,0.0,0
movie.gross_revenue,0.0,0.0,0.0,5
movie.genre,0.004,0.028,0.007,36
actor.nationality,0.0,0.0,0.0,1


### 1.3.2. Output Symbol Priors

The simplest form for the third component is to use output label priors, i.e. unigram probabilities of output labels. To model that, we can:
- train a unigram language model using `ngramcount` & `ngrammake` (let's call it $\lambda_{LM_{1}}$)
- compose it with the $\lambda_{W2T_{U}}$ so that the whole pipeline becomes:

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{1}}$$

- Since `O` tag is the most frequent & we will have a model that always predicts it.
- We can represent our model as:

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i)}$$

__Considerations__
- In this baseline we model __observations__. Due to the fact that:
    - our models are based on observations
    - OOV is due to scarcity of observations
- We need to change output symbol table of $\lambda_{W2T_{U}}$ to output only tags present in data or `<unk>`. 
    - it might happen such that an `iob+type` combination never appears in our training data
    - output symbol tables of $\lambda_{W2T}$ (FST) and symbol table of $\lambda_{LM_{1}}$ (FSA) have to match

#### "Training" a Model
A new $\lambda_{*LM}$ model is built following these steps:
1. prepare training data for model (in required format)
2. prepare symbol table for that data
    - apply OOV handing (you can use any of the approaches to introduce `<unk>`)
3. compile training data into FAR using this symbol table
4. estimate model probabilities for $\lambda_{*LM}$ (i.e. train ngram model)


A new $\lambda_{W2T}$ is created (updated) each time we change the symbol table of the $\lambda_{*LM}$.

- If we do not plan to estimate $p(w_i|t_i)$ in $\lambda_{W2T}$, we can create the FST as we did for $\lambda_{W2T_{U}}$ (and keeping input symbol table the same)

In [22]:
# create training data in utterance-per-line format for output symbols (t - tags)
trn = read_corpus_conll('trn.conll')
tags = get_column(trn, column=-1)

# write data
with open('trn.t.txt', 'w') as f:
    for s in tags:
        f.write(" ".join(s) + "\n")
        
tlex = cutoff(tags)
with open('osyms.t.lst.txt', 'w') as f:
    f.write("\n".join(tlex) + "\n")

In [23]:
%%bash
# make symbol table
ngramsymbols osyms.t.lst.txt osyms.t.txt
# compile data into FAR again
farcompilestrings \
    --symbols=osyms.t.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.t.txt trn.t.far

- Let's train a unigram language model.

In [24]:
%%bash
ngramcount --order=1 trn.t.far trn.t1.cnt
ngrammake trn.t1.cnt t1.lm
ngraminfo t1.lm

# of states                                       1
# of ngram arcs                                   38
# of backoff arcs                                 0
initial state                                     0
unigram state                                     -1
# of final states                                 1
ngram order                                       1
# of 1-grams                                      39
well-formed                                       y
normalized                                        y


- Let's create a new $\lambda_{W2T}$ (let's call it $\lambda_{W2T_{T}}$ for "tags"):
    - following the same procedure we followed for $\lambda_{W2T_{U}}$, but using:
        - as input symbol table (`isyms.txt`)
        - as output symbol table (`t.osyms.txt`)
    - allowing `<unk> <unk>` and *word*-`<unk>` arcs

In [25]:
make_w2t('isyms.txt', 'osyms.t.txt', out='w2t_t.txt')

In [26]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.txt \
    --osymbols=osyms.t.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2t_t.txt w2t_t.bin

fstinfo w2t_t.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               osyms.t.txt
# of states                                       1
# of arcs                                         36138
initial state                                     0
# of final states                                 1


In [27]:
%%bash
fstcompose sent.fsa w2t_t.bin | fstcompose - t1.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	O	0.476697922
1	2	of	O	0.476697922
2	3	<unk>	O	0.476697922
3	2.00510979


In [28]:
%%bash
wdir='wdir'
farr=($(ls $wdir))

for f in ${farr[@]}
do
    fstcompose $wdir/$f w2t_t.bin | fstcompose - t1.lm |\
        fstshortestpath | fstrmepsilon | fsttopsort | fstprint --isymbols=isyms.txt
done > w2t_t.t1.out

In [29]:
refs = read_corpus_conll('tst.conll')
hyps = read_fst4conll('w2t_t.t1.out')

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
country.name,1,0.0,0.0,62
movie.location,1,0.0,0.0,7
award.ceremony,1,0.0,0.0,7
award.category,1,0.0,0.0,2
director.nationality,1,0.0,0.0,1
character.name,1,0.0,0.0,15
movie.gross_revenue,1,0.0,0.0,5
movie.genre,1,0.0,0.0,36
actor.nationality,1,0.0,0.0,1
movie.type,1,0.0,0.0,4


- The model still has $F_1=0$, since `O` is the tag with highest prior.
- Observe the weights in the output

### 1.3.4. Exercises
- Compare sizes of $\lambda_{W2T_{U}}$ and $\lambda_{W2T_{T}}$ for `# of arcs`
- Unigram models & $\lambda_{W2T}$
    - Test pipeline: $\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{1}}$


- Bigram models: train a *tag* bigram model (let's call it $\lambda_{LM_{2}}$)

    - Test pipeline with $\lambda_{W2T_{U}}$: $\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{2}}$
    - Test pipeline with $\lambda_{W2T_{T}}$: $\lambda = \lambda_{W} \circ \lambda_{W2T_{T}} \circ \lambda_{LM_{2}}$

### 1.3.5. Maximum Likelihood Estimation (Emission Probabilities)
- So far we haven't explored the relation between input and output
- The next thing we can do is to expose our model to observations and estimate $p(w_{i}|t_{i})$ from data.
- We can use `ngramcount` and `ngrammake` to make a smoothed probability model (we are using default, i.e. no parameters). 
- We need to estimate probabilities like we would estimate bigram probabilities, thus:
    - prepare lexicon with *tags* and *words*
    - read CoNLL format corpus into far (token per line, preprocessed)
    - count bigrams
    - make a bigram language model
    - print bigrams with weights (negative log probabilities)
    - choose bigrams (it will contain unigrams, as well as `<s>` and `</s>` bigrams)
    - convert to FST & compile
    
- Let's call the model $\lambda_{W2T_{MLE}}$

In [30]:
%%bash
# lets use our symbol tables (since they both have been applied cut-off)
cat isyms.txt osyms.t.txt | cut -f 1 | sort | uniq > msyms.m.lst.txt
ngramsymbols msyms.m.lst.txt msyms.t.txt

# let's convert data to ngrams
cat trn.conll | sed '/^$/d' | awk '{print $2,$1}' > trn.w2t.txt

# compile to far
farcompilestrings \
    --symbols=msyms.t.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.w2t.txt trn.w2t.far
    
# count bigrams
ngramcount --order=2 trn.w2t.far trn.w2t.cnt
# make a model
ngrammake trn.w2t.cnt trn.w2t.lm

# print ngram probabilities as negative logs
ngramprint \
    --symbols=msyms.t.txt\
    --negativelogs \
    trn.w2t.lm trn.w2t.probs

- Let's define a python function to convert probabilities printout to W2T FST

In [31]:
def make_w2t_mle(probs, out='w2t_mle.tmp'):
    special = {'<epsilon>', '<s>', '</s>'}
    oov = '<unk>'  # unknown symbol
    state = '0'    # wfst specification state
    fs = " "       # wfst specification column separator
    otag = 'O'
    mcn = 3        # minimum column number
    
    lines = [line.strip().split("\t") for line in open(probs, 'r')]

    with open(out, 'w') as f:
        for line in lines:
            ngram = line[0]
            ngram_words = ngram.split()  # by space
            if len(ngram_words) == 2:
                if set(ngram_words).isdisjoint(set(special)):
                    if ngram_words[0] in [otag, oov]:
                        f.write(fs.join([state, state] + ngram_words + [line[1]]) + "\n")
                    elif ngram_words[0].startswith("B-") or ngram_words[0].startswith("I-"):
                        f.write(fs.join([state, state] + line) + "\n")
        f.write(state + "\n")

In [32]:
make_w2t_mle('trn.w2t.probs', out='trn.w2t_mle.txt')

In [33]:
%%bash
fstcompile \
    --isymbols=osyms.t.txt \
    --osymbols=isyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    trn.w2t_mle.txt w2t_mle.bin
    
# we need to invert it to have words on input
fstinvert w2t_mle.bin w2t_mle.inv.bin

fstinfo w2t_mle.inv.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               osyms.t.txt
# of states                                       1
# of arcs                                         1513
initial state                                     0
# of final states                                 1


#### Testing
Let's test it:

In [34]:
%%bash
fstcompose sent.fsa w2t_mle.inv.bin | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	B-movie.name	3.22130394
1	2	of	I-movie.name	3.02857399
2	3	<unk>	B-director.nationality	0.694147706
3


- The pipeline above represents 

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)}$$

- To extend it to unigram tagging model we need to compose it with  $\lambda_{LM_{1}} = p(t_i)$ 

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{MLE}} \circ \lambda_{LM_{1}}$$

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i)}$$ 

In [35]:
%%bash
fstcompose sent.fsa w2t_mle.inv.bin | fstcompose - t1.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	B-movie.name	6.09388542
1	2	of	O	3.86926198
2	3	<unk>	O	4.46324492
3	2.00510979


#### Exercise 1: Maximum Likelihood Estimation
- using `ngramprint` verify the Maximum Likelihood Estimation method (without `--negativelogs` it prints raw probabilities)
    - print bigram counts from $\lambda_{W2T_{MLE}}$ (output of `ngramcount`)
    - print unigram counts for either from $\lambda_{LM_{1}}$ or $\lambda_{W2T_{MLE}}$ (output of `ngramcount`)
    - using these counts compute probability of $p($ `brad|B-actor.name` $)$
    - extract probability of $p($ `brad|B-actor.name` $)$ from $\lambda_{W2T_{MLE}}$ (output of `ngrammake`)
    - compare values
    - repeat the procedure using counts from methods developed for the lab on ngram modeling.

#### Exercise 2: Markov Model Tagger
- Evaluate the MLE pipeline using bigram model on tags, i.e.

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{MLE}} \circ \lambda_{LM_{2}}$$

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i|t_{i-1})}$$ 

- compare performances to the HMM tagger from previous lab (NLTK)

## 2. Joint Distribution Modeling

As we have seen, sequence labeling for Language Understanding could be approached using Hidden Markov Models (similar to Part-of-Speech Tagging), and to models it as in the table below (__HMM__). Stochastic Conceptual Language Models for Spoken Language Understanding in [Raymond & Riccardi (2007)](https://disi.unitn.it/~riccardi/papers2/IS07-GenerDiscrSLU.pdf) (__R&R__) model it jointly.


| Model   | Equation |
|:--------|:----------
| __HMM__ | $$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})}$$
| __R&R__ | $$p(w_{1}^n,t_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_{i}t_{i}|w_{i-N+1}^{i-1}t_{i-N+1}^{i-1})}$$


From implementation perspective, joint modeling implies the following:
- we need to train $\lambda_{SCLM}$ on word-tag pairs
    - create corpus in a format for estimating $p(w_i,t_i|w_{i-N+1}^{i-1}t_{i-N+1}^{i-1})$
    - create symbol tables
- we need to change $\lambda_{W2T}$ to output *word-tag* pairs (let's call it $\lambda_{W2WT}$)
    - create FST like above for $\lambda_{W2WT}$ ($\lambda_{W2WT_{WT}}$ - to differentiate from $\lambda_{W2WT_{U}}$ that contains all possible combinations)
- we also need to change our input symbol table to accommodate OOV words due to joint modeling

#### Preparing Symbol Tables
- Let's create output symbol table the same way we did for $\lambda_{W2T_{T}}$
- Let's create input symbol taking $w$ from the $w,t$ pair

In [36]:
# create training data in utterance-per-line format for output symbols (w+t)
trn = read_corpus_conll('trn.conll')
wt_sents = [["+".join(w) for w in s] for s in trn]
wt_osyms = cutoff(wt_sents)
wt_isyms = [w.split('+')[0] for w in wt_osyms]

with open('trn.wt.txt', 'w') as f:
    for s in wt_sents:
        f.write(" ".join(s) + "\n")
        
with open('osyms.wt.lst.txt', 'w') as f:
    f.write("\n".join(wt_osyms) + "\n")
    
with open('isyms.wt.lst.txt', 'w') as f:
    f.write("\n".join(wt_isyms) + "\n")

In [37]:
%%bash
ngramsymbols osyms.wt.lst.txt osyms.wt.txt
ngramsymbols isyms.wt.lst.txt isyms.wt.txt

- Let's:
    - compile our processed data into FAR
    - train ngram language models on it - $\lambda_{SCLM}$

#### Training Conceptual Language Model

In [38]:
%%bash
# compile data into FAR
farcompilestrings \
    --symbols=osyms.wt.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.wt.txt trn.wt.far

# train ngram model
ngramcount --order=2 trn.wt.far trn.wt.cnt
ngrammake trn.wt.cnt wt2.lm
ngraminfo wt2.lm

# of states                                       1096
# of ngram arcs                                   6179
# of backoff arcs                                 1095
initial state                                     1
unigram state                                     0
# of final states                                 533
ngram order                                       2
# of 1-grams                                      1095
# of 2-grams                                      5617
well-formed                                       y
normalized                                        y


#### Building W2WT FST

- Let's build unweighted $\lambda_{W2WT_{WT}}$, using
    - input symbol table `isyms.wt.txt`
    - output symbol table `osyms.wt.txt`

In [39]:
def make_w2t_wt(isyms, sep='+', out='w2wt.tmp'):
    special = {'<epsilon>', '<s>', '</s>'}
    oov = '<unk>'  # unknown symbol
    state = '0'    # wfst specification state
    fs = " "       # wfst specification column separator
    
    ist = sorted(list(set([line.strip().split("\t")[0] for line in open(isyms, 'r')]) - special))
    
    with open(out, 'w') as f:
        for e in ist:
            f.write(fs.join([state, state, e.split('+')[0], e]) + "\n")
        f.write(state + "\n")

In [40]:
make_w2t_wt('osyms.wt.txt', out='w2wt_wt.txt')

In [41]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.wt.txt \
    --osymbols=osyms.wt.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2wt_wt.txt w2wt_wt.bin

fstinfo w2wt_wt.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.wt.txt
output symbol table                               osyms.wt.txt
# of states                                       1
# of arcs                                         1094
initial state                                     0
# of final states                                 1


- Lets test the whole $\lambda_{W} \circ \lambda_{W2WT_{WT}} \circ \lambda_{SCLM_{2}}$

#### Preparing Test Data
- since we have changed input symbol table we need to recompile & extract our test data

In [42]:
%%bash
farcompilestrings \
    --symbols=isyms.wt.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    tst.txt tst.wt.far

wdir='wdir_wt'
mkdir -p $wdir

farextract --filename_prefix="$wdir/" tst.wt.far
cp $wdir/tst.txt-0001 sent.wt.fsa

fstprint sent.wt.fsa

0	1	star	star
1	2	of	of
2	3	<unk>	<unk>
3


In [43]:
%%bash
fstcompose sent.wt.fsa w2wt_wt.bin | fstcompose - wt2.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	star+O	7.93891811
1	2	of	of+O	1.55421352
2	3	<unk>	<unk>	2.84977818
3	1.10391009


#### Evaluation
- Since on the output we have `word+tag`, we need to post-process the output for evaluation
- the function `read_fst4conll` already has that functionality via `split=True` and `sep='+'`

In [47]:
%%bash
wdir='wdir_wt'
farr=($(ls $wdir))

for f in ${farr[@]}
do
    fstcompose $wdir/$f w2wt_wt.bin | fstcompose - wt2.lm |\
        fstshortestpath | fstrmepsilon | fsttopsort | fstprint --isymbols=isyms.wt.txt
done > w2wt_wt.wt2.out

In [54]:
refs = read_corpus_conll('tst.conll')
hyps = read_fst4conll('w2wt_wt.wt2.out', split=True)

results = evaluate(refs, hyps)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
country.name,0.629,0.71,0.667,62
movie.location,0.0,0.0,0.0,7
award.ceremony,0.714,0.714,0.714,7
award.category,1.0,0.0,0.0,2
director.nationality,1.0,0.0,0.0,1
character.name,0.667,0.267,0.381,15
movie.gross_revenue,0.0,0.0,0.0,5
movie.genre,0.963,0.722,0.825,36
actor.nationality,1.0,1.0,1.0,1
movie.type,1.0,0.0,0.0,4


#### Exercise: Full $\lambda_{W2WT_{U}}$
- Implement $\lambda_{W2WT_{U}}$ using 'full' input and output symbol tables (`isyms.txt` and `osyms.txt`)
- Test the pipeline: $\lambda_{W} \circ \lambda_{W2WT_{U}} \circ \lambda_{SCLM_{2}}$
    - Observe the issues

#### Exercise
- Compare each pipeline in terms of:
    - size of input symbol table
    - size of output symbol table
    - size (number of arcs) of $\lambda_{W2T}$
    - size of $\lambda_{*LM}$

## 3. Common Improvements

- Training an ngram language model on data that contains tags only (i.e. $\lambda_{*LM}$) has one __big issue__: the out-of-span tag (`'O'`) is very frequent, consequently, there is not enough context to learn a good ngram model. 
- Joint modeling of words and tags, i.e. $\lambda_{SCLM}$, on the other hand, has a very specific (and less frequent context).
- There are two common enhancements to these models:
    - removing out-of-span tag `'O'` from the $\lambda_{*LM}$ to provide context for other tags
    - generalization of input into __classes__, i.e. $\lambda_{G}$, so that the data is less sparse

### Input Generalization (Normalization)

Th Language Understanding pipeline (as presented during the lectures) is 

$$\lambda = \lambda_{W} \circ \lambda_{G} \circ \lambda_{W2T} \circ \lambda_{*LM}$$

The function of $\lambda_{G}$ is this pipeline is to *generalize* the input, reducing sparsity.

#### Normalization
In Natural Language Processing it is common to __normalize__ (pre-process) the input data to reduce sparsity (e.g. [textacy](https://chartbeat-labs.github.io/textacy/build/html/api_reference/text_processing.html))'s pre-processing). 

The __normalization__ replaced all members of the __infinite set__ with a single __unique token__ with respect to a __common pattern__. It is not possible to learn a good model for each possible number, for instance.

- The example "entities" that have common pattern are:
    - numbers
    - emails
    - url
    - phone numbers
    - credit card numbers
    - etc.

These "entities" are generally captured using __regular expressions__.

#### Lookup Tables
Lookup tables provide a convenient way to generalize members of __large__ and __known set__ of entities. The common examples are *cities*, *countries*, *airport codes*, *movie names*, etc.
Even though these sets are potentially infinite, the lists of cities and movie names are generally available as external __Knowledge Bases__, and it is possible to check membership of a token.

#### [(Named) Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
> Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

__NER__ also covers entities that are covered by *normalization*, as it is a matter of approach (regex vs. sequence labeling).
There are several NLP tasks that fall under this category, specified with respect to the type of entity:
    - TIMEX - temporal expressions
    - ENAMEX - named entities 
    - NUMEX - numerical expressions
    - etc. (e.g. protein names in BioMedical Domain)

The task is similar to Concept Tagging, with an __important__ differences: 
- the same entity (from NER perspective) may belong to different classes in the target domain: 
    - e.g. in `NL2SparQL4NLU`: `actor.name`, `producer.name`, `director.name` are subclasses of a `PERSON`

Consequently, the output of such systems could be used as input for Concept Tagging.

#### Exercise: Lab
- Implement number generalization to map all numerical expressions in input to `<num>`
- Evaluate the pipeline with this step
- Observe sizes of input and output symbol tables