# Sequence Labeling with Weighted Finite State Machines

- Language Understanding Systems Lab
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

This notebook is part of the Laboratory Work for [Language Understanding Systems class](http://disi.unitn.it/~riccardi/page7/page13/page13.html) of [University of Trento](https://www.unitn.it/en).
Laboratory has been ported to jupyer notebook format for remote teaching during [COVID-19 pandemic](https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic).

__Requirements__

- [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome)
- [OpenGRM](http://www.opengrm.org/twiki/bin/view/GRM/NGramLibrary)
- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

## 1. Shallow Parsing with WFSMs: Natural Language Understanding

Language Understanding of several tasks, one of which is entity extraction (concept tagging). The task is usually approached as Shallow Parsing, where we segment the input into constituents and label them using IOB-schemes.

Using Weighted Finite State Machines for the task provides several benefits:
- Even though we saw how to do sequence labeling using HMM, in real applications models can become quite complex to solve. 
- The task usually involves several components (e.g. *emission* & *transition* probabilities), and WFSMs provide an efficient way to represent and process this components via intersection and composition operations.
    - WFSTs are good at modeling HMM and solving state machine problems
    - Weights can be associated with edges as costs or probabilities (default: cost = negative log probability)

### 1.1. Common Sequence Labeling Pipeline

The common approach to concept tagging (or sequence labeling in general) makes use of 3 components:

|                   | Description                      
|:------------------|:------------------------------
| $$\lambda_{W}$$   | FSA representation of an input sentence
| $$\lambda_{W2T}$$ | FST to translate words into output labels (e.g. `iob+type`)
| $$\lambda_{*LM}$$ | FSA Ngram Language Model to score the sequences of output labels

Consequently, Sequence Labeling ($\lambda$) is performed by composition of these three components as:

$$\lambda = \lambda_{W} \circ \lambda_{W2T} \circ \lambda_{*LM}$$

- It is common to include other components to perform intermediate operations for:
    - generalization of input ($\lambda_{G}$)
    - cleaning of output
    - etc.

### 1.2. General Setup
Let's start by preparing our workspace.

What we have is:
- training & test sets in utterance-per-line format
- training & test sets in CoNLL format that contain word-tag observations

For working with WFSMs we need:
- input symbol table: words
- output symbol table: tags

#### Corpus Preprocessing:
To handle OOV (unknown) words let's:
- apply frequency cut-off to lexicon
- replace OOV words in both training and test data with `<unk>`

##### OOV with Frequency Cut-off using OpenGRM NGram Library Tools

It is easy to apply frequency cut-off and replace OOV with `<unk>` using other means (e.g. `python`). 
Here we demonstrate how to achieve that using provided tools and some `unix` commands.

__Objective__: map low frequency unigrams to `<unk>`

- generate symbol table using `ngramsymbols` for a corpus
- compile the corpus into FAR using `farcompilestrings`
- count unigrams in FAR using `ngramcount`
- print counts using `ngramprint`
- filter the words externally and save them into a file
- generate a new symbol table using `ngramsymbols`
- recompile the corpus into a new FAR using the new symbol table and `farcompilestrings`

> __Note__: *if you provide an external word list to `ngramsymbols` it will generate a symbol table in the required format.*

- Since we will be using corpus files a lot, let's copy them into current directory with shorter names.

In [1]:
%%bash
dpath='NL2SparQL4NLU/dataset/NL2SparQL4NLU'

cp $dpath.train.utterances.txt trn.txt
cp $dpath.test.utterances.txt tst.txt

cp $dpath.train.conll.txt trn.conll
cp $dpath.test.conll.txt tst.conll

- Handling OOV with OpenFST and OpenGRM tools

In [56]:
%%bash
# create full symbol table
ngramsymbols trn.txt trn.isyms.tmp
# compile into FAR
farcompilestrings --symbols=trn.isyms.tmp --keep_symbols trn.txt trn.far.tmp
# count unigrams
ngramcount --order=1 trn.far.tmp trn.cnt.tmp
# print counts as integers
ngramprint --integers trn.cnt.tmp trn.cnt.txt.tmp

# bash: you can use python to process file
while read -r word freq; do \
    if (( freq > 1 )); then echo $word ; fi
done < trn.cnt.txt.tmp > trn.cnt.txt.cutoff.tmp

# final input symbol table
ngramsymbols trn.cnt.txt.cutoff.tmp isyms.txt

# delete temp files
rm -f *.all

In [57]:
%%bash
# let's compile both training and test set into far using this symbol table
farcompilestrings \
    --symbols=isyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.txt trn.far

farcompilestrings \
    --symbols=isyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    tst.txt tst.far

As a result we have:
- Symbol table (`isyms.txt`)
    - contains `['<s>', '</s>', '<epsilon>', '<unk>']` that are added automatically
- Training data as FAR with OOV replaced (`trn.far`)
- Test data as FAR with OOV replaced (`tst.far`)

##### Generating Output Symbol Table
To do sequence labeling we additionally require *output symbol table*.

In case we know the concepts (__types__), we can build our output symbol table without looking at data.

Since our output labels are composed of segmentation (`iob`) and classification (`type`) labels, we can make sure that each __type__ has all possible IOB prefixes.

- prefix each __type__ with all possible IOB prefixes (i.e. `I-` and `B-`)
- generate symbol table using `ngramsymbols` (as shown above)

In case we don't know the list of __types__, we have to extract __types__ from training data in CoNLL format (stripping `iob` prefix). 

In [58]:
%%bash
# create a unique list of types
cat trn.conll | cut -f 2 | cut -d '-' -f 2 | sed '/^ *$/d' | sort | uniq > types.txt

# prefix each type with segmentation information
while read -r word
do
    if [[ $word != 'O' ]]
    then
        echo "B-$word"
        echo "I-$word"
    else
        echo $word
    fi
done < types.txt > osyms.tmp

# generate output symbol table with iob-prefixed typed
ngramsymbols osyms.tmp osyms.txt

rm -f *.tmp

### 1.3. Applying FSTs
- Our objective is to be able to annotate input sentences (as FSAs in FAR) using the machines we are going to build. 
- Our test data has been loaded into FAR archive
- We will can `farextract` to extract these sentence FSAs and apply our FSMs to them
    - `farextract --filename_prefix="<odir>" <FAR>` will extract contents of `<FAR>` into directory `<odir>`
    - we can iterate over files in the directory and apply operations (see below)
- We can also create FAR of the processed FSMs using `farcreate` as
    - `farcreate --file_list_input <list of FST filenames> <output FAR>`


In [59]:
%%bash
wdir='wdir'
mkdir -p $wdir

farextract --filename_prefix="$wdir/" tst.far
cp $wdir/tst.txt-0001 sent.fsa

fstprint sent.fsa

0	1	star	star
1	2	of	of
2	3	<unk>	<unk>
3


- For testing we are going to use this `sent.fsa`
- For evaluation we are going to iterate over whole FAR

### 1.3. Baselines
Let's demonstrate the process by building some simple Shallow Parsing models.

### 1.3.1. Random
The simplest solution is to assign output labels __randomly__. 

To achieve this we need to:
- implement an FST that translates our words into output symbols ($\lambda_{W2T}$) with equal cost, or no cost at all (i.e. unweighted FST)
    - let's call it $\lambda_{W2T_{U}}$ for [universe](https://en.wikipedia.org/wiki/Universe_(mathematics)).
- compose it with our sentence
- choose a random path in FST

The FST $\lambda_{W2T_{U}}$ represents search space for $p(w_i|t_i)$ without being exposed to any observation. It is build using only our knowledge of the __domain__:
- vocabulary of language (input symbols)
- concepts __types__ in our domain

input and output symbols and all translations are possible. 

Since we have no model yet, the whole pipeline is:

$$\lambda_{R} = \lambda_{W} \circ \lambda_{W2T_{U}}$$

- __*random path*__ here is opposed to __*best path*__ or __*shortest path*__

- Let's define a function in python to write FST specification given input and output symbol tables as below
    - we will be using it a lot

In [84]:
def make_w2t(isyms, osyms, sep='+', out='w2t.tmp'):
    special = {'<epsilon>', '<s>', '</s>'}
    oov = '<unk>'
    state = '0'
    fs = " "  # column sepataror for fst
    
    ist = sorted(list(set([line.strip().split("\t")[0] for line in open(isyms, 'r')]) - special))
    ost = sorted(list(set([line.strip().split("\t")[0] for line in open(osyms, 'r')]) - special))
    
    with open(out, 'w') as f:
        for i in range(len(ist)):
            for j in range(len(ost)):
                f.write(fs.join([state, state, ist[i], ost[j]]) + "\n")
        f.write(state + "\n")

In [85]:
make_w2t('isyms.txt', 'osyms.txt', out='w2t_u.txt')

In [86]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.txt \
    --osymbols=osyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2t_u.txt w2t_u.bin

fstinfo w2t_u.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               osyms.txt
# of states                                       1
# of arcs                                         45648
initial state                                     0
# of final states                                 1


#### Exercise
- Compute input and output symbol table sizes
- Compare their multiplication to `# of arcs`

#### Testing

- note the usage of `fstrandgen` instead of `fstshortestpath` to get __random__ paths of FST.
- All tokens will be predicted as `O`, if we use `fstshortestpath`.
    - Bonus Question: *Why?* (try uncommenting & running)

In [87]:
%%bash
fstcompose sent.fsa w2t_u.bin | fstrandgen | fstrmepsilon | fsttopsort | fstprint
# fstcompose sent.fsa w2t_u.bin | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	I-person.name
1	2	of	B-movie.description
2	3	<unk>	I-movie.release_date
3


#### Evaluation
- For evaluation we are going to use `conlleval.pl` script (provided is `src` directory)
- We need to convert output to the appropriate format

- Collecting prediction from our model & storing them into a file

In [88]:
%%bash
wdir='wdir'
farr=($(ls $wdir))

for f in ${farr[@]}
do
    fstcompose $wdir/$f w2t_u.bin | fstrandgen | fstrmepsilon | fsttopsort | fstprint
done > w2t_u.out

- post-process the output to have at least 3 columns, such that:
    - column #1 (or 0) is words
    - column #2 (last-1) is reference labels
    - column #3 (last) is predicted labels
    - replace tabs (`\t`) with spaces (`' '`) and cleanup

In [89]:
%%bash
paste tst.conll w2t_u.out | cut  -f 1,2,6 | tr '\t' ' ' | sed 's/^ .*//g' > w2t_u.out.conll 
perl ../src/conlleval.pl < w2t_u.out.conll                                                             

processed 7117 tokens with 1091 phrases; found: 6822 phrases; correct: 20.
accuracy:   2.15%; precision:   0.29%; recall:   1.83%; FB1:   0.51
                 : precision:   0.00%; recall:   0.00%; FB1:   0.00  121
       actor.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  284
actor.nationality: precision:   0.00%; recall:   0.00%; FB1:   0.00  292
       actor.type: precision:   0.00%; recall:   0.00%; FB1:   0.00  306
   award.category: precision:   0.00%; recall:   0.00%; FB1:   0.00  326
   award.ceremony: precision:   0.36%; recall:  14.29%; FB1:   0.71  274
   character.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  308
     country.name: precision:   1.75%; recall:   8.06%; FB1:   2.88  285
    director.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  303
director.nationality: precision:   0.00%; recall:   0.00%; FB1:   0.00  275
movie.description: precision:   0.00%; recall:   0.00%; FB1:   0.00  296
      movie.genre: precision:   0.32%; recall:   2.

### 1.3.2. Output Symbol Priors

The simplest form for the third component is to use output label priors, i.e. unigram probabilities of output labels. To model that, we can:
- train a unigram language model using `ngramcount` & `ngrammake` (let's call it $\lambda_{LM_{1}}$)
- compose it with the $\lambda_{W2T_{U}}$ so that the whole pipeline becomes:

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{1}}$$

- Since `O` tag is the most frequent & we will have a model that always predicts it.
- We can represent our model as:

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(t_i)}$$

__Considerations__
- In this baseline we model __observations__. Due to the fact that:
    - our models are based on observations
    - OOV is due to scarcity of observations
- We need to change output symbol table of $\lambda_{W2T_{U}}$ to output only tags present in data or `<unk>`. 
    - it might happen such that an `iob+type` combination never appears in our training data
    - output symbol tables of $\lambda_{W2T}$ (FST) and symbol table of $\lambda_{LM_{1}}$ (FSA) have to match

#### "Training" a Model
A new $\lambda_{*LM}$ model is built following these steps:
1. prepare training data for model (in required format)
2. prepare symbol table for that data
    - apply OOV handing (you can use any of the approaches to introduce `<unk>`)
3. compile training data into FAR using this symbol table
4. estimate model probabilities for $\lambda_{*LM}$ (i.e. train ngram model)


A new $\lambda_{W2T}$ is created (updated) each time we change the symbol table of the $\lambda_{*LM}$.

- If we do not plan to estimate $p(w_i|t_i)$ in $\lambda_{W2T}$, we can create the FST as we did for $\lambda_{W2T_{U}}$ (and keeping input symbol table the same)

In [90]:
%%bash
# create training data in utterance-per-line format for output symbols (t - tags)
cat trn.conll | cut -f 2 |\
    sed 's/^$/~/g' | tr '\n' ' ' | tr '~' '\n' |\
    sed 's/  */ /g;s/^ *//g;s/ *$//g' > trn.t.txt

# apply cut-off & recompile with OOV replace with `<unk>`
ngramsymbols trn.t.txt trn.t.osyms.tmp
farcompilestrings --symbols=trn.t.osyms.tmp --keep_symbols trn.t.txt trn.t.far.tmp
ngramcount --order=1 trn.t.far.tmp trn.t.cnt.tmp
ngramprint --integers trn.t.cnt.tmp trn.t.cnt.txt.tmp

# filter lexicon
while read -r word freq
do
    if (( freq > 1 ))
    then 
        echo $word
    fi
done < trn.t.cnt.txt.tmp > trn.t.cnt.txt.cutoff.tmp

# final output symbol table
ngramsymbols trn.t.cnt.txt.cutoff.tmp t.osyms.txt

# delete temp files
rm -f *.all

# compile data into FAR again
farcompilestrings \
    --symbols=t.osyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.t.txt trn.t.far

- Let's train a unigram language model.

In [91]:
%%bash
ngramcount --order=1 trn.t.far trn.t1.cnt
ngrammake trn.t1.cnt t1.lm
ngraminfo t1.lm

# of states                                       1
# of ngram arcs                                   38
# of backoff arcs                                 0
initial state                                     0
unigram state                                     -1
# of final states                                 1
ngram order                                       1
# of 1-grams                                      39
well-formed                                       y
normalized                                        y


- Let's create a new $\lambda_{W2T}$ (let's call it $\lambda_{W2T_{T}}$ for "tags"):
    - following the same procedure we followed for $\lambda_{W2T_{U}}$, but using:
        - as input symbol table (`isyms.txt`)
        - as output symbol table (`t.osyms.txt`)
    - allowing `<unk> <unk>` and *word*-`<unk>` arcs

In [None]:
make_w2t('isyms.txt', 't.osyms.txt', out='w2t_t.txt')

In [92]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.txt \
    --osymbols=t.osyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2t_t.txt w2t_t.bin

fstinfo w2t_t.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               t.osyms.txt
# of states                                       1
# of arcs                                         36138
initial state                                     0
# of final states                                 1


In [93]:
%%bash
fstcompose sent.fsa w2t_t.bin | fstcompose - t1.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	O	0.476737976
1	2	of	O	0.476737976
2	3	<unk>	O	0.476737976
3	2.00485039


In [14]:
%%bash
wdir='wdir'
farr=($(ls $wdir))

for f in ${farr[@]}
do
    fstcompose $wdir/$f w2t_t.bin | fstcompose - t1.lm |\
        fstshortestpath | fstrmepsilon | fsttopsort | fstprint
done > w2t_t.t1.out

paste tst.conll w2t_t.t1.out | cut  -f 1,2,6 | tr '\t' ' ' | sed 's/^ .*//g' > w2t_t.t1.out.conll
perl ../src/conlleval.pl < w2t_t.t1.out.conll

processed 7117 tokens with 1091 phrases; found: 0 phrases; correct: 0.
accuracy:  72.15%; precision:   0.00%; recall:   0.00%; FB1:   0.00
       actor.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
actor.nationality: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
       actor.type: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
   award.category: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
   award.ceremony: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
   character.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
     country.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
    director.name: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
director.nationality: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
      movie.genre: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
movie.gross_revenue: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
   movie.language: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
   m

- The model still has $F_1=0$, since `O` is the tag with highest prior.
- Observe the weights in the output

### 1.3.4. Exercises
- Compare sizes of $\lambda_{W2T_{U}}$ and $\lambda_{W2T_{T}}$ for `# of arcs`
- Unigram models & $\lambda_{W2T}$
    - Test pipeline: $\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{1}}$


- Bigram models: train a *tag* bigram model (let's call it $\lambda_{LM_{2}}$)

    - Test pipeline with $\lambda_{W2T_{U}}$: $\lambda = \lambda_{W} \circ \lambda_{W2T_{U}} \circ \lambda_{LM_{2}}$
    - Test pipeline with $\lambda_{W2T_{T}}$: $\lambda = \lambda_{W} \circ \lambda_{W2T_{T}} \circ \lambda_{LM_{2}}$

### 1.3.5. Maximum Likelihood Estimation (Emission Probabilities)
- So far we haven't explored the relation between input and output
- The next thing we can do is to expose our model to observations and estimate $p(w_{i}|t_{i})$ from data.
- We can use `ngramcount` and `ngrammake` to make a smoothed probability model (we are using default, i.e. no parameters). 
- We need to estimate probabilities like we would estimate bigram probabilities, thus:
    - prepare lexicon with *tags* and *words*
    - read CoNLL format corpus into far (token per line, preprocessed)
    - count bigrams
    - make a bigram language model
    - print bigrams with weights (negative log probabilities)
    - choose bigrams (it will contain unigrams, as well as `<s>` and `</s>` bigrams)
    - convert to FST & compile
    
- Let's call the model $\lambda_{W2T_{MLE}}$

In [96]:
%%bash
# lets use our symbol tables (since they both have been applied cut-off)
cat isyms.txt t.osyms.txt | cut -f 1 | sort | uniq > msyms.text.tmp
ngramsymbols msyms.text.tmp t.msyms.txt

# let's convert data to ngrams
cat trn.conll | sed '/^$/d' | awk '{print $2,$1}' > trn.w2t.txt

# compile to far
farcompilestrings \
    --symbols=t.msyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.w2t.txt trn.w2t.far
    
# count bigrams
ngramcount --order=2 trn.w2t.far trn.w2t.cnt
# make a model
ngrammake trn.w2t.cnt trn.w2t.lm

# print ngram probabilities as negative logs
ngramprint \
    --symbols=t.msyms.txt\
    --negativelogs \
    trn.w2t.lm trn.w2t.probs

- Let's define a python function to convert probabilities printout to W2T FST

In [100]:
def probs_w2t_mle(probs, out='w2t_mle.tmp'):
    mcn = 3   # minimum column number
    state = '0'
    fs = " "  # column sepataror for fst
    otag = 'O'
    
    lines = [line.strip().split("\t") for line in open(probs, 'r')]
    
    with open(out, 'w') as f:
        for line in lines:
            if len(line) == mcn:
                if line[0].startswith("B-") or line[0].startswith("I-") or line[0] == otag:
                    f.write(fs.join([state, state] + line) + "\n")
        f.write(state + "\n")

In [101]:
probs_w2t_mle('trn.w2t.probs', out='w2t_mle.txt')

In [102]:
%%bash
fstcompile \
    --isymbols=t.osyms.txt \
    --osymbols=isyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    trn.w2t.mle.txt w2t.mle.bin
    
# we need to invert it to have words on input
fstinvert w2t.mle.bin w2t.mle.inv.bin

fstinfo w2t.mle.inv.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               t.osyms.txt
# of states                                       1
# of arcs                                         1509
initial state                                     0
# of final states                                 1


#### Testing
Let's test it:

In [103]:
%%bash
fstcompose sent.fsa w2t.mle.inv.bin | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	B-movie.name	3.22130394
1	2	of	I-movie.name	3.02857399
2	3	<unk>	B-director.nationality	0.694147706
3


- The pipeline above represents 

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)}$$

- To extend it to unigram tagging model we need to compose it with  $\lambda_{LM_{1}} = p(t_i)$ 

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{MLE}} \circ \lambda_{LM_{1}}$$

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i)}$$ 

In [104]:
%%bash
fstcompose sent.fsa w2t.mle.inv.bin | fstcompose - t1.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	B-movie.name	6.09392548
1	2	of	O	3.86930203
2	3	<unk>	O	4.46328497
3	2.00485039


#### Exercise 1: Maximum Likelihood Estimation
- using `ngramprint` verify the Maximum Likelihood Estimation method (without `--negativelogs` it prints raw probabilities)
    - print bigram counts from $\lambda_{W2T_{MLE}}$ (output of `ngramcount`)
    - print unigram counts for either from $\lambda_{LM_{1}}$ or $\lambda_{W2T_{MLE}}$ (output of `ngramcount`)
    - using these counts compute probability of $p($ `brad|B-actor.name` $)$
    - extract probability of $p($ `brad|B-actor.name` $)$ from $\lambda_{W2T_{MLE}}$ (output of `ngrammake`)
    - compare values
    - repeat the procedure using counts from methods developed for the lab on ngram modeling.

#### Exercise 2: Markov Model Tagger
- Evaluate the MLE pipeline using bigram model on tags, i.e.

$$\lambda = \lambda_{W} \circ \lambda_{W2T_{MLE}} \circ \lambda_{LM_{2}}$$

$$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i|t_{i-1})}$$ 

- compare performances to the HMM tagger from previous lab (NLTK)

## 2. Joint Distribution Modeling

As we have seen, sequence labeling for Language Understanding could be approached using Hidden Markov Models (similar to Part-of-Speech Tagging), and to models it as in the table below (__HMM__). Stochastic Conceptual Language Models for Spoken Language Understanding in [Raymond & Riccardi (2007)](https://disi.unitn.it/~riccardi/papers2/IS07-GenerDiscrSLU.pdf) (__R&R__) model it jointly.


| Model   | Equation |
|:--------|:----------
| __HMM__ | $$p(t_{1}^{n}|w_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_i|t_i)p(t_i|t_{i-N+1}^{i-1})}$$
| __R&R__ | $$p(w_{1}^n,t_{1}^{n}) \approx \prod_{i=1}^{n}{p(w_{i}t_{i}|w_{i-N+1}^{i-1}t_{i-N+1}^{i-1})}$$


From implementation perspective, joint modeling implies the following:
- we need to train $\lambda_{SCLM}$ on word-tag pairs
    - create corpus in a format for estimating $p(w_i,t_i|w_{i-N+1}^{i-1}t_{i-N+1}^{i-1})$
    - create symbol tables
- we need to change $\lambda_{W2T}$ to output *word-tag* pairs (let's call it $\lambda_{W2WT}$)
    - create FST like above for $\lambda_{W2WT}$ ($\lambda_{W2WT_{WT}}$ - to differentiate from $\lambda_{W2WT_{U}}$ that contains all possible combinations)

#### Preparing Symbol Tables
- Let's create symbol tables the same way we did for $\lambda_{W2T_{T}}$

In [105]:
%%bash
# create training data in utterance-per-line format for output symbols (wt - tags)
# using `+` to joint w & t
cat trn.conll | tr '\t' '+' |\
    sed 's/^$/~/g' | tr '\n' ' ' | tr '~' '\n' |\
    sed 's/  */ /g;s/^ *//g;s/ *$//g' > trn.wt.txt

# apply cut-off & recompile with OOV replace with `<unk>`
ngramsymbols trn.wt.txt trn.wt.osyms.tmp
farcompilestrings --symbols=trn.wt.osyms.tmp --keep_symbols trn.wt.txt trn.wt.far.tmp
ngramcount --order=1 trn.wt.far.tmp trn.wt.cnt.tmp
ngramprint --integers trn.wt.cnt.tmp trn.wt.cnt.txt.tmp

# filter lexicon
while read -r word freq
do
    if (( freq > 1 ))
    then 
        echo $word
    fi
done < trn.wt.cnt.txt.tmp > trn.wt.cnt.txt.cutoff.tmp

# final output symbol table
ngramsymbols trn.wt.cnt.txt.cutoff.tmp wt.osyms.txt

# delete temp files
rm -f *.all

- Let's:
    - compile our processed data into FAR
    - train ngram language models on it - $\lambda_{SCLM}$

#### Training Conceptual Language Model

In [106]:
%%bash
# compile data into FAR again
farcompilestrings \
    --symbols=wt.osyms.txt \
    --keep_symbols \
    --unknown_symbol='<unk>' \
    trn.wt.txt trn.wt.far

# train ngram model
ngramcount --order=2 trn.wt.far trn.wt.cnt
ngrammake trn.wt.cnt wt2.lm
ngraminfo wt2.lm

# of states                                       1096
# of ngram arcs                                   6179
# of backoff arcs                                 1095
initial state                                     1
unigram state                                     0
# of final states                                 534
ngram order                                       2
# of 1-grams                                      1095
# of 2-grams                                      5618
well-formed                                       y
normalized                                        y


#### Building W2WT FST

- Let's build unweighted $\lambda_{W2WT_{WT}}$, using
    - input symbol table `isyms.txt`
    - output symbol table `wt.osyms.txt`
- all words that appear in `isyms.txt` and do not appear in *word-tag* pairs (i.e. `wt.syms.txt`) need to be taken care of
    - in practice this means that we need do either of the two:
        - update input symbol table, removing these words
        - construct $\lambda_{W2T}$ that handles that (mapping them to '`<unk>`', as there are no '`<unk>+iob+type`').

In [119]:
def make_w2t_filter(isyms, osyms, out='w2t.tmp', sep='+'):
    special = {'<epsilon>', '<s>', '</s>'}
    oov = '<unk>'
    state = '0'
    fs = " "  # column sepataror for fst
    
    ist = set([line.strip().split("\t")[0] for line in open(isyms, 'r')]) - special
    ost = set([line.strip().split("\t")[0] for line in open(osyms, 'r')]) - special
    known = set()
    
    with open(out, 'w') as f: 
        for e in list(ost):
            if e == oov:
                f.write(fs.join([state, state, e, e]) + "\n")  # <unk> <unk>
            else:
                w,t = e.split(sep)
                known.add(w)
                f.write(fs.join([state, state, w, e]) + "\n")
        
        for e in list(ist):
            if e not in known:
                f.write(fs.join([state, state, e, oov]) + "\n")
    
        f.write(state + "\n")


In [120]:
make_w2t_filter('isyms.txt', 'wt.osyms.txt', out='w2wt_wt.txt')

In [122]:
%%bash
# Let's compile it
fstcompile \
    --isymbols=isyms.txt \
    --osymbols=wt.osyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    w2wt_wt.txt w2wt_wt.bin

fstinfo w2wt_wt.bin | head -n 8

fst type                                          vector
arc type                                          standard
input symbol table                                isyms.txt
output symbol table                               wt.osyms.txt
# of states                                       1
# of arcs                                         1157
initial state                                     0
# of final states                                 1


- Lets test the whole $\lambda_{W} \circ \lambda_{W2WT_{WT}} \circ \lambda_{SCLM_{2}}$

In [123]:
%%bash
fstcompose sent.fsa w2wt_wt.bin | fstcompose - wt2.lm | fstshortestpath | fstrmepsilon | fsttopsort | fstprint

0	1	star	star+O	7.93915796
1	2	of	of+O	1.55418813
2	3	<unk>	<unk>	2.84977818
3	1.10391009


#### Evaluation
- Since on the output we have `word+tag`, we need to post-process the output for evaluation

#### Exercise: Full $\lambda_{W2WT_{U}}$
- Implement $\lambda_{W2WT_{U}}$ using 'full' input and output symbol tables (`isyms.txt` and `osyms.txt`)
- Test the pipeline: $\lambda_{W} \circ \lambda_{W2WT_{U}} \circ \lambda_{SCLM_{2}}$