# Language Modeling using OpenGRM: Solutions

- Natural Language Understanding
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

This notebook covers Lecture on __Sequence and Language Modeling__ using OpenFST and OpenGRM NGram Library.

__Requirements__

- [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome)
- [OpenGRM](http://www.opengrm.org/twiki/bin/view/GRM/NGramLibrary)
- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

## Exercises

- Read Tool Manuals (for available tools and their options; they start with `far`, `ngram` and `fst`)

- Preprocess training and test sets using [corpus and lexicon preprocessing functions](corpus_pp_python.ipynb) to:
    - add sentence begin & end tags
    - handle unknown words (e.g. frequency cut-off)
    
- Convert corpus and lexicon to OpenGRM format

- Train different language models (LM) on the training set of `NL2SparQL4NLU` using OpenGRM NGram Library

    - vary n-gram order
    - vary smoothing
    - compute LM perplexity on the test set

- Report order & smoothing method with lowest perplexity

- Compute score for utterances in the test set
- Compare score to the ones computed manually


### OpenGRM Corpus Pre-processing & Lexicon Generation

OpenGRM requires lexicon and corpus to be in a specific formats, and provides tools for:

- Automatic lexicon extraction from corpus (`ngramsymbols`)
- Compilation of text into FAR
    - option to replace OOV (`farcompilestrings --unknown_symbol`)


In [1]:
%%bash
fname=NL2SparQL4NLU
trn=$fname.trn.data
tst=$fname.tst.data

# create lexicon in the required format
ngramsymbols < $trn > $fname.lex

# complile training set into FAR, replacing OOV with '<unk>'
# already done externally, but in case you have modified lexicon, this is the command
farcompilestrings --unknown_symbol="<unk>" --symbols=$fname.lex -keep_symbols=1 $trn > $fname.trn.far 

# compile test set into FAR, replacing OOV with '<unk>'
farcompilestrings --unknown_symbol="<unk>" --symbols=$fname.lex -keep_symbols=1 $tst > $fname.tst.far

### Ngram Counting & Language Model Training

- Modify `--method` and `--order` to vary *smoothing* and *ngram order*

In [2]:
%%bash
fname=NL2SparQL4NLU

# counts ngrams
ngramcount --order=3 $fname.trn.far > $fname.trn.counts
# make LM 
ngrammake --method=katz $fname.trn.counts > $fname.lm

# compute perplexity on the test set (ignore warnings)
ngramperplexity $fname.lm $fname.tst.far

1084 sentences, 9285 words, 0 OOVs
logprob(base 10)= -10556.3;  perplexity = 10.4247





### Scoring Test Set

In [3]:
%%bash
fname=NL2SparQL4NLU
# apply LM
ngramapply $fname.lm $fname.tst.far $fname.tst.out.far

### Printing Scored Output

In [4]:
%%bash
fname=NL2SparQL4NLU
# print strings with weights
farprintstrings --print_weight $fname.tst.out.far | head -n 10

<s> star of <unk> </s>	11.3299
<s> who is in the movie the campaign </s>	13.7932
<s> list the cast of the movie the campaign </s>	19.1071
<s> who was in twilight </s>	13.3456
<s> who is in <unk> </s>	10.052
<s> actor from lost </s>	19.9336
<s> who played in the movie rocky </s>	24.6591
<s> who played in the movie captain america </s>	19.2894
<s> cast and crew for in july </s>	11.7449
<s> who is in movie in july </s>	12.5787
