# Statistical Language Modeling

- Language Understanding Systems
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

## 0. Setup

- Get the dataset for exercises from https://github.com/esrel/NL2SparQL4NLU .


In [1]:
%%bash

if [ ! -d "NL2SparQL4NLU" ]
then
    git clone https://github.com/esrel/NL2SparQL4NLU.git
else
    echo "NL2SparQL4NLU already exists"
fi

NL2SparQL4NLU already exists


## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research.

__Corpus properties__:
- Format
- Language
- Annotation
- Split for Machine Learning

In [2]:
%%bash

trn=NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt
tst=NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt

cat $trn | head -n 10

who plays luke on star wars new hope
show credits for the godfather
who was the main actor in the exorcist
find the female actress from the movie she 's the man
who played dory on finding nemo
who was the female lead in resident evil
who played guido in life is beautiful
who was the co-star in shoot to kill
find the guy who plays charlie on charlie 's angels
cast and crew of movie the campaign


__Format__:

- Utterance (sentence) per line
- Tokenized
- Lowercased

__Language__: English monolingual

__Annotation__: None (for now)

__Split__: training & test sets

### 1.2. Counting

*Corpus* description in terms of:

- total number of words
- total number of utterances


#### Exercise:

##### Objective

Compute the corpus descriptive statistics above (total utterance and word counts) for the __training__ and __test__ sets of NL2SparQL4NLU dataset. Compare the computed statistics with the reference values below.


| Metric           | Train  | Test   |
|------------------|-------:|-------:|
| Total Words      | 21,453 |  7,117 |
| Total Utterances |  3,338 |  1,084 |


##### Bash Solution

```bash
wc -lw $trn
wc -lw $tst
```


### 1.3. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalogue of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 


#### Exercise: 

##### Objective
Compute *lexicon size* (total number of unique words) of the NL2SparQL4NLU dataset and compare to the reference values.

| Metric       | Value |
|--------------|------:|
| Lexicon Size | 1,729 |

##### Algorithm

1. Tokenize as token-per-line
2. Sort
3. Remove duplicates

##### Bash Solution

```bash
cat $trn |\
    tr ' ' '\n' |\  # convert to token-per-line (replace spaces with newline)
    sort |\         # sort (alphabetically)
    uniq |\         # remove duplicates (unique)
    wc -l           # count lines
```

### 1.4. Frequency Lists

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts (our corpus is lowercased)

#### Exercise:

##### Objective
Compute frequencies of words in the lexicon (using only training set). Compare the 5 most frequent words to the table below.

| Word   | Frequency |
|--------|----------:|
| the    |     1,337 |
| movies |     1,126 |
| of     |       607 |
| in     |       582 |
| movie  |       564 |

##### Algorithm

1. Tokenize as token-per-line
2. Sort
3. Count occurences of each word

##### Bash Solution

```bash
cat $trn |\
    tr ' ' '\n' |\                    # convert to token-per-line (replace spaces with newline)
    sort |\                           # sort (alphabetically)
    uniq -c |\                        # remove duplicates (unique) and count occurences
    sort -gr |\                       # reverse sort the list by frequency
    awk '{OFS="\t";print $2, $1}' |\  # swap output columns [optional]
    head -n 5                         # get top 5 lines
```


### 1.5. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). 

__Frequency Cut-Off__: 

- remove least frequent words (with respect to frequency threshold)
- remove most frequent words (with respect to frequency threshold)
- remove *stop words*

#### Stop Words
In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.


#### Exercise:

Compare Frequency List to the stop word list in `NL2SparQL4NLU/extras/english.stop.txt`.

##### Objective
Create lexicons applying the following steps and report lexicon sizes. Compare to the reference values in the table.

- Remove words that occure less than 2 times, i.e. remove words with frequency 1 ([hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
- Remove words that occure more that 100 times
- Remove stop words from the lexicon

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| cut-off    |   2 | N/A |  950 |
| cut-off    | N/A | 100 | 1694 |
| cut-off    |   2 | 100 |  915 |
| stop words | N/A | N/A | 1529 |

##### Algorithm(s)

__Cut-off__
1. Compute Frequency List
2. Remove words below/above the threshold

__Stop Word Removal__
1. Compute Lexicon
2. Remove words that also appear in stop word list

##### Bash Solution

- SLOW! (convert cell to 'code' & replace "\`\`\`bash" with "`%%bash`", & delete final "\`\`\`")
- Use PYTHON

```bash

trn=NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt
swl=NL2SparQL4NLU/extras/english.stop.txt

# create frequency list
cat $trn | tr ' ' '\n' | sort | uniq -c | awk '{OFS="\t";print $2, $1}' > lexicon.txt

function cutoff() {
    # use -1 for th_min and th_max to ignore    
    freq_lex=$1
    th_min=$2
    th_max=$3
    sw_file=$4
    
    local df_min=1
    local df_max=$((2**32))
    
    th_min=$(( th_min != -1 ? th_min : df_min ))
    th_max=$(( th_max != -1 ? th_max : df_max ))
    
    declare -a stopwords
    declare -a lexicon
    
    if [[ $sw_file != '' ]]
    then
        IFS=$'\n' read -d '' -r -a stopwords < $sw_file        
    fi
    
    swcount=0
    while read -r token count
    do
        if (( count >= th_min && count <= th_max ))
        then
            if printf '%s\n' ${stopwords[@]} | grep -q "^${token}$"
            then
                ((swcount++))
            else
                lexicon+=( $token )
            fi     
        fi
    done < ${freq_lex}
    
    # print size of lexicon
    echo ${#lexicon[@]}
}

cutoff lexicon.txt -1 -1
cutoff lexicon.txt 2 -1
cutoff lexicon.txt -1 100
cutoff lexicon.txt 2 100
cutoff lexicon.txt -1 -1 $swl
cutoff lexicon.txt 2 100 $swl
```


## 2. Ngrams and Ngram Probabilities

[n-gram](https://en.wikipedia.org/wiki/N-gram) is a contiguous sequence of *n* items from a given sequence of text or speech. An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

__Example__:

- character n-grams: cat
- word n-grams: the cat is fat

|                     | 1-gram  | 2-gram  | 3-gram  |
|---------------------|---------|---------|---------|
|                     | unigram | bigram  | trigram |
| *Markov Order*      | 0       | 1       | 2       |
| *Character N-grams* | `['c', 'a', 't']` | `['ca', 'at']` | `['cat']` |
| *Word N-grams*      | `['the', 'cat', 'is' , 'fat']` | `['the cat', 'cat is', ...]` | `['the cat is', ...]` |


### 2.1. Counting Bigrams

*Frequency List* of a corpus is essentially a unigram count. Bigram count only differs in a unit of counting -- pairs of words. 

#### Exercise:

##### Objective
Compute bigram frequencies in the training set of NL2SparQL4NLU corpus. Compate top 5 most frequent bigrams to the values in table.

 word 1 | word 2 | count 
:-------|:-------|-------:
show    | me     |   377
the     | movie  |   267
of      | the    |   186
me      | the    |   122
is      | the    |   120

##### Algorithm (for bash)
1. tokenize as token-per-line
2. print word#1 and word#2 side by side
3. count

##### Bash Solution

In [4]:
%%bash

trn=NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt

cat $trn | tr ' ' '\n' > w1.txt
tail -n +2 w1.txt > w2.txt
paste w1.txt w2.txt > bigrams.txt
cat bigrams.txt | sort | uniq -c | sort -gr |\
    awk '{OFS="\t";print $2, $3, $1}' |\
    head -n 5
    

show	me	377
the	movie	267
of	the	186
me	the	122
is	the	120


### 2.2. Counting Trigrams

#### Exercise:
Extend bigram counting to trigrams.

### 2.3. Improving N-grams: Sentence beginning & end tags

The bash bigram counting above has an issue of counting n-grams across sentence boundaries. Including sentence boundary markers leads to a better model.

#### Exercise:
Add sentence beginning (`<S>`) and sentence end (`</S>`) tags and recompute bigrams. (Remove `</S> <S>` from counts. Also take care of runover unigrams.)

##### Bash Solution

In [5]:
%%bash

trn=NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt

cat $trn | sed 's/^/<S> /g;s/$/ <\/S>/g' | tr ' ' '\n' > w1.txt
tail -n +2 w1.txt > w2.txt
paste w1.txt w2.txt | sed '/<\/S>	<S>/d' > bigrams.txt
cat bigrams.txt | sort | uniq -c | sort -gr |\
    awk '{OFS="\t";print $2, $3, $1}' |\
    head -n 5

<S>	what	511
<S>	show	450
show	me	377
movies	</S>	333
<S>	find	268


### 2.4. Computing N-gram Probabilities

#### 2.4.1. Calculating Probability from Frequencies

Probabilities of n-grams can be computed from relative frequency counts (*Maximum Likelihood Estimation*), as follows.

N-gram   | Equation                      |
:--------|:------------------------------|
Unigram  | $$p(w_i) = \frac{c(w_i)}{N}$$ |
Bigram   | $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}$$

where:
- $N$ is the total number of words in a corpus
- $c(x)$ is the count of occurences of $x$ in a corpus (x could be unigram, bigram, etc.)

#### Exercise(s):

- Compute (automatically) probabilities of each token in the lexicon.
- Calculate N-gram probabilities of:
  
  - $p(the | of)$
  - $p(the | is)$
  - $p(play | the)$
  - all n-gram probabilities of words after `italy', i.e. $p(*|italy)$

## 3. Language Models

A statistical [language model](https://en.wikipedia.org/wiki/Language_model) is a probability distribution over sequences of words. Given such a sequence, say of length $m$, it assigns a probability $P(w_{1},\ldots ,w_{m})$ to the whole sequence (using Chain Rule). Consequently, the unigram and bigram probabilities computed above constitute a language model of our corpus.

It is more useful for Natural Language Processing to have a __probability__ of a sequence being legal, rather than a grammar's __boolean__ decision whether it is legal or not.

### 3.2. Computing Probability of a Sequence (String)

The most common usage of a language model is to compute probability of a sequence.

#### 3.2.1. Probability of a Sequence: [Chain Rule](https://en.wikipedia.org/wiki/Chain_rule_(probability))

Probability of a sequence is computed as a product of conditional probabilities (chain rule). 

$$p(w_1,...,w_T)=p(w_1) \prod_{i=2}^T p(w_i|w_1,...,w_{i-1}) = p(w_1) \prod^T_{i=2} p(w_i|h_i)$$

Where $h_i = \{w_1,...,w_{i-1}\}$ is a history (previous context). The order of n-gram, truncates the history to length $n - 1$. Making a simplifying assumption that probability of a curent element only depends on previous $n$ elements.

$$p(w_i|w_1,...,w_{i-1}) = p(w_i|w_{i-n+1},...,w_{i-1})$$

Consequently we have:

N-gram   | Equation                   |
:--------|:---------------------------|
unigram  | $$p(w_i)$$                 |
bigram   | $$p(w_i|w_{i-1})$$         |
trigram  | $$p(w_i|w_{i-2},w_{i-1})$$ |

__Example__: "The cat is fat"

$$p(\langle s \rangle, the, cat, is, fat, \langle /s \rangle) =$$
$$= p(the | \langle s \rangle) * p(cat | the) * p(is | cat) * p(fat | is) * p(\langle /s \rangle | fat) = $$
$$= 0.25 * 0.10 * 0.20 * 0.05 * 0.15 = 0.0000375$$


#### Exercise:
Using computed n-gram probabilities compute the probabilities of utterances in `NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt`

__Expect Errors!__ (we will address them later)

### 3.1. Language Model as Automaton

Language Model can be used as an automaton to generate probable legal sequences using the algorithm below.

__Algorithm for Bigram LM__

- $w_{i-1} = \langle s \rangle$;
- *while* $w_i \neq \langle /s \rangle$

    - stochastically get new word w.r.t. $p(w_i|w_{i-1})$

#### Exercise
Using computed n-gram probabilities implement a function to generate sequences.

### 3.3. LM Issues

#### 3.3.1. Underflow Problem

Probabilities are usually small ($<1$).
Multiplying many of those may cause *underflow* problem.

##### Solution
Use the sum of the probabilities' logs instead of product

| Properties     |
|:---------------|
| $$p(a) > p(b)$$
| $$log(p(a)) > log(p(b))$$
| $$log(a*b) = log(a) + log(b)$$
| $$p(a) * p(b) \rightarrow log(p(a)) + log(p(b))$$

#### 3.3.2. Data Sparseness

Many rare, but possible combinations are not present in training data. 
Unseen words and n-grams have $0$ probability. 
Thus, whole sequence gets $0$ probability.

__Example__: Unseen Word: "cow"

$$p(\langle s \rangle, the, cow, is, fat, \langle /s \rangle) =$$ 
$$= p(the | \langle s \rangle) * p(cow | the) * p(is | cow) * p(fat | is) * p(\langle /s \rangle | fat) = $$
$$= 0.25 * 0.00 * 0.00 * 0.05 * 0.15 = 0.00$$

The problem is somewhat avoided using log probabilities.

##### Solution: Smoothing
- Add some probability to unseen events
- Remove some probability from seen events -- __discounting__
- Joint probability distribution sums to 1!

#### 3.3.3. Out-Of-Vocabulary (OOV) Rate

*OOV Rate*: % of word tokens in test data that are not contained in the lexicon (vocabulary).
Empirically each OOV word results in 1.5 - 2 extra errors (> 1 due to the loss of contextual information).

#### Exercise (Optional):
Compute OOV Rate for `NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt`

### 3.4. Handling Unseen Words

How to handle words (in test set) that were never seen in the training data?
Train a language model with specific token (e.g. `<UNK>`) for unknown words!
How to estimate probabilities of unknown words and n-grams?

The *simplest* approach is to replace all the words that are not in vocabulary (lexicon) with the `<UNK>` token and treat it as any other word. (For instance, applying frequency cut-off to the lexicon, will allow estimate these probabilities on the training set.)


### 3.5. Smoothing

Available smoothing methods: ([tutorial](https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf))
- [Additive smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) (__simplest__)
- Good-Turing estimate
- Jelinek-Mercer smoothing (interpolation)
- Katz smoothing (backoff)
- Witten-Bell smoothing
- Absolute discounting
- Kneser-Ney smoothing


#### 3.5.1. Add-One Smoothing
Kind of Additive Smoothing, imaginary training data where all possible n-gram combinations occur once.

__Bigrams__

$V$ -- bigram vocabulary size

$$p(w_i | w_{i-1}) = \frac{c(w_{i-1},w_i)+1}{c(w_{i-1})+V}$$

__N-grams__

$V$ -- total number of possible $(N-1)$-grams

$$p(w_i | w^{i-1}_{i-N+1}) = \frac{c(w^{i-1}_{i-N+1},w_i)+1}{c(w^{i-1}_{i-N+1})+V}$$

Typically, we assume $V = \{w : c(w) > 0\} \cup \{<UNK>\}$


### 3.6. Exercise: Putting it all together

Train a Language Model (compute n-gram probabilities) such that:

- case insensitive (by default)
- 2-gram
- log probabilities
- considers sentence boundaries (beginning and end of sentence tags)
- considers unknown words
- Add-One Smoothing

Compute probabilties of utterances in `NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt`

### 3.7. LM Evaluation: [Perplexity](https://en.wikipedia.org/wiki/Perplexity)

- Measures how well model fits test data;
- Probability of test data;
- Weighted average branching factor in predicting the next word (lower is better).
- Computed as:

$$ PPL = \sqrt[N]{\frac{1}{p(w_1,w_2,...,w_N)}} = \sqrt[N]{\frac{1}{\prod_{i=1}^{N}p(w_i|w_{i-N+1})}}$$

Where $N$ is the number of words in test set;


#### Exercise (Optional):
Calculate Perplexity for `NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt`