# Traditional language models

<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [slavko.zitnik@fri.uni-lj.si](mailto:slavko.zitnik@fri.uni-lj.si) for any comments.</sub>

A model (from statistical point of view) is a mathematical representation of a process. Models may be an approximation of a process and there are two important reasons for this: 

1. We observe the process a limited amount of times.
2. A model can be very complex so we should normally simplify it.

In statistics we may have heard: `All models are wrong, but some are useful.`

## Bag of words

We have already seen some models. One of them and also the simplest one is bag-of-words model, which is a naive way of modelling human language. But still, it is useful and popular. For the bag-of-words model we also know: 

1. It has an oversimplified view of the language.
2. It takes into account only the frequency of the words in the language, not their order or position.
3. In a way we have created it, it was useful for tasks such as text classification or sentiment analysis, where we were interested only into separate words and their count.

## n-Grams

Text is always a sequence - a sequence of words, characters, symbols, ... So one idea might be to model how text is generated or which token is most probably to proceed in a given sequence. We can learn probabilities over two tokens (bigrams), three tokens (trigrams), ... n tokens (n-grams).

"Bigram" is just a fancy name for 2 consecutive words while n-gram is an n-tuple of consecutive tokens. Let's show a quick example of using word-based n-grams.

In [1]:
import random
import math
from functools import reduce
import operator
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
 
first_sentence = reuters.sents()[0]
print("First sentence: \n\t{}\n\n".format(first_sentence))
 
print("Bigrams: \n\t{}\n\n".format(list(bigrams(first_sentence))))
print("Padded bigrams: \n\t{}\n\n".format(list(bigrams(first_sentence, pad_left=True, pad_right=True))))
print("Trigrams: \n\t{}\n\n".format(list(trigrams(first_sentence))))
print("Padded trigrams: \n\t{}\n\n".format(list(trigrams(first_sentence, pad_left=True, pad_right=True))))

First sentence: 
	['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


Bigrams: 
	[('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'n

In [2]:
# 1. Get the data and count occurences
model = defaultdict(lambda: defaultdict(lambda: 0))
 
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

print("'What the economists' trigram occurence number: {}".format(model["what", "the"]["economists"]))
print("'Hell' follows 'What the' in {} cases.".format(model["what", "the"]["hell"]))
print("{} sentences start with 'The'\n\n".format(model[None, None]["The"])) 

# 2. transform occurences to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

print("'What the economists' trigram probability in given text: {:.3}".format(model["what", "the"]["economists"]))
print("The probability of 'Hell' following 'What the' is {:.3}.".format(model["what", "the"]["hell"]))
print("The probability of a sentence to start with 'The' is {:.3}.".format(model[None, None]["The"]))

'What the economists' trigram occurence number: 2
'Hell' follows 'What the' in 0 cases.
8839 sentences start with 'The'


'What the economists' trigram probability in given text: 0.0435
The probability of 'Hell' following 'What the' is 0.0.
The probability of a sentence to start with 'The' is 0.162.


### Text generation

#### Greedy approach
Algorithm: Select the most probable word given last n-1 words.

In [4]:
# 3. Use the model (e.g. text generation)
text = [None, None]
sentence_finished = False
probs = []

while not sentence_finished:
    token = max(model[tuple(text[-2:])].items(), key=operator.itemgetter(1))
    text.append(token[0])
    probs.append(token[1])

    if text[-2:] == [None, None]:
        break

print(f"Probability of text: {reduce(operator.mul, probs, 1)}\n")
print(f"Token probabilities: \n\t'{' '.join([str(prob) for prob in probs if token])}'")
print(f"Generated sequence: \n\t'{' '.join([token for token in text if token])}'")

Probability of text: 0.0034298705414246475

Token probabilities: 
	'0.16154324146501936 0.13055775540219483 0.6303797468354431 0.2580115036976171 0.9998732251521298 1.0'
Generated sequence: 
	'The company said .'


We can call the example above **greedy decoding** which will always yied the same result given input. To generate more useful text, tokens should be selected more random and taking into account cumulative scores. For example, there are some options:

* **Beam search** is also a deterministic decoding and offers an improvement over greedy decoding. A problem of greedy decoding is that we might miss the most likely sequence since we predict only the most probable word at each timestep. Beam search mitigates this by keeping a track of most probable n sequences at every step and ultimately selecting the most probable sequence.
* **Top *k* sampling** selects k most probable words and distributes their comulative probability over them. The problem is that we must choose a fixed sized parameter k which might lead to suboptimal results in some scenarios.
* **Top *p* sampling** addresses this by selecting top words whose cumulative probability just exceeds p. This comulative probability is then again distributed among these words.

A randomized example below, which shows there are no rules and you might use your imagination in NLP:

In [5]:
# 3. Use the model (e.g. text generation)
 
def rand_ngram_generator(initial):
    text = initial
    sentence_finished = False
    prob = 1.0

    while not sentence_finished:
        r = random.random()
        accumulator = .0

        for word in model[tuple(text[-2:])].keys():
            accumulator += model[tuple(text[-2:])][word] 

            if accumulator >= r:
                prob *= model[tuple(text[-2:])][word]
                text.append(word)
                break

        if text[-2:] == [None, None]:
            sentence_finished = True

    print(f"Probability of text: {prob}\n")
    print(f"Generated sequence: \n\t'{' '.join([token for token in text if token])}'")

In [9]:
rand_ngram_generator([None, None])

Probability of text: 1.6389479350819981e-56

Generated sequence: 
	'But he said in its mid - Mississippi River ( ex Chicago ) 100 pct of the year , not the central bank intervention at 143 . 75 dlrs Net profit 686 , 000 Revs 4 , 142 , 095 , 991 vs profit 123 , 500 in the previous week , Schlesinger told Reuters that European governments unfairly subsidise the Airbus project .'


In [10]:
rand_ngram_generator(["I", "am"])

Probability of text: 3.5612535612535607e-06

Generated sequence: 
	'I am sure that something they might have for acquisitions .'


In [11]:
rand_ngram_generator(["They", "were"])

Probability of text: 6.171160366632004e-10

Generated sequence: 
	'They were rather a reduction in output from the December 1986 .'


## Exercises

Collect as many as possible Christmas and New year wishes (in Slovene). Then analyse your corpus and train a simple language model that would generate a wish for your close ones.

Implement beam search, different sampling techniques and compare results. Use some other and larger corpora.