Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi, and on the [Infini-gram](https://infini-gram.io/) documentation.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [5]:
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [6]:
import nltk
nltk.download('reuters')

print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/davideteixeira/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


54716
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [7]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [8]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [12]:
# your code here
print(uni_model['the'] * total_count)

58251.0


What is the most likely word in the corpus?

In [17]:
# your code here
max = 0
max_key = ""
for key in uni_model:
    if uni_model[key] > max:
        max = uni_model[key]
        max_key = key
        
print(key, uni_model[key] * total_count)
        


KRN 1.0


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [18]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

adjustment greatest futures Finance guesses January vs , 27 466 . mln , of has - month 900 over adverse - Therapies at MOORE default in dlrs & new for noted higher AFTERNOON tube the U said for vs products mln noting two studying SALE February destined 000 Energy 000 Corp of to of the , Santa to of not venture a Liro QTR from said a one / Avg the QTR equalling Dart cts currency than dollar . quarter sells on dlrs Congress of Net each ROWTON from ^ processing . issue candidate party the of difficult to Spie


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can pad the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [19]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [None]:
# your code here


#### Likely pairs

What are the probabilities of each word following 'today'?

In [None]:
# your code here


What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [None]:
# your code here


#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [None]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    

print (' '.join([t for t in text]))

## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [None]:
# your code here


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [None]:
# your code here


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [None]:
# your code here


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [None]:
# your code here


#### Likely tuples

Check the most likely words following "today the public".

In [None]:
# your code here


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [None]:
# your code here


#### Prompting the model

Ask the user for some input as a prompt, and make your model continue the input text!

In [None]:
# your code here


## Perplexity

Now that you have several language models based on different n-gram sizes, you can compare them in terms of perplexity regarding some provided text. Before you do that, however, you might want to apply smoothing to your language models, if you haven't done so already. This will ensure that no sequence gets a probability of zero.

For the unigram model, we first get the probability of a sequence by multiplying the probability of each token, and then apply the perplexity formula.

In [None]:
from nltk import word_tokenize

text = input("Sentence: ")
tokens = word_tokenize(text)

In [None]:
uni_prob = 1
for token in tokens:
    uni_prob *= uni_model[token]

uni_perp = pow(uni_prob, -1/len(tokens))

print("Unigram probability: ", uni_prob, "\nUnigram perplexity: ", uni_perp)

Do the same for the larger n language models. Don't forget to pad the sentence for each case.
What do you observe?

In [None]:
# your code here


## Infini-gram

[∞-gram](https://infini-gram.io/) is a Language Model with backoff that can compute unbounded n-gram counts in a very efficient way. Instead of pre-computing n-gram count tables (which would be very expensive), the infini-gram engine can compute ∞-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency.
- You can read more about infini-gram in [Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens](https://arxiv.org/abs/2401.17377).
- And you can try it out directly in the Hugging Face [infini-gram](https://huggingface.co/spaces/liujch1998/infini-gram) web interface.

Infini-gram includes indexes on several corpora, which can be queried for the purposes below.

Infini-gram can be used via its [API endpoint](https://infini-gram.io/api_doc) or via its [Python package](https://infini-gram.io/pkg_doc) (available for Linux distributions).

Infini-gram includes methods for:
- Counting the number of times an n-gram appears in the corpus.
- Computing the n-gram probability of the last token conditioned on the previous tokens.
- Computing the next-token distribution of an (n-1)-gram.
- Computing the ∞-gram probability of the last token conditioned on the previous tokens, with backoff.
- Computing the next-token distribution of an ∞-gram, with backoff.
- Search for documents containing n-gram(s).

Let's try out each of these via the [API endpoint](https://infini-gram.io/api_doc).

#### Count an n-gram

In [None]:
import requests

index = 'v4_rpj_llama_s4'
ngram = 'University of Porto'

payload = {
    'index': index,
    'query_type': 'count',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The ngram ", result.get('tokens'), " appears", result.get('count'), "times in", index)

#### Compute the probability of the last token in an n-gram

In [None]:
ngram = 'University of Porto is the'

payload = {
    'index': index,
    'query_type': 'prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The ngram", result.get('tokens')[0:-1], "appears", result.get('prompt_cnt'), "times, \nfollowed by", result.get('tokens')[-1], "in", result.get('cont_cnt'), "of those, \nfor a probability of", result.get('prob'))

#### Compute the next-token distribution of an (n-1)-gram

In [None]:
payload = {
    'index': index,
    'query_type': 'ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
for t in result.get('result_by_token_id').values():
    print(f"{t.get('token'):15}", "appears", f"{t.get('cont_cnt'):3}", "times, for a probability of", t.get('prob'))

#### Compute the ∞-gram probability of the last token

Comparing n-gram with ∞-gram probabilities:

In [None]:
ngram = 'since University of Porto is the'

payload = {
    'index': index,
    'query_type': 'prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print(result.get('prob'))

In [None]:
payload = {
    'index': index,
    'query_type': 'infgram_prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The longest found suffix is '", result.get('longest_suffix'), "' and appears", result.get('prompt_cnt'), "times, \nfollowed by", result.get('tokens')[-1], "in", result.get('cont_cnt'), "of those, \nfor a probability of", result.get('prob'))

#### Compute the ∞-gram next-token distribution

Comparing n-gram with ∞-gram next-token distributions:

In [None]:
ngram = 'since University of Porto is the'

payload = {
    'index': index,
    'query_type': 'ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
result.get('result_by_token_id')

In [None]:
payload = {
    'index': index,
    'query_type': 'infgram_ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The longest found suffix is '", result.get('longest_suffix'), "' and appears", result.get('prompt_cnt'), "times")
for t in result.get('result_by_token_id').values():
    print(f"{t.get('token'):15}", "appears", f"{t.get('cont_cnt'):3}", "times, for a probability of", t.get('prob'))

#### Search for documents containing n-gram(s)

In [None]:
ngram = 'University of Porto'

payload = {
    'index': index,
    'query_type': 'search_docs',
    'query': ngram,
    'maxnum': 3,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
len(result.get('message'))

In [None]:
for d in result.get('documents'):
    print(d.get('doc_ix'))
    for s in d.get('spans'):
        print(s[0])

#### Generating text

By exploring ∞-gram next-token distributions, we can create a text generator that takes into account these probabilities. Can you do it?

(Note that by relying on the API endpoint we end up needing to make successive calls to the API, which is pretty slow.)

In [None]:
# your code here
