Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi, and on the [Infini-gram](https://infini-gram.io/) documentation.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [16]:
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [17]:
import nltk
nltk.download('reuters')

print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/davideteixeira/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


54716
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [18]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [19]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [20]:
# your code here
print(uni_model['the'] * total_count)

58251.0


What is the most likely word in the corpus?

In [21]:
# your code here
max = 0
max_key = ""
for key in uni_model:
    if uni_model[key] > max:
        max = uni_model[key]
        max_key = key
        
print(key, uni_model[key] * total_count)
        


KRN 1.0


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [22]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

Portugal resolved bank in added 5 31 Full an of simply . an special SERVICES " TO after without harvesting 15 fell hardline crowns 78 said qtr Bayou are Industrie in prevent FFED dlrs earnings , - 97 probable UNILEVER crucial ( new of ; the said , at 26 for tons of interest when mln John while , with the nearly in though to 15 ratification session vs L Broadcasting State this on , , / vs withdraw destroyed 244 stock PAYOUT two . for farm said 358 unexpected trade today second GY outflow common the get 8 payable


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can pad the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [23]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
count = 0
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1
        count += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [24]:
# your code here
from collections import defaultdict
from nltk import bigrams
from nltk.corpus import reuters

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

# Convert counts to probabilities
for w1 in bi_model:
    total_count = sum(bi_model[w1].values())  # Total occurrences of w1
    for w2 in bi_model[w1]:
        bi_model[w1][w2] /= total_count  # Normalize

# The model now contains P(w2 | w1)

#### Likely pairs

What are the probabilities of each word following 'today'?

In [25]:
# your code here
word = "today"

if word in bi_model:
    for w2, prob in bi_model[word].items():
        print(f"P({w2} | {word}) = {prob:.4f}")
else:
    print(f"No occurrences of '{word}' in the bigram model.")

P(. | today) = 0.1864
P(to | today) = 0.0659
P(' | today) = 0.1068
P(and | today) = 0.0250
P(as | today) = 0.0136
P(, | today) = 0.1636
P(with | today) = 0.0076
P(by | today) = 0.0205
P(when | today) = 0.0030
P(on | today) = 0.0114
P(recommended | today) = 0.0008
P(he | today) = 0.0053
P(its | today) = 0.0023
P(for | today) = 0.0189
P(De | today) = 0.0008
P(European | today) = 0.0008
P(described | today) = 0.0008
P(the | today) = 0.0136
P(," | today) = 0.0076
P(they | today) = 0.0015
P(issued | today) = 0.0015
P(being | today) = 0.0008
P(that | today) = 0.0333
P(quoted | today) = 0.0045
P(it | today) = 0.0159
P(." | today) = 0.0038
P(show | today) = 0.0015
P(of | today) = 0.0098
P(at | today) = 0.0288
P(through | today) = 0.0015
P(reported | today) = 0.0152
P(( | today) = 0.0008
P(said | today) = 0.0159
P(in | today) = 0.0189
P(well | today) = 0.0008
P(is | today) = 0.0030
P(a | today) = 0.0038
P(because | today) = 0.0023
P(will | today) = 0.0030
P(are | today) = 0.0015
P(subject | tod

What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [26]:
# your code here
start_symbol = "<s>"

if start_symbol in bi_model:
    for w2, prob in sorted(bi_model[start_symbol].items(), key=lambda x: x[1], reverse=True):
        print(f"P({w2} | {start_symbol}) = {prob:.4f}")
else:
    print("No sentence-starting words found in the bigram model.")


P(The | <s>) = 0.1615
P(" | <s>) = 0.0656
P(It | <s>) = 0.0323
P(He | <s>) = 0.0290
P(In | <s>) = 0.0252
P(But | <s>) = 0.0193
P(U | <s>) = 0.0158
P(A | <s>) = 0.0140
P(This | <s>) = 0.0082
P(They | <s>) = 0.0082
P(However | <s>) = 0.0072
P(& | <s>) = 0.0057
P(Under | <s>) = 0.0046
P(For | <s>) = 0.0038
P(Analysts | <s>) = 0.0037
P(Last | <s>) = 0.0037
P(On | <s>) = 0.0037
P(As | <s>) = 0.0037
P(At | <s>) = 0.0035
P(Some | <s>) = 0.0032
P(If | <s>) = 0.0032
P(There | <s>) = 0.0031
P(JAPAN | <s>) = 0.0030
P(Earlier | <s>) = 0.0029
P(FED | <s>) = 0.0029
P(1986 | <s>) = 0.0028
P(One | <s>) = 0.0027
P(An | <s>) = 0.0026
P(Terms | <s>) = 0.0025
P(USDA | <s>) = 0.0025
P(EC | <s>) = 0.0024
P(Net | <s>) = 0.0024
P(Asked | <s>) = 0.0023
P(Dealers | <s>) = 0.0021
P(Japan | <s>) = 0.0020
P(Total | <s>) = 0.0020
P(Other | <s>) = 0.0020
P(BANK | <s>) = 0.0020
P(That | <s>) = 0.0019
P(While | <s>) = 0.0019
P(AMERICAN | <s>) = 0.0018
P(Although | <s>) = 0.0017
P(No | <s>) = 0.0017
P(CANADA | <s>) = 0

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [27]:
import random

text = ["<s>"]

while text[-1] != "</s>":
    current_word = text[-1]
    
    if current_word not in bi_model:
        break

    next_words, probabilities = zip(*bi_model[current_word].items())

    next_word = random.choices(next_words, weights=probabilities, k=1)[0]

    text.append(next_word)
    print(next_word)

print(' '.join([t for t in text if t not in ("<s>", "</s>")]))


Brazil
'
S
.
</s>
Brazil ' S .


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [28]:
# your code here
from collections import defaultdict
from nltk import trigrams
from nltk.corpus import reuters

# Create a placeholder for the trigram model
tri_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        tri_model[(w1, w2)][w3] += 1  # Store count of w3 given (w1, w2)

# Convert counts to probabilities
for w1_w2 in tri_model:
    total_count = sum(tri_model[w1_w2].values())  # Total occurrences of (w1, w2)
    for w3 in tri_model[w1_w2]:
        tri_model[w1_w2][w3] /= total_count  # Normalize to get probabilities

# The model now contains P(w3 | w1, w2)


#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [33]:
def get_most_likely_words(context, model, top_n=5):
    if context in model:
        sorted_words = sorted(model[context].items(), key=lambda x: x[1], reverse=True)
        print(f"Most likely words following {context}:")
        for w3, prob in sorted_words[:top_n]:  # Show top N most likely words
            print(f"P({w3} | {context}) = {prob:.4f}")
    else:
        print(f"No occurrences of {context} in the trigram model.")

# Check for specific contexts
get_most_likely_words(("today", "the"), tri_model)
get_most_likely_words(("England", "has"), tri_model)

Most likely words following ('today', 'the'):
P(company | ('today', 'the')) = 0.1667
P(price | ('today', 'the')) = 0.1111
P(public | ('today', 'the')) = 0.0556
P(European | ('today', 'the')) = 0.0556
P(Bank | ('today', 'the')) = 0.0556
Most likely words following ('England', 'has'):
P(been | ('England', 'has')) = 0.5000
P(carried | ('England', 'has')) = 0.2500
P(recently | ('England', 'has')) = 0.2500


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [42]:
# your code here
import random

# Sequence start symbol
text = ["I", "am"]  # We need two start symbols for the trigram context

# Generate text until we find the end-of-sequence symbol
while text[-1] != "</s>":
    # Get the previous two words
    context = (text[-2], text[-1])

    if context not in tri_model:
        break  # Stop if the context is unknown in the trigram model

    # Get the list of possible next words and their probabilities
    next_words, probabilities = zip(*tri_model[context].items())

    # Select a word based on the probability distribution
    next_word = random.choices(next_words, weights=probabilities, k=1)[0]

    text.append(next_word)

# Remove the start and end symbols for readability
generated_text = ' '.join([t for t in text if t not in ("<s>", "</s>")])

print(generated_text)


I am confident that substantial development funds will be asked to continue to rise more than made up by 50 mln dlrs .


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [43]:
# your code here
from collections import defaultdict
from nltk import ngrams
from nltk.corpus import reuters

# Create a placeholder for the 4-gram model
fourgram_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each 4-gram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(sentence, 4, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        fourgram_model[(w1, w2, w3)][w4] += 1

# Convert counts to probabilities
for context in fourgram_model:
    total_count = sum(fourgram_model[context].values())  # Total occurrences of (w1, w2, w3)
    for w4 in fourgram_model[context]:
        fourgram_model[context][w4] /= total_count  # Normalize to get probabilities

# The model now contains P(w4 | w1, w2, w3)


#### Likely tuples

Check the most likely words following "today the public".

In [44]:
# your code here
def get_most_likely_words_4gram(context, model, top_n=5):
    if context in model:
        sorted_words = sorted(model[context].items(), key=lambda x: x[1], reverse=True)
        print(f"Most likely words following {context}:")
        for w4, prob in sorted_words[:top_n]:  # Show top N most likely words
            print(f"P({w4} | {context}) = {prob:.4f}")
    else:
        print(f"No occurrences of {context} in the 4-gram model.")

# Check for the context "today the public"
get_most_likely_words_4gram(("today", "the", "public"), fourgram_model)

Most likely words following ('today', 'the', 'public'):
P(is | ('today', 'the', 'public')) = 1.0000


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [55]:
# your code here
import random

# Sequence start symbol
text = ["<s>", "<s>", "Portugal"]  # We need three start symbols for the 4-gram context

# Generate text until we find the end-of-sequence symbol
while text[-1] != "</s>":
    context = tuple(text[-3:])  # Use the last 3 words as context (4-gram context)

    if context not in fourgram_model:
        break  # Stop if the context is unknown in the 4-gram model

    # Get the list of possible next words and their probabilities
    next_words, probabilities = zip(*fourgram_model[context].items())

    # Select a word based on the probability distribution
    next_word = random.choices(next_words, weights=probabilities, k=1)[0]

    text.append(next_word)

# Remove the start and end symbols for readability
generated_text = ' '.join([t for t in text if t not in ("<s>", "</s>")])

print(generated_text)

Portugal , which joined the Community in January 1986 .


#### Prompting the model

Ask the user for some input as a prompt, and make your model continue the input text!

In [57]:
# your code here
import random

# Function to generate text based on a user-provided prompt
def generate_text_from_prompt(prompt, model, max_length=50):
    # Preprocess the prompt to split into words and add <s> padding if necessary
    prompt_words = prompt.split()
    prompt_words = ["<s>"] * (3 - len(prompt_words)) + prompt_words  # Ensure there are at least 3 starting words
    
    # Generate text based on the prompt until the end-of-sequence symbol is found or max length is reached
    text = prompt_words

    while text[-1] != "</s>" and len(text) < max_length:
        context = tuple(text[-3:])  # Use the last 3 words as context (4-gram context)

        if context not in model:
            break  # Stop if the context is unknown in the 4-gram model

        # Get the list of possible next words and their probabilities
        next_words, probabilities = zip(*model[context].items())

        # Select a word based on the probability distribution
        next_word = random.choices(next_words, weights=probabilities, k=1)[0]

        text.append(next_word)

    # Remove the start and end symbols for readability
    generated_text = ' '.join([t for t in text if t not in ("<s>", "</s>")])
    
    return generated_text

# Example: Ask the user for a prompt
prompt = input("Enter a prompt to continue: ")

# Generate text based on the 4-gram model
generated_text = generate_text_from_prompt(prompt, fourgram_model)

print("Generated text:")
print(generated_text)


Enter a prompt to continue: Portugal
Generated text:
Portugal ' s Agriculture Minister Alvaro Barreto that the commission accepted over 785 , 000 vs 1 , 156 , 828 assets 61 . 04 billion dlrs of certificates redeemed to date have cost the state - owned Turkish Petroleum Corp to explore for them .


## Perplexity

Now that you have several language models based on different n-gram sizes, you can compare them in terms of perplexity regarding some provided text. Before you do that, however, you might want to apply smoothing to your language models, if you haven't done so already. This will ensure that no sequence gets a probability of zero.

For the unigram model, we first get the probability of a sequence by multiplying the probability of each token, and then apply the perplexity formula.

In [None]:
from nltk import word_tokenize

text = input("Sentence: ")
tokens = word_tokenize(text)

In [None]:
uni_prob = 1
for token in tokens:
    uni_prob *= uni_model[token]

uni_perp = pow(uni_prob, -1/len(tokens))

print("Unigram probability: ", uni_prob, "\nUnigram perplexity: ", uni_perp)

Do the same for the larger n language models. Don't forget to pad the sentence for each case.
What do you observe?

In [None]:
# your code here


## Infini-gram

[∞-gram](https://infini-gram.io/) is a Language Model with backoff that can compute unbounded n-gram counts in a very efficient way. Instead of pre-computing n-gram count tables (which would be very expensive), the infini-gram engine can compute ∞-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency.
- You can read more about infini-gram in [Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens](https://arxiv.org/abs/2401.17377).
- And you can try it out directly in the Hugging Face [infini-gram](https://huggingface.co/spaces/liujch1998/infini-gram) web interface.

Infini-gram includes indexes on several corpora, which can be queried for the purposes below.

Infini-gram can be used via its [API endpoint](https://infini-gram.io/api_doc) or via its [Python package](https://infini-gram.io/pkg_doc) (available for Linux distributions).

Infini-gram includes methods for:
- Counting the number of times an n-gram appears in the corpus.
- Computing the n-gram probability of the last token conditioned on the previous tokens.
- Computing the next-token distribution of an (n-1)-gram.
- Computing the ∞-gram probability of the last token conditioned on the previous tokens, with backoff.
- Computing the next-token distribution of an ∞-gram, with backoff.
- Search for documents containing n-gram(s).

Let's try out each of these via the [API endpoint](https://infini-gram.io/api_doc).

#### Count an n-gram

In [None]:
import requests

index = 'v4_rpj_llama_s4'
ngram = 'University of Porto'

payload = {
    'index': index,
    'query_type': 'count',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The ngram ", result.get('tokens'), " appears", result.get('count'), "times in", index)

#### Compute the probability of the last token in an n-gram

In [None]:
ngram = 'University of Porto is the'

payload = {
    'index': index,
    'query_type': 'prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The ngram", result.get('tokens')[0:-1], "appears", result.get('prompt_cnt'), "times, \nfollowed by", result.get('tokens')[-1], "in", result.get('cont_cnt'), "of those, \nfor a probability of", result.get('prob'))

#### Compute the next-token distribution of an (n-1)-gram

In [None]:
payload = {
    'index': index,
    'query_type': 'ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
for t in result.get('result_by_token_id').values():
    print(f"{t.get('token'):15}", "appears", f"{t.get('cont_cnt'):3}", "times, for a probability of", t.get('prob'))

#### Compute the ∞-gram probability of the last token

Comparing n-gram with ∞-gram probabilities:

In [None]:
ngram = 'since University of Porto is the'

payload = {
    'index': index,
    'query_type': 'prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print(result.get('prob'))

In [None]:
payload = {
    'index': index,
    'query_type': 'infgram_prob',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The longest found suffix is '", result.get('longest_suffix'), "' and appears", result.get('prompt_cnt'), "times, \nfollowed by", result.get('tokens')[-1], "in", result.get('cont_cnt'), "of those, \nfor a probability of", result.get('prob'))

#### Compute the ∞-gram next-token distribution

Comparing n-gram with ∞-gram next-token distributions:

In [None]:
ngram = 'since University of Porto is the'

payload = {
    'index': index,
    'query_type': 'ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
result.get('result_by_token_id')

In [None]:
payload = {
    'index': index,
    'query_type': 'infgram_ntd',
    'query': ngram,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
print("The longest found suffix is '", result.get('longest_suffix'), "' and appears", result.get('prompt_cnt'), "times")
for t in result.get('result_by_token_id').values():
    print(f"{t.get('token'):15}", "appears", f"{t.get('cont_cnt'):3}", "times, for a probability of", t.get('prob'))

#### Search for documents containing n-gram(s)

In [None]:
ngram = 'University of Porto'

payload = {
    'index': index,
    'query_type': 'search_docs',
    'query': ngram,
    'maxnum': 3,
}
result = requests.post('https://api.infini-gram.io/', json=payload).json()

In [None]:
len(result.get('message'))

In [None]:
for d in result.get('documents'):
    print(d.get('doc_ix'))
    for s in d.get('spans'):
        print(s[0])

#### Generating text

By exploring ∞-gram next-token distributions, we can create a text generator that takes into account these probabilities. Can you do it?

(Note that by relying on the API endpoint we end up needing to make successive calls to the API, which is pretty slow.)

In [None]:
# your code here
