# N-gram language modeling: Trigram model for Spanish

In this notebook we are going to play with one of the Spanish corpora available from NLTK: CESS_ESP, a corpus built from a collection of Spanish news items.

Let's start by loading the corpus. After that, we'll build a trigram model using the corpus.

In [1]:
import warnings
warnings.filterwarnings("ignore")  # Suppress warnings for cleaner output

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize  # Utilities for sentence and word tokenization
from nltk.util import pad_sequence  # Utility to pad sequences for language modeling
from nltk.lm.preprocessing import flatten  # Utility to flatten nested sequences into a single list

# Download the CESS-ESP corpus (a Spanish corpus) if it has not already been downloaded
nltk.download('cess_esp')  # Downloads the Spanish corpus for training or testing
nltk.download('punkt')  # Downloads the tokenizer model for sentence and word tokenization

from nltk.corpus import cess_esp  # Import the CESS-ESP corpus for use

[nltk_data] Downloading package cess_esp to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Load the CESS-ESP corpus
corpus = cess_esp.sents() # Load the corpus as a list of tokenized sentences

In [None]:
# Let's look at the first sentence
corpus[0]

['El',
 'grupo',
 'estatal',
 'Electricité_de_France',
 '-Fpa-',
 'EDF',
 '-Fpt-',
 'anunció',
 'hoy',
 ',',
 'jueves',
 ',',
 'la',
 'compra',
 'del',
 '51_por_ciento',
 'de',
 'la',
 'empresa',
 'mexicana',
 'Electricidad_Águila_de_Altamira',
 '-Fpa-',
 'EAA',
 '-Fpt-',
 ',',
 'creada',
 'por',
 'el',
 'japonés',
 'Mitsubishi_Corporation',
 'para',
 'poner_en_marcha',
 'una',
 'central',
 'de',
 'gas',
 'de',
 '495',
 'megavatios',
 '.']

In [None]:
# Let's look at some sentences in the corpus
for i in range(101,104):  # Print three
    print(f"Sentence {i + 1}: {' '.join(cess_esp.sents()[i])}")

Sentence 102: Se puso como objetivo consolidar las democracias y fortalecer las instituciones y la cultura democrática .
Sentence 103: La declaración propugnó trabajar por la igualdad social y consolidar las bases socioeconómicas para posibilitar una democracia integral , así_como asumir las oportunidades que ofrecen la globalización del comercio .
Sentence 104: La gobernabilidad requiere , según la Declaración , transformaciones sociales , económicas y culturales profundas que conduzcan a disminuir las desigualdades y la exclusión social , y a superar la pobreza , la corrupción , el terrorismo , el narcotráfico , el lavado de dinero y cualquier forma de delincuencia organizada .


In [None]:
# How many sentences in the corpus?
len(corpus)

6030

<p>
  We'll perform simple preprocessing by lowercasing words and padding sentences with
 <code>[BOS]</code> and <code>[EOS]</code> symbols.

  Since we’re building a trigram model, each sentence will be padded with two <code>[BOS]</code>
  symbols at the beginning and two <code>[EOS]</code> symbols at the end.
  
  Although a single
  <code>[EOS]</code> would suffice, NLTK's padding method applies equal padding on both sides
  by default. For simplicity, we’ll accept the addition of the extra <code>[EOS]</code>, as it has
  no significant impact.
</p>


In [None]:
# (1) Preprocess the text: lowercase, keep alphanumeric tokens, pad each sentence, and tokenize

# Normalize the corpus: convert all tokens to lowercase, keep alphanumeric tokens or those containing underscores
normalized_corpus = [
    [word.lower() for word in sentence if word.isalpha() or "_" in word]
    for sentence in corpus
]

# Pad each sentence with 2 [BOS] (for trigram model) at the beginning and 2 [EOS] at the end
padded_corpus = [
    list(pad_sequence(
        sentence,
        pad_left=True, left_pad_symbol='[BOS]',  # Add 2 [BOS] symbols for initial padding
        pad_right=True, right_pad_symbol='[EOS]',  # Add 2 [EOS] symbol to mark the end of the sentence
        n=3  # Trigram model: n=3 specifies the padding size
    ))
    for sentence in normalized_corpus
]

# Flatten the padded corpus into a single list of tokens for language modeling
flat_corpus = list(flatten(padded_corpus))

Let's see how the preprocessed text looks like.

In [None]:
import random
random.seed(42)

# (2) Explore the corpus
print("Sample sentences:")
for i in range(101,104):
    print(f"Sentence {i + 1}: {' '.join(padded_corpus[i])}")

Sample sentences:
Sentence 102: [BOS] [BOS] se puso como objetivo consolidar las democracias y fortalecer las instituciones y la cultura democrática [EOS] [EOS]
Sentence 103: [BOS] [BOS] la declaración propugnó trabajar por la igualdad social y consolidar las bases socioeconómicas para posibilitar una democracia integral así_como asumir las oportunidades que ofrecen la globalización del comercio [EOS] [EOS]
Sentence 104: [BOS] [BOS] la gobernabilidad requiere según la declaración transformaciones sociales económicas y culturales profundas que conduzcan a disminuir las desigualdades y la exclusión social y a superar la pobreza la corrupción el terrorismo el narcotráfico el lavado de dinero y cualquier forma de delincuencia organizada [EOS] [EOS]


Let's get an idea of the size of the corpus and the vocabulary we're working with.

In [None]:
vocab = set(flat_corpus)
print("Vocabulary size:", len(vocab))
print("Number of words in the corpus:", len(flat_corpus))

Vocabulary size: 23540
Number of words in the corpus: 186781


We'll be implementing a simple trigram language model. To define word probabilities according to the model, we'll need to compute the bigram and trigram counts in the corpus. nltk's `ngrams` function can be used to extract the bigrams and trigrams from sentences.

In [None]:
from nltk.util import ngrams
from nltk import FreqDist

# nltk's ngrams function can be used to get bigrams and trigrams
sentence = padded_corpus[101]
print("Sentence:",' '.join(sentence))
# bigrams
print("Bigrams:",list(ngrams(sentence, 2)))
# trigrams
print("Trigrams:", list(ngrams(sentence, 3)))

Sentence: [BOS] [BOS] se puso como objetivo consolidar las democracias y fortalecer las instituciones y la cultura democrática [EOS] [EOS]
Bigrams: [('[BOS]', '[BOS]'), ('[BOS]', 'se'), ('se', 'puso'), ('puso', 'como'), ('como', 'objetivo'), ('objetivo', 'consolidar'), ('consolidar', 'las'), ('las', 'democracias'), ('democracias', 'y'), ('y', 'fortalecer'), ('fortalecer', 'las'), ('las', 'instituciones'), ('instituciones', 'y'), ('y', 'la'), ('la', 'cultura'), ('cultura', 'democrática'), ('democrática', '[EOS]'), ('[EOS]', '[EOS]')]
Trigrams: [('[BOS]', '[BOS]', 'se'), ('[BOS]', 'se', 'puso'), ('se', 'puso', 'como'), ('puso', 'como', 'objetivo'), ('como', 'objetivo', 'consolidar'), ('objetivo', 'consolidar', 'las'), ('consolidar', 'las', 'democracias'), ('las', 'democracias', 'y'), ('democracias', 'y', 'fortalecer'), ('y', 'fortalecer', 'las'), ('fortalecer', 'las', 'instituciones'), ('las', 'instituciones', 'y'), ('instituciones', 'y', 'la'), ('y', 'la', 'cultura'), ('la', 'cultura', 

Now we build two counters, one for bigrams and one for trigrams. We go over the sentences of the corpus, and update bigram and trigram counts as we encounter new occurrences.

In [None]:
from nltk.util import ngrams

# Initialize bigram and trigram counts as dictionaries
bigram_counts = {}  # Dictionary to store the frequency of each bigram
trigram_counts = {}  # Dictionary to store the frequency of each trigram

# Iterate over sentences to get bigram and trigram counts
for sentence in padded_corpus:
    # Generate bigrams and trigrams from the sentence
    bigrams = list(ngrams(sentence, 2))  # Create all bigrams (2-word sequences) in the sentence
    trigrams = list(ngrams(sentence, 3))  # Create all trigrams (3-word sequences) in the sentence

    # Update bigram counts
    for bigram in bigrams:
        # Increment the count for this bigram, initializing it to 0 if not already in the dictionary
        bigram_counts[bigram] = bigram_counts.get(bigram, 0) + 1

    # Update trigram counts
    for trigram in trigrams:
        # Increment the count for this trigram, initializing it to 0 if not already in the dictionary
        trigram_counts[trigram] = trigram_counts.get(trigram, 0) + 1

# Sort and display the bigram counts
print("Sorted bigram counts:")
# Sort bigrams by frequency in descending order and convert to a dictionary for display
sorted_bigram_counts = dict(sorted(bigram_counts.items(), key=lambda item: item[1], reverse=True))
print(sorted_bigram_counts)

# Sort and display the trigram counts
print("\nSorted trigram counts:")
# Sort trigrams by frequency in descending order and convert to a dictionary for display
sorted_trigram_counts = dict(sorted(trigram_counts.items(), key=lambda item: item[1], reverse=True))
print(sorted_trigram_counts)

Sorted Bigram Counts:
{('[BOS]', '[BOS]'): 6030, ('[EOS]', '[EOS]'): 6030, ('de', 'la'): 1512, ('en', 'el'): 886, ('[BOS]', 'el'): 808, ('de', 'los'): 789, ('en', 'la'): 737, ('a', 'la'): 522, ('[BOS]', 'la'): 502, ('que', 'se'): 418, ('de', 'las'): 412, ('que', 'el'): 379, ('lo', 'que'): 317, ('que', 'la'): 302, ('a', 'los'): 300, ('[BOS]', 'en'): 298, ('en', 'los'): 264, ('y', 'el'): 260, ('por', 'el'): 251, ('con', 'el'): 230, ('y', 'la'): 227, ('[BOS]', 'los'): 223, ('de', 'que'): 219, ('que', 'no'): 214, ('de', 'un'): 214, ('por', 'la'): 206, ('en', 'las'): 199, ('de', 'su'): 191, ('con', 'la'): 185, ('en', 'su'): 181, ('millones', 'de'): 170, ('a', 'las'): 160, ('en', 'un'): 158, ('de', 'una'): 157, ('y', 'que'): 155, ('que', 'los'): 148, ('[BOS]', 'pero'): 147, ('en', 'una'): 144, ('no', 'se'): 143, ('para', 'el'): 140, ('que', 'en'): 136, ('la', 'que'): 132, ('a', 'su'): 131, ('el', 'presidente'): 129, ('el', 'que'): 129, ('y', 'de'): 118, ('[BOS]', 'no'): 118, ('se', 'ha'): 11

Let's have a look at the count for some specific bigrams and trigrams.

In [None]:
# Example of accessing the count for a specific bigram
example_bigram = ('el', 'rey')
print(f"Count for bigram {example_bigram} is: {bigram_counts[example_bigram]}")
# Example of accessing the count for a specific trigrams
example_trigram = ('es', 'lo', 'que')
print(f"Count for bigram {example_trigram} is: {trigram_counts[example_trigram]}")

Count for bigram ('el', 'rey') is: 11
Count for bigram ('es', 'lo', 'que') is: 14


In [None]:
def suggest_next_word(input_, bigram_counts, trigram_counts, vocab):
    """
    Suggests the most probable next words based on a given input sentence using bigram and trigram counts.

    Parameters:
        input_ (list): A tokenized list of words representing the input sentence.
        bigram_counts (dict): Dictionary containing bigram counts.
        trigram_counts (dict): Dictionary containing trigram counts.
        vocab (set): Set of all possible words in the vocabulary.

    Raises:
        ValueError: If the bigram count for the last bigram is zero.

    Returns:
        list: A sorted list of (word, probability) tuples in descending order of probability.
    """
    # Pad the input to ensure it works for trigram generation
    padded_input = list(pad_sequence(input_, pad_left=True, left_pad_symbol='[BOS]', n=3))
    last_bigram = tuple(padded_input[-2:])  # Extract the last bigram from the padded input

    # Get the count of the last bigram once to avoid repeated lookups
    last_bigram_count = bigram_counts.get(last_bigram, 0)
    if last_bigram_count == 0:
        # Raise an error if the bigram count is zero
        raise ValueError(f"Bigram {last_bigram} not found in the corpus. Unable to calculate probabilities.")

    # Compute probabilities only for trigrams starting with the last bigram
    vocab_probabilities = {
        vocab_word: trigram_counts.get(last_bigram + (vocab_word,), 0) / last_bigram_count
        for vocab_word in vocab
    }

    # Sort and return the top suggestions
    return sorted(vocab_probabilities.items(), key=lambda x: x[1], reverse=True)

In [None]:
suggest_next_word(["las", "bases"], bigram_counts, trigram_counts, vocab)[:3]

[('socioeconómicas', 0.3333333333333333),
 ('de', 0.3333333333333333),
 ('en', 0.3333333333333333)]

In [None]:
suggest_next_word(["[BOS]","[BOS]"], bigram_counts, trigram_counts, vocab)[:5]

[('el', 0.13399668325041458),
 ('la', 0.08325041459369818),
 ('en', 0.0494195688225539),
 ('los', 0.036981757877280266),
 ('pero', 0.02437810945273632)]

In [None]:
suggest_next_word(["el", "rey"], bigram_counts, trigram_counts, vocab)[:5]

[('y', 0.18181818181818182),
 ('felipe_ii', 0.09090909090909091),
 ('tuvo', 0.09090909090909091),
 ('ha', 0.09090909090909091),
 ('juan_carlos', 0.09090909090909091)]

In [None]:
suggest_next_word(["la", "empresa", "de"], bigram_counts, trigram_counts, vocab)[:4]

[('sondeos', 0.25),
 ('santa_bárbara', 0.25),
 ('elaborados', 0.25),
 ('internet', 0.25)]

In [None]:
def generate_text(initial_sentence, bigram_counts, trigram_counts, vocab, n):
    """
    Generate text using a trigram language model with bigram and trigram counts.

    Parameters:
        initial_sentence (list): A tokenized list of words to start the generation.
        bigram_counts (dict): Dictionary containing bigram counts.
        trigram_counts (dict): Dictionary containing trigram counts.
        vocab (set): Set of all possible words in the vocabulary.
        n (int): Maximum length of the generated text.

    Returns:
        list: A list of tokens representing the generated text.
    """
    # Start with the initial sentence as the basis for generation
    generated_text = initial_sentence.copy()

    # Continue generating words until the desired length is reached
    while len(generated_text) < n:
        # Get word suggestions based on the current context
        suggestions = suggest_next_word(generated_text, bigram_counts, trigram_counts, vocab)

        # Select the most probable word (greedy decoding: pick the token with the highest probability)
        sampled_word, _ = suggestions[0]  # The first item in the sorted list has the highest probability

        # Append the sampled word to the generated text
        generated_text.append(sampled_word)

        # Stop generation if the end-of-sentence token is generated
        if sampled_word == "[EOS]":
            break

    return generated_text


In [None]:
initial_sentence = ["la"]
generated_text = generate_text(initial_sentence, bigram_counts, trigram_counts, vocab, 50)
print("Generated Text:", " ".join(generated_text))

Generated Text: la decisión de prohibir el sobrevuelo de aeronaves norteamericanas en las que se ha convertido en una de las elecciones generales celebradas ayer domingo la prensa local [EOS]


In [None]:
# Example
initial_sentence = ["las", "bases", "socioeconómicas"]
generated_text = generate_text(initial_sentence, bigram_counts, trigram_counts, vocab, 30)
print("Generated Text:", " ".join(generated_text))

Generated Text: las bases socioeconómicas para posibilitar una democracia estable abierta y sin glamour nunca se le propongan y a los que se ha convertido en una de las elecciones generales celebradas


Sometimes we still run into **zero-count** issues!

In [None]:
initial_sentence = ["rey", "español"]
generated_text = generate_text(initial_sentence, bigram_counts, trigram_counts, vocab, 20)
print("Generated Text:", " ".join(generated_text))

ValueError: Bigram ('rey', 'español') not found in the corpus. Unable to calculate probabilities.

Clearly, there are problems with **long-term dependencies**, since the context window of 3 words is too small to keep track of them.

Still, it's pretty good compared to random text generation. Some local dependencies and grammatical patterns have been learned. Here's what random text looks like.

In [None]:
random.seed()
print("Generated Text:", " ".join(random.sample(list(vocab), k=15)))

Generated Text: aplauso ariel_grazziani riverside espléndido cimas süddeutsche_zeitung cristo carreteras real remolinos atribuirles sigo muchacha orgánicos suministro


## Possible improvements

* Handling 0 counts in model (smoothing or backoff)
* Better text generation strategy (no greedy decoding)