### Explanation of the Trigram Language Model Code

#### 1. Overview

The code implements a **trigram language model** for predicting the next word given a context of two words. In an n-gram model, we assume that the probability of a word depends only on the preceding $n-1$ words. For a trigram model (where $n = 3$), the probability of a word $w_i$ given the two preceding words $w_{i-2}$ and $w_{i-1}$ is estimated as:

$$
P(w_i \mid w_{i-2}, w_{i-1}) = \frac{\text{Count}(w_{i-2}, w_{i-1}, w_i)}{\text{Count}(w_{i-2}, w_{i-1})},
$$

where  
- $\text{Count}(w_{i-2}, w_{i-1}, w_i)$ is the frequency of the trigram (three-word sequence) in the corpus, and  
- $\text{Count}(w_{i-2}, w_{i-1})$ is the frequency of the two-word context.

This model supports two main tasks:  
1. **Next-Word Prediction:** Given a two-word context, the model predicts the most likely next word.  
2. **Sentence Completion:** Starting with an initial phrase, the model iteratively predicts and appends words until a desired sentence length is reached or no prediction is available.

---

#### 2. Detailed Breakdown

**Step 1: Preprocessing the Corpus**

- **Corpus Definition:**  
  A larger corpus with multiple sentences and paragraphs is defined. This narrative-style text contains diverse contexts, ensuring the model learns various word sequences.

- **Tokenization:**  
  The `tokenize` function performs the following:
  - Converts the text to lowercase (to ensure consistency),
  - Removes punctuation and non-alphabetical characters (to clean the text), and
  - Splits the text into individual tokens (words).

  This preprocessing step prepares the text for building the n-gram counts.

**Step 2: Building Trigram Counts**

- **Trigram Definition:**  
  A trigram is a sequence of three consecutive words, written as $(w_{i-2}, w_{i-1}, w_i)$.  
  The code iterates over the tokenized corpus and, for each window of three words, treats the first two words as the context and the third word as the target.

- **Counting Occurrences:**  
  A Python dictionary (specifically, a `defaultdict` of `Counter` objects) is used to record how many times each two-word context is followed by a specific next word. These counts form the basis for estimating conditional probabilities.

- **Vocabulary Extraction:**  
  The vocabulary is the set of all unique words in the corpus. Sorting the vocabulary alphabetically simplifies display and analysis.

**Step 3: Defining Prediction Functions**

- **Next-Word Prediction (`predict_next_word`):**  
  This function takes a context (a tuple of two words) and returns the most frequent word that follows that context. In effect, it chooses the word $w$ that maximizes $\text{Count}(c, w)$, where $c$ is the context.

- **Full Probability Distribution (`get_full_next_word_probabilities`):**  
  For a given context $c$, this function computes the probability of each word $w$ in the vocabulary being the next word, using:

  $$
  P(w \mid c) = \begin{cases}
  \displaystyle \frac{\text{Count}(c, w)}{\sum_{w'} \text{Count}(c, w')}, & \text{if } c \text{ is found}, \\
  0, & \text{otherwise}.
  \end{cases}
  $$

  This provides a full distribution, allowing us to see the likelihood of every possible next word.

**Step 4: Demonstrating the Probability Distributions**

- The code tests multiple sample contexts (for example, $('the', 'knight')$, $('once', 'upon')$, etc.) and prints the complete probability distribution for each.
- The distributions are sorted in descending order of probability, clearly showing which words are most likely to follow each context.

**Step 5: Sentence Completion**

- **Functionality:**  
  The `complete_sentence` function takes a starting sentence and repeatedly predicts the next word using the trigram model until either a maximum word count is reached or no further prediction is available.

- **Process:**  
  1. The input sentence is tokenized.
  2. The last two tokens are used as the current context.
  3. The model predicts the next word, which is then appended to the sentence.
  4. This process repeats until the sentence reaches the desired length.

This function demonstrates how the trigram model can generate extended text by iteratively applying the learned n-gram statistics.

---

#### 3. So in short:

- **Preprocessing:**  
  The corpus is standardized by converting to lowercase, cleaning punctuation, and tokenizing into words, ensuring consistent input for the model.

- **Trigram Model:**  
  The model counts occurrences of trigrams and estimates the probability of a word $w_i$ following a two-word context $(w_{i-2}, w_{i-1})$ using:

  $$
  P(w_i \mid w_{i-2}, w_{i-1}) = \frac{\text{Count}(w_{i-2}, w_{i-1}, w_i)}{\text{Count}(w_{i-2}, w_{i-1})}.
  $$

- **Prediction Functions:**  
  Two functions are defined: one to predict the most likely next word and another to compute the full probability distribution over the vocabulary for a given context.

- **Sentence Generation:**  
  The sentence completion function leverages the trigram model to generate longer text by iteratively predicting and appending words based on the current context.


In [8]:
# Import necessary libraries
import re
from collections import defaultdict, Counter

#################################################################
# Step 1: Define a Larger Corpus and Preprocessing Function
#################################################################

# Here we define a larger corpus with multiple sentences and paragraphs.
# This corpus contains a short narrative with various contexts and styles.
corpus = """
Once upon a time in a land far away, there lived a brave knight.
The knight was known for his valor and courage, and many stories of his adventures were told throughout the kingdom.
The knight journeyed through mysterious forests, fought fierce dragons, and helped those in need.
His story was an inspiration to all who heard it.
In a small village, the people would often speak of the knight's exploits.
The villagers believed that one day, the knight would return to save them from impending doom.
Meanwhile, the kingdom thrived with tales of heroism and hope.
The knight's legacy continued to grow as his adventures were recounted by bards and storytellers.
As time passed, legends and myths were born from his deeds.
Every town had a story, and every child dreamed of a hero like him.
"""

def tokenize(text):
    """
    Preprocess the text by:
      1. Converting to lowercase.
      2. Removing punctuation and non-letter characters.
      3. Splitting the text into individual words (tokens).
    """
    text = text.lower()                              # Convert text to lowercase
    text = re.sub(r'[^a-z\s]', '', text)              # Remove punctuation and non-letter characters
    tokens = text.split()                            # Split text into tokens based on whitespace
    return tokens

# Tokenize the corpus and display the tokens.
tokens = tokenize(corpus)
print("1. Tokens from the corpus:")
print(tokens)

#################################################################
# Step 2: Build N-Gram (Trigram) Counts from the Corpus
#################################################################

# We build a trigram model (n=3) where each context is two words predicting a third word.
n = 3

# Create a dictionary mapping each 2-word context to a Counter of possible next words.
ngrams_counts = defaultdict(Counter)
for i in range(len(tokens) - n + 1):
    context = tuple(tokens[i:i+n-1])  # First two words as context
    next_word = tokens[i+n-1]         # Third word as the predicted word
    ngrams_counts[context][next_word] += 1

# Build the vocabulary: the sorted list of unique tokens in the corpus.
vocabulary = sorted(set(tokens))
print("\n2. Vocabulary (sorted alphabetically):")
for idx, word in enumerate(vocabulary, 1):
    print(f"{idx}. {word}")

print("\n3. Trigram Counts (contexts with their next words and counts):")
for idx, (context, counter) in enumerate(ngrams_counts.items(), 1):
    print(f"{idx}. Context {context}: {dict(counter)}")

#################################################################
# Step 3: Define Functions for Next Word Prediction and Probability Distribution
#################################################################

def predict_next_word(context, ngrams_counts):
    """
    Predict the next word using maximum likelihood estimation.
    Returns the most common next word for the given context.

    Parameters:
      - context: A tuple or list of words (expected length is n-1, i.e., 2 words for a trigram).
      - ngrams_counts: Dictionary mapping contexts to counters of next words.

    Returns:
      - The predicted next word as a string, or None if the context is not found.
    """
    context = tuple(context)
    if context in ngrams_counts:
        next_word = ngrams_counts[context].most_common(1)[0][0]
        return next_word
    else:
        return None

def get_full_next_word_probabilities(context, ngrams_counts, vocabulary):
    """
    For a given context, return a dictionary mapping each word in the entire vocabulary
    to its probability of being the next word.

    If the context is found in the n-gram counts, probabilities are calculated as:
      (Count of word following the context) / (Total counts for that context).
    Words not observed after the context are assigned a probability of 0.

    Parameters:
      - context: A tuple or list of words (context for prediction).
      - ngrams_counts: Dictionary mapping contexts to counters of next words.
      - vocabulary: List of all unique tokens in the corpus.

    Returns:
      - A dictionary where keys are candidate words and values are their probabilities.
    """
    context = tuple(context)
    if context in ngrams_counts:
        total_count = sum(ngrams_counts[context].values())
        # Create a distribution that includes every word from the vocabulary.
        dist = {word: ngrams_counts[context][word] / total_count if word in ngrams_counts[context] else 0.0
                for word in vocabulary}
    else:
        # If the context is not found, assign a probability of 0.0 for every word.
        dist = {word: 0.0 for word in vocabulary}
    return dist

#################################################################
# Step 4: Show the Full Probability Distribution for Multiple Sample Contexts
#################################################################

# List of sample contexts to test.
sample_contexts = [
    ('the', 'knight'),
    ('once', 'upon'),
    ('in', 'a'),
    ('his', 'story'),
    ('every', 'town')
]

print("\n4. Full Probability Distributions for Multiple Sample Contexts:")
for context in sample_contexts:
    print(f"\nContext: {context}")
    prob_dist = get_full_next_word_probabilities(context, ngrams_counts, vocabulary)
    # Sort the probability distribution in descending order of probability.
    sorted_prob = sorted(prob_dist.items(), key=lambda x: -x[1])
    for idx, (word, prob) in enumerate(sorted_prob, 1):
        print(f"  {idx}. Word: '{word}' - Probability: {prob:.3f}")

#################################################################
# Step 5: Sentence Completion Examples for Multiple Starting Phrases
#################################################################

def complete_sentence(sentence, ngrams_counts, max_words=20):
    """
    Complete a sentence by predicting and appending words until:
      - The sentence reaches a specified maximum number of words, or
      - No further prediction is available for the current context.

    The function tokenizes the input sentence and then uses the last two words (the context)
    to predict the next word. This process repeats until the sentence is complete.

    Parameters:
      - sentence: The starting sentence as a string.
      - ngrams_counts: The dictionary containing trigram counts.
      - max_words: The maximum total number of words for the completed sentence.

    Returns:
      - The completed sentence as a string.
    """
    tokens = tokenize(sentence)
    while len(tokens) < max_words:
        # Use the last two words as the context (or all tokens if fewer than 2)
        context = tokens[-(n-1):] if len(tokens) >= n-1 else tokens
        next_word = predict_next_word(context, ngrams_counts)
        if not next_word:
            break
        tokens.append(next_word)
    return ' '.join(tokens)

# List of sample starting phrases.
sample_sentences = [
    "once upon",
    "the knight",
    "in a land",
    "his story",
    "every town"
]

print("\n5. Sentence Completion Examples:")
for idx, sentence in enumerate(sample_sentences, 1):
    completed = complete_sentence(sentence, ngrams_counts, max_words=50)
    print(f"Example {idx}. Starting phrase: '{sentence}'")
    print(f"           Completed sentence: {completed}\n")

#################################################################
# Step 6: Explanation Summary
#################################################################

print("Explanation Summary:")
print("---------------------------------------------------------")
print("1. We defined a larger corpus containing multiple sentences to enrich the model's context.")
print("2. The corpus is preprocessed: it is converted to lowercase, punctuation is removed, and it is tokenized into words.")
print("3. A trigram model is built by mapping each 2-word context to the counts of possible following words.")
print("4. The vocabulary is extracted as the set of all unique tokens, sorted alphabetically.")
print("5. For any given context, we can compute the full probability distribution over the entire vocabulary.")
print("6. We demonstrated this with multiple sample contexts, showing the probability of each word being the next word.")
print("7. Finally, the sentence completion function uses the trigram model to extend multiple starting phrases.")
print("---------------------------------------------------------")


1. Tokens from the corpus:
['once', 'upon', 'a', 'time', 'in', 'a', 'land', 'far', 'away', 'there', 'lived', 'a', 'brave', 'knight', 'the', 'knight', 'was', 'known', 'for', 'his', 'valor', 'and', 'courage', 'and', 'many', 'stories', 'of', 'his', 'adventures', 'were', 'told', 'throughout', 'the', 'kingdom', 'the', 'knight', 'journeyed', 'through', 'mysterious', 'forests', 'fought', 'fierce', 'dragons', 'and', 'helped', 'those', 'in', 'need', 'his', 'story', 'was', 'an', 'inspiration', 'to', 'all', 'who', 'heard', 'it', 'in', 'a', 'small', 'village', 'the', 'people', 'would', 'often', 'speak', 'of', 'the', 'knights', 'exploits', 'the', 'villagers', 'believed', 'that', 'one', 'day', 'the', 'knight', 'would', 'return', 'to', 'save', 'them', 'from', 'impending', 'doom', 'meanwhile', 'the', 'kingdom', 'thrived', 'with', 'tales', 'of', 'heroism', 'and', 'hope', 'the', 'knights', 'legacy', 'continued', 'to', 'grow', 'as', 'his', 'adventures', 'were', 'recounted', 'by', 'bards', 'and', 'storyte