# From N-Grams to Transformer-Based Language Models: Theoretical Foundation

When talking about language models, we are talking about models that assign probabilities to a sequence of tokens: 
$$
p(w_1, w_2, \ldots, w_M ), \text{ with } w_m \in V. \text{ The set } V \text{ is a discrete vocabulary,}
V = \{\text{aardvark, abacus, } \ldots, \text{ zither}\}.
$$

So one might ask why would you want to compute the probability of a word sequence? Well, in many applications the goal is to produce word sequences as output: 

- Machine Translation
- Speech Recognition
- Summarization
- Dialogue Systems (Chat bots), etc

In many of the systems for performing these tasks, there is a subcomponent that compputes the probability of the output text. The purpose of this component is to generate texts that are more fluent. 



## 1. Naive Approach: Unbiasead Language Model

Let's talk about a very __naive approach to build a language model__. A very naive language model would be one, for example that explores the concept of __relative frequency estimate__ to compute the probability of a sequence of tokens. 

Let's work through an example to see this concept in action. Consider the sentence: "Computers are useless, they can only give you answers." Now let's estimate the probability of this sequence of word tokens using the relative frequency estimate: 

$$
p(\text{Computers are useless, they can only give you answers})
= \frac{\text{count}(\text{Computers are useless, they can only give you answers}) \ }{\text{count}(\text{all sentences ever spoken})}
$$

This way of modeling language has a very serious problem, in the theoretical limit of infinite data it would be indeed a good solution, however it is very hard to estimate the count of all the sentences ever spoken, one cannot even imagine how big the dataset would have to be to have an accurate count of all the sentences ever spoken, because of this phenomenon this estimator is said to be __unbiased__. 

Another thing to notice about this implementation is that even grammatically correct sentences would have very low probabilities if they are not included in the set of _all sentences ever spoken_ (i.e., in case we can group them in a dataset). Clearly, this estimator is very data-hungry, and suffers from high vari- ance: even grammatical sentences will have probability zero if they have not occurred in the training data(_side note_:Chomsky famously argued that this is evidence against the very concept of probabilistic language mod- els: no such model could distinguish the grammatical sentence colorless green ideas sleep furiously from the ungrammatical permutation furiously sleep ideas green colorless.)

Therefore what is the solution to this problem? And how can we solve it, __we need to  to introduce bias to have a chance of making reliable estimates from finite training data__. 

## 2. N-grams Language Models

The n-gram language model on the other hand __computes the probability of sequence of tokens as the product of probability of subsequences__. 

$$
\text{The probability of a sequence } p(\mathbf{w}) = p(w_1, w_2, \ldots, w_M) \text{ can be refactored using the chain rule}
$$

$$
p(\mathbf{w}) = p(w_1, w_2, \ldots, w_M) \quad \text{} = p(w_1) \times p(w_2 \mid w_1) \times p(w_3 \mid w_2, w_1) \times \ldots \times p(w_M \mid w_{M-1}, w_1) \quad \text{}
$$


Each element in the product is the probability of a word given all its predecessors. We can think of this as a _word prediction task_: given the context _Computers are_, we want to compute a probability over the next token. The relative frequency estimate of the probability of the word _useless_ in this context is,


$$
p(\text{useless} \mid \text{computers are}) = \frac{\text{count(computers are useless } )}{\sum_{x \in V} \text{count(computers are } x)}
$$

$$
= \frac{\text{count(computers are useless)}}{\text{count(computers are)}}
$$

If you think carefully about the denominator you can see that we haven't really made any progress so far. To computer the conditional probability $$
p(w_M \mid w_{M-1}, w_{M-2}, \ldots, w_1)
$$
 
we would need to model $$
V^{M-1}
$$ contexts


To solve this problem n-grams models make a very interesting assumption, __they condition only on the past n-1 words__: 

$$
p(w_m \mid w_{m-1}, \ldots, w_1) \approx p(w_m \mid w_{m-1}, \ldots, w_{m-n+1})
$$

This means that the probability of a sentence can be approximate as: 

$$
\text{This model requires estimating and storing the probability of only } V^n \text{ events, which is exponential in the order of the n-gram, and not } V^M, \text{ which is exponential in the length of the sentence.}
$$

$$
\text{The n-gram probabilities can be computed by relative frequency estimation,}
$$

$$
p(w_m \mid w_{m-1}, w_{m-2}) = \frac{\text{count}(w_{m-2}, w_{m-1}, w_m)}{\sum_{w'} \text{count}(w_{m-2}, w_{m-1}, w')}
\quad \text{[6.12]}
$$

The hyperparameter  _n_  controls the size of the context used in each conditional probability. If this is misspecified, the language model will perform poorly. Let’s consider the potential problems concretely.


- When n is too small. Consider the following sentences:
  - __Gorillas__ always like to groom their __friends__.
  - The __computer__ that’s on the 3rd floor of our office building __crashed__.

In each example, the words written in bold depend on each other: the likelihood of __their__ depends on knowing that __gorillas__ is plural, and the likelihood of __crashed__ depends on knowing that the subject is a __computer__. _If the n-grams are not big enough to capture this context, then the resulting language model would offer probabilities that are too low for these sentences, and too high for sentences that fail basic linguistic tests like number agreement_.
- When n is too big.
In this case, it is hard good estimates of the n-gram parameters from our dataset, because of data sparsity. To handle the gorilla example, it is necessary to model 6-grams, which means accounting for V 6 events. Under a very small vocab- ulary of V = 104, this means estimating the probability of 1024 distinct events.

These two problems point to another bias-variance tradeoff (see § 2.2.4). A small n- gram size introduces high bias, and a large n-gram size introduces high variance. We can even have both problems at the same time! Language is full of long-range dependen- cies that we cannot capture because n is too small; at the same time, language datasets are full of rare phenomena, whose probabilities we fail to estimate accurately because n is too large. 

#### One solution is to try to keep _n_ large, while still making low-variance estimates of the underlying parameters. To do this, we will introduce a different sort of bias: smoothing.

## Code Implementation:

In [33]:
import nltk
from nltk.util import ngrams
from collections import Counter 
import random 

# downloading the dataset

nltk.download('gutenberg')
from nltk.corpus import gutenberg
text = gutenberg.raw('austen-emma.txt')
nltk.download('punkt')
# downloading stopwords
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /Users/admin/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /Users/admin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/admin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Preprocessing the text

### Training 

### Generate Text using the n-gram Model

In [64]:
import random
from nltk import FreqDist
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline,pad_both_ends
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
from nltk.util import ngrams, everygrams

# Function to preprocess the text
def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    #print(f"Raw tokens: {tokens[:20]}")  # Debugging: print the first 20 tokens
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove punctuation and non-alphabetic tokens
    tokens = [token for token in tokens if token.isalpha()]
    
    # Optionally: Reduce stop words removal
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words or token in ['emma', 'harriet', 'she', 'would', 'could']]
    
    #print(f"Processed tokens: {tokens[:20]}")  # Debugging: print the first 20 processed tokens
    return tokens

# Load and preprocess the training text
text = gutenberg.raw('austen-emma.txt')  # Make sure you're using the right text
train_tokens = preprocess(text)

# Create padded n-grams for training data
padded_tokens = list(pad_both_ends(train_tokens, n=2))
train_data = list(everygrams(padded_tokens, max_len=2))
vocab = set(train_tokens)

# Debugging: Print a sample of n-grams
#print("Sample n-grams in train_data:", train_data[:10])  # Debugging: print the first 10 n-grams

# Initialize the MLE model
lm = MLE(2)

# Train the model
lm.fit([train_data], vocab)

# Debugging: Print the vocabulary size and contents after training
print("Vocabulary size after training:", len(lm.vocab))
print("Vocabulary contents:", list(lm.vocab))  # Print all vocabulary items

# Function to generate text based on the MLE model
def generate_text(lm, seed_word, num_words):
    text = [seed_word]
    for _ in range(num_words):
        # Use the last word in the generated text as context
        context = text[-1:]

        # Get all possible next words
        word_scores = {word: lm.score(word, context) for word in lm.vocab}
        # Filter out words with a score of 0
        word_scores = {word: score for word, score in word_scores.items() if score > 0}
        
        #print(f"Context: {context}")
        #print(f"Word scores: {word_scores}")
        
        if word_scores:
            # Choose the next word based on scores and add randomness
            next_word = random.choices(
                list(word_scores.keys()), 
                weights=list(word_scores.values()), 
                k=1
            )[0]
            text.append(next_word)
            print(f"Next word chosen: {next_word}")
        else:
            break  # Exit if no valid next word is found

    return ' '.join(text)

# Generate text starting with a given seed word
seed_word = 'I'  # You can change this to a different word if needed
generated_text = generate_text(lm, seed_word, 50)

print("Generated text:", generated_text)


Vocabulary size after training: 6809
Vocabulary contents: ['placed', 'renewal', 'exercise', 'inexpressible', 'hoped', 'orphan', 'moderation', 'grows', 'admire', 'reminded', 'resolutely', 'faces', 'irritate', 'announce', 'companions', 'recited', 'apprehend', 'conceived', 'protracted', 'illnesses', 'polish', 'uttered', 'extend', 'unavoidable', 'lace', 'punctuality', 'likenesses', 'lawyer', 'james', 'joyous', 'nervous', 'blown', 'town', 'satin', 'rid', 'occurs', 'suspense', 'arisen', 'dead', 'grave', 'viewed', 'underbred', 'nodding', 'ceremonious', 'thankful', 'soliloquy', 'token', 'overlook', 'make', 'riddles', 'admission', 'greatcoat', 'employed', 'ribbon', 'league', 'addressing', 'shrinking', 'sun', 'unmirthful', 'backgammon', 'taken', 'unwholesome', 'naivete', 'papa', 'chosen', 'consulting', 'curve', 'beautiful', 'absolutely', 'act', 'joined', 'papers', 'gayest', 'confessed', 'feelings', 'mortified', 'always', 'failings', 'creditable', 'moreover', 'prose', 'together', 'arranged', 'nin

Next word chosen: part
Next word chosen: doubt
Next word chosen: yet
Next word chosen: bread
Next word chosen: butter
Next word chosen: she
Next word chosen: never
Next word chosen: able
Next word chosen: fix
Next word chosen: declare
Next word chosen: affection
Next word chosen: glad
Next word chosen: come
Next word chosen: said
Next word chosen: knightley
Next word chosen: said
Next word chosen: quite
Next word chosen: horror
Next word chosen: believe
Next word chosen: married
Next word chosen: settled
Next word chosen: go
Next word chosen: next
Next word chosen: morning
Next word chosen: party
Next word chosen: wait
Next word chosen: little
Next word chosen: deserve
Next word chosen: less
Next word chosen: she
Next word chosen: would
Next word chosen: way
Next word chosen: mr
Next word chosen: elton
Next word chosen: would
Next word chosen: hymen
Next word chosen: saffron
Generated text: I emma soon alarmed considerably longer change emma might manners would imprudent settle early p

# Training and Generating Text using NLTK's MLE Model

This notebook demonstrates how to preprocess text data, train an n-gram language model using Maximum Likelihood Estimation (MLE) with NLTK, and generate text based on the trained model. We will use Jane Austen's "Emma" from the NLTK Gutenberg corpus as our dataset.

## Import Libraries

First, we import the necessary libraries.


In [65]:
import random
from nltk import FreqDist
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline,pad_both_ends
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
from nltk.util import ngrams, everygrams

## Preprocess Text

We define a function to preprocess the text by tokenizing it, converting to lowercase, removing punctuation and non-alphabetic tokens, and removing stop words.


In [66]:
# Function to preprocess the text
def preprocess(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    print(f"Raw tokens: {tokens[:20]}")  # Debugging: print the first 20 tokens
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove punctuation and non-alphabetic tokens
    tokens = [token for token in tokens if token.isalpha()]
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    print(f"Processed tokens: {tokens[:20]}")  # Debugging: print the first 20 processed tokens
    return tokens


## Load and Preprocess the Training Text

We load the text of Jane Austen's "Emma" and preprocess it using the function defined above.


In [67]:
# Load and preprocess the training text
text = gutenberg.raw('austen-emma.txt')  # Make sure you're using the right text
train_tokens = preprocess(text)


Raw tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER', 'I', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich']
Processed tokens: ['emma', 'jane', 'austen', 'volume', 'chapter', 'emma', 'woodhouse', 'handsome', 'clever', 'rich', 'comfortable', 'home', 'happy', 'disposition', 'seemed', 'unite', 'best', 'blessings', 'existence', 'lived']


## Create Padded n-grams for Training

We create padded n-grams for the training data and generate the vocabulary.


In [68]:
# Create padded n-grams for training data
padded_tokens = list(pad_both_ends(train_tokens, n=2))
train_data = list(everygrams(padded_tokens, max_len=2))
vocab = set(train_tokens)

# Debugging: Print a sample of n-grams
print("Sample n-grams in train_data:", train_data[:10])  # Debugging: print the first 10 n-grams


Sample n-grams in train_data: [('<s>',), ('<s>', 'emma'), ('emma',), ('emma', 'jane'), ('jane',), ('jane', 'austen'), ('austen',), ('austen', 'volume'), ('volume',), ('volume', 'chapter')]


## Initialize and Train the MLE Model

We initialize the MLE model and train it on the preprocessed data.


In [69]:
# Initialize the MLE model
lm = MLE(2)

# Train the model
lm.fit([train_data], vocab)

# Debugging: Print the vocabulary size and contents after training
print("Vocabulary size after training:", len(lm.vocab))
print("Vocabulary contents:", list(lm.vocab))  # Print all vocabulary items


Vocabulary size after training: 6808
Vocabulary contents: ['placed', 'renewal', 'exercise', 'inexpressible', 'hoped', 'orphan', 'moderation', 'grows', 'admire', 'reminded', 'resolutely', 'faces', 'irritate', 'announce', 'companions', 'recited', 'apprehend', 'conceived', 'protracted', 'illnesses', 'polish', 'uttered', 'extend', 'unavoidable', 'lace', 'punctuality', 'likenesses', 'lawyer', 'james', 'joyous', 'nervous', 'blown', 'town', 'satin', 'rid', 'occurs', 'suspense', 'arisen', 'dead', 'grave', 'viewed', 'underbred', 'nodding', 'ceremonious', 'thankful', 'soliloquy', 'token', 'overlook', 'make', 'riddles', 'admission', 'greatcoat', 'employed', 'ribbon', 'league', 'addressing', 'shrinking', 'sun', 'unmirthful', 'backgammon', 'taken', 'unwholesome', 'naivete', 'papa', 'chosen', 'consulting', 'curve', 'beautiful', 'absolutely', 'act', 'joined', 'papers', 'gayest', 'confessed', 'feelings', 'mortified', 'always', 'failings', 'creditable', 'moreover', 'prose', 'together', 'arranged', 'nin

## Generate Text Based on the Trained Model

We define a function to generate text based on the trained MLE model.


In [73]:
# Function to generate text based on the MLE model
def generate_text(lm, seed_word, num_words):
    text = [seed_word]
    for _ in range(num_words):
        # Use the last word in the generated text as context
        context = text[-1:]

        # Get all possible next words
        word_scores = {word: lm.score(word, context) for word in lm.vocab}
        # Filter out words with a score of 0
        word_scores = {word: score for word, score in word_scores.items() if score > 0}
        
        #print(f"Context: {context}")
        #print(f"Word scores: {word_scores}")
        
        if word_scores:
            # Choose the next word based on scores and add randomness
            next_word = random.choices(
                list(word_scores.keys()), 
                weights=list(word_scores.values()), 
                k=1
            )[0]
            text.append(next_word)
            #print(f"Next word chosen: {next_word}")
        else:
            break  # Exit if no valid next word is found

    return ' '.join(text)


## Generate Sample Text

Finally, we use the `generate_text` function to generate text starting with a given seed word.


In [75]:
# Generate text starting with a given seed word
seed_word = 'Today'  
generated_text = generate_text(lm, seed_word, 50)

print("Generated text:", generated_text)


Generated text: Today emma harriet could manage extremely glad see nothing drawing music never saw answer thinking worth could touch pedal totally ignorant left written pressingly would interested believe giving fair nothing confidence towards long passed undiscerned heard others overpowering period fortitude proper attention mother says positively must proceed state health emma aware


In [76]:
# Generate text starting with a given seed word
seed_word = 'Love'
generated_text = generate_text(lm, seed_word, 50)

print("Generated text:", generated_text)


Generated text: Love emma smiling prosing undistinguishing unfastidious apt despair however emma every thing tolerable unfortunately could give back nonsense restrain speak made unfit office governess could kept longer obliged practise provided weather made progress without concurrence lord earth sea good manners happy though letter sudden variation might come shall speak yesterday certainly


# Evaluating the MLE Model

We will define functions to calculate the perplexity and cross-entropy loss of the trained MLE model on test data.

## Import Libraries
First, we import any additional necessary libraries.


In [77]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Vocabulary


## Preprocess the Test Data
We load and preprocess the test text.


In [78]:
# Load and preprocess the test text
test_text = gutenberg.raw('austen-persuasion.txt')  # You can change this to another text if needed
test_tokens = preprocess(test_text)

# Create padded n-grams for test data
padded_test_tokens = list(pad_both_ends(test_tokens, n=2))
test_data = list(everygrams(padded_test_tokens, max_len=2))


Raw tokens: ['[', 'Persuasion', 'by', 'Jane', 'Austen', '1818', ']', 'Chapter', '1', 'Sir', 'Walter', 'Elliot', ',', 'of', 'Kellynch', 'Hall', ',', 'in', 'Somersetshire', ',']
Processed tokens: ['persuasion', 'jane', 'austen', 'chapter', 'sir', 'walter', 'elliot', 'kellynch', 'hall', 'somersetshire', 'man', 'amusement', 'never', 'took', 'book', 'baronetage', 'found', 'occupation', 'idle', 'hour']


## Calculate Cross-Entropy Loss

We define a function to calculate the cross-entropy loss of the model on the test data.


In [83]:
# Function to calculate cross-entropy loss
def cross_entropy(lm, test_data):
    log_prob_sum = 0
    n = 0

    for ngram in test_data:
        context = ngram[:-1]
        word = ngram[-1]
        log_prob = lm.logscore(word, context)
        
        if log_prob == float('-inf'):
            print(f"Encountered zero probability for context {context} and word {word}")
        
        log_prob_sum += log_prob
        n += 1

    return -log_prob_sum / n


## Calculate Perplexity

We define a function to calculate the perplexity of the model on the test data.


In [84]:
# Function to calculate perplexity
def perplexity(lm, test_data):
    return 2 ** cross_entropy(lm, test_data)


## Evaluate the Model

We evaluate the model by calculating the cross-entropy loss and perplexity on the test data.


In [85]:
# Calculate cross-entropy loss and perplexity
cross_entropy_loss = cross_entropy(lm, test_data)
perplexity_score = perplexity(lm, test_data)

print("Cross-Entropy Loss:", cross_entropy_loss)
print("Perplexity:", perplexity_score)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Encountered zero probability for context ('first',) and word principal
Encountered zero probability for context ('principal',) and word acquaintance
Encountered zero probability for context ('acquaintance',) and word marrying
Encountered zero probability for context ('marrying',) and word cousin
Encountered zero probability for context ('cousin',) and word continually
Encountered zero probability for context ('continually',) and word hearing
Encountered zero probability for context ('hearing',) and word father
Encountered zero probability for context ('sister',) and word described
Encountered zero probability for context ('described',) and word one
Encountered zero probability for context ('miss',) and word elliot
Encountered zero probability for context ('elliot',) and word thought
Encountered zero probability for context ('thought',) and word affectionately
Encountered zero probability for context ('affectionately',) and word perhaps
Encountered zero probability for context ('perhaps

changing the lm.MLE to lm.Laplace(2)