## Understanding Perplexity in Language Models

To understand and compute perplexity, a key evaluation metric for language models, and analyze how it reflects the quality of text predictions.

**What is Perplexity?**
Perplexity measures the uncertainty of a language model in predicting a sequence of words. It indicates how "perplexed" the model is by the text.

- Low perplexity: The model predicts the sequence with high confidence.
- High perplexity: The model struggles to predict the sequence, indicating poor performance.






In [1]:
import tensorflow as tf
from transformers import TFAutoModelForCausalLM, AutoTokenizer

def calculate_perplexity(text, model_name='gpt2'):
    """
    Calculates the perplexity of the given text using a GPT-2 model in TensorFlow.
    
    Args:
        text (str): Input text.
        model_name (str): Name of the Hugging Face model (default: 'gpt2').
    
    Returns:
        float: Perplexity score.
    """
    
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = TFAutoModelForCausalLM.from_pretrained(model_name)
    
    # Tokenize the input text
    tokens = tokenizer.encode(text, return_tensors='tf')

    # Log the tokenized input
    print(f"\nOriginal Text: {text}")
    print(f"Tokenized Input: {tokens}")

    # Calculate loss and perplexity
    outputs = model(tokens, labels=tokens)
    loss = outputs.loss
    perplexity = tf.exp(loss)

    print(f"Loss: {loss.numpy()}")
    return perplexity.numpy()

# Compare perplexity of different examples
texts = [
    "The quick brown fox jumps over the lazy dog.",  # Grammatically correct and meaningful
    "Quick the brown fox over lazy jumps dog the.",  # Grammatically incorrect and jumbled
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",  # Latin placeholder text
    "Random gibberish xzq mfnweor pasd."  # Completely random gibberish
]

print("\n--- Perplexity Comparison ---")
for text in texts:
    perplexity = calculate_perplexity(text)
    print(f"Perplexity: {perplexity}")



--- Perplexity Comparison ---


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.



Original Text: The quick brown fox jumps over the lazy dog.
Tokenized Input: [[  464  2068  7586 21831 18045   625   262 16931  3290    13]]
Loss: [5.0905147]
Perplexity: [162.47346]


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.



Original Text: Quick the brown fox over lazy jumps dog the.
Tokenized Input: [[21063   262  7586 21831   625 16931 18045  3290   262    13]]
Loss: [8.565364]
Perplexity: [5246.7485]


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.



Original Text: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Tokenized Input: [[   43 29625   220  2419   388   288 45621  1650   716   316    11   369
   8831   316   333 31659   271  2259  1288   270    13]]
Loss: [0.9613916]
Perplexity: [2.6153336]


All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.



Original Text: Random gibberish xzq mfnweor pasd.
Tokenized Input: [[29531 46795   527   680  2124    89    80   285 22184   732   273 38836
     67    13]]
Loss: [6.982953]
Perplexity: [1078.0974]


Lower perplexity for natural text (first example) indicates the model is confident in predicting the sequence.
Higher perplexity for random gibberish reflects the model's struggle to make predictions.

## Imports and Libraries

In [None]:
pip install --upgrade tensorflow-datasets

In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
from collections import Counter

### Loading dataset

We will use the IMDb movie reviews dataset, which is a collection of movie reviews along with sentiment labels. We will focus on the text data and ignore the labels for this task.

**as_supervised=True** allows us to retrieve the data in a format where the input data is paired with its label (although we won't use the labels in this case).


In [5]:
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)


2024-11-19 23:14:10.233458: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /Users/divyahegde/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/divyahegde/tensorflow_datasets/imdb_reviews/plain_text/incomplete.7NDLW5_1.0.0/imdb_reviews-t…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /Users/divyahegde/tensorflow_datasets/imdb_reviews/plain_text/incomplete.7NDLW5_1.0.0/imdb_reviews-t…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /Users/divyahegde/tensorflow_datasets/imdb_reviews/plain_text/incomplete.7NDLW5_1.0.0/imdb_reviews-u…

[1mDataset imdb_reviews downloaded and prepared to /Users/divyahegde/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


## Preprocess the Text Data

The IMDb dataset contains raw text data that we need to preprocess. We will tokenize the text (split the text into words) and convert them into bigrams (pairs of consecutive words).

Here, the tokenize() function converts the text from a byte string to a regular Python string and then splits it into individual words.

In [10]:
def tokenize(text):
    return text.numpy().decode('utf-8').split()

def extract_bigrams(text):
    words = tokenize(text)
    bigrams = [(words[i], words[i + 1]) for i in range(len(words) - 1)]
    return bigrams


### Limit Data Size for Training and Testing
To keep the experiment manageable, we will limit the training and test data to a smaller number of samples (500 for training and 100 for testing):

In [6]:
train_data = dataset['train'].map(lambda x, y: x)
test_data = dataset['test'].map(lambda x, y: x)

train_texts = list(train_data.take(500))  # Limit to 500 training samples
test_texts = list(test_data.take(100))    # Limit to 100 test samples


2024-11-19 23:32:01.452421: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2024-11-19 23:32:01.483479: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


### Build Vocabulary and Convert Words to Indices 

In [11]:
train_bigrams = []
for text in train_texts:
    train_bigrams.extend(extract_bigrams(text))

test_bigrams = []
for text in test_texts:
    test_bigrams.extend(extract_bigrams(text))

train_words = [w for bigram in train_bigrams for w in bigram]
test_words = [w for bigram in test_bigrams for w in bigram]

vocab = list(set(train_words))
vocab_size = len(vocab)
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
word_to_idx["<UNK>"] = vocab_size  # Add a special token for unknown words
idx_to_word = {idx: word for word, idx in word_to_idx.items()}


## Convert Words to Indices

Every word in the training and test sets is replaced by its corresponding index in the vocabulary.

In [12]:
def word_to_index(word):
    return word_to_idx.get(word, word_to_idx["<UNK>"])  # Use <UNK> for unknown words

train_sequences = [word_to_index(word) for word in train_words]
test_sequences = [word_to_index(word) for word in test_words]


### Prepare Data for the LSTM Model

From the sequences of word indices, we need to prepare the data for the LSTM model. Specifically, we create input-output pairs where the input is a sequence of words, and the output is the next word in the sequence.

In [14]:
def create_input_output(sequences, sequence_length=2):
    X, y = [], []
    for i in range(len(sequences) - sequence_length):
        X.append(sequences[i:i + sequence_length - 1])
        y.append(sequences[i + sequence_length - 1])
    return np.array(X), np.array(y)

X_train, y_train = create_input_output(train_sequences)
X_test, y_test = create_input_output(test_sequences)


### Build the LSTM Model

The model will have the following layers:

- Embedding Layer: Converts word indices into dense word embeddings.
- LSTM Layer: Processes the sequence of words to capture temporal dependencies.
- Dense Layer: Outputs the probability distribution over all possible next words.

In [15]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size + 1, output_dim=128, input_length=X_train.shape[1]),  # +1 for <UNK>
    tf.keras.layers.LSTM(128, return_sequences=False),
    tf.keras.layers.Dense(vocab_size + 1, activation='softmax')  # +1 for <UNK>
])


### Compile and Train the Model

In [16]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=3, batch_size=64, validation_data=(X_test, y_test))


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x2cc0d68c0>

### Calculate Perplexity

In [17]:
def calculate_perplexity(model, X, y):
    predictions = model.predict(X)
    log_prob_sum = 0
    N = len(y)
    
    for i in range(N):
        prob = predictions[i, y[i]]
        log_prob_sum += np.log(prob + 1e-10)  # Smoothing to avoid log(0)
    
    perplexity = np.exp(-log_prob_sum / N)
    return perplexity

perplexity = calculate_perplexity(model, X_test, y_test)
print(f'Perplexity: {perplexity}')


Perplexity: 424.87818777474286
