# Introduction to Language Models

In this notebook, we introduce language models. We start with a classic ngram (trigram) language model, and then we load up a state-of-the-art pretrained transformer neural language model, and we use it in the same way.

## Building an ngram model.

Let's construct an ngram model.  We will build and "train" the ngram model from scratch, from some training text that we'll load from the Gutenberg public-domain text dataset.

In [None]:
# Import necessary libraries
import nltk
import random
from nltk import bigrams, trigrams
from collections import defaultdict, Counter

# Download and load a sample text dataset
nltk.download('punkt')
nltk.download('gutenberg')

Using this API, you could load, e.g. Shakespere's Hamlet by saying `nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')`  (Try it!) Below we load Jane Austen's Emma.

Then we print some of the text.

In [None]:
text = nltk.corpus.gutenberg.raw('austen-emma.txt')

print(f'The text is {len(text)} characters long. Here is the beginning:\n{text[:300]}')

## Tokenization

An ngram model can be built on letters or words or other chunks of text.

Here we will split the text into words.

If you'd like to create a character-based ngram model, you can try `tokens = list(text)` instead.

In [None]:
# Preprocess the text
tokens = nltk.word_tokenize(text.lower())


## Count trigram statistics

The trigrams function chops up text into a list of triples of adjacent words, like this:

In [None]:
list(trigrams(tokens))[:10]

We can now run through every triple of adjacent tokens.

In [None]:
# Build trigram statistics
trigram_model = defaultdict(Counter)
for w1, w2, w3 in trigrams(tokens, pad_right=True, pad_left=True):
    trigram_model[(w1, w2)][w3] += 1

# Convert counts to probabilities
for w1_w2 in trigram_model:
    total_count = float(sum(trigram_model[w1_w2].values()))
    for w3 in trigram_model[w1_w2]:
        trigram_model[w1_w2][w3] /= total_count



## Examining the model

We can read the trigram model by supplying any bigram as input; then it gives us the probability distribution of the next words.

We often write this as
$$p(y | x)$$
where $x$ is the sequence of previous words, in this case, $x = ['she', 'saw']$, and $y$ is the next word to be predicted, which will be selected based on the table.

In [None]:
trigram_model[('she', 'saw')]

## Generate text with a trigram model

We can repeat the process of choosing a next word based on this rule.

In [None]:
# Sampling function to generate text
text = ['she', 'saw']
for _ in range(50):
    next_words = list(trigram_model[tuple(text[-2:])].keys())
    if not next_words:  # Check if there are no possible next words
        break
    next_word = random.choices(next_words, weights=trigram_model[tuple(text[-2:])].values())[0]
    if next_word is None:  # End of sentence
        break
    text.append(next_word)
print(' '.join(text))


Take a close look at the output.

Try running it a few times.

What do you think of the generated text?

# Neural Language Modeling

Here we load a pretrained transformer called "distilgpt2".

Instead of looking up probabilities in a table, it uses a neural network to estimate probabilities. This neural network was pretrained on a large amount of web text.

The neural network is a classifier, and it needs to have a fixed vocabulary of tokens to choose between for each "word".  So we use a tokenizer algorithm that splits text using a fixed vocabulary of about 50 thousand tokens.

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
torch.set_grad_enabled(False)

# Load pretrained model and tokenizer
model = GPT2LMHeadModel.from_pretrained("distilgpt2")
model.eval()  # Set the model to evaluation mode
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

## Using the tokenizer object

The GPT2 tokenizer splits a string into word-like tokens and translates each of them to a number.  See how the number 2497 corresponds  to the token `' saw'`.

In [None]:
# Set the initial prompt
prompt = "She saw"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

print(input_ids)
print([tokenizer.decode(input_ids[0, i]) for i in range(len(input_ids[0]))])

## Examining raw transformer model outputs

A transformer `model` is a neural network that takes a sequence of tokens as inputs and produces a grid of numbers, `logits`, as outputs.



In [None]:
transformer_outputs = model(input_ids).logits
print(transformer_outputs.shape)
print(transformer_outputs[0, :, :6])


## Applying softmax and examimining probabilities

The each row of numbers are raw neural network classifier outputs, for a 50257-way classifier. To convert to probabilities, we put them through a softmax.

Each row of classifier output predicts the token following each of the tokens in the array (since we provided two tokens, we get two predictions). To generate, we are just interested in the last row.

The following function gets these probabilities; we run it and print out the top 10 probabilities.

In [None]:
def get_probabilities(input_ids):
    transformer_outputs = model(input_ids).logits
    probabilities = torch.nn.functional.softmax(transformer_outputs, dim=-1)
    return probabilities[:, -1, :]

probabilities = get_probabilities(input_ids)

print(f"after \"{tokenizer.decode(input_ids[0])}\":")
for v, k in zip(*probabilities[0].topk(10)):
    print(f"{tokenizer.decode([k])}: {v.item():.3f}")

## Generating text

We can iteratively sample from the neural language model to generate text.

In [None]:
output_ids = input_ids

# Generation loop
for _ in range(50):
    # Get the logits for the last token
    probabilities = get_probabilities(output_ids)
    next_token = torch.multinomial(probabilities, num_samples=1)

    # Append the sampled token to the output_ids
    output_ids = torch.cat([output_ids, next_token], dim=-1)

# Decode and print the generated text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text)


Take a close look at the output from the neural language model.

Try running it a few times.

What do you think of the generated text?

How does it compare to the trigram model output?


## Looking inside the transformer model

Let's look inside the neural language model. What is inside a transformer?

In [None]:
model

## Understaning an encoder layer.

The first thing to understand is that the transformer uses an *encoder* to translate each input token into a vector. The encoder layer is called `model.transformer.wte`.

Let's use the encoder to encode the two tokens 'She saw`:

In [None]:
embeddings = model.transformer.wte(input_ids)
print(embeddings)
print(embeddings.shape)

Notice that the embeddings are 768-dimensional vectors. How does this compare to the vocabulary size?

## Inverting the embedding matrix

We can transpose the embedding matrix to get an approximate inversion that will tell us "which token has the embedding closest to a vector".

Here we will get the embedding vectors for Paris, France, and Tokyo, and invert them back to the original tokens.

In [None]:
test_ids = tokenizer.encode(" Paris France Tokyo", return_tensors="pt")
embeddings = model.transformer.wte(test_ids)
inverse_embeddings = model.transformer.wte.weight.t()
print(embeddings)
print(embeddings.shape)


scores = embeddings @ inverse_embeddings
for v, k in zip(*scores[0].max(dim=-1)):
    print(f"{tokenizer.decode([k])}: {v.item():.3f}")

## Semantic vector algebra

What should we get if we calculate the vector "Tokyo - Paris + France"?

Try it below.  The code prints the 10 tokens with the closest embeddings.

These type of vector algebra relationships emerge naturally when the embedding vectors are learned using neural network training.

In [None]:
vec = embeddings[0, 2, :] - embeddings[0, 0, :] + embeddings[0, 1, :]
scores = vec[None] @ inverse_embeddings
for v, k in zip(*scores[0].topk(10)):
    print(f"{tokenizer.decode([k])}: {v.item():.3f}")