# Logits in Causal Language Models

Task: Ask a language model for how likely each token is to be the next one.

## Setup

We start in the same way as the tokenization notebook:

In [1]:
# If the import fails, uncomment the following line:
# !pip install transformers
import torch
from torch import tensor
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd

One step in this notebook will ask you to write a function. The most common error when function-ifying notebook code is accidentally using a global variable instead of a value computed in the function. This is a quick and dirty little utility to check for that mistake. (For a more polished version, check out [`localscope`](https://localscope.readthedocs.io/en/latest/README.html).)

In [2]:
def check_global_vars(func, allowed_globals):
    import inspect
    used_globals = set(inspect.getclosurevars(func).globals.keys())
    disallowed_globals = used_globals - set(allowed_globals)
    if len(disallowed_globals) > 0:
        raise AssertionError(f"The function {func.__name__} used unexpected global variables: {list(disallowed_globals)}")

Download and load the model.

In [3]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

In [4]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


## Task

In the tokenization notebook, we simply used the `generate` method to have the model generate some text. Now we'll do it ourselves.

Consider the following phrase:

In [5]:
phrase = "This weekend I plan to"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

1: Call the `tokenizer` on the phrase to get a `batch` that includes `input_ids`.

In [6]:
batch = tokenizer(phrase, return_tensors='pt')
input_ids = batch['input_ids']

2: Call the `model` on the `input_ids`. Examine the shape of the logits.

In [7]:
with torch.no_grad(): # This tells PyTorch we don't need it to compute gradients for us.
    model_output = model(input_ids)
print(f"logits shape: {list(model_output.logits.shape)}")

logits shape: [1, 5, 50257]


3: Pull out the logits corresponding to the *last* token in the input phrase. Hint: Think about what each number in the shape means.

Note: The `model` returns a dictionary-like object. The `logits` are in `model_output.logits`.

In [10]:
last_token_logits = model_output.logits[0, -1, :]
assert last_token_logits.shape == (len(tokenizer.get_vocab()),)

4: Identify the token id and corresponding string of the most likely next token.

To find the most likely token, we need to find the *index* of the *largest value* in the `last_token_logits`. The method that does this is called `argmax`. (It's a common enough operation that it's built into PyTorch.)

Note: The `tokenizer` has a `decode` method that takes a token id, or a list of token ids, and returns the corresponding string.

In [11]:
last_token_probabilities = last_token_logits.softmax(dim=-1)
# dim=-1 means to compute the softmax over the last dimension

# Find the most likely token
most_likely_token_id = last_token_probabilities.argmax().item()
decoded_token = tokenizer.decode(most_likely_token_id)
probability_of_most_likely_token = last_token_probabilities[most_likely_token_id].item()

# Print the results
print("For the phrase:", phrase)
print(f"Most likely next token: {most_likely_token_id}, which corresponds to {repr(decoded_token)}, with probability {probability_of_most_likely_token:.2%}")

For the phrase: This weekend I plan to
Most likely next token: 467, which corresponds to ' go', with probability 5.98%


5: Use the `topk` method to find the top-10 most likely choices for the next token.

See the documentation for [`torch.topk`](https://pytorch.org/docs/stable/generated/torch.topk.html). Calling `topk` on a tensor returns a named tuple with two tensors: `values` and `indices`. The `values` are the top-k values, and the `indices` are the indices of those values in the original tensor. (In this case, the indices are the token ids.)

*Note*: This uses Pandas to make a nicely displayed table, and a *list comprehension* to decode the tokens. You don't *need* to understand how this all works, but I highly encourage thinking about what's going on.

In [14]:
most_likely_tokens = last_token_logits.topk(10)
print(f"most likely token index from topk is {most_likely_tokens.indices[0]}") # this should be the same as argmax
decoded_tokens = [tokenizer.decode(i) for i in most_likely_tokens.indices]
probabilities_of_most_likely_tokens = last_token_probabilities[most_likely_tokens.indices]

# Make a nice table to show the results
most_likely_tokens_df = pd.DataFrame({
    'tokens': decoded_tokens,
    'probabilities': probabilities_of_most_likely_tokens,
})
# Show the table, in a nice formatted way (see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Builtin-Styles)
# Caution: this "gradient" has *nothing* to do with gradient descent! (It's a color gradient.)
most_likely_tokens_df.style.hide_index().background_gradient()


most likely token index from topk is 467


  most_likely_tokens_df.style.hide_index().background_gradient()


tokens,probabilities
go,0.059828
take,0.04388
spend,0.03157
make,0.030519
do,0.029206
be,0.02796
attend,0.025885
visit,0.025827
run,0.022074
have,0.020955


6. Write a function that is given a phrase and a *k* and returns the top *k* most likely next tokens.

Build this function using only code that you've already filled in above. Clean up the code so that it doesn't do or display anything extraneous. Add comments about what each step does.


In [19]:
def predict_next_tokens(phrase, k):
    # Tokenize the phrase
    input_ids = tokenizer.encode(phrase, return_tensors='pt')

    # Generate the output tokens and probabilities
    with torch.no_grad():
        output = model(input_ids)
        last_token_logits = output.logits[0, -1]
        last_token_probabilities = torch.softmax(last_token_logits, dim=0)

    # Get the top k most likely next tokens
    topk_tokens = last_token_logits.topk(k)
    decoded_tokens = [tokenizer.decode(token) for token in topk_tokens.indices]
    probabilities = last_token_probabilities[topk_tokens.indices]

    # Create a pandas dataframe to display the results
    tokens_df = pd.DataFrame({
        'tokens': decoded_tokens,
        'probabilities': probabilities
    })

    return tokens_df



check_global_vars(predict_next_tokens, allowed_globals=["torch", "tokenizer", "pd", "model"])

In [20]:
predict_next_tokens("This weekend I plan to", 5).style.hide_index().background_gradient()

  predict_next_tokens("This weekend I plan to", 5).style.hide_index().background_gradient()


tokens,probabilities
go,0.059828
take,0.04388
spend,0.03157
make,0.030519
do,0.029206


In [21]:
predict_next_tokens("To be or not to", 5).style.hide_index().background_gradient()

  predict_next_tokens("To be or not to", 5).style.hide_index().background_gradient()


tokens,probabilities
be,0.648473
have,0.021346
the,0.012962
do,0.009471
",",0.007444


In [22]:
predict_next_tokens("For God so loved the", 5).style.hide_index().background_gradient()

  predict_next_tokens("For God so loved the", 5).style.hide_index().background_gradient()


tokens,probabilities
world,0.120081
Lord,0.072237
people,0.050226
earth,0.04275
children,0.026753


## Analysis


Q1: Explain the shape of `model_output.logits`.

In [23]:
print(model_output.logits)

tensor([[[-32.2573, -30.3204, -32.3731,  ..., -43.4906, -43.1277, -32.2732],
         [-58.8898, -62.1764, -65.4265,  ..., -69.8106, -67.6983, -63.7279],
         [-74.3908, -75.3577, -77.8408,  ..., -84.0086, -74.2405, -76.7100],
         [ -3.6331,  -5.7036,  -9.3678,  ..., -15.0274, -14.9595,  -7.1032],
         [-77.8725, -79.7685, -82.1183,  ..., -88.5235, -86.5616, -79.6716]]])


The shape of model_output.logits is a tensor with shape [batch_size, sequence_length, vocabulary_size]. The vocabulary_size dimension represents the number of possible tokens in the model's vocabulary, and each element in the tensor represents the model's logits (log-odds) for each possible token at each position in the sequence. Therefore, the shape of model_output.logits allows us to access the model's predictions for all tokens at all positions in the sequence.

Q2: Change the -1 in `last_token_logits` to -2. What does the variable represent now? What does its argmax represent?

Changing the -1 in last_token_logits to -2 refers to the logits of the second to last token in the sequence. This variable represents the model's predictions for the token that should come before the last token in the sequence. Taking its argmax will give us the index of the token that the model thinks is most likely to come before the last token in the sequence.

Q3: Let's think. The method in this notebook only get the scores for *one* next-token at a time. What if we wanted to do a whole sentence? We’d have to generate a token for each word in that sentence. What are a few different ways we could we adapt the approach used in this notebook to generate a complete sentence?

To think about different ways to do this, think about what decision(s) you have to make when generating each token.

Note: you don't have to write any code to answer this question.

There are different ways to adapt the approach used in this notebook to generate a complete sentence. One approach is to generate the next token based on the previously generated token and the entire context of the sentence generated so far. Another approach is to use beam search to generate multiple possible next tokens and choose the sequence of tokens with the highest overall probability. We could also use sampling to randomly generate the next token based on its probability distribution. Finally, we could use techniques such as temperature scaling to control the randomness of the generated text.