
**This exercise is part of the [Recurrent network for entailment](https://www.kaggle.com/code/datasniffer/nlp-recurrent-networks-for-entailment/notebook).**

---

# How advanced language models generate text

<!--div style="display:block;width:300px;float:right;height:40%">&nbsp;</div-->

The BERT model [is _non-causal_](https://huggingface.co/blog/bert-101), meaning that it has access to future words (or _tokens_, rather) to predict the likelihood of other words in context. A purely causal model is GPT2. This means that GPT2 only has has access to past words in a sequence to predict future words (hence the term "causal"). We will use it to demonstrate how neural networks can generate realistic sounding text. 


Let's get started. First, run the next code cell.
    
    


In [None]:
import pandas as pd
import numpy as np

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools_nlp_utility import *

print("\nSetup complete")


# GPT2

GPT stands for Generative Pre-trained Transformer. _Generative_ implies that the model intends to be represent the full joint probability density of every possible sequence of words (or _tokens_, rather), instead of only modeling the conditional probability density of a single word given a context.

Recurrent neural network layers for contextualizing word embeddings are slow. This is mainly because each recurrent layer needs to iteratively process one word embedding after the other, in sequence. This fundamentally prevents paralellization of the computations on multiple processor cores or on GPUs and/or TPUs. Instead of RNN layers, GPT2 uses a mechanism called "_self attention_" which does allow for parallelization. "Self attention" is similar but not identical to the gating process that takes place in LSTMs and GRUs: The sequence of input embeddings is turned into a new sequence of embeddings in the form of weighted averages of the input embeddings. The idea is that each weighted average will represent a different semantic unit in the input sequence. The weights for computing these averages are computed from the input embeddings using a feed forward layer that is simultaneously trained with the entire network. The output from this is called an "_attention head_", and often multiple such attention heads are run in parallel, in which case the outputs are concatenated. If the (concatenated) sequence of attention head outputs is then passed through a feed forward layer, we speak of a (self-attending) "_transformer_" block. GPT2 (as well as BERT) stacks multiple such transformer blocks on top of each other; hence the term Transformer in its name. 

<details><summary>Self-attention and transformer in some detail</summary>
Let $u_j, j=0, \ldots, n-1$ is the sequence of word embeddings, and let $U = [u_0, u_2, \ldots, u_{n-1}]$. Then the attention weights are computed from
<br /><br />
<!--
$$W_\text{attention} = \mathrm{softmax}((W_\text{q}U+b_\text{q})'(W_\text{k}U + b_\text{k})
= \mathrm{softmax}((U'W_\text{q}'+1 b_\text{q}')(W_\text{k}U + b_\text{k}1')) \\
= \mathrm{softmax}(U'W_\text{q}'(W_\text{k}U + b_\text{k}1')+1 b_\text{q}'(W_\text{k}U + b_\text{k}1')) \\
= \mathrm{softmax}(U'W_\text{q}'W_\text{k}U + U'W_\text{q}'b_\text{k}1' + 1 b_\text{q}'W_\text{k}U + 1 b_\text{q}'b_\text{k}1') \\
$$
-->

$$
\begin{align*}
Q &= W_\text{q} U + b_\text{q} 1_n' \\
K &= W_\text{k} U + b_\text{k} 1_n' \\
V &= W_\text{v} U + b_\text{v} 1_n' \\
W_\text{attention} &= \mathrm{softmax}(Q'K / \sqrt{d}) 
\end{align*}
$$
    
and the output sequence from the attention head is computed as
    
$$O = (W_\text{v}U + b_\text{v}1') W_\text{attention}$$
    
Here $d$ is the number of rows of $Q$ and $K$ (which may or may not be the same as the number of rows of $U$—i.e. the input word embedding dimension). Furthermore, the $\mathrm{softmax}$ function is normalizing _by row_ in the matrix $Q'V/\sqrt{d}$, and _not_ over all the values in the matrix. If there are multiple _attention heads_ with outputs $O_1, O_2, \ldots, O_h$, each with their own sets of parameters $W_\text{q}, b_\text{q}, W_\text{k}, b_\text{k}, W_\text{v}, b_\text{v}$, then these outputs are concatenated:

$$ O = \begin{pmatrix} O_1 \\ O_2 \\ \vdots \\ O_h \end{pmatrix}. $$

When $O$ is passed through a dense feed forward layer and normalized, this constitutes a _transformer_ block:

$$ T = W_\text{transformer}\sigma(W_\text{ffw} O + b_\text{ffw}) + b_\text{transformer}, $$

where $\sigma$ can be any activation function, but usually is a ReLU or GeLU activation function.
</details><br />

    
We can import a pretrained version of a GPT2 model from the HuggingFace `transformer` library:
    


In [None]:
import tensorflow as tf
from transformers import TFAutoModelForCausalLM
gpt2_model = TFAutoModelForCausalLM.from_pretrained("gpt2")
gpt2_model.trainable = False


# Step 1: Tokenization and encoding

As is always the case with text, we need to turn text into a sequence of tokens, and then encode those sequences into a TensorFlow friendly format: `numpy` arrays (or `tensorflow` tensors). The easiest is to use the tokenizers that come with pretrained models from the `transformer` library:
    


In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
seq = "Machine learning with TensorFlow can do amazing" # unfinished sentence
inpts = tokenizer(seq, return_tensors="tf")
inpts


The tokenizer knows words, as well as parts of words, so that also unfamiliar words can be represented. In fact, it even encodes single characters and may consider a space to be part of a token:
    


In [None]:
inpt_ids = inpts["input_ids"]  # just token IDs, no attention mask

print(f"{'token id': >10} {'token ': >13}\n{'  ':-<10} {'  ':-<13}")
for id in inpt_ids[0].numpy():
    word = tokenizer.decode(id)
    print(f"{id: >10} {'`'+word+'`': >13}")


The tokenizer has attributes regarding the vocabulary, such as the `vocab` attribute. Can you figure out how many tokens does the vocabulary of the tokenizer know?
    


In [None]:
# Check your answer (Run this code cell to receive credit!)
part_1.solution()

# Step 2: Predicting the _next word_ probabilities


We can pass the tokenized and encoded string directly into `gpt2_model`. The output is an object that has an `logits` attribute. Logits are the network outputs _before_ applying the _softmax_ activation function. The `logits` attribute is an array with `shape = [batch_size, seq_len, vocab_size]`, and so for a single input string it is essentially a matrix of dimensions `seq_len` by `vocab_size`.



In [None]:
output = gpt2_model(inpt_ids)
output.logits


The output format is described summarily in  the [documentation](https://huggingface.co/docs/transformers/v4.23.1/en/main_classes/output#transformers.modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions).

What is important to understand is that each row in the matrix specifies the _conditional_ probability for the next token _given_ the sequence of tokens so far, for each token in the vocabulary. For example, the second row encodes the probability $P(T\, |\, \mathtt{Machine, learning})$ where $T$ is any token in the vocabulary, and hence, it can be used to compute the probability $P(\mathtt{will}\,|\, \mathtt{Machine, learning})$ for instance, or the probability $P(\mathtt{has}\, |\, \mathtt{Machine, learning})$. The token $T$ for which this _conditional_ probability is the highest, is the most likely token to follow the sequence `[Machine, learning]`.

The last row in the matrix specifies the _conditional_ probability for the next token _given_ the entire sequence so far: 

$$P(T\,|\,\mathtt{Machine, learning, with, T, ensor, Flow, can, do, amazing}),$$

where $T$ is a token in the vocabulary. (If you wonder about the split between `T`, `ensor`, and `Flow`, look back at the encoding above.)

So why does `logits` not just give the last row of this matrix, but all the intermediate probabilities $P(T|\mathtt{Machine}), P(T|\mathtt{Machine, learning}), P(T|\mathtt{Machine, learning, with})$, etc. as well? 

The answer lies in the fact that GPT2 is a _generative language model_, meaning that it wants to give the _joint_ probability $P(\mathtt{Machine, learning, with, T, ensor, Flow, can, do, amazing})$ and not (just) the conditional probability from the last row: The matrix of logits encodes the entire probability of observing the sequence of tokens. Recall from probability theory that 

$$
\begin{align}
    P(&\mathtt{Machine,learning,with,\ldots,amazing}) \\
    &= P(\mathtt{Machine})\cdot P(\mathtt{learning,with,\ldots,do,amazing}|\mathtt{Machine})\, \\
    &= P(\mathtt{Machine})\cdot P(\mathtt{learning}|\mathtt{Machine})\cdot P(\mathtt{with,\ldots,do,amazing}|\mathtt{Machine, learning}) \\
    &= P(\mathtt{Machine})\cdot P(\mathtt{learning}|\mathtt{Machine})\cdot P(\mathtt{with}|\mathtt{Machine,learning})
    \cdots P(\mathtt{amazing}|\mathtt{Machine,learning,with,\ldots,do})
\end{align}
$$

Notice that all of the conditional probabilities on the right hand side of the last line are represented by the rows of the `logits` attribute.

#### The token predicted to follow 'amazing'


    
Let's try to find the token $T$ in the vocabulary that is the most likely to follow the text "_Machine learning with TensorFlow can do amazing_" according to GPT2. To a human this sentence screams that the word 'things' should follow, but does GPT2 also think so too?


    
To do so, we can take the last row from `output.logits`, compute the conditional probabilties and find the token for which this is largest. To compute the probabilities we can pass the logits through the `keras.activations.softmax()` function. Then TensorFlow's `tf.argmax()` function returns the index where this probabilitie is highest:
    


In [None]:
last_logits = output.logits[0,-1:,:]
conditional_probs = tf.keras.activations.softmax(last_logits)
token_id = tf.argmax(conditional_probs, axis=1)
tokenizer.decode(token_id)


So indeed GPT2 also thinks 'things' should follow the sentence! (The example was found in [this post](https://jamesmccaffrey.wordpress.com/2021/10/21/a-predict-next-word-example-using-hugging-face-and-gpt-2/).)

    
The above code does one thing that isn't necessary: passing the logits through the _softmax_ function. In stead of searching for the token with the highest conditional probability, we can also simply search for the token with the highest logit.

Let's do this for all the rows in `output.logits` to see the predicted token from the sequence of conditional probabilities that it represents:
    


In [None]:
# find the most likely next token IDs for each row of logits
pred_ids = tf.argmax(output.logits[0,:,:],axis=1)

# print the most likely token together with the sequence of tokens so far
for i, pred_id in enumerate(pred_ids):
    
    # decode the sequence of tokens so far
    past = "'" + tokenizer.decode(inpt_ids.numpy()[0][:i+1]) + "'"
    
    # decode the predicted next token
    pred_token = "'" + tokenizer.decode(pred_id) + "'"
    
    print(f"{i}: {past: <50} → {pred_token: <12}")


Note that the predicted output is always a plausible next word or token, except perhaps for the token predicted after 'Machine' (a period) and after 'Machine learning with TensorFlow' (a newline).

The actual token that comes after 'Machine learning with TensorFlow can' is 'do', and not the predicted 'be'. Compute the _conditional_ probability of the token 'do', and compare it to that of the token 'be'. 
    


In [None]:
# compute the conditional probabilities of the tokens given 'Machine learning with TensorFlow can'
# YOUR CODE (approx. 1 line of code)
probs = tf.keras.activations.softmax(output.logits[0,6:7,:])[0,:].numpy()

token_id_of_do = tokenizer.encode(' do')
token_id_of_be = tokenizer.encode(' be')
ratio = probs[token_id_of_be] / probs[token_id_of_do] 

print(f"The token ' be' is {ratio} times more likely than the token ' do' to follow \"Machine learning with TensorFlow can\"")

In [None]:
# Check your work (Run this to get points!)
part_2.check()

# You can ask for a hint or the solution by uncommenting the following:
#part_2.hint()
#part_2.solution()


# Step 3: Generating text: iteratively predicting the _next word_

Let's use these principles to make GPT2 generate some text by starting from the prompt "_In the future_", predict the next token as we did above, concatenating that token to the prompt, and use that as a new prompt. This is set up in the code below, but some code is missing. Complete the code and see what text is generated:
    


In [None]:
start_text = tokenizer.encode("In the future")

num_tokens_generated = 0
while tokenizer.decode(start_text[-1]) != '.': # stop as soon the last predicted token is a period
    
    # store list of tokens as a numpy array 
    start_text_as_numpy = np.array(start_text, dtype="int32")
    
    ## predict the next token: 
    # compute logits, 
    logits = gpt2_model(start_text_as_numpy).logits
    
    # find the one with the largest logit
    # YOU CODE (~ 1 line of code)
    most_likely_token = tf.argmax(logits,axis=1)[-1].numpy()
    
    # append the generated token to end of the sequence
    start_text += [most_likely_token]
    print(tokenizer.decode(start_text))
    
    num_tokens_generated += 1
    if num_tokens_generated > 30:
        break


In [None]:
# Check your work (Run this to get points!)
part_3.check()

# You can ask for a hint or the solution by uncommenting the following:
#part_3.hint()
#part_3.solution()


### Using `pipeline` to generate text with GPT2

The [&#129303; HuggingFace documentation](https://huggingface.co/gpt2?text=My+name+is+Thomas+and+my+main) explains that the GPT2 model can also be used for a text generating `pipeline`. 

As you may recall from the [first tutorial](https://www.kaggle.com/datasniffer/nlp-intro) in this course, `pipeline`'s take a string of text, and turns it into whatever it is supposed to turn it into (e.g., a _sentiment classification_, an _answer to a question_, _more text_). A `pipeline` does all the steps required for transforming an input string of text into an output that we have been doing by hand in the past exercises:

1. tokinzing the string, 
2. encoding the tokens, 
3. passing the encoded sequence of tokens through a neural language model
4. transforming the output to a category label


The "text-generation" `pipeline` essentially does what you have implemented in the above code cell. Let's see it in action:
    


In [None]:
import transformers
from transformers import pipeline
transformers.set_seed(42) # for repeatablity

gpt2_generator = pipeline('text-generation', model='gpt2')

In [None]:
gpt2_generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5, pad_token_id=50256)


As you can see, we passed some extra arguments to the text generator. In particular, we passed the argument `num_return_sequences = 5`. The result is that we get 5 generated pieces of text—all starting with our prompt "Hello, I'm a language model,", but then all ending differently. You may have noted that when you run your own text generating code, it will always return the same answer to the same prompt. So how does `gpt2_generator` give different results?

The obvious answer is of course that it introduces randomness. The question then is just "how?"




# Step 4: Introducing random variation

Remember that in our own generating code we used logits to predict the most likely next token. We were able to turn these logits into probabilities by means of the `keras.activations.softmax()` function. Let's consider this in more detail. 
    
Say we have a the sequence `["Hello", ",", "I", "'m", "a", "language", "model"]`. If we enter this sequence into the model we receive as output an object that holds (among other things) the matrix with logits. The _last row_ of this matrix contains the logits $L_i$ for the token in the vocabulary with token ID $i$, _conditional_ on the input sequence, for all $i=0, \ldots, N_\text{vocab}-1$. Then

$$
\displaystyle P(\text{next token ID} = i\,|\,\mathtt{Hello, ,, I, }\text{'}\mathtt{m, a, language, model}) = {e^{L_i} \over \sum_{j=0}^{N_\text{vocab}-1} e^{L_j}}
$$

is the _conditional_ probability that the next token ID is $i$, _given_ the prompt "_Hello, I'm a language model_" as input. 

The sequence of probabilities $P(T\,|\,\mathtt{Hello, ,, I, 'm, a, language, model})$ for all tokens $T$ in the vocabulary forms a discrete probability distribution. To introduce randomness into the generated text without getting unlikely results, we can simply sample from this distribution. Next, we explore this idea further.
    



#### Random sampling in Tensorflow

Tensorflow comes with a function `tf.random.categorical` that can be used to generate random draws from a discrete distribution as just discussed. It accepts as its first argument _not_ probabilities, but logits. These are then used internally to sample in accordance with the probabilities that we see when applying the softmax function to the logits. A second argument specifies the sample size: 


In [None]:
# obtain conditional logits for the next word
text = tokenizer.encode("Hello, I'm a language")
text_as_numpy = np.array(text, dtype="int32")
logits = gpt2_model(text_as_numpy).logits

# generate a random token from the conditional probability distribution
next_token = tf.random.categorical(logits, num_samples=1)[-1][0].numpy()
next_token, tokenizer.decode(next_token)


Using this random sampling function, we now introduce some variation into the sentences that are generated by our `while` loop. Complete the code below to do so:
    


In [None]:
random_text = tokenizer.encode("In the future") #"This is a sports")

num_tokens_generated = 0
while tokenizer.decode(random_text[-1]) != '.': # stop as soon the last predicted token is a period
    
    # store list of tokens as a numpy array 
    random_text_as_numpy = np.array(random_text, dtype="int32")
    
    # predict the next token: compute logits, then take the one with the largest logit
    logits = gpt2_model(random_text_as_numpy).logits
    # YOUR CODE (1 line of code)
    random_next_token = tf.random.categorical(logits, num_samples=1)[-1][0].numpy()
    
    # paste generated token to end of the sequence
    random_text += [random_next_token]
    print(tokenizer.decode(random_text))
    
    num_tokens_generated += 1
    if num_tokens_generated > 30:
        break


It should be said that the way GPT2 introduces randomness in the generated text is a little bit different that the simple method here.

In [None]:
# Check your answer (Run this code cell to receive credit!)
part_4.solution()


# Congratulations!

You've finished the NLP course. It's an exciting field that has developed some of the most advanced deep learning models and that will help you make use of vast amounts of data you didn't know how to work with before.

This course should be just your introduction. Try a project with [text](https://www.kaggle.com/datasets?tags=14104-text+data). You'll have fun with it, and your skills will continue growing.
    
