As we would expect from an encoder-decoder architecture, T5 is mainly a sequence-to-sequence or text-to-text model, i.e. its input is a sequence of token, and its output is a different sequence of token which, in general, does not have the same length as the input. In the training method used for T5, the input and labels for different tasks are all converted into pairs of sentences, so that the same model can be applied to a large variety of tasks. The firsts part of the sentence is fed into the encoder, the second part - the labels - are fed as targets into the decoder which is then trained using teacher-forcing on these labels.

Let us first load a pretrained T5 model and try an example, specifically we reproduce the example for a single training step given in the [Huggingface documentation](https://huggingface.co/docs/transformers/model_doc/t5). This reflects the training method that was used for unsupervised pretraining that is called **span corruption**. With this method, the model receives an input in which spans of words are replaced by a special token. The model is then trained to predict the value of these masked spans.

In [1]:
!pip3 install transformers==4.27.*
!pip3 install sentencepiece==0.1.97



In [2]:
#
# Import libraries and load model
#
import transformers
import torch
model_version  = "t5-small"
tokenizer = transformers.T5Tokenizer.from_pretrained(model_version, model_max_length=512)
model = transformers.T5ForConditionalGeneration.from_pretrained(model_version)

In [3]:
#
# Next we will prepare the encoder input and the decoder input. The encoder input is
# the masked sentence
#
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park").input_ids
input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(dim=0)
#
# The labels (the target used to calculate the loss function) consists of the true values for each mask
# extra_id_0 ---> dog
# extra_id_1 ---> the
#
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>").input_ids
labels = torch.tensor(labels, dtype=torch.long).unsqueeze(dim=0)
#
# Run that through model. We specify the decoder inputs using the labels argument
#
out = model(input_ids= input_ids, labels = labels)
print(out.__dict__.keys())
print(out.loss.item())

dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])
3.7837319374084473


Let us understand what is going on. First, we create our input, containing a masked version of the sentence "The cute dog walks in the park", with the two spans "cute dog" and "the" replaced by special token, so called **sentinel token**. Then we create the corresponding target, which consists of each sentinel token, followed by the expected value of the token. We then call the [forward method](https://github.com/huggingface/transformers/blob/f2cc8ffdaaad4ba43343aab38459c4208d265267/src/transformers/models/t5/modeling_t5.py#L1617) of the model.

The forward method first sends the input_ids through the encoder to obtain its representation in the internal model dimension, called encoder_outputs. It then takes the labels and shifts them by one position to the right, by applying the method `_shift_right`. More precisely, we shift the labels by one position to the right, ignoring the last token but filling up with a special token called the decoder start token on the right. Note that the token that we loose in this way is the end-of-sentence token, while the token that we use fill up is at the same time the decoder start token and the pad token.


During inference, we do not pass the labels but the input ids and the decoder input ids. Let us try this out. We use the same input ids as before. As decoder input, we use a sequence consisting only of the decoder start token. We then retrieve the logits from the model and take the argmax, corresponding to greedy search. Note that we need to add an extra batch dimension as dimension zero for both inputs as before.

In [4]:
decoder_input_ids = torch.tensor([model.config.decoder_start_token_id], dtype=torch.long).unsqueeze(dim = 0)
print(decoder_input_ids)
out = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids)
logits = out.logits
print(logits.shape) # (B, L, V)
sample_idx = torch.argmax(logits[0, -1, :]).item()
print(tokenizer.convert_ids_to_tokens(sample_idx))

tensor([[0]])
torch.Size([1, 1, 32128])
<extra_id_0>


This is actually what we expect, as we want the output to be in the same form as before. Let us now proceed like this - we concatenate our output to the decoder input and run through the same procedure again. We repeat this process until we have sampled a given maximum number of token or reach an end-of-sentence token

In [5]:
def sample_from_input_string(input_string, model, tokenizer):
    input_ids = tokenizer(input_string).input_ids
    input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(dim = 0)
    #
    # first iteration
    #
    decoder_input_ids = torch.tensor([model.config.decoder_start_token_id], dtype=torch.long).unsqueeze(dim = 0)
    logits = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids).logits
    sample_idx = torch.argmax(logits[0, -1, :]).item()
    #
    # additional iterations
    #
    for i in range(10):
        sample = torch.tensor([sample_idx], dtype=torch.long).unsqueeze(dim=0)
        decoder_input_ids = torch.cat((decoder_input_ids, sample), dim = 1)
        logits = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids).logits
        sample_idx = torch.argmax(logits[0, -1, :]).item()
        if sample_idx == model.config.eos_token_id:
            break
    #
    # Decode
    #
    outputs = decoder_input_ids[0, :].tolist()
    return tokenizer.decode(outputs)

print(sample_from_input_string("The <extra_id_0> walks in <extra_id_1> park", model, tokenizer))

<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2>park.


This is the same result that we also find in the Huggingface documentation, note that the outcome is deterministic, as we use greedy search here. So we have successfully reproduced the generate method. However, this is still fairly inefficient. In the forward method of the transformer, we first feed the input ids into the encoder, however, these input ids have the same value for each iteration! Fortunately, the model returns the hidden state of the last layer of the encoder (a tensor of shape $B \times L \times D$) along with the outputs, and allows us to feed this as an additional input into subsequent calls of the forward method. Here is a modified sampling method using this feature.

In [6]:
def sample_from_input_string(input_string, model, tokenizer, remove_padding = False):
    t = tokenizer(input_string)
    input_ids = t.input_ids
    input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(dim = 0)
    #
    # first iteration
    #
    decoder_input_ids = torch.tensor([model.config.decoder_start_token_id], dtype=torch.long).unsqueeze(dim = 0)
    out = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids)
    logits = out.logits
    hidden_state = out.encoder_last_hidden_state
    sample_idx = torch.argmax(logits[0, -1, :]).item()
    #
    # additional iterations
    #
    for i in range(50):
        sample = torch.tensor([sample_idx], dtype=torch.long).unsqueeze(dim=0)
        decoder_input_ids = torch.cat((decoder_input_ids, sample), dim = 1)
        logits = model(input_ids = input_ids, decoder_input_ids = decoder_input_ids, encoder_outputs = [hidden_state]).logits
        sample_idx = torch.argmax(logits[0, -1, :]).item()
        if sample_idx == model.config.eos_token_id:
            break
    #
    # Decode result (and remove padding if requested)
    #
    if remove_padding:
        outputs = decoder_input_ids[0, 1:].tolist()
    else:
        outputs = decoder_input_ids[0, :].tolist()
    return tokenizer.decode(outputs)

print(sample_from_input_string("The <extra_id_0> walks in <extra_id_1> park", model, tokenizer))

<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2>park.


The examples that we have used so far corresponding to the first training phase that was applied to the model - the unsupervised pre-training using span corruption. In addition, the model did undergo supervised trainings on specific tasks. Let us take a closer look at some of them.

The general idea of how to train T5 on these specific tasks is to add a task-specific prefix to the input which tells the model which task it is supposed to carry out. These prefixes are described in the appendix of the original paper - for translation, the prefix is simply "translate English to German" (or German replaced by the language of your choice). Apart from that, the processing pattern is exactly as before, so we can simply reuse the function that we have already put together.

In [7]:
print(sample_from_input_string("translate English to German: The house is wonderful.", model, tokenizer, remove_padding = True))

Das Haus ist wunderbar.


This is nice - exactly the same model with the same inference (and training) methods can be used for translation as well. This is the general idea behind models like T5 - instead of fine-tuning via transfer learning which involves putting together a task-specific model that takes over some of the layers of the pre-trained model, we train the exactly same model using essentially the same code.

Translation is not the only downstream task on which T5 has been trained. As an example for a different task, let us look at natural language inference. T5 has been training on the MNLI corpus described [in this paper](https://arxiv.org/abs/1704.05426). The general structure of the task is as follows. The model is given two sentences, where the first one is a premise and the second one is a hypothesis. The task is to predict whether the hypothesis follows from the premise ("entailment"), is contradicting the premise ("contradiction") or neither nor applies ("neutral"). Again, the task is specified by using a task specific prefix, namely "mnli", followed by "premise", followed by the premise, and then "hypothesis" before finally the hypothesis is added to the input. Here are a few examples, where the first one is taken from the T5 paper, where as the other ones are made up.

In [8]:
print(sample_from_input_string("“mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity", model, tokenizer, remove_padding = True))
print(sample_from_input_string("“mnli premise: John is Simons father. hypothesis: Simon is Johns son", model, tokenizer, remove_padding = True))
print(sample_from_input_string("“mnli premise: It is dark at night. hypothesis: After 10 pm it usually gets dark", model, tokenizer, remove_padding = True))
print(sample_from_input_string("“mnli premise: Birds have feathers. A pigeon is a bird. hypothesis: A pigeon has feathers", model, tokenizer, remove_padding = True))

contradiction
contradiction
neutral
entailment


A similar task is QNLI, which consists of telling whether a given sentence contains the answer to a specific question. Again, the task is made known to the model using a prefix, this time "qnli".

In [9]:
print(sample_from_input_string("qnli question: Where did Jebe die? sentence: Genghis Khan recalled Subutai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand", model, tokenizer, remove_padding = True))

entailment


A similar task is to not only tell whether the question can be answered based on the information in the context, but also to provide the answer. T5 also has a prefix for that.

In [10]:
print(sample_from_input_string("question: Where did Jebe die? context: Genghis Khan recalled Subutai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand", model, tokenizer, remove_padding = True))

Samarkand


Now let us try something else. We will ask our model an open question, without using any specific prefix. This is called an **instruction** in NLP. So let us ask our model to give us the boiling temperature of water (which is roughly 212 degree F or 100 degree C)

In [11]:
print(sample_from_input_string("Please answer the following question: what is the boiling temperature of water?", model, tokenizer, remove_padding = True))

Bitte beantworten Sie folgende Frage: Wie hoch ist die Wasserkochertemperatur?


Interesting. Apparently our model is not able to derive the actual instruction from the input, but instead mistakenly falls back into a translation of the input. This is not unexpected, as T5 has not been trained specifically on deriving the intent hidden in a prompt, but instead relies on the prefixes to determine the type of task. Two years after T5 was published, Google published a [follow-up paper](https://arxiv.org/abs/2210.11416) in which the team presented FLAN-T5, a version of T5 that, in addition, has been trained on a large number of open questions like in our example. Here, FLAN stands for "Finetuned language net".

How exactly this was done becomes a bit more tangible by looking at [this code snippet](https://github.com/google-research/FLAN/blob/main/flan/templates.py) which is part of the code used by Google to prepare the dataset used for training.  As we can see, labeled data is converted into different instructions using templates, so that the model learns a large variety of different instructions that refer to the same task.

In [12]:
model_version  = "google/flan-t5-small"
flan_tokenizer = transformers.T5Tokenizer.from_pretrained(model_version, model_max_length=512)
flan_model = transformers.T5ForConditionalGeneration.from_pretrained(model_version)

Let us first try some of our previous example to see how the new model reacts to them and then try our question again.

In [13]:
print(sample_from_input_string("question: Where did Jebe die? context: Genghis Khan recalled Subutai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand", flan_model, flan_tokenizer, remove_padding = True))

Samarkand


In [14]:
print(sample_from_input_string("translate English to German: The house is wonderful.", flan_model, flan_tokenizer, remove_padding = True))

Das Haus ist schön.


In [15]:
print(sample_from_input_string("Please answer the following question: what is the boiling temperature of water?", flan_model, flan_tokenizer, remove_padding = True))

98 °C


FLAN T5 comes in different sizes. Let us download the large model (download is a bit more than 3 GB, so this might take some time) and ask it a few questions.

In [16]:
model_version  = "google/flan-t5-large"
flan_tokenizer = transformers.T5Tokenizer.from_pretrained(model_version, model_max_length=512)
flan_model = transformers.T5ForConditionalGeneration.from_pretrained(model_version)
print(sample_from_input_string("Who was the first man in space?", flan_model, flan_tokenizer, remove_padding = True))
print(sample_from_input_string("What is the boiling point of water?", flan_model, flan_tokenizer, remove_padding = True))

john glenn
212 °F


We see that the model correctly infers the intent of the prompt and gives a reasonable answer. Of course the answer to the first question is factually wrong, but the answer to the second question is correct. Let us try a few questions which were not part of the templates.

In [17]:
print(sample_from_input_string("True or false: the earth is orbiting around the sun", flan_model, flan_tokenizer, remove_padding = True))
print(sample_from_input_string("Pete is 20 years old. Simon is twice as old as Pete. How old is Simon?", flan_model, flan_tokenizer, remove_padding = True))

yes
20 * 2 = 40 years old.


Not bad! Especially the answer to the second question is impressing. Following FLAN-T5, Google has also applied the same method to other networks, including decoder-only architectures like LamDA-PT, and reports that the models obtained in this way outperforms models trained solely via unsupervised training, see [this paper](https://arxiv.org/pdf/2109.01652v5.pdf).