# T5: Text-To-Text Transfer Transformer

![](../figs/deep_nlp/t5/t5.gif)

“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” {cite}`raffel2020exploring`

- A unified framework that converts all text-based language problems into a text-to-text format by framing them as conditional text generation tasks.
- Combining the pre-training objectives of BERT and GPT-2, T5 is trained on a very large number of tasks and is able to perform well on a wide range of tasks with minimal task-specific architecture modifications.
- C4 (Corpus of Cleaned Web Crawled Text) is used as the training corpus, which is a large-scale dataset of 3.3 billion web pages.
- Achieves state-of-the-art results on 11 out of 15 tasks in GLUE, SuperGLUE, and SQuAD v1.1.

## T5: Text-to-Text Framework

![](../figs/deep_nlp/t5/t5-training.png)

### Unified Input & Output Format

- T5 means "`T`ext-`T`o-`T`ext `T`ransfer `T`ransformer".
- Every task considered by T5 is framed as a conditional text generation task with a single input and output sequence.
- Translation: `translate English to German: How old are you?` $\rightarrow$ `Wie alt bist du?`
- Text classification: `classify sentiment: This movie is so bad.` $\rightarrow$ `negative`
- Text summarization: `summarize: The movie was not good. The animation and the graphics were good. This is a good movie.` $\rightarrow$ `The movie was not good.`
- Entailment: `entailment: I like to eat broccoli and bananas. I eat a banana every day.` $\rightarrow$ `neutral`
- MNLI (entailment): `mnli: Premise: A person on a horse jumps over a broken down airplane. Hypothesis: The person is training his horse for a competition.` $\rightarrow$ `entailment`
- MNLI (neutral): `mnli: Premise: A person on a horse jumps over a broken down airplane. Hypothesis: The person is at the zoo, riding a horse.` $\rightarrow$ `neutral`
- Regression: `sts-b: The cat was playing in the garden. The cat was playing in the yard.` $\rightarrow$ `5.0`

### Encoder-Decoder Transformer Model

- T5 uses the same encoder-decoder Transformer architecture as BERT.
- However, a simplified layer normalization is used the activations are only rescaled and no additive bias is applied.
- After layer normalization, a residual skip connection, originated from ResNet, adds each subcomponent's input to its output.
- Also, istead of using a fixed positional encoding, T5 uses a relative positional encoding.

## References

- [T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
- [A Full Guide to Finetuning T5 for Text2Text and Building a Demo with Streamlit](https://medium.com/nlplanet/a-full-guide-to-finetuning-t5-for-text2text-and-building-a-demo-with-streamlit-c72009631887)
- 