# ByT5: Towards a token-free future with pre-trained byte-to-byte models

![](../figs/deep_nlp/byt5/byt5-intro.png)

The ByT5 model was presented in {cite}`xue2022byt5`.

The abstract from the paper is the following:

> Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

## What Is A Token In Machine Learning?

- A token is a sequence of characters that is considered as a single entity during processing. 
- Tokens are usually derived from words, but they can also be derived from subwords, characters, or even bytes.
- For example, the sentence "The quick brown fox jumps over the lazy dog" can be tokenized into the following tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
- Some words can be tokenized into multiple tokens, for example, "don't" can be tokenized into ["do", "n't"].
- At the byte or chracter level, the sentence can be tokenized into the 43 character tokens.
- In transformer models, tokens are usually represented as vectors of fixed length, for example, 512-dimensional vectors to limit the cost of computation.
- Attention mechanisms are expensive, and the cost of computation increases with the order of $N^2$ where $N$ is the number of tokens in the sequence.
- This explains why tokenization is important, it dramatically reduces the number of tokens in a sequence, and thus the cost of computation.

## What Is ByT5?

- ByT5 is a token-free model that operates directly on raw text (bytes or characters).
- Therefore, it does not require a tokenizer, and it can process text in any language out of the box.
- One advantage of token-free models is that they are more robust to noise.
- Also, out-of-vocabulary words are not a problem for token-free models. It is `<UNK>`-free.
- `<UNK>` is the token used to represent out-of-vocabulary words in token-based models.
- The proposed ByT5 is based on Google’s recent token-based mT5 (Massively Multilingual Text-to-Text Transfer Transformer)

![](../figs/deep_nlp/byt5/byt5-vs-mt5.png)

## References

- [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/pdf/2105.13626v1.pdf)
- [ByT5: What It Might Mean For SEO](https://wandb.ai/onlineinference/byt5/reports/ByT5-What-It-Might-Mean-For-SEO--Vmlldzo4NzY1NzE#what-is-byt5?)
