This repository contains the implementation using Pytorch of the seminal paper "Attention is All You Need" by Vaswani et al. This implementation includes the Transformer model, tokenizer, and training scripts.
The Transformer model, introduced in the paper "Attention is All You Need," is a novel architecture designed to handle sequential data with self-attention mechanisms. This architecture has achieved state-of-the-art performance in various natural language processing tasks.
- Python 3.11 or higher
- PyTorch 2.3 or higher
- NumPy
- transformers 4.41 or higher
To set up the environment, follow these steps:
-
Clone the repository:
git clone https://github.com/deeplearningcafe/transformer-pytorch.git cd transformer-pytorch
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
Or if using conda:
conda create -n transformer_torch conda activate transformer_torch
-
Install the required packages:
pip install -r requirements.txt
We use the BPE tokenizer, the same as the original paper. To train the tokenizer from scratch we use the tokenizers
library from huggingface. The vocabulary_size
is set to 37000.
python train_tokenizer.py
In the case of using a dataset that is different from the original WMT 2014 English-to-German
dataset, the warmup steps should be changed. By overfitting in a single batch, we can test several warmup values. To run it just change the hp_search
variable to True inside the config.yaml
file, the tolerance, max steps and search interval can be changed.
python training.py
To train the Transformer model, use the provided script:
python training.py
Here, config.yaml
is a configuration file specifying the model parameters, training settings, and dataset paths. Parameter count becomes 56,754,176, while try to use the same configuration as in the paper.
transformer:
hidden_dim: 512
num_heads: 8
intermediate_dim: 2048
eps: 1e-06
num_layers: 6
dropout: 0.1
label_smoothing: 0.1
train:
warmup_steps: 4000
max_length: 128
device: cpu
train_batch: 128
eval_batch: 128
steps: 10
val_steps: 1
log_steps: 1
save_steps: 1
use_bitsandbytes: False
save_path: weights/transformer_
dataset_name: "bentrevett/multi30k"
We include the inference.py
file for inference that calls the generate
method from the transformer class, with option to use greedy sampling or top_k sampling.
To evaluate the model on the test set we include test_model.py
file.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is All You Need. arXiv preprint arXiv:1706.03762.
- Hugging Face Transformers: https://huggingface.co/transformers/
This project is licensed under the MIT license. Details are in the LICENSE file.