Skip to content

deeplearningcafe/transformer-pytorch

Repository files navigation

Attention is All You Need

This repository contains the implementation using Pytorch of the seminal paper "Attention is All You Need" by Vaswani et al. This implementation includes the Transformer model, tokenizer, and training scripts.

Table of Contents

Overview

The Transformer model, introduced in the paper "Attention is All You Need," is a novel architecture designed to handle sequential data with self-attention mechanisms. This architecture has achieved state-of-the-art performance in various natural language processing tasks.

Requirements

  • Python 3.11 or higher
  • PyTorch 2.3 or higher
  • NumPy
  • transformers 4.41 or higher

Installation

To set up the environment, follow these steps:

  1. Clone the repository:

    git clone https://github.com/deeplearningcafe/transformer-pytorch.git
    cd transformer-pytorch
  2. Create a virtual environment and activate it:

    python3 -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`

    Or if using conda:

    conda create -n transformer_torch
    conda activate transformer_torch
  3. Install the required packages:

    pip install -r requirements.txt

Usage

Train tokenizer

We use the BPE tokenizer, the same as the original paper. To train the tokenizer from scratch we use the tokenizers library from huggingface. The vocabulary_size is set to 37000.

python train_tokenizer.py

Hyperparameter search

In the case of using a dataset that is different from the original WMT 2014 English-to-German dataset, the warmup steps should be changed. By overfitting in a single batch, we can test several warmup values. To run it just change the hp_search variable to True inside the config.yaml file, the tolerance, max steps and search interval can be changed.

python training.py

Train transformer

To train the Transformer model, use the provided script:

python training.py

Here, config.yaml is a configuration file specifying the model parameters, training settings, and dataset paths. Parameter count becomes 56,754,176, while try to use the same configuration as in the paper.

transformer:
hidden_dim: 512
num_heads: 8
intermediate_dim: 2048
eps: 1e-06
num_layers: 6
dropout: 0.1
label_smoothing: 0.1

train:
warmup_steps: 4000
max_length: 128
device: cpu
train_batch: 128
eval_batch: 128
steps: 10
val_steps: 1
log_steps: 1
save_steps: 1
use_bitsandbytes: False
save_path: weights/transformer_
dataset_name: "bentrevett/multi30k"

Examples

We include the inference.py file for inference that calls the generate method from the transformer class, with option to use greedy sampling or top_k sampling. To evaluate the model on the test set we include test_model.py file.

References

Author

aipracticecafe

License

This project is licensed under the MIT license. Details are in the LICENSE file.

About

Attention is All You Need paper implementation using Pytorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published