Transformer From Scratch in Vanilla Python

Educational Transformer with no autograd. You can train and fine-tune a model on any text file, and it will generate text that sounds like it.
The full Transformer layers are in layers.py. Each has a forward and backprop methods.
Multi-Head Self Attention layer has just 80 lines of code.

Note: See some more samples in the Results section.

1. Project Structure

src/ : Folder with python files.
- src/model.py: File with the Model class.
- src/layers.py: Every Transformer layer. Each contains a .forward() and .backward() method.
- src/layers_recurrent.py: RNN and LSTM layers. Can be thrown in the mix to test creative Ensembles.
- src/utils.py : File with helper functions and classes.
data/ : Folder to store the text files. Currently holds shakespeare.txt and jules_verne.txt.
models/ : Folder which stores the saved models. Further explaination in section 2.
config.py : All model configurations. Edit this file to alter model layers and hyperparameters.
run.py : Script executed to train/fine_tune/test the model.

2. Running it Yourself

Requirements

The required packages are listed in requirements.txt.
The torch tensors make computation a little faster, and so are is used on the Transformer implementation. However, autograd is NOT used. All backpropagation is manually implemented.
The requirements can be installed on a virtual environment with the command:

pip install -r requirements.txt

To run, install the necessary requirements and a text corpus (any text you wish to replicate, .txt format).
Please download your text file in the data directory.

Note: The training is by default implemented to detect CUDA availability, and run on CUDA if found.

Pretraining

To pretrain a Transformer on language modeling (predicting next character), first go into config.py and chose the necessary arguments.
In the training_params dictionary, choose:
- --corpus (name of file in data directory with the text you want to train the model on)
- --to_path (.json file that will be created to store the model) [OPTIONAL]
And you can choose the hyperparameters (although the defaults work pretty well):
Finally, simply run on terminal:

python3 run.py --train --config=config.py

You can kill the training at any time. This will NOT corrupt the saved models.

Note: for pretraining deep Transformers (many Blocks in series), a really large text corpus is necessary. I obtained reasonably good results with >1M characters. If you want to alter layers/dimensions, do so in the config.py file, as described in the Build the Model section.

Fine-Tuning

To fine-tune a Transformer on a given text file, go to config.py and choose the arguments:
In the fine_tuning_params dictionary, choose:
- --corpus (name of file in data directory with the text you want to train the model on)
- --from_path (.json file that contains pretrained model)
- --to_path (.json file that will be created to store the model) [OPTIONAL]
And you can choose the hyperparameters (although the defaults work pretty well).
Finally, simply run on terminal:

python3 run.py --fine_tune --config=config.py

Note: For fine-tuning, a you can get adventurous with smaller text files. I obtained good results with a ~10K character Bee Gees songs text file.

Testing

To test your Transformer, go to config.py and choose the arguments:
In the testing_params dictionary, choose:
- --from_path: (.json file that contains pretrained model)
- --testing_corpus: (optionally, add a text corpus to generate a loss metric)
- seed: (the start to the string your model generates, it has to "continue" it) [OPTIONAL]
- evaluation_n_timesteps: (how many characters will be generated, "sounding" like the source text) [OPTIONAL]
model_layers will not be accessed during testing, as you will use the layers of the pretrained model.
Finally, simply run on terminal:

python3 run.py --test --config=config.py

Note: The testing script does not access any hyperparametes, because the model is already trained.

Build a Custom Model

To customize the model layers, go into config.py and edit the model_layers dictionary.

Note: Each layer takes as arguments the input and output sizes. The first layer must be a Embedding layer with input size equals vocab_size. The last layer must be a CrossEntropyLoss layer with the previous layer's output size equals vocab_size.
You may chose among the following layers:
- Transformer Layers:
  - Embedding (first layer, turns input indexes into vectors)
  - PositionalEmbedding (second layer, adds position information to every timestep of the input)
  - TemporalDense (simple fully-connected layer)
  - MultiHeadSelfAttention (core of the transformer, calculates weighted sum of inputs)
  - Block (full transformer block - connects MHSA and Dense layers with residuals and LayerNorm)
  - Dropout (can be added after layers to apply dropout)
  - CrossEntropyLoss (last layer, returns probabilities for next generated character)
- Extra recurrent layers:
  - RNN (Recurrent Neural Network layer)
  - LSTM (Long Short Term Memory layer)

3. Results

The transformer currently implemented in config.py achieved a loss of 1.01 with a vocabulary size of 80 characters.
I trained it on Jules Verne's complete works (~13M characters), and in Shakespeare's complete works (~1M characters).

The trainings went on for 100,000 timesteps, which took 10h40min on a GTX1070 NVIDIA GPU.

Sample from the Shakespeare model:

LUCIO:
Nay, now blame me and my fantasy!
As thou shalt know now I do love,
Love the blessed strength of our embrace.

DUKE VINCENTIO:
Dark not is thou will here, poor boy!
What thou hast is a judgment taint,
And, as much as thou love is real,
Thou heart wilt shred apart.

LUCIO:
Thou rascal! How, my lord, would you rather,
Conspire on me, betray my friendsip,
But I shall now bear my own fate.
I care not, O drunk power: I part with thee,
I care not, thy firm foe: and he comes not.

Sample from the Jules Verne model:

Nemo led the frigate by large rocks, the prey which the present
forest of waves marked. But they planted cries surrounded by waters
of prayers and tunnels of the large ocean. Besides, they were going
on to the shore.
The lowest appliances, with peculiar results, hung patterns and
frosts to the bottom, accompanied by the dominion of a strange sound,
was everything that could not be left in this part of the Arctic Circle,
and manufactured at the end of the Rio Norway Island.
The western Norwegian crew was unaccustomed, and the heat of hunger had
their best to remain again. The next danger of twelve miles was from the
Andara, unable to cross the fierce diamond waves with the hollow.

Note: Unlike recurrent layers, the Multi Head Self Attention forward and backward passes ran many times faster on the GPU than on my M2 CPU.

4. Appendix

Training hyperparameters:

n_iter (number of times the model will run a full sequence during training)
n_timesteps (number of characters the model can accept as input at once)
batch_size (number of parallel iterations the model will run)
learning_rate (scalar regulating how quickly model parameters change. Should be smaller for fine-tuning)
regularization: (scalar regulating size of weights and overfitting) [OPTIONAL]
dropout_prob: (percentage of weights to be zeroed by dropout layer) [OPTIONAL]
patience (after how many evaluations without improvement should the learning rate be reduced) [OPTIONAL]
evaluation_interval: (interval of iterations between evaluation steps) [OPTIONAL]
evaluation_n_timesteps: (number of characters to be generated in the sample every evaluation) [OPTIONAL]

Fine-tuning hyperparameters:

n_iter (number of times the model will run a full sequence during training)
n_timesteps (number of characters the model will see/predict on each iteration in n_iter)
batch_size (number of parallel iterations the model will run)
learning_rate (scalar regulating how quickly model parameters change)
regularization: (scalar regulating size of weights and overfitting) [OPTIONAL]
patience (after how many iterations without improvement should the learning rate be reduced) [OPTIONAL]
dropout_prob: (percentage of weights to be zeroed by dropout layer) [OPTIONAL]
evaluation_interval: (interval of iterations between evaluation steps) [OPTIONAL]
evaluation_n_timesteps: (number of characters to be generated in the sample every evaluation) [OPTIONAL]

Note: model_layers will not be accessed during fine-tuning, as the layers of the pretrained model will be automatically loaded.

Name		Name	Last commit message	Last commit date
Latest commit History 276 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer From Scratch in Vanilla Python

1. Project Structure

2. Running it Yourself

3. Results

4. Appendix

About

Releases

Packages

Languages

License

eduardoleao052/Transformer-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Transformer From Scratch in Vanilla Python

1. Project Structure

2. Running it Yourself

3. Results

4. Appendix

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages