Makespeare is a GPT-style transformer that I coded from scratch and trained on the tiny-shakespeare dataset. This idea is inspired by Andrej Karpathy's video (https://youtu.be/kCc8FmEb1nY) which I used a reference only to overcome certain obstacles.
The transformer was trained on the tiny-shakespeare
dataset containing 40,000 lines of text from Shakespeare's plays. Click here for the dataset.
An excerpt from the dataset:
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
All:
Resolved. resolved.
First Citizen:
First, you know Caius Marcius is chief enemy to the people.
All:
We know't, we know't.
...
For the transformer to be able to interpret text, we need to convert the input text into something a computer can understand - ✨Numbers✨. This is done by:
- Splitting up text into multiple parts or tokens
- Giving a unique numerical ID to each unique token
- Thus, every unique word is mapped to a unique numerical ID.
- In practice, a dictionary is used to keep track of the ID of each word. The number of word-ID pairs present in the dictionary is known as its vocabulary size (referred to as
vocab_size
in the code).
Word | ID |
---|---|
Cat | 1 |
Dog | 2 |
... | ... |
- If a word that is not present in the dictionary is encountered, special rules are followed to assign an ID to it.
- Converting each token into a learnable n-dimensional vector
- For example, how similar two words are can be measured by the distance between their corresponding points in n-dimensional space (similarity increases the closer the points are).
- The dimension of each such vector is fixed and corresponds to
embedding_len
in the code. Some sources also refer to this asd_model
(model dimension).
Matrix of learnable vectors that represent the respective position of each token in a sentence.
Such embeddings allow the transformer to learn how words need to be in a certain order to make sense in a sentence.
Mechanism through which the model can learn which words to focus on in a particular sentence. Attention is computed by:
- Generating 3 matrices, namely, the Query (Q), Key (K) and Value (V) as follows:
where
(context_length, embedding_len)
(embedding_len, embedding_len)
- Splitting the Query, Key and Value matrices into
num_head
heads - Computing attention on each head as follows:
where
- Concatenating the attention values calculated for each head into a single attention matrix
The computed attention matrix is added to the attention block's input matrix. This is known as a residual connection.
The residual output then undergoes normalisation for better and faster training.
graph BT;
id1(Input) --> id2(Attention)
id1(Input) --> id3((+))
id2(Attention) --> id3((+))
id3((+)) --> id4(Residual)
id4(Residual) --> id5(Normalisation)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
Note: Makespeare makes use of a slightly modified version of this step wherein the attention block's input matrix undergoes normalisation, the attention matrix is computed using this normalised input, and finally, the residual computation is performed. This is known as pre-normalisation and is simply a rearrangement of the aforementioned order of steps as follows:
graph BT;
id1(Input) --> id5(Normalisation)
id1(Input) --> id3((+))
id2(Attention) --> id3((+))
id3((+)) --> id4(Residual)
id5(Normalisation) --> id2(Attention)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
The output from the previous step is fed to a feedforward neural network.
graph BT;
id1(Attention_Output) --> id2(Normalisation)
id2(Normalisation) --> id3(Feedforward_NN)
id3(Feedforward_NN) --> id4((+))
id1(Attention_Output) --> id4((+))
id4([+]) --> id5(Residual)
style id1 fill:#005f73, stroke:#005f73
style id2 fill:#0a9396, stroke:#0a9396
style id3 fill:#ca6702, stroke:#ca6702
style id4 fill:#bb3e03, stroke:#bb3e03
style id5 fill:#9b2226, stroke:#9b2226
Similar to GPT, Makespeare is a decoder-only transformer.
Attention mechanism similar to dot product self attention with the only difference being that query values are not given access to any succeeding key-value pairs. In other words, no future tokens are accessed by the decoder while predicting the current token.
A (context_length, context_length)
mask is used to accomplish this. The mask is set to