In [42]:
import numpy as np

import mytransformer


np.set_printoptions(suppress=True)

# NOTE: What is the differnece in the decoder block that was mentioned in the Random Transformer during inference and during training?
# TODO: Find out if it has something to do with masked multihead attention, what exactly and if there is something else, what is? 

# Transformer

Now we are ready to implement both the encoder and decoder blocks with the positional encoding to generate an output sequence. 

The complete transformer is made of two parts:
- Encoder which takes the input sequence and generates a rich reperesentation. Composed of multiple stacks of encoder blocks.
- Decoder which takes the generated encoder's output and generated tokens to generate the output sequence. It is also composed of stacks of decoder blocks. 

A final linear layer with a softmax on top of the decoder is necessary for word generation.

The whole algorithm looks like this:
1. Encoder processing: the encoder receives the input sequence (embedding with added positional encodings) and generates a rich representation which gets fed into the decoder. 
2. Decoder initialization: The decoding process begins with the start-of-sequence (SOS) token combined with the encoder's output.
3. Decoder operation: the decoder uses the encoder's output together with all of the previously generated tokens to produce a new list of embeddings. 
4. Linear Layer for logits: a linear layer is applied to the last output embedding from the decoder to generate logits, representing raw predictions for the next token.
5. Softmax for probabilities: These logits are passed through a softmax layer which converts them into a probability distribution over potential next tokens. 
6. Iterative token generation: this process is repeated with each step involving the cumulative embeddings of the previously generated tokens and the **initial** encoder's output.
7. Sequence completion: The generation continues through these steps until the end-of-sequence (EOS) token is produced or a predefined sequence length is reached.

### 1. Linear Layer

This is a simple linear transformation that takes the decoder's output and transforms it into a vector of vocab_size. vocab_size is the size of our vocabulary. 
For our example, it will be made up of 10 words. 

In [35]:
def linear(x, W, b):
    return (x @ W) + b

# We assume our decoder's output is a simple vector [1, 0, 1, 0]
logits = linear(x=np.array([[1,0,1,0]]), W=np.random.randn(4, 10), b=np.random.randn(1, 10))
logits

array([[-0.75454625, -1.72984897, -2.74518706, -5.10065025, -1.06210493,
         0.89496205, -1.94516827, -0.61045523,  2.2844327 ,  1.41214933]])

What do we use as input for the linear layer? The decoder will output one embedding for each token in the sequence. 

The input for the linear layer will be the last generated embedding. The last embedding encapsulates information to the entire sequence up to that point, so it contains all the information needed to generate the next token. 


**This means that each output embedding from the decoder contains information about the entire sequence up to that point.**

### 2. Softmax

The out of the linear layer and the input to the softmax layer are called logits and softmax is needed to obtain the word probabilities. 

In [39]:
def softmax(x: np.array) -> np.array:
    return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)

probs = softmax(x=logits)
probs

array([[0.02594799, 0.00978442, 0.0035447 , 0.00033621, 0.019078  ,
        0.13504427, 0.00788902, 0.02996965, 0.54189555, 0.22651018]])

### 3. The Random Encoder-Decoder Transformer

Firstly, we need to define a dictionary that maps the words to their initial embeddings. 
Usually, word2vec or GlOVE is used, but I am using random initializations. 

In [46]:
vocabulary = [
    "hello",
    "mundo",
    "world",
    "how",
    "?",
    "EOS",
    "SOS",
    "a",
    "hola",
    "c",
]
embedding_reps = np.random.randn(10, 4)
vocabulary_embeddings = {
    word: embedding_reps[i] for i, word in enumerate(vocabulary)
}
vocabulary_embeddings

{'hello': array([ 0.77951459,  0.75806863, -1.08660408, -0.33525287]),
 'mundo': array([ 0.09980578,  0.14668742, -0.1784667 , -0.2883775 ]),
 'world': array([-1.63377171, -0.69276119,  1.18119715, -0.17603885]),
 'how': array([-1.33615235, -1.10483817, -0.06466068,  0.39783812]),
 '?': array([ 0.34698074, -0.86457566,  1.39360451, -0.91708825]),
 'EOS': array([-0.65274899, -0.12009902, -0.45978305,  0.76393155]),
 'SOS': array([ 0.60378661,  0.62930697, -0.55687916, -0.15650524]),
 'a': array([ 0.81077324, -0.25395342, -0.30993164, -0.06185999]),
 'hola': array([-1.27392534,  1.36883098, -0.45121209, -0.26708426]),
 'c': array([ 0.29710762, -0.3372509 , -0.65838874,  0.79640235])}

Let's write a generate function that takes in the input sequence and generates tokens autoregressively.

In [92]:
def generate(input_sequence, max_iters=3):
    # Firstly, we encode the inputs into embeddings 
    embedded_inputs = [
        vocabulary_embeddings[token] for token in input_sequence
    ]

    print(f"Embedded representations (encoder input):\n{embedded_inputs}")

    # NOTE: (Apparently not) Next, we need to positionaly encode each embedding
    encoder_output = mytransformer.encoder(x=embedded_inputs)
    print(f"Embedding generated by encoder (encoder output):\n{encoder_output}")

    # We initialize the decoder output with the embedding of the start token
    sequence_embeddings = [vocabulary_embeddings["SOS"]]
    output = "SOS"

    # Random matrices for the linear layer
    d_vocab = len(vocabulary_embeddings)
    W = np.random.randn(mytransformer.d_embedding, d_vocab)
    b = np.random.randn(1, d_vocab)
    # logits = linear(x=sequence_embeddings, )

    # We limit number of decoding steps to avoid too long sequences without EOS
    for i in range(max_iters):
        # Decoder step
        decoder_output = mytransformer.decoder(x=sequence_embeddings, decoder_embedding=encoder_output)

        # Only the last output is for prediction (as that token contains all the necessary information of the previously generated tokens)
        logits = linear(decoder_output[-1], W, b)        

        # Pass it through the softmax layer
        probs = softmax(logits)

        # We then generate the most likely next token
        next_token = vocabulary[np.argmax(probs)]
        sequence_embeddings.append(vocabulary_embeddings[next_token])
        output += " " + next_token

        print(f"""
            Iteration: {i}
            Generated token: {next_token}
            Token probability: {np.max(probs)}
        """)

        # If the end-of-sequence token is generated, we return the sequence and end the generation
        if next_token == "EOS":
            return output
        
    return output, sequence_embeddings

In [96]:
generate(["hello", "world"])

Embedded representations (encoder input):
[array([ 0.77951459,  0.75806863, -1.08660408, -0.33525287]), array([-1.63377171, -0.69276119,  1.18119715, -0.17603885])]
Embedding generated by encoder (encoder output):
[[ 0.41465919 -1.2092834   1.40973063 -0.61510241]
 [ 0.41304152 -1.20857394  1.41076047 -0.61522406]]

            Iteration: 0
            Generated token: a
            Token probability: 0.5183529832584921
        

            Iteration: 1
            Generated token: ?
            Token probability: 0.9852352948654279
        

            Iteration: 2
            Generated token: ?
            Token probability: 0.6350116052193634
        


('SOS a ? ?',
 [array([ 0.60378661,  0.62930697, -0.55687916, -0.15650524]),
  array([ 0.81077324, -0.25395342, -0.30993164, -0.06185999]),
  array([ 0.34698074, -0.86457566,  1.39360451, -0.91708825]),
  array([ 0.34698074, -0.86457566,  1.39360451, -0.91708825])])