# Carlo Merola - Deep Learning Exam 14_06_24
# Student ID: 0001112544



# Sentence Reconstruction

The purpose of this project is to take in input a sequence of words corresponding to a random permutation of a given english sentence, and reconstruct the original sentence.

The otuput can be either produced in a single shot, or through an iterative (autoregressive) loop generating a single token at a time.


CONSTRAINTS:
* No pretrained model can be used.
* The neural network models should have less the 20M parameters.
* No postprocessing should be done (e.g. no beamsearch)
* You cannot use additional training data.


BONUS PARAMETERS:

A bonus of 0-2 points will be attributed to incentivate the adoption of models with a low number of parameters.

# Dataset

The dataset is composed by sentences taken from the generics_kb dataset of hugging face. We restricted the vocabolary to the 10K most frequent words, and only took sentences making use of this vocabulary.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K

Download the dataset

In [3]:
from datasets import load_dataset
from keras.layers import TextVectorization
import tensorflow as tf
import numpy as np
np.random.seed(42)
ds = load_dataset('generics_kb',trust_remote_code=True)['train']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/8.64k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/11.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1020868 [00:00<?, ? examples/s]

### Unprocessed Data Visualization

In [4]:
for textbatch in ds.take(1):
    print('Type of textbatch:',type(textbatch))
    print('Keys in textbatch:',textbatch.keys())
    print('\n')

i = 0
for textbatch in ds.take(3):
    i += 1
    print('Text {}:'.format(i), textbatch['generic_sentence'])

Type of textbatch: <class 'dict'>
Keys in textbatch: dict_keys(['source', 'term', 'quantifier_frequency', 'quantifier_number', 'generic_sentence', 'score'])


Text 1: AA batteries maintain the settings if the power ever goes off.
Text 2: Aardvark females appear to come into season once per year.
Text 3: Aardvark holes are used by small buck as a resting place to escape the midday sun.


Filter row with length greater than 8.


In [5]:
ds = ds.filter(lambda row: len(row["generic_sentence"].split(" "))>8 )
corpus = [ '<start> ' + row['generic_sentence'].replace(","," <comma>") + ' <end>' for row in ds ]  # add start and end tokens and replace commas with <comma>
corpus = np.array(corpus)


Filter:   0%|          | 0/1020868 [00:00<?, ? examples/s]

### Data visualization in Unprocessed Corpus

In [6]:
print('First sentence in corpus:',corpus[0])

First sentence in corpus: <start> AA batteries maintain the settings if the power ever goes off. <end>


Create a tokenizer and Detokenizer

In [7]:
# tokenizer transforms the text into integers to be fed into the model, and applies padding to make all the sequences the same length
tokenizer=TextVectorization( max_tokens=10000, standardize="lower_and_strip_punctuation", encoding="utf-8",) #con il max prende le piu frequenti. ordina i token del vocab dal piu frequente al meno frequente

 # learn the vocabulary from the corpus and preprocess the text
tokenizer.adapt(corpus)

class TextDetokenizer:
    def __init__(self, vectorize_layer):
        self.vectorize_layer = vectorize_layer
        vocab = self.vectorize_layer.get_vocabulary()
        self.index_to_word = {index: word for index, word in enumerate(vocab)}

    def __detokenize_tokens(self, tokens):
        def check_token(t):
          if t == 3:                                                              # 3 is the index for the <start> token
            s="<start>"
          elif t ==2:                                                             # 2 is the index for the <end> token
            s="<end>"
          elif t ==7:                                                             # 7 is the index for the <comma> token
            s="<comma>"
          else:
            s=self.index_to_word.get(t, '[UNK]')                                  # if key found in dict it returns the value, else it returns '[UNK]'
          return s                                                                # 1 is the index of the [UNK] token in the vocabulary

        return ' '.join([ check_token(token) for token in tokens if token != 0])  # 0 is the index for padding

    def __call__(self, batch_tokens):
       return [self.__detokenize_tokens(tokens) for tokens in batch_tokens]



detokenizer = TextDetokenizer( tokenizer )
sentences = tokenizer( corpus ).numpy()


### Visualizing lenght of Vocabulary together with the first 20 keys

In [8]:
print('Vocabulary:',tokenizer.get_vocabulary()[:10],tokenizer.get_vocabulary()[10:20])
print('Vocabulary length:',tokenizer.vocabulary_size())

Vocabulary: ['', '[UNK]', 'end', 'start', 'the', 'of', 'and', 'comma', 'is', 'to'] ['a', 'in', 'are', 'that', 'can', 'for', 'or', 'as', 'have', 'with']
Vocabulary length: 10000


Remove from corpus the sentences where any unknow word appears

In [9]:
mask = np.sum( (sentences==1) , axis=1) >= 1                # sentences == 1 returns a boolean array with True where the token is the [UNK] token
                                                            # creating a mask with True where the sentence has at least one [UNK] token
original_data = np.delete( sentences, mask , axis=0)

In [10]:
original_data.shape

(241236, 28)

Each sentence is padded to get a fixed vector size of 28 tokens

Shuffle the sentences

In [11]:
from tensorflow.keras.utils import Sequence

class DataGenerator(Sequence):
    def __init__(self, data, batch_size=32, shuffle=True, seed=42):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.seed = seed
        self.on_epoch_end()


    def __len__(self):
        return int(np.floor(len(self.data) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        data_batch = np.array([self.data[k] for k in indexes])
        #copy of ordered sequences
        result = np.copy(data_batch)
        #shuffle only the relevant positions for each batch
        for i in range(data_batch.shape[0]):
          np.random.shuffle(data_batch[i,1:data_batch[i].argmin() - 1])

        return data_batch , result

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.data))
        if self.shuffle:
            if self.seed is not None:
                np.random.seed(self.seed)
            np.random.shuffle(self.indexes)

In [12]:
# Make a random permutation of training and test set
np.random.seed(42)
# Shuffle the all data
shuffled_indices = np.random.permutation(len(original_data))
shuffled_data = original_data[shuffled_indices]

In [13]:
#split the dataset
train_generator = DataGenerator(shuffled_data[:220000])
test_generator = DataGenerator(shuffled_data[220000:])

In [14]:
#train_generator = DataGenerator(original_data[:220000])
#test_generator = DataGenerator(original_data[220000:])
x, y = test_generator.__getitem__(1)  # get first batch of data

### Visualizing random tokenized sentence

In [15]:
i = np.random.randint(0, x.shape[0])                        # get a random index from the batch
print("Tokenized original: ", y[i])
print("\nTokenized shuffled: ", x[i])
#print('\nBatch Size:{}. Lenght:{}'.format(x.shape[0],x.shape[1]))

Tokenized original:  [   3   22 9853 1157   49  300   16  101  214   39    4   58    5  728
    2    0    0    0    0    0    0    0    0    0    0    0    0    0]

Tokenized shuffled:  [   3  300   39    4   16    5  214   58 1157  101   22 9853  728   49
    2    0    0    0    0    0    0    0    0    0    0    0    0    0]


In [16]:
detok_x = detokenizer(x)
detok_y = detokenizer(y)

for i in range(7):
  print("original - target sequence: ", detok_y[i])
  print("shuffled - input sequence: ", detok_x[i])
  print("\n")

original - target sequence:  <start> ranchers clear large areas of rainforest to become pastures for their cattle <end>
shuffled - input sequence:  <start> large their areas for cattle ranchers rainforest clear pastures become to of <end>


original - target sequence:  <start> some earwigs have stripes on the thorax and abdomen <end>
shuffled - input sequence:  <start> stripes thorax some and the earwigs on abdomen have <end>


original - target sequence:  <start> magnetic manipulation can turn molecules in a liquid into computing such devices <end>
shuffled - input sequence:  <start> into in magnetic such a liquid molecules can manipulation computing turn devices <end>


original - target sequence:  <start> healthy wetlands means cleaner water <comma> reduced flooding and more places for recreation <end>
shuffled - input sequence:  <start> reduced wetlands and recreation for water places healthy cleaner flooding <comma> means more <end>


original - target sequence:  <start> market sh

In [17]:
detokenizer(y)[0:7]

['<start> ranchers clear large areas of rainforest to become pastures for their cattle <end>',
 '<start> some earwigs have stripes on the thorax and abdomen <end>',
 '<start> magnetic manipulation can turn molecules in a liquid into computing such devices <end>',
 '<start> healthy wetlands means cleaner water <comma> reduced flooding and more places for recreation <end>',
 '<start> market share is the percent share in sales one company controls in a particular market <end>',
 '<start> face flies spend only a small amount of time on the animal <end>',
 '<start> organic foods are extremely important in prevention and management of cancer <end>']

# Metrics

Let s be the source string and p your prediction. The quality of the results will be measured according to the following metric:

1.  look for the longest substring w between s and p
2.  compute |w|/max(|s|,|p|)

If the match is exact, the score is 1.

When computing the score, you should NOT consider the start and end tokens.



The longest common substring can be computed with the SequenceMatcher function of difflib, that allows a simple definition of our metric.

In [18]:
from difflib import SequenceMatcher

def score(s,p):
  match = SequenceMatcher(None, s, p).find_longest_match()
  #print(match.size)
  return (match.size/max(len(p),len(s)))

### Printing Max Lenght of Sequence

Let's do an example.

In [19]:
original = "at first henry wanted to be friends with the king of france"
generated = "henry wanted to be friends with king of france at the first"

print("your score is ",score(original,generated))

your score is  0.5423728813559322


### Computing scores, meand and variance of randomly shuffled sequences match

In [20]:
scores = []
for i in range(len(detok_x)):
    orig = detok_y[i]
    gen = detok_x[i]
    scores.append(score(orig,gen))

print('Mean score of Not-Reordered Sequences:',np.mean(scores))
print('Std of Not-Reordered Sequences:',np.std(scores))
print('Max Score:',np.max(scores))
print('Min Score:',np.min(scores))

Mean score of Not-Reordered Sequences: 0.1585000396657057
Std of Not-Reordered Sequences: 0.03408054917696136
Max Score: 0.25842696629213485
Min Score: 0.09782608695652174


The score must be computed as an average of at least 3K random examples taken form the test set.

# What to deliver

You are supposed to deliver a single notebook, suitably commented.
The notebook should describe a single model, although you may briefly discuss additional attempts you did.

The notebook should contain a full trace of the training.
Weights should be made available on request.

You must also give a clear assesment of the performance of the model, computed with the metric that has been given to you.

# Good work!

# Model Description:
For this project I opted to use a Transformer model, in order to process sequences of data.
The choice has been made thinking of the Transformer architecture advantages over a classical LSTM-NN for Sequence to Sequence problems.

1. Transformer Models allow for more efficient parallelization during training compared to LSTMs
2. While in LSTMs information needs to propagate through each time-stemp, Transformer Models can capture dependencies between tokens regardless of their distance in the sequence, using the Attention Mechanism

## Encoder-Decoder structure of a Transformer
##### Encoder Layer:
* Multi-Head Self-Attention Layer
* Position-wise fully connected Feed-Forward Neural Network Layer

##### Decoder Layer:
* Multi-Head Self-Attention Layer
* Encoder-Decoder Cross-Attention Layer to focus on relevant parts of the input sequence
* Position-wise fully connected Feed-Forward Neural Network Layer

In [21]:
import keras
from keras import layers, Model
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

## Defining Token and Position Embedding layer
Layer to transform integer-encoded tokens in input into dense, continuous vector representations of a specified dimensionality.

The Embedding layer will be put outside of the Encoder and Decoder layers in order to *share* the same Learnt Embedding space for Both the Encoder and Decoder and thus *reduce* the number of Parameters of the network

The Token Embedding layers applies a mask to zero tokens to avoid processing Padding tokens. This Mask is propagated to subsequent layers that support it.

In [44]:
# output dim. (num_seq, seq_len, d_model)
# d_model = dimensionality of the embedding space
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, d_model):
        super().__init__()
        self.maxlen = maxlen
        self.d_model = d_model
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=d_model, mask_zero=True) # mask_zero=True to avoid processing padding tokens. Mask is propagated to subsequent layers - like attention layer

        # create a position embedding learnet during train. Not deterministic, but can lead to capturing positional patterns better for specified data
        # embedding has space of max sequence length
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=d_model)

    def call(self, x, pos=True):
        length = tf.shape(x)[-1]
        x = self.token_emb(x)
        #x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))  # scale
        if pos:
            positions = tf.range(start=0, limit=length, delta=1)                            # holder for the positions of the tokens
            positions = self.pos_emb(positions)
            #positions = self.positional_encoding(self.d_model)[:, :length,:]
            x += positions                                                                  # returns one single output with positions embeddings added to the token embeddings
        return x                                                                            # without positional embedding will treat the sentence as a bag of words


    # ovveride the compute_mask method to propagate the mask to the next layer by the token embedding layer
    def compute_mask(self, inputs, mask=None):
        return self.token_emb.compute_mask(inputs, mask)


    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
        return pos * angle_rates

    def positional_encoding(self, d_model, position=2048):
        angle_rads = self.get_angles(np.arange(position)[:, np.newaxis],
                                np.arange(d_model)[np.newaxis, :],
                                d_model)

        # apply sin to even indices in the array; 2i
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

        # apply cos to odd indices in the array; 2i+1
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

        pos_encoding = angle_rads[np.newaxis, ...]

        return tf.cast(pos_encoding, dtype=tf.float32)


## Defining Self-Attention Mechanism for the Transformer
##### This Class includes self-attention, causal self-attention and cross-attention layers.
##### * For the model to achieve the best results it has been ESSENTIAL to include a Padding Mask to attend only to meaningful tokens. The Padding Mask is computed and propagated by the Embedding Layer.

##### * For the Decoder Causal Self-Attention, the Padding Mask has been combined with the Causal Mask. The Causal Mask ensures that during the Self-Attention computation, each position can only attend to Tokens positions preceding the current one.

##### * The Cross-Attention Layers has been used in order for the decoder to focus on relevant parts of the Encoder Output.


In [45]:
class AttentionMechanism(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dropout_rate=0.1):
        super().__init__()
        assert d_model % num_heads == 0                                                 # verify that the model dimensions are divisible by the number of heads

        self.self_mha = layers.MultiHeadAttention(num_heads=num_heads,
                                                  key_dim=d_model//num_heads)
        self.causal_mha = layers.MultiHeadAttention(num_heads=num_heads,
                                                    key_dim=d_model//num_heads)
        self.cross_mha = layers.MultiHeadAttention(num_heads=num_heads,
                                                   key_dim=d_model//num_heads)

        self.layernorm = layers.LayerNormalization(epsilon=1e-6)                        # normalize the output of the multihead attention
        #self.dropout = layers.Dropout(dropout_rate)                                    # dropout layer to set random values of attention weights to 0
        self.add = layers.Add()                                                         # define the addition layer to add residual connections

    def call(self, x, enc_output=None, causal_mask=None, mask=None):#, mask=None):
        # causal self-attention
        if causal_mask!=None:
            if mask!=None:
                # Combining Padding Mask with Causal Mask
                mask = tf.cast(mask, dtype=tf.int32)
                padding_mask = tf.expand_dims(mask, axis=1)# shape: (batch_size, 1, seq_length)
                combined_mask = padding_mask * causal_mask  # shape: (batch_size, seq_length, seq_length)
                attn_output = self.causal_mha(query=x, key=x, value=x, attention_mask=combined_mask)
            else:
                attn_output = self.causal_mha(query=x, key=x, value=x, attention_mask=causal_mask)

        # cross-attention. Padding mask will be passed by embedding layer
        elif enc_output!=None:
            attn_output = self.cross_mha(query=x, key=enc_output, value=enc_output)

        # self-attention with 0 padding propagated by Embedding layer
        else:
            attn_output = self.self_mha(query=x, key=x, value=x)

        #attn_output = self.dropout(attn_output)                                       # setting random values of attention weights to 0
        x = self.add([x, attn_output])                                                 # attention outputs to the input
        x = self.layernorm(x)
        return x

## Defining the Position Wise Feed-Forward Layer
Weights of dense layers have been initialized with GlorotUniform. This weights initialization reduces eventual problems of vanishing or exploding gradients.

The attention output is added to the FFN output with a skip connection.

In [46]:
class PositionWiseFeedForward(tf.keras.layers.Layer):
    def __init__(self, d_model, dff, dropout_rate=0.1):
        initializer = tf.keras.initializers.GlorotUniform(seed=42)               # initializer for the weights of the FFN layers

        # dff first layer dim., from d_model dim. to higher dimensional space to to catpure more complex patterns
        # second layer back to d_model dimensions - dimension of each token in the Embedding Dimension
        super().__init__()
        self.ffn = tf.keras.Sequential([
            layers.Dense(dff, activation='relu', kernel_initializer=initializer),   # first layer of the FFN to increase the dimensionality of the input and capture more complex patterns
            layers.Dense(d_model, kernel_initializer=initializer)                   # second layer of the FFN to return to the original dimension
            #,layers.Dropout(dropout_rate)                                          
        ])
        self.dropout = layers.Dropout(dropout_rate)                                 # dropout layer after the FFN

        self.add = layers.Add()
        self.layernorm = layers.LayerNormalization(epsilon=1e-6)                    # normalize the output of the FFN

    # computation that function does when called - ovverriden method of layer class
    def call(self, x, training=False):
        ffnout = self.ffn(x)                                                        # feed the input to the FFN
        ffnout = self.dropout(ffnout, training=training)                            # apply dropout to the output of the FFN
        x = self.add([x, ffnout])                                                   # including the residual connection
        x = self.layernorm(x)
        return x

## Defining and Encoder, with multiple Encoder Layers
In input to the Encoder Layer there will be a shuffled sequence.
We want the model to learn the relationships between the positions of the tokens in the shuffled sequence and their positions in the original sequence.
Position embedding is thus crucial and each encoder layer is composed by a sequence of self-attention and ffn.

In [47]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, dropout_rate =0.1):#maxlen, vocab_size, dropout_rate=0.1):
        super().__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        #self.embedding = TokenAndPositionEmbedding(maxlen, vocab_size, d_model)
        self.dropout = layers.Dropout(dropout_rate)                                                     # dropout layer to prevent overfitting

        self.enc_layers = [ tf.keras.Sequential([
                            AttentionMechanism(d_model, num_heads, dropout_rate=dropout_rate),
                            PositionWiseFeedForward(d_model, dff) ])                                    # after each pffn a droput layer is added inside the class
                           for _ in range(num_layers) ]                                                 # repeat the sequential stack - encoder layers for num_layers times

    def call(self, x, training=False):
        #x = self.embedding(x)                                                                          # in input to the encoder I give Embedded outputs. The embedding layers has been defined outside to share weights with the encoder
        x = self.dropout(x, training=training)                                                          # apply dropout layer to the Embedding layer output
                                                                                                        # to regularize input representations before feeding them to the encoder layers
        for enc_layer in self.enc_layers:                                                               # adding num_layers of encoder layers iterating over list
            x = enc_layer(x)

        return x

## Defining Decoder Layer
##### Each decoder layer will calculate a Causal Mask to pass to self-attention mechanism in order to focus only on past tokens in the sequence, and not peek at the future.

In [48]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super().__init__()

        self.attention = AttentionMechanism(d_model, num_heads, dropout_rate)
        self.ffn = PositionWiseFeedForward(d_model, dff)

    def call(self, x, enc_output, training=False):
        causal_mask = self.get_causal_attention_mask(x)                         # get the causal mask for the decoder
        x = self.attention(x=x, causal_mask=causal_mask)                        # causal self-attention. Causal mask is used to prevent the model from peeking at the future tokens
        x = self.attention(x=x, enc_output=enc_output)                          # cross-attention
        x = self.ffn(x)                                                         # position-wise feed forward network adds attention outputs to ffn outputs and normalizes the output
        return x

    # Create lower triangular matrix to be used as a mask for the decoder.
    # This mask will be used to prevent the decoder from peeking at the future tokens.
    # For sequence of length 4, the mask will look like this:
    # Index 1 - [1,0,0,0], Index 2 - [1,1,0,0], Index 3 - [1,1,1,0], Index 4 - [1,1,1,1]
    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]

        i = tf.range(sequence_length)[:, None]  # create a column vector of shape (sequence_length, 1)
        j = tf.range(sequence_length)

        mask = tf.cast(i >= j, dtype=tf.int32)  # create a matrix of shape (sequence_length, sequence_length)
        mask = tf.reshape(mask, (1, sequence_length, sequence_length))

        mask = tf.tile(mask, tf.concat([[batch_size], [1], [1]], axis=0))  # tile the mask to match the shape of the inputs - shape (batch_size, seqlen, seqlen)

        return mask

## Defining Decoder as Multiple Stacks of Decoder Layers

In [49]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, maxlen, dropout_rate=0.1):#vocab_size, dropout_rate=0.1):
        super().__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        #self.embedding = TokenAndPositionEmbedding(maxlen, vocab_size, d_model)
        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
        self.dropout = layers.Dropout(dropout_rate)

    def call(self, target_seq, enc_output, training=False):
        x = self.dropout(target_seq, training=training)                     # apply dropout layer to the Embedding layer output, in input to the decoder
                                                                            # The embedding are calculatede in Transformer before this block to share weights with the encoder
        # iterating over list of decoder layers
        for i in range(self.num_layers):
            x = self.dec_layers[i](x, enc_output, training=training)        # passing target sequence and encoder output to the decoder layer. Encoder output is used in cross-attention

        return x

## Putting everything together in Transformer Model
##### Token Embedding Space is shared between Encoder and Decoder to use less parameters in the Network.

In [50]:
# ovverriding Model class to define the Transformer model
class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, maxlen, vocab_size, dropout_rate=0.1):
    super().__init__()
    self.start_token = 3
    self.end_token = 2
    self.maxlen = maxlen

    self.embedding = TokenAndPositionEmbedding(maxlen,
                                               vocab_size,
                                               d_model)

    self.encoder = Encoder(num_layers=num_layers,
                           d_model=d_model,
                           num_heads=num_heads,
                           dff=dff,
                           #vocab_size=vocab_size,
                           dropout_rate=dropout_rate)

    self.decoder = Decoder(num_layers=num_layers,
                           d_model=d_model,
                           num_heads=num_heads,
                           dff=dff,
                           maxlen=maxlen,
                           #vocab_size=vocab_size,
                           dropout_rate=dropout_rate)

    self.final = tf.keras.layers.Dense(vocab_size, activation='softmax')

  def call(self, inputs, training=False):
    input_seq, target_seq = inputs
    enc_input = self.embedding(input_seq, pos=True)                       # embedding the input sequences
    dec_input = self.embedding(target_seq, pos=True)                      # embedding the target sequences
    enc_output = self.encoder(enc_input, training=training)               # passing the embedded input sequences to the encoder
    dec_output = self.decoder(dec_input, enc_output, training=training)   # passing the embedded target sequences and encoder output to the decoder
    out = self.final(dec_output)                                          # adding a final layer to the decoder to output the token probabilities

    return out

# Setting the parameters and visualizing Model Summary.

In [51]:
num_layers = 4
d_model = 256
dff = 1024
num_heads = 8
vocab_size = tokenizer.vocabulary_size()
maxlen = x.shape[1]  # 28
dropout_rate = 0.2

tblock = Transformer(num_layers=num_layers,
                    d_model=d_model,
                    num_heads=num_heads,
                    dff=dff,
                    maxlen=maxlen,
                    vocab_size=vocab_size,
                    dropout_rate=dropout_rate)

# initialize the model with a batch of data to build the model and print the summary
x_batch, y_batch = train_generator.__getitem__(1)
batch_inputs = (x_batch, y_batch)
_  = tblock(batch_inputs)
tblock.summary()


Model: "transformer_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 token_and_position_embeddi  multiple                  2567168   
 ng_1 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 encoder_1 (Encoder)         multiple                  3159040   
                                                                 
 decoder_1 (Decoder)         multiple                  4211712   
                                                                 
 dense_33 (Dense)            multiple                  2570000   
                                                                 
Total params: 12507920 (47.71 MB)
Trainable params: 12507920 (47.71 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### The model has around 12.5M parameters.

##### Defining Masked Loss so that only meaningful Tokens contribute - no padding - no start token

In [30]:
def masked_loss(y_true, y_pred):
    # Create a mask to identify padding tokens (assuming 0 is the padding token)
    sos = 3
    padding = 0

    mask = tf.cast(
        tf.math.logical_and(
          tf.math.not_equal(y_true, padding),
          tf.math.not_equal(y_true, sos)),
    dtype=tf.float32
    )

    loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)

    # Apply the mask to the loss
    loss *= mask

    # Compute the average loss, ignoring the padding tokens and start tokens
    return tf.reduce_sum(loss) / tf.reduce_sum(mask)

In [31]:
# controlling accuracy without start tokens
def masked_accuracy(label, pred):
  sos,padding=3,0
  pred = tf.argmax(pred, axis=-1)
  label = tf.cast(label, pred.dtype)
  same = label == pred

  mask = (label!=padding)&(label!=sos)
  same = same & mask

  same = tf.cast(same, dtype=tf.float64)
  mask = tf.cast(mask, dtype=tf.float64)
  return tf.reduce_sum(same)/tf.reduce_sum(mask)

## During training, at each time step, the decoder's input is the ground truth token from the target sequence at the previous time step. Before the first word in target sequence, there is < start > token in the decoder input.

##### The dataset has been saved locally in order to obtain the sequence shifted by one position to be given as the decoder target.

##### This has been done because the decoder needs to predict next token in the sequence given the current one.

Also saving the dataset locally has been a valuable resource-saver for training with Colab.

In [32]:
# Saving dataset locally
def save_ds(train_generator, test_generator):
    train_shuff = []  
    train_ord = []  

    sos = 3
    eos = 2
    for i in range(len(train_generator)):
        data_batch, res = train_generator[i] 
        train_shuff.extend(data_batch)
        train_ord.extend(res)

    # Concatenate all batches into a single array
    train_ord = tf.convert_to_tensor(np.array(train_ord), dtype=tf.int32)
    train_shuff = tf.convert_to_tensor(np.array(train_shuff), dtype=tf.int32)
    train_target = train_ord[:, 1:]  # Remove the <start> token from the target sequences - this is becasue the model will predict the next token given the previous tokens
    dec_in = train_ord[:,:-1]       # Match lenghts

    test_shuff = []  
    test_ord = []   
    for i in range(len(test_generator)):
        data_batch, res = test_generator[i]
        test_shuff.extend(data_batch)
        test_ord.extend(res)

    test_shuff = tf.convert_to_tensor(np.array(test_shuff))
    test_ord = tf.convert_to_tensor(np.array(test_ord))

    return train_shuff, dec_in, train_target, test_shuff, test_ord

train_shuff, dec_in, train_target, test_shuff, test_ord = save_ds(train_generator, test_generator)

# Training

Using a common learning rate scheduler for Transformers. The learning rate increases during the warmaup period.

In [33]:
from keras.callbacks import ModelCheckpoint, EarlyStopping, LearningRateScheduler

In [34]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super().__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    step = tf.cast(step, dtype=tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

  def get_config(self):
      return {
          'd_model': self.d_model.numpy(),
          'warmup_steps': self.warmup_steps
      }

### In order to obtain the longest substring matching and achieve the best scores, the accuracy of the model has been taken in account more than the mere loss during training.

In [55]:
w_path = '/content/drive/MyDrive/DLExam/Weights/'
model_name = 'transformer_teachforce_4_256_1024_8_02.h5'

callback1 = ModelCheckpoint(filepath=w_path+model_name, save_best_only=True, save_weights_only=True, monitor='val_masked_accuracy')
callback2 = EarlyStopping(monitor='val_masked_accuracy', patience=3, restore_best_weights=True)
callbacks = [callback1, callback2]

In [53]:
EPOCHS=16
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

In [None]:
tblock.compile(optimizer=optimizer, loss=masked_loss, metrics=masked_accuracy)
history = tblock.fit( x=[train_shuff, dec_in], y=train_target, epochs=EPOCHS, batch_size=256, validation_split=0.1, callbacks=callbacks)

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


### To achieve a better Accuracy of the model I have chosen to do 10 more epochs, despite the Validation Loss increasing. Having a better accuracy means more probability that subsequent words are in the correct position.

In [None]:
EPOCHS_2=10
history = tblock.fit( x=[train_shuff, dec_in], y=train_target, epochs=EPOCHS_2, batch_size=256, validation_split=0.1, callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Seing increasing results, I opted to maximize even further the val_accuracy, stopping the training with the early stopping callback.

In [38]:
EPOCHS_3 = 30
tblock.compile(optimizer=optimizer, loss=masked_loss, metrics=masked_accuracy)
history = tblock.fit( x=[train_shuff, dec_in], y=train_target, epochs=EPOCHS_3, batch_size=1024, validation_split=0.1, callbacks=callbacks)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30


# Just a Test


In [58]:
def predict(input_sequence, maxlen=28, model=tblock):
    sos = 3
    eos = 2
    batch_size = input_sequence.shape[0]

    generated = tf.expand_dims([sos], 0)   # Add batch dimension and initialize with start token
    for i in range(maxlen):
        predictions = model([input_sequence, generated], training=False)   # Pass through the model
        predicted_id = tf.argmax(predictions[:,-1,:], axis=-1).numpy()      # Get the token with highest probability for last time-sequence
        new_token = predicted_id[0].item()

        # Update the decoder input with the predicted token for the next iteration
        generated = tf.concat([generated, tf.expand_dims([new_token], 0)], axis=-1)
        if predicted_id == eos:  # If end token is predicted, stop
            return np.array(generated)


    return np.array(generated)

for _ in range(3):
    i = np.random.randint(0, test_shuff.shape[0])                                   # get a random index from the batch
    pred = predict(test_shuff[i:i+1], maxlen=maxlen, model=tblock)                  # get the model's prediction for the input sequence
    print('Input sequence:', detokenizer(np.array(test_shuff[i:i+1])))              # print the input sequence
    print('Target sequence:', detokenizer(np.array(test_ord[i:i+1])))               # print the target sequence
    print('Predicted sequence:', detokenizer((pred[0:1])))                          # print the predicted sequence
    print('\n')

Input sequence: ['<start> at many of history levels interactions through emerges a form hierarchical <end>']
Target sequence: ['<start> form emerges through a history of interactions at many hierarchical levels <end>']
Predicted sequence: ['<start> history emerges at many levels of interactions through a hierarchical form <end>']


Input sequence: ['<start> blood synthesis is of necessary acid red of formation folic acids and the cells the for nucleic <end>']
Target sequence: ['<start> folic acid is necessary for the synthesis of nucleic acids and the formation of red blood cells <end>']
Predicted sequence: ['<start> folic acid is necessary for the formation of the red blood cells and the synthesis of nucleic <end>']


Input sequence: ['<start> <comma> combinations <comma> pictures and <comma> sounds have of a websites graphics words <end>']
Target sequence: ['<start> websites have a combinations of words <comma> graphics <comma> pictures <comma> and sounds <end>']
Predicted sequence: 

# Results

### To calculate the scores, sentences are taken from the < start >, till the < end > tokens, not included.

In [39]:
def predict_batches(input_sequence, maxlen=28, model=None):
    sos = 3
    eos = 2
    batch_size = input_sequence.shape[0]

    generated = tf.cast(tf.constant([[sos]] * batch_size),dtype=tf.int64)   # initialize with start token for each batch element
    for i in range(maxlen):
        predictions = model([input_sequence, generated], training=False)    # pass through the model
        predicted_ids = tf.argmax(predictions[:, -1, :], axis=-1).numpy()   # get the token with highest probability for last time-sequence
        new_tokens = tf.expand_dims(predicted_ids, axis=1)                  # eeshape to (batch_size, 1)

        generated = tf.concat([generated, new_tokens], axis=-1)             # concatenate new tokens

        if np.all(predicted_ids == eos):
            break

    return np.array(generated)


scores = []
def calculate_scores(model, predict, test_shuff, test_ord, detokenizer, scoring, n=None, batch_size=32, prints=-1):
    if n is None:
        n = test_shuff.shape[0]

    sos, eos = 3, 2
    #assert n>=batch_size
    for i in range(0, n, batch_size):
        batch_end = min(i + batch_size, n)
        preds = predict(test_shuff[i:batch_end], model=model)
        for j in range(preds.shape[0]):
            if(preds[j:j+1].shape[1]!=0):
                pred_sindx = np.where(preds[j] == sos)[0]
                pred_eindx = np.where(preds[j] == eos)[0]

                if len(pred_sindx) == 0 or len(pred_eindx) == 0:
                    continue

                pred_sindx = pred_sindx[0]
                pred_eindx = pred_eindx[0]

                orderered = test_ord[i + j:i + j + 1].numpy()
                ord_sindx = np.where(orderered[0] == sos)[0]
                ord_eindx = np.where(orderered[0] == eos)[0]
                
                if len(ord_sindx) == 0 or len(ord_eindx) == 0:
                    continue

                ord_sindx = ord_sindx[0]
                ord_eindx = ord_eindx[0]

                orig = detokenizer(orderered[0:1, ord_sindx+1:ord_eindx])
                gen = detokenizer(preds[j:j + 1, pred_sindx+1:pred_eindx])
                scores.append(scoring(orig[0], gen[0]))

            if i + j < prints:
                print(f'Original_{i + j + 1}:', orig[0])
                print(f'Generated_{i + j + 1}:', gen[0])
                print(f'Score_{i + j + 1}:', scores[i + j])
                print('\n')

    mean = np.mean(scores)
    std = np.std(scores)

    return scores,mean, std

In [60]:
scores=[]                   # reset the scores
N_EL = test_shuff.shape[0]  # number of elements to test
PRINT_PREDS = 5             # print the first n predictions
PRED_BATCH = 1024

scores, mean, std = calculate_scores(tblock, predict_batches, test_shuff, test_ord, detokenizer, score, n=N_EL, batch_size=PRED_BATCH, prints=PRINT_PREDS)

print('Model tested on {} elements of test-set:'.format(N_EL))
print('\nMean value of scorings:', mean)
#print('Std of scorings:', std)
#print('Max. score:', np.max(scores))
#print('Min. score:', np.min(scores))

Original_1: recycling prevents pollution and helps conserve precious natural resources
Generated_1: recycling helps conserve precious natural resources and prevents pollution
Score_1: 0.5675675675675675


Original_2: predators like to eat banana slugs at all stages of their lives
Generated_2: predators like to eat banana slugs at all stages of their lives
Score_2: 1.0


Original_3: corporate profits drive stock prices and corporations have been getting serious about business
Generated_3: corporations have serious profits about getting stock and corporate profits before business prices
Score_3: 0.1836734693877551


Original_4: snails are hermaphrodites but they have to mate before laying some days later
Generated_4: some snails have hermaphrodites but they are laying to mate later days before
Score_4: 0.33766233766233766


Original_5: abuse can affect any family regardless of income <comma> profession <comma> religion <comma> or education
Generated_5: abuse can affect any profession <co

# The Model Achieves an Average Score of 0.548848

## Different Models and Approaches:
### Other models have been tested to reduce the number of the parameters.
### For example a good model has been obtained reducing the width of the model (parameters d_model=128 and dff=512) and increasing the depth to 10 layers. This model has a total of around 7M paramets, and achieves a good score of 0.5125.
### Another model achieving good results that has been tested is the one with same width proposed here, but with more depth and an increased number of layers. In order to not increase too much the number of parameters though, I have chosen to stick to 4 layers for the encoder and decoder.