<a href="https://colab.research.google.com/github/UtpalMattoo/NLP-Attention-Transformers/blob/master/Transformer_Data_Prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2019 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Transformer model for language understanding

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/text/transformer">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/transformer.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial trains a <a href="https://arxiv.org/abs/1706.03762" class="external">Transformer model</a> to translate Portuguese to English. This is an advanced example that assumes knowledge of [text generation](text_generation.ipynb) and [attention](nmt_with_attention.ipynb).

The core idea behind the Transformer model is *self-attention*—the ability to attend to different positions of the input sequence to compute a representation of that sequence. Transformer creates stacks of self-attention layers and is explained below in the sections *Scaled dot product attention* and *Multi-head attention*.

A transformer model handles variable-sized input using stacks of self-attention layers instead of [RNNs](text_classification_rnn.ipynb) or [CNNs](../images/intro_to_cnns.ipynb). This general architecture has a number of advantages:

* It make no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects (for example, [StarCraft units](https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/#block-8)).
* Layer outputs can be calculated in parallel, instead of a series like an RNN.
* Distant items can affect each other's output without passing through many RNN-steps, or convolution layers (see [Scene Memory Transformer](https://arxiv.org/pdf/1903.03878.pdf) for example).
* It can learn long-range dependencies. This is a challenge in many sequence tasks.

The downsides of this architecture are:

* For a time-series, the output for a time-step is calculated from the *entire history* instead of only the inputs and current hidden-state. This _may_ be less efficient.   
* If the input *does* have a  temporal/spatial relationship, like text, some positional encoding must be added or the model will effectively see a bag of words. 

After training the model in this notebook, you will be able to input a Portuguese sentence and return the English translation.

<img src="https://www.tensorflow.org/images/tutorials/transformer/attention_map_portuguese.png" width="800" alt="Attention heatmap">

In [None]:
!pip install -q tfds-nightly

# Pin matplotlib version to 3.2.2 since in the latest version
# transformer.ipynb fails with the following error:
# https://stackoverflow.com/questions/62953704/valueerror-the-number-of-fixedlocator-locations-5-usually-from-a-call-to-set
!pip install matplotlib==3.2.2

[K     |████████████████████████████████| 3.7MB 9.4MB/s 


In [None]:
# tf.data: Build TensorFlow input pipelines
# https://www.tensorflow.org/guide/data
# https://www.tensorflow.org/api_docs/python/tf/data/experimental


In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf

import time
import numpy as np
import matplotlib.pyplot as plt

## Setup input pipeline

Use [TFDS](https://www.tensorflow.org/datasets) to load the [Portugese-English translation dataset](https://github.com/neulab/word-embeddings-for-nmt) from the [TED Talks Open Translation Project](https://www.ted.com/participate/translate).

This dataset contains approximately 50000 training examples, 1100 validation examples, and 2000 test examples.

In [None]:
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

[1mDownloading and preparing dataset 124.94 MiB (download: 124.94 MiB, generated: Unknown size, total: 124.94 MiB) to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=0.0, description='Generating splits...', max=3.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Generating train examples...', max=51785.0, style=Progres…

HBox(children=(FloatProgress(value=0.0, description='Shuffling ted_hrlr_translate-train.tfrecord...', max=5178…

HBox(children=(FloatProgress(value=0.0, description='Generating validation examples...', max=1193.0, style=Pro…

HBox(children=(FloatProgress(value=0.0, description='Shuffling ted_hrlr_translate-validation.tfrecord...', max…

HBox(children=(FloatProgress(value=0.0, description='Generating test examples...', max=1803.0, style=ProgressS…

HBox(children=(FloatProgress(value=0.0, description='Shuffling ted_hrlr_translate-test.tfrecord...', max=1803.…

[1mDataset ted_hrlr_translate downloaded and prepared to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0. Subsequent calls will reuse this data.[0m


Create a custom subwords tokenizer from the training dataset. 

In [None]:
# https://www.tensorflow.org/datasets/api_docs/python/tfds/deprecated/text/SubwordTextEncoder#build_from_corpus
# https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacity_intro_to_tensorflow_for_deep_learning/l10c01_nlp_lstms_with_reviews_subwords_dataset.ipynb#scrollTo=33AthPiALFZK
# SubwordTextEncoder.build_from_corpus() will create a tokenizer for us. You could also use this 
# functionality to get subwords from a much larger corpus of text as well

tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

print (type(tokenizer_en))

<class 'tensorflow_datasets.core.deprecated.text.subword_text_encoder.SubwordTextEncoder'>


In [None]:
# i=11
# j=0
# for data in train_examples:
#   if (j < i):
#     a, b = data
#     print (f"a is: {a} and \n b is: {b}\n") #a and b are byte strings
#     j+=1

In [None]:
"""
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
[8214, 6, 40, 4092, 57, 3, 1687, 1, 6155, 12, 3, 461, 6770, 19, 5227, 1088, 97, 1, 5, 8, 3, 4213, 3408, 7256, 1670, 2, 8215]
mas e se estes fatores fossem ativos ?
[8214, 25, 6, 16, 138, 7800, 2004, 2445, 8073, 29, 8215]
mas eles não tinham a curiosidade de me testar .
[8214, 25, 66, 13, 342, 3, 5278, 7990, 4, 38, 3625, 8072, 2, 8215]
e esta rebeldia consciente é a razão pela qual eu , como agnóstica , posso ainda ter fé .
[8214, 6, 54, 3906, 2682, 156, 2646, 7990, 8, 3, 496, 139, 216, 354, 1, 21, 1712, 243, 4206, 1, 375, 130, 75, 1960, 2, 8215]
`` `` '' podem usar tudo sobre a mesa no meu corpo . ''
[8214, 149, 74, 258, 123, 60, 3, 4088, 22, 73, 806, 37, 8215]
"""

sample_string = 'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'

tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

Tokenized string is [4, 59, 15, 1792, 6561, 3060, 7952, 1, 15, 103, 134, 378, 3, 47, 6122, 6, 5311, 1, 91, 13, 1849, 559, 1609, 894, 2]
The original string: and when you improve searchability , you actually take away the one advantage of print , which is serendipity .


The tokenizer encodes the string by breaking it into subwords if the word is not in its dictionary.

In [None]:
for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

4 ----> and 
59 ----> when 
15 ----> you 
1792 ----> improve 
6561 ----> search
3060 ----> abilit
7952 ----> y
1 ---->  , 
15 ----> you 
103 ----> actually 
134 ----> take 
378 ----> away 
3 ----> the 
47 ----> one 
6122 ----> advantage 
6 ----> of 
5311 ----> print
1 ---->  , 
91 ----> which 
13 ----> is 
1849 ----> ser
559 ----> end
1609 ----> ip
894 ----> ity
2 ---->  .


In [None]:
BUFFER_SIZE = 20000
BATCH_SIZE = 64

Add a start and end token to the input and target. 

In [None]:
def encode(lang1, lang2):

  lst1 = [tokenizer_pt.vocab_size]
  lst2 = [tokenizer_en.vocab_size]

  # print (f"Protugese source sentence - = {lang1.numpy()}")
  lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
      lang1) + [tokenizer_pt.vocab_size+1]
 
  # print (f"Protugese tokenized vector - beginning and end of vector is vocab_size and vocab_size+1 for Portugese = {lang1}\n")
  
  # print (f"English source sentence - = {lang2.numpy()}")
  lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
      lang2) + [tokenizer_en.vocab_size+1]

  # print (f"English tokenized vector - beginning and end of vector is vocab_size and vocab_size+1 for English = {lang2}\n")
  
  return lang1, lang2


You want to use `Dataset.map` to apply this function to each element of the dataset.  `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode you can only use TensorFlow Ops and functions. 

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

Note: To keep this example small and relatively fast, drop examples with a length of over 40 tokens.

In [None]:
MAX_LEN = 30

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset
# https://www.tensorflow.org/tutorials/load_data/numpy
# https://stackoverflow.com/questions/62436302/accessing-data-in-tensorflow-prefetchdataset

print (f"number of training examples: {len(train_examples)}")
print (f"type of training examples: {type(train_examples)}\n") # <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
lst = list(train_examples) # convert to list and then to tf.data.Dataset.from_tensor_slices
tf_slices_train_examples = tf.data.Dataset.from_tensor_slices(lst)

source_pt = []
target_en = []
result_pt = []
result_en = []

#Finally, access tensor elements in tensor of type tf.data.Dataset.from_tensor_slices
for example_no, source_target in enumerate(tf_slices_train_examples.take(100)):

  source = source_target[0].numpy().decode('UTF-8')
  target = source_target[1].numpy().decode('UTF-8')
  if (len(source) < MAX_LEN) and (len(target) < MAX_LEN):
    source_pt.append(source)
    target_en.append(target)
    pt, en = encode(source, target)
    result_pt.append(pt)
    result_en.append(en)
  else: 
    continue

number of training examples: 51785
type of training examples: <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>



In [None]:
from tensorflow import keras

# for item in  zip(source_pt, result_pt, target_en, result_en):
#    if (len(item[1]) >  max_source_pt_len):      
#       max_source_pt_len = len(item[1])

def pad_sequence(result_pt, result_en):
  padded_inputs_pt = tf.keras.preprocessing.sequence.pad_sequences(result_pt, padding="post")
  padded_inputs_en = tf.keras.preprocessing.sequence.pad_sequences(result_en, padding="post")
  return padded_inputs_pt, padded_inputs_en

In [None]:
padded_inputs_pt, padded_inputs_en = pad_sequence(result_pt, result_en)
print ("Padded input sequences:\n")
print (padded_inputs_pt)
print ("\nPadded output sequences:\n")
print (padded_inputs_en)

Padded input sequences:

[[8214    3 4675   53    9 1537    2 8215]
 [8214   27   56 5569 8055    2 8215    0]
 [8214   42   13  170 2610    2 8215    0]
 [8214  212    2 8215    0    0    0    0]
 [8214  273    1  179 7240 8055    2 8215]
 [8214   67  107  173 8215    0    0    0]
 [8214    3  313   95  410    2 8215    0]
 [8214   54   53    3   69  486    2 8215]
 [8214  372    2 8215    0    0    0    0]]

Padded output sequences:

[[8087    3 3978   20 2981    2 8088    0    0    0    0]
 [8087   12   20  952 7931    2 8088    0    0    0    0]
 [8087   16   97   36 1537    2 8088    0    0    0    0]
 [8087  153   51    2 8088    0    0    0    0    0    0]
 [8087  307    1  204    1 7936    8   78 5054    2 8088]
 [8087   94  136  192 8088    0    0    0    0    0    0]
 [8087    3  338   20   79    2 8088    0    0    0    0]
 [8087   18   10   20   32  422    2 8088    0    0    0]
 [8087  153   51    2 8088    0    0    0    0    0    0]]


## Masking

Mask all the pad tokens in the batch of sequence. It ensures that the model does not treat padding as the input. The mask indicates where pad value `0` is present: it outputs a `1` at those locations, and a `0` otherwise.

In [None]:
def create_padding_mask(seq):
  print ("Padded input:\n")
  print (seq)
  print ()
  print ("Masked paddings:\n")
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
  # why add a new axis when a padding mask can is returned by tf.cast(tf.math.equal(seq, 0), tf.float32)
  # return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)
  return seq[:, :]  # (batch_size, 1, 1, seq_len)

In [None]:
x = tf.constant([[8087, 307, 1, 204,    1, 7936,    8,   78, 5054,    2, 8088],
                 [8087,   18,   10,   20,   32,  422,    2, 8088,    0,    0,    0], 
                 [8087,  153,   51,    2, 8088,    0,    0,    0,    0,    0,    0]])
create_padding_mask(x)

Padded input:

tf.Tensor(
[[8087  307    1  204    1 7936    8   78 5054    2 8088]
 [8087   18   10   20   32  422    2 8088    0    0    0]
 [8087  153   51    2 8088    0    0    0    0    0    0]], shape=(3, 11), dtype=int32)

Masked paddings:



<tf.Tensor: shape=(3, 11), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1.]], dtype=float32)>

In [262]:
def create_look_ahead_mask(size):
  print ("************************* In create_look_ahead_mask ****************************")
  # tf.matrix_band_part(input, -1, 0) ==> Lower triangular part.
  # https://www.tensorflow.org/api_docs/python/tf/linalg/band_part
  
  print (tf.linalg.band_part(tf.ones((size, size)), -1, 0)) # Uncomment later
 
  # switch to the uppper triangular part
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  print ("\n******************After subtract******************************\n")
  print (mask)
  return mask  # (seq_len, seq_len) tf.cast(x, tf.int32)

In [263]:
x = tf.random.uniform((10, 4))
print (x)
temp = create_look_ahead_mask(x.shape[1])
temp

tf.Tensor(
[[0.65239656 0.38838315 0.53767145 0.55196345]
 [0.26812708 0.19646943 0.9061558  0.39706135]
 [0.33801997 0.52262604 0.48510373 0.24971282]
 [0.6662196  0.14866138 0.36983204 0.5765791 ]
 [0.3571918  0.459682   0.271657   0.09301698]
 [0.48169446 0.55015886 0.16322911 0.65544367]
 [0.58037543 0.8639753  0.5869429  0.092152  ]
 [0.72468245 0.97446585 0.9399307  0.31246507]
 [0.18736744 0.11667609 0.0176121  0.7439172 ]
 [0.9946773  0.5199362  0.6276696  0.3641461 ]], shape=(10, 4), dtype=float32)
************************* In create_look_ahead_mask ****************************
tf.Tensor(
[[1. 0. 0. 0.]
 [1. 1. 0. 0.]
 [1. 1. 1. 0.]
 [1. 1. 1. 1.]], shape=(4, 4), dtype=float32)

******************After subtract******************************

tf.Tensor(
[[0. 1. 1. 1.]
 [0. 0. 1. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]], shape=(4, 4), dtype=float32)


<tf.Tensor: shape=(4, 4), dtype=int32, numpy=
array([[0, 1, 1, 1],
       [0, 0, 1, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 0]], dtype=int32)>

## CREATE MASKS

Since the target sequences are padded, it is important to apply a padding mask when calculating the loss.

In [279]:
def create_masks(inp, tar):
  print ("*************************In create_masks ****************************")

  # Encoder padding mask
  print ("calling create_padding_mask(inp) for encoder ")
  enc_padding_mask = create_padding_mask(inp)
  print (enc_padding_mask)
  print (enc_padding_mask.shape)

  # Used in the 2nd attention block in the decoder.
  # This padding mask is used to mask the encoder outputs.
  print ("calling create_padding_mask(inp) for decoder ")
  dec_padding_mask = create_padding_mask(inp)
  print (dec_padding_mask)
  print (dec_padding_mask.shape)


  # Used in the 1st attention block in the decoder.
  # It is used to pad and mask future tokens in the input received by 
  # the decoder.

  print (tf.shape(tar))  
  print (tf.shape(tar)[1])

  print ("Calling create_look_ahead_mask(tf.shape(tar)[1])")
  look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
  print (tf.shape(look_ahead_mask))

  print ("calling create_padding_mask(tar)")
  dec_target_padding_mask = create_padding_mask(tar)
  
  print ("Calling tf.maximum(dec_target_padding_mask, look_ahead_mask)")
  

  print (f"\n{tf.shape(dec_target_padding_mask)} and matrix is {dec_target_padding_mask}")
  print (f"\n{tf.shape(look_ahead_mask)} and matrix is {look_ahead_mask}")


  look_ahead_mask = tf.constant(look_ahead_mask[:-1, :], dtype=tf.int32)

  print ("changed dimension")

  print (f"\n{tf.shape(look_ahead_mask)} and matrix is {look_ahead_mask}")
  
  combined_mask = tf.maximum(tf.cast(dec_target_padding_mask, tf.int32), look_ahead_mask)
  
  return enc_padding_mask, combined_mask, dec_padding_mask

In [280]:
EPOCHS = 20

In [281]:
def train_step(inp, tar_inp_parm):
  # print (f"Shape passed in:\n Input/Portugese shape: {inp.shape} \n Output/English shape: {tar_inp_parm.shape}" )
  tar_inp = tar_inp_parm[:, :-1]
  tar_real = tar_inp_parm[:, 1:]  
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)  
  return enc_padding_mask, combined_mask, dec_padding_mask  

Portuguese is used as the input language and English is the target language.

In [282]:
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset
# https://www.tensorflow.org/tutorials/load_data/numpy
# https://stackoverflow.com/questions/62436302/accessing-data-in-tensorflow-prefetchdataset

import numpy as np

print (f"number of training examples: {len(train_examples)}")
print (f"type of training examples: {type(train_examples)}\n") # <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
lst = list(train_examples) # convert to list and then to tf.data.Dataset.from_tensor_slices
tf_slices_train_examples = tf.data.Dataset.from_tensor_slices(lst)

source_pt = []
target_en = []
result_pt = []
result_en = []

#Finally, access tensor elements in tensor of type tf.data.Dataset.from_tensor_slices
for example_no, source_target in enumerate(tf_slices_train_examples.take(100)):

  source = source_target[0].numpy().decode('UTF-8')
  target = source_target[1].numpy().decode('UTF-8')
  if (len(source) < MAX_LEN) and (len(target) < MAX_LEN):
    source_pt.append(source)
    target_en.append(target)
    pt, en = encode(source, target)
    result_pt.append(pt)
    result_en.append(en)
  else: 
    continue

print ("\nSource without padding or masked padded tokens:\n")
tf.print (result_pt)
print ("\nTarget without padding, masking padded tokens or look ahead masks:\n")
tf.print (result_en)

padded_inputs_pt, padded_inputs_en = pad_sequence(result_pt, result_en)

number of training examples: 51785
type of training examples: <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>


Source without padding or masked padded tokens:

[[8214, 3, 4675, 53, 9, 1537, 2, 8215],
 [8214, 27, 56, 5569, 8055, 2, 8215],
 [8214, 42, 13, 170, 2610, 2, 8215],
 [8214, 212, 2, 8215],
 [8214, 273, 1, 179, 7240, 8055, 2, 8215],
 [8214, 67, 107, 173, 8215],
 [8214, 3, 313, 95, 410, 2, 8215],
 [8214, 54, 53, 3, 69, 486, 2, 8215],
 [8214, 372, 2, 8215]]

Target without padding, masking padded tokens or look ahead masks:

[[8087, 3, 3978, 20, 2981, 2, 8088],
 [8087, 12, 20, 952, 7931, 2, 8088],
 [8087, 16, 97, 36, 1537, 2, 8088],
 [8087, 153, 51, 2, 8088],
 [8087, 307, 1, 204, 1, 7936, 8, 78, 5054, 2, 8088],
 [8087, 94, 136, 192, 8088],
 [8087, 3, 338, 20, 79, 2, 8088],
 [8087, 18, 10, 20, 32, 422, 2, 8088],
 [8087, 153, 51, 2, 8088]]


In [285]:
print ("\nSource with padding. No masked padded tokens:\n")
print (tf.constant(padded_inputs_pt))

print ("\nTarget with padding. No masked padded tokens or Look ahead masks:\n")
print (tf.constant(padded_inputs_en))

print ("calling train step\n")
enc_padding_mask, combined_mask, dec_padding_mask = train_step(padded_inputs_pt, padded_inputs_en)

print ("*******************************************")

print (enc_padding_mask)
print (combined_mask)
print (dec_padding_mask)


Source with padding. No masked padded tokens:

tf.Tensor(
[[8214    3 4675   53    9 1537    2 8215]
 [8214   27   56 5569 8055    2 8215    0]
 [8214   42   13  170 2610    2 8215    0]
 [8214  212    2 8215    0    0    0    0]
 [8214  273    1  179 7240 8055    2 8215]
 [8214   67  107  173 8215    0    0    0]
 [8214    3  313   95  410    2 8215    0]
 [8214   54   53    3   69  486    2 8215]
 [8214  372    2 8215    0    0    0    0]], shape=(9, 8), dtype=int32)

Target with padding. No masked padded tokens or Look ahead masks:

tf.Tensor(
[[8087    3 3978   20 2981    2 8088    0    0    0    0]
 [8087   12   20  952 7931    2 8088    0    0    0    0]
 [8087   16   97   36 1537    2 8088    0    0    0    0]
 [8087  153   51    2 8088    0    0    0    0    0    0]
 [8087  307    1  204    1 7936    8   78 5054    2 8088]
 [8087   94  136  192 8088    0    0    0    0    0    0]
 [8087    3  338   20   79    2 8088    0    0    0    0]
 [8087   18   10   20   32  422    2 808

In [None]:
"""
x = 1 - tf.linalg.band_part(tf.ones((10, 10)), -1, 0)
x = tf.cast(x, tf.int32)
print (x)
y = tf.constant(x[:-1, :], dtype=tf.int32)
print (y)
dec_target_padding_mask = tf.constant([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], dtype=tf.int32)
combined_mask = tf.maximum(dec_target_padding_mask, y)
print (combined_mask)
"""

In [243]:
y = tf.constant(x[:-1, :], dtype=tf.int32)
print (y)

tf.Tensor(
[[0 1 1 1 1 1 1 1 1 1]
 [0 0 1 1 1 1 1 1 1 1]
 [0 0 0 1 1 1 1 1 1 1]
 [0 0 0 0 1 1 1 1 1 1]
 [0 0 0 0 0 1 1 1 1 1]
 [0 0 0 0 0 0 1 1 1 1]
 [0 0 0 0 0 0 0 1 1 1]
 [0 0 0 0 0 0 0 0 1 1]
 [0 0 0 0 0 0 0 0 0 1]], shape=(9, 10), dtype=int32)


In [244]:
dec_target_padding_mask = tf.constant([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]], dtype=tf.int32)

In [245]:
combined_mask = tf.maximum(dec_target_padding_mask, y)
print (combined_mask)

tf.Tensor(
[[0 1 1 1 1 1 1 1 1 1]
 [0 0 1 1 1 1 1 1 1 1]
 [0 0 0 1 1 1 1 1 1 1]
 [0 0 0 0 1 1 1 1 1 1]
 [0 0 0 0 0 1 1 1 1 1]
 [0 0 0 0 0 1 1 1 1 1]
 [0 0 0 0 0 0 0 1 1 1]
 [0 0 0 0 0 0 0 0 1 1]
 [0 0 0 0 0 1 1 1 1 1]], shape=(9, 10), dtype=int32)
