# Masking & Padding in Keras

**Resources**

[TF Docs](https://www.tensorflow.org/guide/keras/masking_and_padding)

[StackOverflow](https://stackoverflow.com/questions/47057361/how-do-i-mask-a-loss-function-in-keras-with-the-tensorflow-backend)

## How to avoid using the padded values for calculating encoder hidden states?

*Masking is the mechanism that informs the model  that some part of the data is actually padding and should be ignored.*

* Add a `keras.layers.Masking` layer.
* Configure a `keras.layers.Embedding` layer with `mask_zero=True`.
* Pass a `mask` argument manually when calling layers that support this argument (e.g. RNN layers).

<hr>

When using the `Functional API` or the `Sequential API`, a mask generated by an `Embedding` or `Masking` layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). ***Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.*** This is why the following works -
```python
model = tf.keras.Sequential()
model.add(Embedding(input_dim=5000, output_dim=16))
model.add(layers.LSTM(32))
```

Or we could explicitly pass the mask, using `mask_zero=True` in *Embedding* layer.
```python
# ENCODER
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(num_encoder_tokens, latent_dim, mask_zero=True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
```

<hr>

Inside a subclassed layer or a subclassed model however, maks aren't automatically propogated, so you will need to manually pass a `mask` argument inside the subclassed model's `call` & pass it  to the layer that needs it. For example, **MultiHead Attention Layer in Encoder & Decoder of a Transformer uses mask**. So we, pass the `mask` argument to it's `call` method.

```python
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model):
        super(MultiHeadAttention, self).__init__()
        # initializtions
        
    def call(self, query, key, value, mask):
        ...
        .
        .
        ...
        # Apply ScaledDotProduct (SingleHeadAttention) on each of the heads
        attention_weights, context_vector = SingleHeadAttention(query, key, value, mask)
        ...
```

## How to avoid using padded terms while calculating the loss in decoder?

If there's a mask in your model, it'll be propagated layer-by-layer and eventually applied to the loss. So if you're padding and masking the sequences in a correct way, the loss on the padding placeholders would be ignored.

In [3]:
import numpy as np
from tensorflow.keras.layers import Input, Masking, LSTM
from tensorflow.keras.models import Model

In [29]:
max_sentence_length = 5
character_number = 2

input_tensor = Input(shape=(max_sentence_length, character_number))
masked_input = Masking(mask_value=0)(input_tensor)
output = LSTM(3, return_sequences=True)(masked_input)
model = Model(input_tensor, output)
model.compile(loss='mae', optimizer='adam')

X = np.array([[[0, 0], [0, 0], [1, 0], [0, 1], [0, 1]],
              [[0, 0], [0, 1], [1, 0], [0, 1], [0, 1]]])
y_true = np.ones((2, max_sentence_length, 3))
y_pred = model.predict(X)

In [30]:
X.shape, y_true.shape, y_pred.shape

((2, 5, 2), (2, 5, 3), (2, 5, 3))

In [31]:
y_pred

array([[[ 0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ],
        [-0.0793161 , -0.17551334,  0.12153674],
        [-0.2666166 , -0.01823441, -0.04468612],
        [-0.38669798,  0.03491515, -0.15501674]],

       [[ 0.        ,  0.        ,  0.        ],
        [-0.19989397,  0.04159923, -0.12086871],
        [-0.22334047, -0.11435185,  0.00689123],
        [-0.36727932,  0.00774298, -0.12397961],
        [-0.45215282,  0.05321965, -0.20078692]]], dtype=float32)

LSTM ignored those cases where X had zero vector

In [32]:
# See if the loss computed by model.evaluate() is equal to the masked loss
unmasked_loss = np.abs(1 - y_pred).mean()
masked_loss = np.abs(1 - y_pred[y_pred != 0]).mean()

In [33]:
model.evaluate(X, y_true)



0.7887610197067261

In [34]:
masked_loss

1.1268014

In [35]:
unmasked_loss

1.088761

**The above example was supposed to work as per Stackoverflow answer, however it doesn't. Will have to find a better solution. Procrastinating for now**