Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Masking work? #3086

Closed
poyuwu opened this issue Jun 27, 2016 · 32 comments
Closed

How does Masking work? #3086

poyuwu opened this issue Jun 27, 2016 · 32 comments

Comments

@poyuwu
Copy link

poyuwu commented Jun 27, 2016

I'm wondering how Masking Layer works.
I try to write simple model to test Masking on Activation Layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input
a = np.array([[3.,1.,2.,2.,0.,0.]])

inputs = Input(shape=(6,))
mask = Masking(mask_value=0.0)(inputs)
softmax = Activation('softmax')(mask)
model = Model(input=inputs,output=softmax)
model.predict(a)

and the result of prediction is

array([[ 0.50744212,  0.06867483,  0.18667753,  0.18667753,  0.02526405,
         0.02526405]])

Is this the correct behavior?
My keras version is 1.0.5

@ipoletaev
Copy link

ipoletaev commented Jun 28, 2016

I'm also interested with this question. It seems to me you expected to get something like the following:
[[ 0.53444666, 0.07232948, 0.19661194, 0.19661194]] ?
But from the other side, - according to the explanations in core.py : class Masking(Layer) - masking doesn't work with 1D input data. So, if you try this, for example:

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])

input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

...you will get [[-0.20101213],[ 0. ],[-0.51546627]] as expected.
But I think that, most likely, there's something wrong in my understanding.

@poyuwu
Copy link
Author

poyuwu commented Jun 28, 2016

@ipoletaev
yes, sure. However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.
Here is another example about Masknig on bi-LSTM layer but sum two layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, LSTM, merge
a = np.array([[[.3,.1,.2,.2,.1,.1],[.2,.3,.3,.3,.3,.1],[0,0,0,0,0,0]]])

inputs = Input(shape=(3,6))
mask = Masking(mask_value=0.0)(inputs)
fw = LSTM(1,return_sequences=True)(mask)
bw = LSTM(1,return_sequences=True,go_backwards=True)(mask)
merged = merge([fw,bw],mode='sum')
model = Model(input=inputs,output=fw)
model2 = Model(input=inputs,output=bw)
model3 = Model(input=inputs,output=merged)

the fw's output is
array([[[-0.07041532], [-0.12203699], [-0.12203699]]])
the bw's output is
array([[[ 0. ], [-0.03112165], [ 0.02271803]]])
the merge's output is
array([[[-0.07041532], [-0.15315863], [-0.09931896]]])
but I think it should be (Here it also can padding 0 to keep shape.)
array([[[-0.10153697], [-0.09931896]]])
which -0.10153697 = (-0.07041532) + (-0.03112165) and -0.09931896 = -0.12203699 + 0.02271803
Is is anything wrong on Keras?

@ipoletaev
Copy link

ipoletaev commented Jun 28, 2016

@poyuwu

However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.

Hmm... I don't know how to make such out only through the Keras.

About your example: I think it's similar to the aforementioned example, so you should get array([[[-0.07041532], [-0.15315863], [0.02271803]]]). And it's really strange that bw works right but fw doesn't, because of its third output is not equal to zero, but it must...

@lomizandtyd
Copy link

lomizandtyd commented Jun 28, 2016

Hi guys, I got this question too.. Especially for LSTM (BRNN).

Masking Layer gives a masked vector, only work for the inputs, not for inner states.
So in @poyuwu 's example, the fw's output still has value in step 3.

This might be correct because inputs are surely masked.
While, I want to find a way to skip the calculation step when coming masked value, like some special tags.

However, I think using Masking layer in bidirectional RNN for sequences with different lengths may be totally wrong.

@ipoletaev
Copy link

ipoletaev commented Jun 28, 2016

@lomizandtyd

So in @poyuwu 's example, the fw's output still has value in step 3.

Yes, it's logically, but in any case we want to get zero at the third place, isn't it?

While, I want to find a way to skip the calculation step when coming masked value.

I think it doesn't matter because of, as I understood, you should specify output sample_weights in fit() in order to skip necessary timesteps with all zeros in feature vector (I have already asked about this #3023 ). But if this is so, then it is not clear why do we need masking if we can specify it in fit(): what are the steps in the examples is using for training, and what - no. I mean it is not important to process these "empty" vectors by network, it is important to train network without backpropagation with errors calculated on such vectors.

Maybe there is some way to use a batch_size=1 and do not bother with padding?

@lomizandtyd
Copy link

@ipoletaev Wow, thanks a lot for this!

Yes, we want to get zero at the masked position.
The problem is we also want to keep the inner states across the masked step.

Maybe we can deliver another sample_weights in the predict() function?.
If do so, BRNN is still wrong...

@ipoletaev
Copy link

ipoletaev commented Jun 28, 2016

@lomizandtyd

...keep the inner states across the masked step.

I think it's not necessary, because the network shouldn't remember what responses it need to get at empty vectors...

Maybe we can deliver another sample_weights in the predict() function?.

I don't understand for what task you want to use it? After all you always know in advance what data you process, and you respectively know - which output of the network corresponds to the empty vectors, so you can just skip such positions in output, I guess.

If do so, BRNN is still wrong...

As far as I understood Keras has been "fighting" with RNN masking task about year :)

@poyuwu
Copy link
Author

poyuwu commented Jun 28, 2016

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])
input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

@ipoletaev I think it's just Dense Layer that have zero inputs, so that its output is 0. If you change activation function to softmax, then you will get wrong answer.
Besides, batch_size set None on time steps will raise other error in some case (especially on merge layer).

In lasagne, it seems to use Masking matrix to deal with padding. (I do not test its accuracy)

@ipoletaev
Copy link

ipoletaev commented Jun 28, 2016

@poyuwu : yes, I had checked it - and you are right. It means,as I understood, that and simple Dense doesn't keep masked values in the way we want...

I write again what does not converge with the expectations:

  • Forward RNN doesn't keep mask values, backward does it. It's strange.
  • Is this task solving with batch_size = 1?
  • How to specify correctly what timesteps the network should to skip.
  • And it's not clear in which moment BiLSTM does reset_state - only in the end of timesteps in current sample, or when the network meets with empty vector?

@poyuwu
Copy link
Author

poyuwu commented Jun 29, 2016

@ipoletaev I don't think

  • Forward RNN doesn't keep mask values, backward does it. It's strange.

this statement is true.
That's because padding argument is 'post', not 'pre'. Hence, the reason is the same as Dense layer I said.

  • How to specify correctly what timesteps the network should to skip.

As I said, in lasagne, we provide a mask numpy.array (the same shape as input) to deal with it. If go_backwards=True, it needs to keep padding argument the same.

Besides, Embedding layer mask_zeros seems to be the same.

@ipoletaev
Copy link

ipoletaev commented Jun 30, 2016

@poyuwu so you want to say that now, there's no way to solve this issue with Keras?
I mean is it necessary to use masking if we use sample weights?

@xuewei4d
Copy link

Same here. It seems masking mechanism in Keras is not fully supported.

@fferroni
Copy link

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]

@GPaolo
Copy link

GPaolo commented Jan 17, 2017

Actually Masking works exactly as expected.
The problem is that you are working with the wrong dimension order: in input = (3,6) the 3 is the time dimension and the Masking layer masks only along that dimension, making the net ignore a time sample if that sample is composed of all elements equal to the masked value.

import keras
from keras.utils.visualize_util import plot
from keras.layers import *
from keras.models import Model

net_input = Input(shape = ( 3, 10))
mask = Masking(mask_value = 0.5)(net_input)
conv = TimeDistributed(Dense(1, activation = 'linear', init='one'))(mask)
out = LSTM(1, init='one', inner_init='one',activation='tanh', inner_activation='tanh',)(conv)
model = Model(net_input, out)

print('W: ' + str(model.get_weights()))

net_in = np.ones((1,3, 10))
val = 0.5
net_in[0, 2, :] = val
out = model.predict(net_in)
print('Input: ' + str(net_in))
print('Output: ' + str(out))

In this case the answers are:

mask = 0.5, val = 0.0 : 0.73566443
mask = 0.0, val = 0.0 : 0.96402758
mask = 0.0, val = 0.5 : 0.99504161
mask = 0.5, val = 0.5 : 0.96402758

so from here you can see that when we mask val we get the same result, while when we mask something else, even if val = 0, we get a different result.

@GPaolo
Copy link

GPaolo commented Jan 17, 2017

Moreover, I just tested, if you have a Multi-input net (with multiple input branches) and you have a masking layer on each branch, it is enough that just one of the inputs at time step t is equal to the masked value that all the time step is skipped.

I guess that if one wants to skip the time step only if all the inputs are equal to the masked value, the branches need to be merged, right?

@irrationalagent
Copy link

irrationalagent commented Jan 19, 2017

Hi Fragore, I have a similar question to you about masking with multiple inputs. I have two input branches and all I want to do is mask 0 from both. Am I right in thinking that adding a mask to the end of each branch is equivalent to adding a single mask AFTER the inputs are merged? here's my example

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(keras.layers.core.Masking(mask_value=0.0))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))

or version with a mask after each branch prior to merging

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input1.add(keras.layers.core.Masking(mask_value=0.0))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH,mask_zero=True))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))


@GPaolo
Copy link

GPaolo commented Jan 24, 2017

Wait, you want to mask the output of the branches that are 0? In that case both of your approaches should give you the same result. But usually you mask inputs, this means to put the mask layer as input of the net.
Ps it may also be more convenient to use the functional API :)
PPS the last dense layer doesn't need TimeDistributed anymore cause the LSTM removes the time dimension.

@slaterb1
Copy link
Contributor

slaterb1 commented Mar 17, 2017

I've been experimenting with and without masking for a little bit now and I have finally figured out what the Masking layer actually does. It doesn't actually "skip" the timepoint that has all masked values, it just forces all the values for that timepoint to be equal to 0... So effectively Masking(mask_value=0.) does nothing. That is why in the example provided by @GPaolo above the results for mask_value=0 and mask_value=0.5 are the same when val matches them.

Here is some easy code to demonstrate what I mean.

Model:

`input1 = Input(batch_shape=(1,1,10)
mask1 = Masking(mask_value=2)(input1)
dense_layer1 = Dense(1, activation='sigmoid')
dense_layer1.setattr('supports_masking', True)
output1 = dense_layer1(mask1)

model = Model(input1, output1)
model.compile(optimizer='adam', loss='binary_crossentropy')
`
Data:

`data = np.ones((10, 1, 10), dtype='float32')
#set half of the data equal to mask value
for index in range(5,10):
data[index,0,:] = 2

#set first data point equal to mask value to show that this line is uneffected
data[0,0,0] = 2`

print outputs:

`get_mask_output = K.function([model.layers[0].input], [model.layers[1].output])
mask_output = get_mask_output([data])[0]

print(data)
print(mask_output)

data:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]]

mask_output:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]`

Predictions:

`test_data = np.ones((5,1,10))
test_data[1,0,:] = 2
test_data[2,0,:] = 0
predictions = model.predict(test_data, batch_size=1)

print(test_data)
print(predictions)
`

Results:

test_data:
`[[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]]

predictions:
[[[ 0.09200736]]

[[ 0.5 ]]

[[ 0.5 ]]

[[ 0.09200736]]

[[ 0.09200736]]]`

As you can imagine, "masking" values by setting them to 0 and still calculating the results for those lines in layers causes some mistakes from backpropagation (treating unknown values as a real result) as well as added unneeded computation time. I'm going to try to rework how masking is done in Keras a bit...

Edit: I did a little bit of digging into the training.py code and I found that the "masking" information (even with mask_value = 0.) does get incorporated into the training of the weights. The masked lines effectively get ignored after the calculation is done (which is good!). The problem that I am encountering in my actual network is that although "masked lines" are ignored during weight training, they are still evaluated by the network going forward which effects the outputs of future layers based on false information. To be able to build a network that handles variably sized inputs (not all have max timepoints) I want to completely ignore the masked lines entirely... I'm going to try to work that out

@ragulpr
Copy link

ragulpr commented Mar 20, 2017

Building on @slaterb1 and @GPaolo 's snippets I tried digging around to see the benefits of masking but haven't found it yet. It feels like I'm missing something.

  • It does not seem to propagate numerically sound values through time
  • It propagates np.nan, see gist
  • Feels (TODO:test) quite numerically unstable to propagate possibly absurd values down the network? Like mask output 0 may not always be in place.
  • It has to test each input
  • Quick testing (see gist) seem to show that there's no immediate performance gains

Does anyone have an idea about if/when it gives performance gains? I didn't have time to run for long/deep/wide and I'm not comfortable about how Python/Keras/Tensorflow/Theano compiles

Is mask an intricate way of doing what I think weights should to be doing? I.e multiplying with the loss and dividing by sum of weights in batch?
It's literally what seems to be done here anyway:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L453

Does it actually halt any execution (yet)?

@carlthome
Copy link
Contributor

carlthome commented Mar 21, 2017

@ragulpr, it's my understanding that masking does more than just loss scaling. If a timestep has been masked, the previous output and state will be reused. See here and here.

@slaterb1
Copy link
Contributor

slaterb1 commented Mar 22, 2017

@ragulpr, I'm not sure about performance gains but Theano is pretty smart about knowing what it needs to hang on to and what it doesn't (based on the API doc: http://deeplearning.net/software/theano/library/scan.html)

More specifically this line: "Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A and k are large."

So after compiling the model it might pass over the masked values (or at least not hold them in memory as long), but that is pure speculation based on similarities in the underlying code.

@carlthome, I came across the mask snippet in the "theano_backend.py" as well and you are right that the masking has a direct effect on how the states are evaluated and passed on (T.switch). Maybe this is too general a question but how does this layer accept the mask? Just to give an example, if I have a model with multiple layers, defined as so:

model = Model(input1, output1)

I understand that Theano wraps this up as a mathematical equation to calculate:

output1 = input1 -> [ layers[0] -> layers[1] -> ... layers[N] ]

but if I have somewhere in the middle:

prev_layer -> Masking_layer -> RNN_layer

The output from the Masking_layer gets put into the RNN_layer as input ("x"). Does the "supports_masking" attribute tell the RNN_layer to figure out the mask? I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer, except that I can pass in a "mask" variable via the call() method of the Recurrent(Layer) object.

I tried calling RNN_layer(prev_layer, mask=Masking_layer) but it didn't do anything different. The last comment in the thread, #176 suggests that it has to be called with a mask but I'm not sure how to do that... Any thoughts?

@carlthome
Copy link
Contributor

carlthome commented Mar 22, 2017

I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer

Each Keras layer declares if it supports masking. Each layer is also responsible for using the mask in a sensible way (which I believe is the primary source of confusion: that the masking functionality is implemented across a bunch of different classes). For RNN layers in particular, they rely on the fact that the underlying K.rnn operation has mask support so if you're looking for where precisely the logic is, you'll note that the RNN layers simply pass the mask argument into the backend, where the magic happens.

@slaterb1
Copy link
Contributor

slaterb1 commented Mar 22, 2017

@carlthome, I saw that in the code but was not able to get the mask to work in my RNN network. For clarity I was trying to rework stuff in RecurrentShop to setup an encoder decoder network that adjusts the next input based on a prediction made on the previous state from both the encoder and the decoder (a custom RNN that uses a .single_step_rnn() instead of the regular .rnn() ).

But based on your advice, I tried to just build a basic LSTM network to act as a NOT Gate (pointless but simple) and it does interpret the mask correctly, when it is passed a mask mid network! I'm including the gist. It shows that masking works for both return_sequences=True and return_sequences=False. It also shows that if you train the network with data that does not have 'masked' input, 'masked' lines in the test data will still get masked appropriately. Hope that helps people understand the masking stuff better!

This is the gist

@Seanny123
Copy link

Seanny123 commented May 30, 2017

@fferroni @GPaolo apparently, the TimeDistributed layer didn't support masking, since this feature has been added in Pull #6401?

@mehrdadscomputer
Copy link

mehrdadscomputer commented Jun 5, 2017

Hey Guys, there is a seq2seq example which it's input is a string (sequence) like '5+9' and output is another string '14'.
The author used pre padding to have sequences with same lengths at input but he didn't use masking.
I add a simple line to add masking to his model and there is about 8 percent improvement in accuracy.
Is my case a correct use of masking?

this is main code:

from random import seed
from random import randint
from numpy import array
from math import ceil
from math import log10
from math import sqrt
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

def random_sum_pairs(n_examples, n_numbers, largest):
    X, y = list(), list()
    for i in range(n_examples):
	    in_pattern = [randint(1,largest) for _ in range(n_numbers)]
	    out_pattern = sum(in_pattern)
	    X.append(in_pattern)
	    y.append(out_pattern)
    return X, y

def to_string(X, y, n_numbers, largest):
    max_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
    Xstr = list()
    for pattern in X:
	    strp = '+'.join([str(n) for n in pattern])
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    Xstr.append(strp)
    max_length = ceil(log10(n_numbers * (largest+1)))
    ystr = list()
    for pattern in y:
	    strp = str(pattern)
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    ystr.append(strp)
    return Xstr, ystr

def integer_encode(X, y, alphabet):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    Xenc = list()
    for pattern in X:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    Xenc.append(integer_encoded)
    yenc = list()
    for pattern in y:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    yenc.append(integer_encoded)
    return Xenc, yenc

def one_hot_encode(X, y, max_int):
    Xenc = list()
    for seq in X:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    Xenc.append(pattern)
    yenc = list()
    for seq in y:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    yenc.append(pattern)
    return Xenc, yenc

def generate_data(n_samples, n_numbers, largest, alphabet):
    X, y = random_sum_pairs(n_samples, n_numbers, largest)
    X, y = to_string(X, y, n_numbers, largest)
    X, y = integer_encode(X, y, alphabet)
    X, y = one_hot_encode(X, y, len(alphabet))
    X, y = array(X), array(y)
    return X, y

def invert(seq, alphabet):
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    strings = list()
    for pattern in seq:
	    string = int_to_char[argmax(pattern)]
	    strings.append(string)
    return ''.join(strings)

seed(1)
n_samples = 1000
n_numbers = 2
largest = 10
alphabet = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', ' ']
n_chars = len(alphabet)
n_in_seq_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
n_out_seq_length = ceil(log10(n_numbers * (largest+1)))
n_batch = 10
n_epoch = 10
model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))
model.add(RepeatVector(n_out_seq_length))
model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(n_chars, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

for i in range(n_epoch):
    X, y = generate_data(n_samples, n_numbers, largest, alphabet)
    print(i)
    model.fit(X, y, epochs=1, batch_size=n_batch)

X, y = generate_data(n_samples, n_numbers, largest, alphabet)
result = model.predict(X, batch_size=n_batch, verbose=0)
expected = [invert(x, alphabet) for x in y]
predicted = [invert(x, alphabet) for x in result]
for i in range(20):
    print('Expected=%s, Predicted=%s' % (expected[i], predicted[i]))

and I just change this part:

model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))

to this part:

from keras.layers import Masking
model = Sequential()
model.add(Masking(mask_value = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], input_shape=(n_in_seq_length, n_chars)))
    model.add(LSTM(100))

sources:
http://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/#comment-400854

@stale stale bot added the stale label Sep 3, 2017
@stale
Copy link

stale bot commented Sep 3, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot closed this as completed Oct 3, 2017
@MeloMing
Copy link

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]

Mask layer will work only when all feature of a timestep equals to the mask value.In you case,the input a is a 3d matrix with the shape(1,3,6),1 means batch_size,3 means timesteps,and 10 means the feature of that timestep.Mask will work when the feature of a timestep all equal to 0.1.if you change a to:
a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[0.1,0.1,0.1,0.1,0.1,0.1]]])

you will get the output like:

[[[8.200001] [0. ] [0. ]]]

@hoangcuong2011
Copy link

Hi,

I struggled a lot with this recently, and here is some experience I learnt. I hope it would be useful for people.

  1. Masking is extremely powerful. I found it perhaps the only way to deal with several "hard" problems that are with sequence of missing inputs, missing outputs as follows.
    image

  2. Masking is not that complicated if we understand how the loss is computed with masking. For instance let us assume we have a sequence with length 256. From this sequence we have a masking with only 4 elements that are with masking of 1 (others are with masking 0). I thought the loss is computed as the average between these 4 elements. Guess what - it is not! The average loss will be divided by 256 instead. For this reason sometimes the loss will be extremely small (0.0something) if we have only few 1 elements and long sequence.
    Does it matter? I guess not, as what we need is the gradient of loss, rather than the loss itself.

  3. When we use softmax as the last layer, the denominator would be the sum of exponential of all elements, regarding whether their masking is 1 or 0.

  4. I thought the output of masking inputs is zeros all the time in LSTM. But it is not the case. Let us assume we have a masking:

0 0 0 1 1 0 0 0

With this case, the three first elements with masking zero has output of 0. However, the three last zeros have output that is as the same as the output of the last element with masking 1.

  1. Meanwhile, Keras is very convenient in the sense that the loss it computes will be based on only elements with masking of 1. I found this is a big plus of using Keras, something a bit too good too be true as I guess implementing this is not that easy.

  2. However, the accuracy in Keras is not computed that way. It is thus not trivial in keras to write a custom metric (for fit). There is something very mysterious to me. I am pretty sure my code for writing custom metric is correct but somehow it does not give me accurate result. Because of this I think it is much much easier if we write such an accuracy function with a custom callback class.

That is it, I hope it is helpful!

@zhanjiezhu
Copy link

Hi,

I struggled a lot with this recently, and here is some experience I learnt. I hope it would be useful for people.

  1. Masking is extremely powerful. I found it perhaps the only way to deal with several "hard" problems that are with sequence of missing inputs, missing outputs as follows.
    image
  2. Masking is not that complicated if we understand how the loss is computed with masking. For instance let us assume we have a sequence with length 256. From this sequence we have a masking with only 4 elements that are with masking of 1 (others are with masking 0). I thought the loss is computed as the average between these 4 elements. Guess what - it is not! The average loss will be divided by 256 instead. For this reason sometimes the loss will be extremely small (0.0something) if we have only few 1 elements and long sequence.
    Does it matter? I guess not, as what we need is the gradient of loss, rather than the loss itself.
  3. When we use softmax as the last layer, the denominator would be the sum of exponential of all elements, regarding whether their masking is 1 or 0.
  4. I thought the output of masking inputs is zeros all the time in LSTM. But it is not the case. Let us assume we have a masking:

0 0 0 1 1 0 0 0

With this case, the three first elements with masking zero has output of 0. However, the three last zeros have output that is as the same as the output of the last element with masking 1.

  1. Meanwhile, Keras is very convenient in the sense that the loss it computes will be based on only elements with masking of 1. I found this is a big plus of using Keras, something a bit too good too be true as I guess implementing this is not that easy.
  2. However, the accuracy in Keras is not computed that way. It is thus not trivial in keras to write a custom metric (for fit). There is something very mysterious to me. I am pretty sure my code for writing custom metric is correct but somehow it does not give me accurate result. Because of this I think it is much much easier if we write such an accuracy function with a custom callback class.

That is it, I hope it is helpful!

Hi @hoangcuong2011 , thanks for your explanations. I've validated your second point and indeed it's exactly what you said. I'm currently trying to implement a LSTM-autoencoder model to encode sequence into sequence, in which it involves a LSTM layer with return_sequence = False and then RepeatVector layer to copy that back to the previous timestep dimension. However, the mask get lost right after the LSTM because return_sequence = False (if True it returns the input_mask), then I'm wondering how I can get back the mask so that the loss will also ignore the padded timesteps? Thanks!

@hoangcuong2011
Copy link

hoangcuong2011 commented Dec 8, 2019

@zhangwj618 I am not really sure what your question is about. I guess you would like to write a custom masking layer. If you explain the question in more detail, I think I can help. Thx!

@hossain666
Copy link

](./typescript-kurulumu.md) |

1 similar comment
@hossain666
Copy link

](./typescript-kurulumu.md) |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests