Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Masking Layer for Sequence to Sequence Learning #957

Closed
NickShahML opened this issue Nov 6, 2015 · 16 comments
Closed

Using Masking Layer for Sequence to Sequence Learning #957

NickShahML opened this issue Nov 6, 2015 · 16 comments

Comments

@NickShahML
Copy link
Contributor

Hey Guys,

I know @EderSantana asked this question a few months ago #395, and I have read several Keras threads that deal with masking.

The question was: We can mask the input of zeros, but won't it bias the cost function as we are not masking the zeros on the output?

Later, the Masking layer was added, and I do believe that this output problem was resolved. However, in order to use masking without an embedding layer, my model structure is:

model = Sequential()
M = Masking(mask_value=0.)
M._input_shape = (x_maxlen, word2vec_dimension)
model.add(M)
model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)) 
model.add(Activation('relu'))
model.add(RepeatVector(y_maxlen))
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables_decoding, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(y_matrix_axis, activation="softmax")) 
model.compile(loss='categorical_crossentropy', optimizer='adam')

I have two questions:

  • My input is in vector form, so I have '0.' (notice the float) for masking my input. My output, however is in integer numpy matrix. So, does the '0.' cover the number '0' in an integer numpy matrix? Or maybe I should use mask_value = 0 (no floating)
  • Given the model I've presented, does the masking layer mask the zeros that come out of the TimeDistributedDense layer? If not, this would lead to the cost function being biased by zeros!

I apologize for the redundancy of the issue, but it is a very important one. Thank you!

@EderSantana
Copy link
Contributor

My input is in vector form, so I have '0.' (notice the float) for masking my input. My output, however is in integer numpy matrix. So, does the '0.' cover the number '0' in an integer numpy matrix? Or maybe I should use mask_value = 0 (no floating)

Keras uses allow_input_downcast=True everybody gets democratically converted to float32.

Given the model I've presented, does the masking layer mask the zeros that come out of the TimeDistributedDense layer? If not, this would lead to the cost function being biased by zeros!

There will be no cost function masking. Note that when you use return_sequences=False you make your model a regular Layer (instead of MaskedLayer), thus no masking at all after that first LSTM. To confirm that, you have a Dense in the middle of your model, if there were a mask going through, Keras would throw an error. Thus, you have have things in your final output that should not contribute to the cost function you have to use sample_weight. This is a parameter you can pass to fit or train_on_batch. Think about it as a hand defined mask.

BTW, I'm not following you guys discussions about Sequence to Sequence Learning. I figured out a way to do it both in the Sutskever way (sequence to sequence with single RNN, Fig 1) and recurrent encoder-decoder like you are doing right now. Is that still an open problem for this community?

CREDIT: The following figure was modified from here.
Fig 1.
screen shot 2015-11-05 at 8 25 22 pm

@NickShahML
Copy link
Contributor Author

@EderSantana again-- a much appreciated thanks.

here will be no cost function masking. Note that when you use return_sequences=False you make your model a regular Layer (instead of MaskedLayer),

I suspected that this might be the case. Many of my predictions with 'masking' didn't seem right (they always were the same length!). So thank you for explaining why that was happening.

Big congrats on using the Sutskever way. I don't want to speak for the others (@simonhughes22 and @sergeyf), but this type of Sutskever RNN would be a major help to us. The main reason being that we simply can not mask our output sequences, meaning the cost function is biased!

Is that still an open problem for this community?

Maybe I'm not investigating all the resources, but everyone on thread #395 seems to have questions and are struggling to improve results. Having an rnn layer that does sequence to sequence and allows masking for the output would be incredibly useful. Any light you can shed on this matter would be very appreciated.

@EderSantana
Copy link
Contributor

Are you familiar with sample_weight this is the thing you may be missing.
Say the last layer of you encoder-decoder generate sentence of length 100 for all batches. But assume that for the first sentence, the desired is only 10 "seconds" long. Your sample weight for that first sentence would be np.concatenate([np.ones(10), np.zeros(90)]).

Did you get the idea? My solution is all about using sample_weight. For the single RNN in Fig 1 there is not even masking. Because you start the sentence with the actual values and pad zeros on the right. Later you just use sample_weights to take out the values that should not be compared to the desired. The last important detail is the desired is padded with zeros in the left (when the input is being presented) AND the right (in case of variable length desired). I kinda found that this trick pretty recently (something like last week or so) it made me super happy :D

@simonhughes22
Copy link

@EderSantana I have 2 qu's:

I figured out a way to do it both in the Sutskever way (sequence to sequence with single RNN, Fig 1)

Can you illustrate that with some sample keras code? An alternative approach would be very useful, but the picture you posted seems like how we are doing it, unless the only difference is that you have a single RNN in the middle, which might help as that's fewer layers in between. I was able to get seq 2 seq learning working but not well. It learned something but accuracy was low.

The last important detail is the desired is padded with zeros in the left (when the input is being presented) AND the right (in case of variable length deciders). I kinda found that this trick pretty recently (something like last week or so) it made me super happy 💃

I am not sure I understand that sentence. Thanks for your advice, it would be cool if we could get the Sutskever method working really well. I have a lot of ideas of how to use that sort of model, but I don't fancy writing from scratch in theano.

@NickShahML
Copy link
Contributor Author

The last important detail is the desired is padded with zeros in the left (when the input is being presented) AND the right (in case of variable length desired)

I'm also confused by this. Lets say you have 100 timesteps. Yet, for sample number 5 lets say, you only have 15 timesteps. For your input, Wouldn't you pad 85 zeros on the left and 15 numbers on the right?

And let us suppose that for sample number 5, your output is supposed to be 20 numbers long. Are you saying you would place 20 numbers on the left and then 80 zeros on the *right?

In summary:

Input Output
Pad zeros on the left Pad zeros on the right
Content on the right Content on the left

There are alot of ideas you're saying (like sample_weight) that I really wish I could implement. I don't know what would be the best use of your time, but could we potentially skype call you? My username is leavesbreathe. I wouldn't mind giving some money for your time as I know you probably have many things to do!

@EderSantana
Copy link
Contributor

I'm preparing a tutorial. I need you guys to understand sequence to sequence learning to be able to use a new recurrent layer that is coming up... so much mystery... but I believe you will like it.

OFF TOPIC:
@LeavesBreathe I'll tell you the truth, it takes me so long to reply and write this code cause I'm writing my thesis proposal and I got a carpal tunnel... I can't work for too long before it starts hurting. This is why I'm also not editing my comments carefully. But I'll do my best to put this code out for you guys.

@NickShahML
Copy link
Contributor Author

I'm preparing a tutorial. I need you guys to understand sequence to sequence learning to be able to use a new recurrent layer that is coming up... so much mystery... but I believe you will like it.

Never been more excited for a tutorial in my life.

I'm writing my thesis proposal and I got a carpal tunnel...

I actually went to med school for a while, so if you ever want to chat, I wouldn't mind giving you some helpful suggestions for carpal tunnel. I actually know a ton about it as I started to get it a while back.

@viksit
Copy link

viksit commented Dec 30, 2015

@EderSantana is that tutorial on sequence-sequence available anywhere? If not - maybe just a 5-6 line overview on this thread would be useful too :)

@EderSantana
Copy link
Contributor

@viksit
There were some changes in the API, my approach only works the version of Sequential that is on Seya. you can see this code:
https://github.com/EderSantana/seya/blob/master/examples/NTM.ipynb

I'm waiting for some fixes on mask: #1206

@ghost
Copy link

ghost commented Jun 3, 2016

@EderSantana :
Hi, I have a question regarding the sample_weight approach. (I did it test it on a simple small toy data of 10 samples with a vocabulary of 10 and varying time steps of maximum length 5 and it seems to work well).... However, don't you have to create a target Y_matrix for your model which is of shape
N_samples * N_max-steps * N_vocabulary in case you are having a softmax cross entropy loss for maximum log likelihood? For a large dataset, this is going to be huge - let's say I have 90K samples, and 94 steps maximum, and 55K words. This is going to be terribly huge (I am not even sure if you can pass sparse matrices to the target value).

Is there a way to calculate this Y_matrix only for a batch (it would still be big, but atleast an O(10^4) approximately less memory? Or is there a more efficient approach to just use the indices directly?

I do know that you could handle the batch sampling yourself and store the arrays dynamically for these batches.... But that sounds like a hard way to do validation then...

Also, keras allows the sample_weights only for the training and not for validation...

I guess that should be changed!

@EderSantana
Copy link
Contributor

you can write you own training loop and update the model with train_on_batch. this way you can create the labels and weights as you need them.

@jkravanja
Copy link

jkravanja commented Jun 26, 2016

I am dealing with a similar problem, where i want the cost function to skip some samples at the beginning of the sequence. I am familiar with sample_weight used with fit or train_on_batch, but does anybody know if it is possible to use it with fit_generator?

@henchc
Copy link

henchc commented Jun 28, 2016

@EderSantana thank you for your clarification of the sample_weight above. I have a task similar to POS tagging where I pad both the X and y, and at first my model was acting as if there was no masking and tried to predict the 0s in the X with 0s in the y. I've implemented your sample_weight method, but am now running into a problem that the model is still trying to predict masked values in the X, but won't predict them as the masked value in the y. So it's not considering the mask a label anymore from the y, but it's trying to predict for paddings in the sentence (X). I feel like I'm almost there, any idea? (I have my stackoverflow question here.

@Sri-Harsha
Copy link

@EderSantana May I know how to use making at the input if the padded values are not zeros but some other values. For example, in image even the samples have zeros as values so we need to pad with a different value. So how to perform masking in this case.

@HariKrishna-Vydana
Copy link

@fchollet i am using fit_generator to train, different batches in my data has different size like (4,100,40),(4,120,40),....(4,1200,40),(4,1500,40)
when i use it to train a keras model all the data gets accumulated on gpu and gpu overflows is there a work around for it

@stale stale bot added the stale label Jun 16, 2017
@stale
Copy link

stale bot commented Jun 16, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot closed this as completed Jul 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants