Two pooling issues #51
Comments
Hi, |
Hi Alexis, I also corrected the max-pooling with a simple fix of setting the padding value to a negative value lower than -1 :
I have also tried adding a ReLU layer to mimic the bug behavior but that didn't improve the results. I don't know why this bug is acting like a feature, could be a sort of "dropout effect" on the longer sentences (wrt the others in the batch), or could just be luck :) Best, |
@OanaMariaCamburu are you training this model on SNLI as ihsgnef? Note that the ReLU does not mimic the issue with the max-pad. |
Yes, I trained on SNLI only. ReLU would mimic it for all except the longest sequence in the batch, which I checked and in train most of the time there's only one longest sentence in the batch, rarely 2 or 3. Do you have any better clues for mimicking the bug in a deterministic/corect way? I also tried by adding dropout on the representations of each timestep but that didn't improve results either. |
Ok, so just to be clear, your numbers should not be compared to the column 3 of ihsgnef, but to the numbers you get without the zero-padding, on the same dataset. If you want to compare to the open-source results, you should train on AllNLI, not only on SNLI. However there seems to be indeed some loss in performance for some tasks. The padding with the zero could be seen as some sort of regularization yes, more like batch-norm (which also leads to different embeddings depending on the batch the sample belongs to) than dropout I guess though. The sentences are encoded on batches from sorted datasets also, which creates an effect too on the padding. The comparison made by ihsgnef between column 3 (InferSent) and column 4 (Fork) does not seem correct, as the models were trained on different data (column 3 on AllNLI and column 4 on SNLI) so it's hard to conclude there. Please look at the comparison I made at the bottom of the README where I trained the models in the same conditions (infersent1 vs infersent2). There is indeed some loss in performance as mentioned, in particular on MR/CR. The fastText compared to GloVe did not bring improvement btw (results with GloVe were almost identical). Switching to fastText was meant for better handling of OOV words with character n-grams, which is a feature we wanted to add eventually. |
Yes, in your paper, you also report results when training just on SNLI, so I compared with those when I noticed the decrease in performance. OK, so do you get better results when mimicking with batch-norm? Btw, for training on ALLNLI, in which order did you concatenate SNLI and MutliNLI train sets, first SNLI and then MultiNLI? And I assume the model selection was also done on the concatenation of devs, right? |
Hi Alexis,
|
Closing as results of both versions of InferSent have been included in the README. |
Hi, thanks for sharing this code!
I noticed two issues with the current implementation of mean- / max- pooling over BiLSTM.
sent_len
is not unsorted before used for normalization.At Line 46
sent_len
is sorted from biggest to smallest, and the input embeddings are adjusted accordingly. At Line 61 the hidden states are rearranged into the original order, whilesent_len
is not. This might lead to incorrect normalization of mean-pooling.Padding is not handled before pooling. As a result the encoded sentence and thus the prediction are dependent on the number of paddings. I'm not sure if this is by design or a mistake. I ran into a case where running in batch vs running on each example give me different predictions, as shown below. Note that this result might not be directly reproducible as only the trained encoder is released and this example is generated from a SNLI classifier I trained on top of the released encoder.
The second issue might be related to Issue #48 .
I made an attempt to fix these two issues in my pull request. With pooling issues fixed I trained a SNLI classifier from scratch. Performance increased a little on SNLI (dev 84.56, test 84.7), but decreased on almost all transfer tasks. Here are numbers I got (Fork column):
The text was updated successfully, but these errors were encountered: