Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Two pooling issues #51

Closed
ihsgnef opened this issue Mar 5, 2018 · 8 comments
Closed

Two pooling issues #51

ihsgnef opened this issue Mar 5, 2018 · 8 comments

Comments

@ihsgnef
Copy link

ihsgnef commented Mar 5, 2018

Hi, thanks for sharing this code!
I noticed two issues with the current implementation of mean- / max- pooling over BiLSTM.

  1. sent_len is not unsorted before used for normalization.
    At Line 46 sent_len is sorted from biggest to smallest, and the input embeddings are adjusted accordingly. At Line 61 the hidden states are rearranged into the original order, while sent_len is not. This might lead to incorrect normalization of mean-pooling.

  2. Padding is not handled before pooling. As a result the encoded sentence and thus the prediction are dependent on the number of paddings. I'm not sure if this is by design or a mistake. I ran into a case where running in batch vs running on each example give me different predictions, as shown below. Note that this result might not be directly reproducible as only the trained encoder is released and this example is generated from a SNLI classifier I trained on top of the released encoder.

ss1 = [
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>']
]
ss2 = [
    ['<s>', 'A', 'man', 'is', 'repainting', 'a', '</s>'],
    ['<s>', 'man', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'blue', '</s>'],
    ['<s>', 'A', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'man', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>']
]

k = 6
ss1 = ss1[:k]
ss2 = ss2[:k]
model.eval()
s1, s1_len = get_batch(ss1, word_vec)
s2, s2_len = get_batch(ss2, word_vec)
s1 = Variable(s1.cuda())
s2 = Variable(s2.cuda())
p = torch.max(model((s1, s1_len), (s2, s2_len)), 1)[1].data.cpu().numpy()
for i in range(len(ss1)):
    b = (get_batch([ss1[i]], word_vec), get_batch([ss2[i]], word_vec))
    print(p[i], torch.max(forward(model, b), 1)[1].data.cpu().numpy()[0])

output:
1 0
1 1
1 1
1 1
1 1
0 0

The second issue might be related to Issue #48 .

I made an attempt to fix these two issues in my pull request. With pooling issues fixed I trained a SNLI classifier from scratch. Performance increased a little on SNLI (dev 84.56, test 84.7), but decreased on almost all transfer tasks. Here are numbers I got (Fork column):

Task SkipThought InferSent Fork
MR 79.4 81.1 79.86
CR 83.1 86.3 83.16
SUBJ 93.7 92.4 92.45
MPQA 89.3 90.2 90.01
STS14 .44/.45 .68/.65 .65/.62
SICK Relatedness 0.858 0.884 0.877
SICK Entailment 79.5 86.1 85.45
SST2 82.9 84.6 81.77
SST5 - - 44.03
TREC 88.4 88.2 85.4
MRPC - 76.2/83.1 74.2/81.8
@aconneau
Copy link
Contributor

Hi,
I got different results than you got when re-training the model without the max-pool option added in 14b7ebb . I added a model trained this way and reported the results in the README (infersent2). Although indeed, the results were a bit worse especially for MR/CR but they were also sometimes a bit better for other tasks. Still investigating why this feature changed the result which is a bit unclear to me. The infersent2 model won't have the problem of different-batch->different-embeddings. It's also trained with fastText vectors.
Best,
Alexis

@OanaMariaCamburu
Copy link

OanaMariaCamburu commented Jun 27, 2018

Hi Alexis,

I also corrected the max-pooling with a simple fix of setting the padding value to a negative value lower than -1 : sent_output = nn.utils.rnn.pad_packed_sequence(sent_output, False, padding_value=-10)[0] and just like ihsgnef I also got worse results. Please see below my results averaged over 5 seeds including your default one, with standard deviations reported on the second line. This is just by fixing the bug, while still using GloVe, for a fair comparison.

MR CR SUBJ MPQA SST2 TREC MRPC-acc MRPC-F1 SICK-E SICK-R_pearson STS14_pearson STS14_spearman
78.18 81.28 92.46 88.46 82.12 89.32 74.82 82.74 85.96 0.887 0.65 0.63
0.25 0.15 0.15 0.21 0.22 0.5 0.66 0.27 0.32 0.002 0 0

I have also tried adding a ReLU layer to mimic the bug behavior but that didn't improve the results. I don't know why this bug is acting like a feature, could be a sort of "dropout effect" on the longer sentences (wrt the others in the batch), or could just be luck :)

Best,
Oana

@aconneau
Copy link
Contributor

@OanaMariaCamburu are you training this model on SNLI as ihsgnef? Note that the ReLU does not mimic the issue with the max-pad.

@OanaMariaCamburu
Copy link

Yes, I trained on SNLI only.

ReLU would mimic it for all except the longest sequence in the batch, which I checked and in train most of the time there's only one longest sentence in the batch, rarely 2 or 3. Do you have any better clues for mimicking the bug in a deterministic/corect way? I also tried by adding dropout on the representations of each timestep but that didn't improve results either.

@aconneau
Copy link
Contributor

aconneau commented Jun 27, 2018

Ok, so just to be clear, your numbers should not be compared to the column 3 of ihsgnef, but to the numbers you get without the zero-padding, on the same dataset. If you want to compare to the open-source results, you should train on AllNLI, not only on SNLI. However there seems to be indeed some loss in performance for some tasks. The padding with the zero could be seen as some sort of regularization yes, more like batch-norm (which also leads to different embeddings depending on the batch the sample belongs to) than dropout I guess though. The sentences are encoded on batches from sorted datasets also, which creates an effect too on the padding.

The comparison made by ihsgnef between column 3 (InferSent) and column 4 (Fork) does not seem correct, as the models were trained on different data (column 3 on AllNLI and column 4 on SNLI) so it's hard to conclude there. Please look at the comparison I made at the bottom of the README where I trained the models in the same conditions (infersent1 vs infersent2). There is indeed some loss in performance as mentioned, in particular on MR/CR. The fastText compared to GloVe did not bring improvement btw (results with GloVe were almost identical). Switching to fastText was meant for better handling of OOV words with character n-grams, which is a feature we wanted to add eventually.

@OanaMariaCamburu
Copy link

Yes, in your paper, you also report results when training just on SNLI, so I compared with those when I noticed the decrease in performance.

OK, so do you get better results when mimicking with batch-norm?

Btw, for training on ALLNLI, in which order did you concatenate SNLI and MutliNLI train sets, first SNLI and then MultiNLI? And I assume the model selection was also done on the concatenation of devs, right?

@ihsgnef
Copy link
Author

ihsgnef commented Jun 28, 2018

Hi Alexis,
Thanks for looking into this issue. I haven't been able to run the whole pipeline with your updates yet. With the v1 code, one small change which theoretically doesn't affect the pooling would improve the performance on almost all tasks slightly. The change is to use batch_first in pad_packed_sequence and adjust the dimensions afterward accordingly. It does not fix the issue discussed above and is likely a bug that worked. I'll try to reproduce these on the new InferSent and SentEval codebase, but here are the numbers that I get from training only on SNLI, with GloVe (row Fork):

Model MR CR SUBJ MPQA STS14 STS Benchmark SICK Relatedness SICK Entailment SST TREC MRPC
InferSent1 81.1 86.3 92.4 90.2 .68/.65 75.8/75.5 0.884 86.1 84.6 88.2 76.2/83.1
InferSent2 79.7 84.2 92.7 89.4 .68/.66 78.4/78.4 0.888 86.3 84.3 90.8 76./83.8
Fork 81.67 86.68 92.46 90.52 .69/.66 0.883 86.4 85.1 88.2 76.5

@aconneau
Copy link
Contributor

Closing as results of both versions of InferSent have been included in the README.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants