Two pooling issues #51

ihsgnef · 2018-03-05T17:27:28Z

Hi, thanks for sharing this code!
I noticed two issues with the current implementation of mean- / max- pooling over BiLSTM.

sent_len is not unsorted before used for normalization.
At Line 46 sent_len is sorted from biggest to smallest, and the input embeddings are adjusted accordingly. At Line 61 the hidden states are rearranged into the original order, while sent_len is not. This might lead to incorrect normalization of mean-pooling.
Padding is not handled before pooling. As a result the encoded sentence and thus the prediction are dependent on the number of paddings. I'm not sure if this is by design or a mistake. I ran into a case where running in batch vs running on each example give me different predictions, as shown below. Note that this result might not be directly reproducible as only the trained encoder is released and this example is generated from a SNLI classifier I trained on top of the released encoder.

ss1 = [
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>'],
    ['<s>', 'A', 'man', 'in', 'a', 'blue', 'shirt', 'standing', 'in', 'front', 'of', 'a', 'garage-like', 'structure', 'painted', 'with', 'geometric', 'designs', '.', '</s>']
]
ss2 = [
    ['<s>', 'A', 'man', 'is', 'repainting', 'a', '</s>'],
    ['<s>', 'man', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'is', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'repainting', 'a', 'garage', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'is', 'wearing', 'a', 'blue', '</s>'],
    ['<s>', 'A', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'A', 'man', 'wearing', 'a', 'blue', 'shirt', '</s>'],
    ['<s>', 'man', 'is', 'wearing', 'a', 'blue', 'shirt', '</s>']
]

k = 6
ss1 = ss1[:k]
ss2 = ss2[:k]
model.eval()
s1, s1_len = get_batch(ss1, word_vec)
s2, s2_len = get_batch(ss2, word_vec)
s1 = Variable(s1.cuda())
s2 = Variable(s2.cuda())
p = torch.max(model((s1, s1_len), (s2, s2_len)), 1)[1].data.cpu().numpy()
for i in range(len(ss1)):
    b = (get_batch([ss1[i]], word_vec), get_batch([ss2[i]], word_vec))
    print(p[i], torch.max(forward(model, b), 1)[1].data.cpu().numpy()[0])

output:
1 0
1 1
1 1
1 1
1 1
0 0

The second issue might be related to Issue #48 .

I made an attempt to fix these two issues in my pull request. With pooling issues fixed I trained a SNLI classifier from scratch. Performance increased a little on SNLI (dev 84.56, test 84.7), but decreased on almost all transfer tasks. Here are numbers I got (Fork column):

Task	SkipThought	InferSent	Fork
MR	79.4	81.1	79.86
CR	83.1	86.3	83.16
SUBJ	93.7	92.4	92.45
MPQA	89.3	90.2	90.01
STS14	.44/.45	.68/.65	.65/.62
SICK Relatedness	0.858	0.884	0.877
SICK Entailment	79.5	86.1	85.45
SST2	82.9	84.6	81.77
SST5	-	-	44.03
TREC	88.4	88.2	85.4
MRPC	-	76.2/83.1	74.2/81.8

The text was updated successfully, but these errors were encountered:

aconneau · 2018-06-27T17:17:38Z

Hi,
I got different results than you got when re-training the model without the max-pool option added in 14b7ebb . I added a model trained this way and reported the results in the README (infersent2). Although indeed, the results were a bit worse especially for MR/CR but they were also sometimes a bit better for other tasks. Still investigating why this feature changed the result which is a bit unclear to me. The infersent2 model won't have the problem of different-batch->different-embeddings. It's also trained with fastText vectors.
Best,
Alexis

OanaMariaCamburu · 2018-06-27T21:59:04Z

Hi Alexis,

I also corrected the max-pooling with a simple fix of setting the padding value to a negative value lower than -1 : sent_output = nn.utils.rnn.pad_packed_sequence(sent_output, False, padding_value=-10)[0] and just like ihsgnef I also got worse results. Please see below my results averaged over 5 seeds including your default one, with standard deviations reported on the second line. This is just by fixing the bug, while still using GloVe, for a fair comparison.

MR	CR	SUBJ	MPQA	SST2	TREC	MRPC-acc	MRPC-F1	SICK-E	SICK-R_pearson	STS14_pearson	STS14_spearman
78.18	81.28	92.46	88.46	82.12	89.32	74.82	82.74	85.96	0.887	0.65	0.63
0.25	0.15	0.15	0.21	0.22	0.5	0.66	0.27	0.32	0.002	0	0

I have also tried adding a ReLU layer to mimic the bug behavior but that didn't improve the results. I don't know why this bug is acting like a feature, could be a sort of "dropout effect" on the longer sentences (wrt the others in the batch), or could just be luck :)

Best,
Oana

aconneau · 2018-06-27T22:05:56Z

@OanaMariaCamburu are you training this model on SNLI as ihsgnef? Note that the ReLU does not mimic the issue with the max-pad.

OanaMariaCamburu · 2018-06-27T22:14:52Z

Yes, I trained on SNLI only.

ReLU would mimic it for all except the longest sequence in the batch, which I checked and in train most of the time there's only one longest sentence in the batch, rarely 2 or 3. Do you have any better clues for mimicking the bug in a deterministic/corect way? I also tried by adding dropout on the representations of each timestep but that didn't improve results either.

aconneau · 2018-06-27T22:41:44Z

Ok, so just to be clear, your numbers should not be compared to the column 3 of ihsgnef, but to the numbers you get without the zero-padding, on the same dataset. If you want to compare to the open-source results, you should train on AllNLI, not only on SNLI. However there seems to be indeed some loss in performance for some tasks. The padding with the zero could be seen as some sort of regularization yes, more like batch-norm (which also leads to different embeddings depending on the batch the sample belongs to) than dropout I guess though. The sentences are encoded on batches from sorted datasets also, which creates an effect too on the padding.

The comparison made by ihsgnef between column 3 (InferSent) and column 4 (Fork) does not seem correct, as the models were trained on different data (column 3 on AllNLI and column 4 on SNLI) so it's hard to conclude there. Please look at the comparison I made at the bottom of the README where I trained the models in the same conditions (infersent1 vs infersent2). There is indeed some loss in performance as mentioned, in particular on MR/CR. The fastText compared to GloVe did not bring improvement btw (results with GloVe were almost identical). Switching to fastText was meant for better handling of OOV words with character n-grams, which is a feature we wanted to add eventually.

OanaMariaCamburu · 2018-06-27T23:01:42Z

Yes, in your paper, you also report results when training just on SNLI, so I compared with those when I noticed the decrease in performance.

OK, so do you get better results when mimicking with batch-norm?

Btw, for training on ALLNLI, in which order did you concatenate SNLI and MutliNLI train sets, first SNLI and then MultiNLI? And I assume the model selection was also done on the concatenation of devs, right?

ihsgnef · 2018-06-28T16:53:18Z

Hi Alexis,
Thanks for looking into this issue. I haven't been able to run the whole pipeline with your updates yet. With the v1 code, one small change which theoretically doesn't affect the pooling would improve the performance on almost all tasks slightly. The change is to use batch_first in pad_packed_sequence and adjust the dimensions afterward accordingly. It does not fix the issue discussed above and is likely a bug that worked. I'll try to reproduce these on the new InferSent and SentEval codebase, but here are the numbers that I get from training only on SNLI, with GloVe (row Fork):

Model	MR	CR	SUBJ	MPQA	STS14	STS Benchmark	SICK Relatedness	SICK Entailment	SST	TREC	MRPC
`InferSent1`	81.1	86.3	92.4	90.2	.68/.65	75.8/75.5	0.884	86.1	84.6	88.2	76.2/83.1
`InferSent2`	79.7	84.2	92.7	89.4	.68/.66	78.4/78.4	0.888	86.3	84.3	90.8	76./83.8
`Fork`	81.67	86.68	92.46	90.52	.69/.66		0.883	86.4	85.1	88.2	76.5

aconneau · 2018-09-11T10:47:33Z

Closing as results of both versions of InferSent have been included in the README.

ihsgnef mentioned this issue Mar 5, 2018

fix pooling issues #52

Closed

aconneau closed this as completed Sep 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two pooling issues #51

Two pooling issues #51

ihsgnef commented Mar 5, 2018 •

edited

aconneau commented Jun 27, 2018

OanaMariaCamburu commented Jun 27, 2018 •

edited

aconneau commented Jun 27, 2018

OanaMariaCamburu commented Jun 27, 2018

aconneau commented Jun 27, 2018 •

edited

OanaMariaCamburu commented Jun 27, 2018

ihsgnef commented Jun 28, 2018 •

edited

aconneau commented Sep 11, 2018

Two pooling issues #51

Two pooling issues #51

Comments

ihsgnef commented Mar 5, 2018 • edited

aconneau commented Jun 27, 2018

OanaMariaCamburu commented Jun 27, 2018 • edited

aconneau commented Jun 27, 2018

OanaMariaCamburu commented Jun 27, 2018

aconneau commented Jun 27, 2018 • edited

OanaMariaCamburu commented Jun 27, 2018

ihsgnef commented Jun 28, 2018 • edited

aconneau commented Sep 11, 2018

ihsgnef commented Mar 5, 2018 •

edited

OanaMariaCamburu commented Jun 27, 2018 •

edited

aconneau commented Jun 27, 2018 •

edited

ihsgnef commented Jun 28, 2018 •

edited