Contrastive loss implementation discrepancy between the paper and codebase #8

Gyat · 2021-05-12T11:28:20Z

Hello,

This is in relation to the losses described in the paper and implemented in the codebase. Need your help in understanding the following:

The 4th Page in the paper reads that: "the contrastive alignment loss enforces alignment between the embedded representations of the object at the output of the decoder, and the text representation at the output of the cross encoder." However, in the code transformer.py, the following snippet is being used for the loss calculations:

"text_pooled_op": encoded_text.pooler_output if self.CLS is not None else None,

"img_pooled_op": img_memory[0] if self.CLS is not None else None, # Return the CLS token

which essentially means that we are deriving the embedded representation of the text from the BERT-based text backbone encoder's classification token and the image embedded representation is being derived from the output of the transformer encoder. Is this genuinely a discrepancy? If not, can you kindly point towards the snippet for these loss calculations where you are tapping in the decoder output?

Also, is the following understanding correct: The 'Soft token prediction' loss from the paper is actually called 'contrastive_align_loss' in the codebase and the 'Contrastive alignment' loss from the paper is actually named 'contrastive_loss' in the codebase.

Thank you.

The text was updated successfully, but these errors were encountered:

ashkamath · 2021-05-14T22:05:33Z

Hi,
It looks like you're confusing the contrastive_align_loss with the contrastive_loss.
In our paper and published results, we do not use the contrastive loss (which is akin to an image-text matching loss from other vision+language pre-training papers). We only left it in the code for completeness since it is something we tried at some point, and thought it would be useful if other users of our code base were interested in experimenting with it. For the two losses that we do use, read the following:

Contrastive align loss, which is calculated between the predictions of the decoder and the embedded representations of the text and the output of the cross encoder. Relevant lines in the code:

mdetr/models/mdetr.py

Line 81 in fdee8c5

if contrastive_align_loss:

,

mdetr/models/mdetr.py

Line 203 in fdee8c5

if self.contrastive_align_loss:

,

mdetr/models/mdetr.py

Line 496 in fdee8c5

def loss_contrastive_align(self, outputs, targets, positive_map, indices, num_boxes):
Contrastive alignment -> loss_contrastive_align that we just discussed above. Soft token prediction is loss_labels

mdetr/models/mdetr.py

Line 464 in fdee8c5

def loss_labels(self, outputs, targets, positive_map, indices, num_boxes):

Hope this makes it more clear! :)

ashkamath closed this as completed May 14, 2021

heyzude mentioned this issue Dec 12, 2021

[2021 fall] ICCV 2021 MDETR - Modulated Detection for End-to-End Multi-Modal Understanding (20213510) awesome-davian/awesome-reviews-kaist#115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrastive loss implementation discrepancy between the paper and codebase #8

Contrastive loss implementation discrepancy between the paper and codebase #8

Gyat commented May 12, 2021 •

edited

Loading

ashkamath commented May 14, 2021

Contrastive loss implementation discrepancy between the paper and codebase #8

Contrastive loss implementation discrepancy between the paper and codebase #8

Comments

Gyat commented May 12, 2021 • edited Loading

ashkamath commented May 14, 2021

Gyat commented May 12, 2021 •

edited

Loading