How to use in-batch negative and gold when training? #110

hongyuntw · 2021-03-05T12:11:48Z

as title
paper used in-batch negative and gold when training
but how could i setting these?

somebody can help me? thanks a lot!

vlad-karpukhin · 2021-03-08T17:36:17Z

Hi @hongyuntw ,
Can you elaborate your question please?

hongyuntw · 2021-03-10T02:42:17Z

@vlad-karpukhin

in paper, it mentioned in-batch negative when training
I wonder the code is already use this trick or not !?
because I cant find relative code on this repo
Thanks a lot !

vlad-karpukhin · 2021-03-10T04:34:14Z

Yes, this code uses in-batch negative training trick.

hongyuntw · 2021-03-10T07:31:45Z

@vlad-karpukhin
thanks!
could you tell me which code file should I modify, I want to compare performance between them.
Thank you

vlad-karpukhin · 2021-03-10T17:13:17Z

The training pipeline code with distributed loss calculation is in train_dense_encoder.py
The model code is in dpr/models/biencoder.py

robinsongh381 · 2021-03-16T14:09:51Z

Hi @vlad-karpukhin and @hongyuntw

From my understading, the implementation of in-batch negative sampling and corresponding loss is computed as follows

Let's assume that batch_size=4 and hard_negatives=1
This means that for every iteration we have 4 questions and 1 positive context and 1 hard negative context for each question, having 8 contexts in total.
Then, the local_q_vector and local_ctx_vectors from model_out are of the shape [4, dim] and [8, dim], respectively where dim=768. here
This indicates that the positive context embedding and hard negative embedding for the first question (i.e local_q_vector[0]) are local_ctx_vectors[0] and local_ctx_vectors[1], respectively.
Likewise, the positive context embedding and hard negative embedding for the second question (i.e local_q_vector[1]) are local_ctx_vectors[2] and local_ctx_vectors[3], respectively.
The same relation applies for the third and foruth question. Please see the illustrated picture below
Now loss is computed through here. Please note that in this case
- input.is_positive=[0,2,4,6]
- input.hard_negatives=[1,3,5,7]
Dot product is performed (here) between local_q_vector and local_ctx_vectors, resulting in scores of the shape [4,8] after which F.log_softmax is applied. (here)

scores = self.get_scores(q_vectors, ctx_vectors)

if len(q_vectors.size()) > 1:
    q_num = q_vectors.size(0)
    scores = scores.view(q_num, -1)

softmax_scores = F.log_softmax(scores, dim=1)

Finally F.nll_loss is computed

loss = F.nll_loss(
    softmax_scores,
    torch.tensor(positive_idx_per_question).to(softmax_scores.device),
    reduction="mean",
)

Recall that softmax_scores is of [4,8] and positive_idx_per_question=[0,2,4,6]
I believe that from the perspective of the first question, this enforces the model to make softmax_scores[0][0] small while making softmax_scores[0][1:] big.
Similarly, from the persepctive of the second question, this enforces the model to make softmax_scores[1][1] small while making softmax_scores[1][0] and softmax_scores[1][2:] big.
Likewise for third and fourth questions.
Hence in-batch negative sampling.

Please DO correct me if I'm wrong

Meanwhile, I have a further question.
The above implementation makes a given question and its hard negative + in-batch negative further apart.
What should I do if I want to make a given question and ONLY its hard negative further apart ? (Of course, still making its positivie closer)
How about this approach ?

scores = self.get_scores(q_vectors, ctx_vectors)

if len(q_vectors.size()) > 1:
    q_num = q_vectors.size(0)
    scores = scores.view(q_num, -1)

softmax_scores = F.log_softmax(scores, dim=1)
softmax_scores_sliced = self.slice(softmax_scores)

loss = F.nll_loss(
    softmax_scores_sliced,
    torch.tensor([0,0,0,0]).to(softmax_scores_sliced.device),
    reduction="mean",
)

where slice function would do something like

hongyuntw · 2021-03-17T10:58:19Z

@robinsongh381
Thank you so much!
After your clearly explanation , I totally understand how the in-batch negative work.

vlad-karpukhin · 2021-03-17T21:34:11Z

Hi @robinsongh381 ,

Your understanding is almost correct.
The blue lines on the graph you provided should be named hard negatives, not just 'negatives'. Every question has '6' regular negatives in that scheme.
Your idea with slicing seems legit, but please remove representations vectors sharing between nodes.

robinsongh381 · 2021-03-18T12:22:52Z

Hi @vlad-karpukhin
Thank you for reply !

You are absolutely right, the blue lines should have been named "hard" negatives.

I am not quite sure

what do you mean by but please remove representations vectors sharing between nodes
and why is that step required ?

Regards

vlad-karpukhin · 2021-03-18T17:02:30Z

I mean the call to gather all representations https://github.com/facebookresearch/DPR/blob/master/train_dense_encoder.py#L668 - it is just redundant for your needs.
It is required in the in-batch negatives scheme, otherwise each node will calculate the loss based on its own 'small' slice of the entire 'global' batch.

Hannibal046 · 2023-10-15T05:43:29Z

Hi @vlad-karpukhin, thank you for providing the code! And @robinsongh381, I appreciate the clear illustration.

I have a question regarding in-batch negatives across devices, as demonstrated here:

DPR/train_dense_encoder.py

Line 618 in a31212d

if distributed_world_size > 1:

From my understanding, the gather operation aims to supply more negative samples for a single question (please correct me if I'm mistaken). With that said, could you please explain why it's necessary to gather questions across GPUs as well?

vlad-karpukhin closed this as completed Apr 22, 2021

Eiriksak mentioned this issue May 13, 2021

Configure different training schemes #146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use in-batch negative and gold when training? #110

How to use in-batch negative and gold when training? #110

hongyuntw commented Mar 5, 2021

vlad-karpukhin commented Mar 8, 2021

hongyuntw commented Mar 10, 2021

vlad-karpukhin commented Mar 10, 2021

hongyuntw commented Mar 10, 2021

vlad-karpukhin commented Mar 10, 2021

robinsongh381 commented Mar 16, 2021 •

edited

hongyuntw commented Mar 17, 2021

vlad-karpukhin commented Mar 17, 2021 •

edited

robinsongh381 commented Mar 18, 2021

vlad-karpukhin commented Mar 18, 2021

Hannibal046 commented Oct 15, 2023

How to use in-batch negative and gold when training? #110

How to use in-batch negative and gold when training? #110

Comments

hongyuntw commented Mar 5, 2021

vlad-karpukhin commented Mar 8, 2021

hongyuntw commented Mar 10, 2021

vlad-karpukhin commented Mar 10, 2021

hongyuntw commented Mar 10, 2021

vlad-karpukhin commented Mar 10, 2021

robinsongh381 commented Mar 16, 2021 • edited

hongyuntw commented Mar 17, 2021

vlad-karpukhin commented Mar 17, 2021 • edited

robinsongh381 commented Mar 18, 2021

vlad-karpukhin commented Mar 18, 2021

Hannibal046 commented Oct 15, 2023

robinsongh381 commented Mar 16, 2021 •

edited

vlad-karpukhin commented Mar 17, 2021 •

edited