Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How to use in-batch negative and gold when training? #110

Closed
hongyuntw opened this issue Mar 5, 2021 · 11 comments
Closed

How to use in-batch negative and gold when training? #110

hongyuntw opened this issue Mar 5, 2021 · 11 comments

Comments

@hongyuntw
Copy link

as title
paper used in-batch negative and gold when training
but how could i setting these?

somebody can help me? thanks a lot!

@vlad-karpukhin
Copy link
Contributor

Hi @hongyuntw ,
Can you elaborate your question please?

@hongyuntw
Copy link
Author

@vlad-karpukhin
image
in paper, it mentioned in-batch negative when training
I wonder the code is already use this trick or not !?
because I cant find relative code on this repo
Thanks a lot !

@vlad-karpukhin
Copy link
Contributor

Yes, this code uses in-batch negative training trick.

@hongyuntw
Copy link
Author

@vlad-karpukhin
thanks!
could you tell me which code file should I modify, I want to compare performance between them.
Thank you

@vlad-karpukhin
Copy link
Contributor

The training pipeline code with distributed loss calculation is in train_dense_encoder.py
The model code is in dpr/models/biencoder.py

@robinsongh381
Copy link

robinsongh381 commented Mar 16, 2021

Hi @vlad-karpukhin and @hongyuntw

From my understading, the implementation of in-batch negative sampling and corresponding loss is computed as follows

  • Let's assume that batch_size=4 and hard_negatives=1

  • This means that for every iteration we have 4 questions and 1 positive context and 1 hard negative context for each question, having 8 contexts in total.

  • Then, the local_q_vector and local_ctx_vectors from model_out are of the shape [4, dim] and [8, dim], respectively where dim=768. here

  • This indicates that the positive context embedding and hard negative embedding for the first question (i.e local_q_vector[0]) are local_ctx_vectors[0] and local_ctx_vectors[1], respectively.

  • Likewise, the positive context embedding and hard negative embedding for the second question (i.e local_q_vector[1]) are local_ctx_vectors[2] and local_ctx_vectors[3], respectively.

  • The same relation applies for the third and foruth question. Please see the illustrated picture below
    image

  • Now loss is computed through here. Please note that in this case

    • input.is_positive=[0,2,4,6]
    • input.hard_negatives=[1,3,5,7]
  • Dot product is performed (here) between local_q_vector and local_ctx_vectors, resulting in scores of the shape [4,8] after which F.log_softmax is applied. (here)

scores = self.get_scores(q_vectors, ctx_vectors)

if len(q_vectors.size()) > 1:
    q_num = q_vectors.size(0)
    scores = scores.view(q_num, -1)

softmax_scores = F.log_softmax(scores, dim=1)
  • Finally F.nll_loss is computed
loss = F.nll_loss(
    softmax_scores,
    torch.tensor(positive_idx_per_question).to(softmax_scores.device),
    reduction="mean",
)
  • Recall that softmax_scores is of [4,8] and positive_idx_per_question=[0,2,4,6]
  • I believe that from the perspective of the first question, this enforces the model to make softmax_scores[0][0] small while making softmax_scores[0][1:] big.
  • Similarly, from the persepctive of the second question, this enforces the model to make softmax_scores[1][1] small while making softmax_scores[1][0] and softmax_scores[1][2:] big.
  • Likewise for third and fourth questions.
  • Hence in-batch negative sampling.

Please DO correct me if I'm wrong

  • Meanwhile, I have a further question.
  • The above implementation makes a given question and its hard negative + in-batch negative further apart.
  • What should I do if I want to make a given question and ONLY its hard negative further apart ? (Of course, still making its positivie closer)
  • How about this approach ?
scores = self.get_scores(q_vectors, ctx_vectors)

if len(q_vectors.size()) > 1:
    q_num = q_vectors.size(0)
    scores = scores.view(q_num, -1)

softmax_scores = F.log_softmax(scores, dim=1)
softmax_scores_sliced = self.slice(softmax_scores)

loss = F.nll_loss(
    softmax_scores_sliced,
    torch.tensor([0,0,0,0]).to(softmax_scores_sliced.device),
    reduction="mean",
)
  • where slice function would do something like
    image

@hongyuntw
Copy link
Author

@robinsongh381
Thank you so much!
After your clearly explanation , I totally understand how the in-batch negative work.

@vlad-karpukhin
Copy link
Contributor

vlad-karpukhin commented Mar 17, 2021

Hi @robinsongh381 ,

Your understanding is almost correct.
The blue lines on the graph you provided should be named hard negatives, not just 'negatives'. Every question has '6' regular negatives in that scheme.
Your idea with slicing seems legit, but please remove representations vectors sharing between nodes.

@robinsongh381
Copy link

Hi @vlad-karpukhin
Thank you for reply !

You are absolutely right, the blue lines should have been named "hard" negatives.

I am not quite sure

  1. what do you mean by but please remove representations vectors sharing between nodes
  2. and why is that step required ?

Regards

@vlad-karpukhin
Copy link
Contributor

  1. I mean the call to gather all representations https://github.com/facebookresearch/DPR/blob/master/train_dense_encoder.py#L668 - it is just redundant for your needs.

  2. It is required in the in-batch negatives scheme, otherwise each node will calculate the loss based on its own 'small' slice of the entire 'global' batch.

@Hannibal046
Copy link

Hi @vlad-karpukhin, thank you for providing the code! And @robinsongh381, I appreciate the clear illustration.

I have a question regarding in-batch negatives across devices, as demonstrated here:

if distributed_world_size > 1:

From my understanding, the gather operation aims to supply more negative samples for a single question (please correct me if I'm mistaken). With that said, could you please explain why it's necessary to gather questions across GPUs as well?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants