Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in pseudo-labelling code? #3

Open
linzhiqiu opened this issue Mar 1, 2022 · 4 comments
Open

Bug in pseudo-labelling code? #3

linzhiqiu opened this issue Mar 1, 2022 · 4 comments

Comments

@linzhiqiu
Copy link

I am confused by the implementation of pseudo-labelling in this library (lib/algs/pseudo_label.py). Especially, the forward() has:

y_probs = y.softmax(1)
onehot_label = self.__make_one_hot(y_probs.max(1)[1]).float()
gt_mask = (y_probs > self.th).float()
gt_mask = gt_mask.max(1)[0] # reduce_any
lt_mask = 1 - gt_mask # logical not
p_target = gt_mask[:,None] * 10 * onehot_label + lt_mask[:,None] * y_probs

output = model(x)
loss = (-(p_target.detach() * F.log_softmax(output, 1)).sum(1)*mask).mean()
return loss

I am confused why when computing p_target, the gt_mask is multiplied by 10? What is meaning of 10 here?

Also, I believe the lt_mask means the examples with max probability smaller than threshold and thus should be ignored when computing the loss. However, the p_target has the + lt_mask[:,None] * y_probs.

This seems to be different from what is described in the paper. If you are implementing a variant of pseudo-labelling loss function, could you point me to that paper?

@linzhiqiu
Copy link
Author

I am also confused by the coef in training_hierarchy.py:

coef = args.consis_coef * math.exp(-5 * (1 - min(iteration/args.warmup, 1))**2)

This coefficient does not appear in the original paper.

@linzhiqiu
Copy link
Author

One more question: For self training, it seems both labeled and unlabeled data are used for the KL divergence between teacher and student? The original paper says only the unlabeled data is used to compute the KLD.

@jongchyisu
Copy link
Collaborator

Hello, for the first question, the code for pseudo-label is from this PyTorch repo which is a re-implementation from this official Tensorflow implementation from Google. From their comment: Multiplying the one-hot pseudo_labels by 10 makes them look like logits.

As for the lt_mask, all of the papers from Google (Oliver et al., FixMatch, etc) use the same codebase but they did not specify the loss functions. You are right that the lt_mask is an extra term for pseudo-labeling that I should include in the paper. Since the output_teacher and output_student are the same when not using pseudo-labels, the extra term becomes the entropy of the predictions.

As for the coef, this is for warmup scheduling following Oliver et al.

About self-training: Thanks for pointing this out. It is a typo in the paper, I did indeed use both labeled and unlabeled data for self-training.

@linzhiqiu
Copy link
Author

Thanks for the helpful response! Could you point me to any paper that uses this specific variant of this pseudo-labelling loss?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants