Isn't loss only supposed to be calculated on masked tokens? #14

EmaadKhwaja · 2022-11-08T23:55:55Z

In the training loop we have:

imgs = imgs.to(device=args.device)
logits, target = self.model(imgs)
loss = F.cross_entropy(logits.reshape(-1, logits.size(-1)), target.reshape(-1))
loss.backward()

However, the output of the transformer is:

  _, z_indices = self.encode_to_z(x)
.
.
.
  a_indices = mask * z_indices + (~mask) * masked_indices

  a_indices = torch.cat((sos_tokens, a_indices), dim=1)

  target = torch.cat((sos_tokens, z_indices), dim=1)

  logits = self.transformer(a_indices)

  return logits, target

which means the returned target is the original unmasked image tokens.

The MaskGIT paper seems to suggest that loss was only calculated on the masked tokens

The text was updated successfully, but these errors were encountered:

darius-lam · 2023-01-05T04:45:22Z

I've attempted both strategies for a simple MaskGIT on CIFAR10 but the generation quality seems to still be bad. There are tricks that the authors are not telling us in the paper for their training scheme

xuesongnie · 2023-08-30T04:47:59Z

I have the same issue. Why loss was calculated on all tokens？

EmaadKhwaja · 2023-09-03T15:43:38Z

@Lamikins I believe the training issues come from an error in the masking formula. I've ammended the error: #16.

@xuesongnie

xuesongnie · 2023-09-03T16:07:35Z

@EmaadKhwaja return logits[~mask], target[~mask] seems a bit problematic， we should calculate masked token loss return logits[mask], target[mask]

EmaadKhwaja · 2023-09-03T16:12:59Z

@xuesongnie it's because the mask calculated is applied to the wrong values. The other option would be to do r = math.floor(1-self.gamma(np.random.uniform()) * z_indices.shape[1]), but I don't like that because it's different from how the formula appears in the paper

xuesongnie · 2023-09-16T15:59:19Z

@xuesongnie it's because the mask calculated is applied to the wrong values. The other option would be to do r = math.floor(1-self.gamma(np.random.uniform()) * z_indices.shape[1]), but I don't like that because it's different from how the formula appears in the paper

Hi, bro. I find that poor performance after modifying return logits[mask], target[mask]. It is weird. I guess the embedding layer also needs to train the corresponding unmasked token.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isn't loss only supposed to be calculated on masked tokens? #14

Isn't loss only supposed to be calculated on masked tokens? #14

EmaadKhwaja commented Nov 8, 2022

darius-lam commented Jan 5, 2023

xuesongnie commented Aug 30, 2023

EmaadKhwaja commented Sep 3, 2023 •

edited

xuesongnie commented Sep 3, 2023

EmaadKhwaja commented Sep 3, 2023

xuesongnie commented Sep 16, 2023

Isn't loss only supposed to be calculated on masked tokens? #14

Isn't loss only supposed to be calculated on masked tokens? #14

Comments

EmaadKhwaja commented Nov 8, 2022

darius-lam commented Jan 5, 2023

xuesongnie commented Aug 30, 2023

EmaadKhwaja commented Sep 3, 2023 • edited

xuesongnie commented Sep 3, 2023

EmaadKhwaja commented Sep 3, 2023

xuesongnie commented Sep 16, 2023

EmaadKhwaja commented Sep 3, 2023 •

edited