You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
importmxnetasmximportlossif__name__=='__main__':
BATCH_SIZE=3MAX_SEQ_LEN=100VOCAB_SIZE=26# Each time step has a uniform distribution over the vocabularyuniform_tensor=1./VOCAB_SIZE*mx.nd.ones(shape=(BATCH_SIZE, MAX_SEQ_LEN, VOCAB_SIZE))
# Sequences have different lengthsvalid_lens=mx.nd.array((1, 10, 100))
loss=loss.SoftmaxCEMaskedLoss(sparse_label=False, from_logits=False)
ce_loss=loss(uniform_tensor, uniform_tensor, valid_lens)
print(ce_loss)
One workaround is to increase sample weights so their weighted sum compensates for the zeros. For example, adding the following line before calling super in SoftmaxCEMaskedLoss:
I have not evaluated how this affects current MT models. This causes short sequences to have smaller per-token loss, weighting long sequences more.
However, perhaps we should edit SoftmaxCEMaskedLoss for correctness (or someone point out if I'm mistaken); I can submit a PR. The CE loss of an sequence should not depend on the batch it is in.
The text was updated successfully, but these errors were encountered:
@JulianSlzr , in the current nmt scripts, the token loss is averaged by the total number of the tokens in the mini-batch. You can imagine it by concatenating all the sequences in the mini-batch to a single one and taking average. You can find it in the following code. loss = loss * (tgt_seq.shape[1] - 1) / (tgt_valid_length - 1).mean()
This ensures that all the tokens are equally weighted. In fact, we are minimizing the following objective: \sum_{i=1}^N\sum_{t=1}^{T_i}\logP(y_t^i|y_1^i, ...y_{t-1}^i, X^i, \Theta)
But sure, what you propose can be useful in some scenarios. A PR is always welcome.
Hi all,
Consider the following (placed in
scripts/nmt/
):This outputs:
However, these should all be
3.2580965
(which is ln(26)), as this is the (average) CE across all valid timesteps, per sequence.The problem is the averaging step of CE is not aware of
valid_length
, which leads to "masking zeros" being included in the mean. Compare with, e.g., Sockeye's implementation.One workaround is to increase sample weights so their weighted sum compensates for the zeros. For example, adding the following line before calling super in SoftmaxCEMaskedLoss:
I have not evaluated how this affects current MT models. This causes short sequences to have smaller per-token loss, weighting long sequences more.
However, perhaps we should edit
SoftmaxCEMaskedLoss
for correctness (or someone point out if I'm mistaken); I can submit a PR. The CE loss of an sequence should not depend on the batch it is in.The text was updated successfully, but these errors were encountered: