You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This not only makes sense in terms of symmetry of the code (if before_{event_type} is invoked in the try block, so should after_{event_type}), but also resolves certain bugs.
An example of a bug caused by the current structure.
When I write a custom callback that invokes CancelBatchException (e.g. for resumable training, when i want to resume from a particular iteration in an epoch), which is executed in this line: self._with_events(self._do_one_batch, 'batch', CancelBatchException)
In this case, a batch is not executed, hence predictions/losses are not calculated for that particular iteration.
However, the Recorder Callback still executes loss accumulation in the after_batch stage. Which means, older loss values are used to accumulate values in the current iteration. This is logically incorrect for single-GPU training, and straight-up error-inducing in distributed training (because losses don't belong to the correct device).
The text was updated successfully, but these errors were encountered:
Currently, at each stage of fitting, the Learner class invokes
_with_events
with appropriate callback invokes.IMO,
self(f'after_{event_type}')
should not be invoked outside of the try-except block. The function should probably look like:This not only makes sense in terms of symmetry of the code (if
before_{event_type}
is invoked in the try block, so shouldafter_{event_type}
), but also resolves certain bugs.An example of a bug caused by the current structure.
When I write a custom callback that invokes
CancelBatchException
(e.g. for resumable training, when i want to resume from a particular iteration in an epoch), which is executed in this line:self._with_events(self._do_one_batch, 'batch', CancelBatchException)
In this case, a batch is not executed, hence predictions/losses are not calculated for that particular iteration.
However, the
Recorder
Callback still executes loss accumulation in theafter_batch
stage. Which means, older loss values are used to accumulate values in the current iteration. This is logically incorrect for single-GPU training, and straight-up error-inducing in distributed training (because losses don't belong to the correct device).The text was updated successfully, but these errors were encountered: