Need to restructure _with_events, the main function in Learner. #4029

Naitik1502 · 2024-04-25T19:10:09Z

Currently, at each stage of fitting, the Learner class invokes _with_events with appropriate callback invokes.

def _with_events(self, f, event_type, ex, final=noop):
        try: self(f'before_{event_type}');  f()
        except ex: self(f'after_cancel_{event_type}')
        self(f'after_{event_type}');  final()

IMO, self(f'after_{event_type}') should not be invoked outside of the try-except block. The function should probably look like:

def _with_events(self, f, event_type, ex, final=noop):
        try: self(f'before_{event_type}');  f(); self(f'after_{event_type}')
        except ex: self(f'after_cancel_{event_type}')
        final()

This not only makes sense in terms of symmetry of the code (if before_{event_type} is invoked in the try block, so should after_{event_type}), but also resolves certain bugs.

An example of a bug caused by the current structure.
When I write a custom callback that invokes CancelBatchException (e.g. for resumable training, when i want to resume from a particular iteration in an epoch), which is executed in this line: self._with_events(self._do_one_batch, 'batch', CancelBatchException)

In this case, a batch is not executed, hence predictions/losses are not calculated for that particular iteration.
However, the Recorder Callback still executes loss accumulation in the after_batch stage. Which means, older loss values are used to accumulate values in the current iteration. This is logically incorrect for single-GPU training, and straight-up error-inducing in distributed training (because losses don't belong to the correct device).

The text was updated successfully, but these errors were encountered:

Naitik1502 closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to restructure _with_events, the main function in Learner. #4029

Need to restructure _with_events, the main function in Learner. #4029

Naitik1502 commented Apr 25, 2024

Need to restructure _with_events, the main function in Learner. #4029

Need to restructure _with_events, the main function in Learner. #4029

Comments

Naitik1502 commented Apr 25, 2024