Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to restructure _with_events, the main function in Learner. #4029

Closed
Naitik1502 opened this issue Apr 25, 2024 · 0 comments
Closed

Need to restructure _with_events, the main function in Learner. #4029

Naitik1502 opened this issue Apr 25, 2024 · 0 comments

Comments

@Naitik1502
Copy link

Currently, at each stage of fitting, the Learner class invokes _with_events with appropriate callback invokes.

def _with_events(self, f, event_type, ex, final=noop):
        try: self(f'before_{event_type}');  f()
        except ex: self(f'after_cancel_{event_type}')
        self(f'after_{event_type}');  final()

IMO, self(f'after_{event_type}') should not be invoked outside of the try-except block. The function should probably look like:

def _with_events(self, f, event_type, ex, final=noop):
        try: self(f'before_{event_type}');  f(); self(f'after_{event_type}')
        except ex: self(f'after_cancel_{event_type}')
        final()

This not only makes sense in terms of symmetry of the code (if before_{event_type} is invoked in the try block, so should after_{event_type}), but also resolves certain bugs.

An example of a bug caused by the current structure.
When I write a custom callback that invokes CancelBatchException (e.g. for resumable training, when i want to resume from a particular iteration in an epoch), which is executed in this line: self._with_events(self._do_one_batch, 'batch', CancelBatchException)

In this case, a batch is not executed, hence predictions/losses are not calculated for that particular iteration.
However, the Recorder Callback still executes loss accumulation in the after_batch stage. Which means, older loss values are used to accumulate values in the current iteration. This is logically incorrect for single-GPU training, and straight-up error-inducing in distributed training (because losses don't belong to the correct device).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant