Add Callback Support #339

mattdeitke · 2022-03-12T21:37:12Z

Background

Adds initial support for Callbacks, inspired by PyTorch Lightning.

The immediate use case is to enable logging during training with Weights and Biases.

Motivation

The motivation is to make it easier to log, debug, and inspect the training setup without having to manually modify runner.py.

Down the line, I suspect callbacks will also be the best place to write tests, where the tests may be in callback functions like on_checkpoint_load(model).

Example

An example usage might be to define a Callback class under the file training/callbacks/wandb_logging.py:

from typing import Any, Dict, Optional

import wandb
from allenact.base_abstractions.callbacks import Callback


class WandbLogging(Callback):
    def setup(self, name: str, **kwargs) -> None:
        wandb.init(
            project="test-project",
            entity="prior-ai2",
            name=name,
            config=kwargs,
        )

    def on_train_log(self, metric_means: Dict[str, float], step: int, **kwargs) -> None:
        wandb.log({**metric_means, "step": step})

    def on_valid_log(
        self,
        metrics: Optional[Dict[str, Any]],
        metric_means: Dict[str, float],
        step: int,
        **kwargs
    ) -> None:
        wandb.log({**metric_means, "step": step})

    def on_test_log(
        self,
        checkpoint: str,
        metrics: Dict[str, Any],
        metric_means: Dict[str, float],
        step: int,
        **kwargs
    ) -> None:
        wandb.log({**metric_means, "step": step})

and to use it, one would add the file to the --callbacks flag in the allenact command:

allenact <...> --callbacks training/callbacks/wandb_logging.py

Note that this doesn't require modifying the experiment configs at all, and hence is fully opt in functionality.

Notes

I'm still thinking about what callbacks would be best, and what should be passed into each of them.

Right now, I think the best approach I have for logging videos, images, or other more complex information, is to save that information to disk, and then process, log, and delete it inside of on_train_log(), but perhaps there's a cleaner solution.

lgtm-com · 2022-03-13T00:38:43Z

This pull request introduces 2 alerts when merging bc49d47 into cc0d123 - view on LGTM.com

new alerts:

2 for Variable defined multiple times

jordis-ai2

I like a lot the idea. Just wondering if we should actually include it in the experiment config, but other than that, and some picky details, LGTM.

jordis-ai2 · 2022-03-14T13:09:58Z

allenact/algorithms/onpolicy_sync/engine.py


 import torch
 import torch.distributed as dist  # type: ignore
 import torch.distributions  # type: ignore
 import torch.multiprocessing as mp  # type: ignore
 import torch.nn as nn
 import torch.optim as optim
+from allenact.algorithms.onpolicy_sync.misc import TrackingInfo, TrackingInfoType


Throughout the code base we try to keep third party imports before allenact ones, so I would move this a few lines below.

If you're using PyCharm I think this would be handled by running Code -> Optimize Imports.

Ah, I will revert these import updates.

I am using isort, which works well with Black and is pretty popular for sorting and organizing imports. It autoformats on save for VSCode so it ended up changing the import order automatically.

jordis-ai2 · 2022-03-14T13:13:37Z

allenact/algorithms/onpolicy_sync/engine.py

@@ -1736,7 +1734,8 @@ def run_eval(
                    lengths: List[int]
                    if self.num_active_samplers > 0:
                        lengths = self.vector_tasks.command(
-                            "sampler_attr", ["length"] * self.num_active_samplers,
+                            "sampler_attr",


I see many formatting changes - are you also using black to ensure consistency?

Yes, note that we're using version 19.10b0 of black.

Yes, using Black, but must be using some different preferences with it (such as the number of characters in a line). I will try with 19.10b0 :)

allenact/algorithms/onpolicy_sync/runner.py

allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py

allenact/base_abstractions/callbacks.py

allenact/main.py

Lucaweihs

Really like this but do have a few suggestions here and there that I think could improve usability. I think the main question for us is if we see people wanting to mix and match callbacks across repeated runs of the same experiment. If not then maybe its sufficient to define callbacks at the experiment config level. One important thing to remember for this discussion is that it need not be the same person running the experiment and so if we "hard code" certain callbacks (e.g. wandb logging) then this will make it a bit more annoying for people who don't have all the appropriate permissions set up.

Lucaweihs · 2022-03-21T22:47:36Z

allenact/algorithms/onpolicy_sync/engine.py


 import torch
 import torch.distributed as dist  # type: ignore
 import torch.distributions  # type: ignore
 import torch.multiprocessing as mp  # type: ignore
 import torch.nn as nn
 import torch.optim as optim
+from allenact.algorithms.onpolicy_sync.misc import TrackingInfo, TrackingInfoType


If you're using PyCharm I think this would be handled by running Code -> Optimize Imports.

Lucaweihs · 2022-03-21T22:49:36Z

allenact/algorithms/onpolicy_sync/engine.py

@@ -1736,7 +1734,8 @@ def run_eval(
                    lengths: List[int]
                    if self.num_active_samplers > 0:
                        lengths = self.vector_tasks.command(
-                            "sampler_attr", ["length"] * self.num_active_samplers,
+                            "sampler_attr",


Yes, note that we're using version 19.10b0 of black.

allenact/algorithms/onpolicy_sync/runner.py

Lucaweihs · 2022-03-21T23:45:12Z

allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py

@@ -926,6 +926,12 @@ def _task_sampling_loop_generator_fn(
                                step_result = step_result.clone({"info": {}})
                            step_result.info[COMPLETE_TASK_METRICS_KEY] = metrics

+                        task_callback_data = current_task.task_callback_data()


I would prefer if tasks didn't know about callbacks as this creates a bidirectional dependency between the tasks and callbacks. Could we instead pass a function (or callable object) to the VectorSampledTask and then do something like task_callback_data = task_callback_data_fn(current_task). This task_callback_data_fn could be returned by the experiment config similarly as to how make_sampler_fn is.

Actually even better would be if each Callback was required to have this function defined on itself as a static method and we just passed in a list of these methods (that way the experiment config wouldn't need to know which callbacks were going to be used with it making it easier to mix and match from the command line).

That’s an interesting idea! Like an on_task_end(task: Task) callback.

allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py

allenact/base_abstractions/callbacks.py

Lucaweihs · 2022-03-21T23:46:59Z

allenact/base_abstractions/task.py

+    def task_callback_data(self) -> Optional[Any]:
+        """Returns any data that should be passed to the log callback function."""
+        return None


If doing my above suggestion then this would be removed.

allenact/main.py

…callbacks

jordis-ai2

Just a minor comment and a short question about usage, bot other than that LGTM.

jordis-ai2 · 2022-04-13T10:37:49Z

allenact/algorithms/onpolicy_sync/runner.py

@@ -86,14 +89,18 @@ def __init__(
        disable_tensorboard: bool = False,
        disable_config_saving: bool = False,
        distributed_ip_and_port: str = "127.0.0.1:0",
+        distributed_preemption_threshold: float = 0.7,


This is nice - not sure why it was never added as an argument, but good you did. I guess it would also make sense to make it an arg in main?

Sorry, have just been periodically adding to this branch for any issues I’ve come across. I think I’m just about ready for it, other than addressing the existing comments above. I’ve been waiting a bit since it’s nice to sometimes add arguments to the callbacks, but if it’s merged into main, it’d be hard to add anything without breaking backwards compatibility.

I meant main.py :)

jordis-ai2 · 2022-04-13T10:42:09Z

allenact/algorithms/onpolicy_sync/runner.py


        message.append(f"tasks {num_tasks} checkpoint {checkpoint_file_name[0]}")
        get_logger().info(" ".join(message))

+        for callback in self.callbacks:


I guess it's also possible to make the callback start a thread and return immediately to allow logger to keep showing stats for train in "real time". Is that right?

This would be possible, but one of the issues is that if wandb is initialized in one process, you cannot log from a thread, and hence why logging from a thread and then destroying it wouldn’t work as one would expect. If you try to log from all threads and processes, logging becomes prohibitively expensive.

See their notes on distributed training: https://docs.wandb.ai/guides/track/advanced/distributed-training

I think I didn't express correctly what I meant. Currently, if the callback function takes a long time to process, packages in the runner queue for logging wait for a long time. When they're read, we quickly flush that queue, resulting in e.g. spikes of FPS (note that there's no timestamp in the logging packages sent from the trainers). I thought that passing the data for the callback to a thread (in the same process) could do the job, but if using an open wandb session from a thread in the same process is a no go, then there's no discussion. 👍

…d error handling.

…llbacks init).

… to preprocessor.

…into callbacks-cpca-softmax

Merging main into callbacks and fixing merge conflict.

…k doesn't need to know anything about callbacks.

Fixing callbacks PR comments and other misc improvements.

Lucaweihs

Having merged the callback sensor API, LGTM.

add callback support

81f5f4a

mattdeitke requested review from Lucaweihs and jordis-ai2 March 12, 2022 21:37

mattdeitke marked this pull request as draft March 12, 2022 21:46

mattdeitke added 4 commits March 12, 2022 14:26

update valid log

59bab94

add args to on policy runner

b9ddd05

add on_test_log and args to setup

e39b923

update metrics, metric means

bc49d47

mattdeitke marked this pull request as ready for review March 13, 2022 01:41

mattdeitke added 6 commits March 12, 2022 17:46

reuse metric means

6fe333e

add all metrics

b83dc70

add way to pass in task_callback data

52f2622

pass back pkg.metric_dicts for metrics

2ed3f8e

fix metrics results

333dcd6

reset single_process_task_callbacks_data after use

5899282

jordis-ai2 reviewed Mar 14, 2022

View reviewed changes

Lucaweihs reviewed Mar 21, 2022

View reviewed changes

mattdeitke and others added 7 commits April 2, 2022 14:32

send metric means when not using tensorboard

cd7bf18

convert means/stds to arrays

6824c9f

add metrics to on_train_log

38e5e57

Merge branch 'main' into callbacks

298acda

pass checkpoint into val

f6e7abd

Merge branch 'callbacks' of https://github.com/allenai/allenact into …

943a246

…callbacks

add after_save_project_state callback

698e56b

Lucaweihs mentioned this pull request Apr 11, 2022

Visualize a policy from a checkpoint #343

Closed

add distributed_preemption_threshold to OnPolicyRunner

3a4f2a5

jordis-ai2 reviewed Apr 13, 2022

View reviewed changes

mattdeitke and others added 2 commits April 28, 2022 12:50

merge main into callbacks

030d2ce

Look constraints, vit, and auxiliary losses.

79b7bbb

Lucaweihs and others added 14 commits May 20, 2022 21:34

Fixing bug that resulted in cpca loss not running when it should.

41a7d2c

Fixing merge conflict in engine.py

2f3f83e

Merge branch 'main' of github.com:allenai/allenact into cpca-softmax

85a81c8

Merge branch 'cpca-softmax' into callbacks-cpca-softmax

7badab7

Improvements to manipulathor plugin, thor sensors, inference code, an…

90fe75c

…d error handling.

Improvements to error reporting and vector sampled task termination.

dea48a9

Extended timeout for TCP store access

d85853f

Removing argument parser input to the runner (previously passed to ca…

085739a

…llbacks init).

Generalizing clip preprocessor to use varying input sizes.

1fe09a4

Updating mixin to check for input image heights/widths before passing…

9016229

… to preprocessor.

Force advance scene for habitat objectnav.

2b538f5

Merge branch 'callbacks-cpca-softmax' of github.com:allenai/allenact …

60f2d8f

…into callbacks-cpca-softmax

Merge pull request #347 from allenai/callbacks-merge

7ef8015

Merging main into callbacks and fixing merge conflict.

Fixing callbacks pr requests, in particular making it so that a tas…

72e4fa4

…k doesn't need to know anything about callbacks.

Lucaweihs mentioned this pull request Jul 29, 2022

Fixing callbacks PR comments and other misc improvements. #350

Merged

Lucaweihs added 6 commits July 29, 2022 11:39

Typo.

bd87a6f

Grabbing debug flag from the environment.

3e51c45

Improving grab of debug flag.

4812860

Using str2bool function.

d05d140

Fix to ai2thor version check.

1cc558c

Merge pull request #350 from allenai/callbacks-cpca-softmax

e89eae1

Fixing callbacks PR comments and other misc improvements.

Lucaweihs approved these changes Aug 16, 2022

View reviewed changes

Lucaweihs merged commit b5c7192 into main Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Callback Support #339

Add Callback Support #339

mattdeitke commented Mar 12, 2022 •

edited

lgtm-com bot commented Mar 13, 2022

jordis-ai2 left a comment

jordis-ai2 Mar 14, 2022

Lucaweihs Mar 21, 2022

mattdeitke Mar 22, 2022

jordis-ai2 Mar 14, 2022

Lucaweihs Mar 21, 2022

mattdeitke Mar 22, 2022

Lucaweihs left a comment

Lucaweihs Mar 21, 2022

Lucaweihs Mar 21, 2022

Lucaweihs Mar 21, 2022

Lucaweihs Mar 21, 2022

mattdeitke Mar 22, 2022

Lucaweihs Mar 21, 2022

jordis-ai2 left a comment

jordis-ai2 Apr 13, 2022

mattdeitke Apr 14, 2022

jordis-ai2 Apr 14, 2022

jordis-ai2 Apr 13, 2022

mattdeitke Apr 14, 2022

jordis-ai2 Apr 14, 2022

Lucaweihs left a comment

Add Callback Support #339

Add Callback Support #339

Conversation

mattdeitke commented Mar 12, 2022 • edited

Background

Motivation

Example

Notes

lgtm-com bot commented Mar 13, 2022

jordis-ai2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lucaweihs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordis-ai2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lucaweihs left a comment

Choose a reason for hiding this comment

mattdeitke commented Mar 12, 2022 •

edited