Continuous evaluations init commit #325

iseessel · 2021-05-25T00:04:28Z

Create a script that continuously evaluates benchmarks as they become available from a pretraining.

Next Steps:

Deal with sharded checkpoints and their conversion
Improve max_iteration logic
Extend to FB infra.
Write unit tests
Think about how these tricky evaluation tests: Continuous evaluations init commit #325 (comment)
Try not to replicate so much logic in the class (e.g. get path names from vissl code, requires some refactoring).
Look into email notifications.

Testing:

Run 8node Swav with 10 epochs with 3 different benchmark evaluations with different resource requirements. SUCCESS.

json config:

{
    "params": {
           "training_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints",
           "benchmarks": [
               {
                   "evaluation_name": "clevr_count_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "clevr_dist_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_dist_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "in1k_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_in1k_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               }
           ],
           "evaluation_iter_freq": 600,
           "evaluation_phase_freq": 2,
           "evaluate_final_phase": true,
           "autoload_slurm_evaluator_checkpoint": false,
           "slurm_evaluator_checkpoint": null,
           "auto_retry_evaluations": true,
           "retry_evaluation_job_ids": [],
           "max_retries": 3,
           "pytorch_ports": [40050, 40051, 40052, 40053, 40054, 40055, 40056, 40057, 40058, 40059, 40060, 40061, 40062, 40063]
       },
       "slurm_options": {
           "PARTITION": "learnfair"
       }
}

Example snippet from evaluation_metrics.json:

{
    "model_final_checkpoint_phase9": [
        {
            "checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "config_files": [
                "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml",
                "hydra.run.dir='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.CHECKPOINT.DIR='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.USE_SLURM=true",
                "config.MODEL.WEIGHTS_INIT.PARAMS_FILE='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch'"
            ],
            "evaluation_name": "clevr_count_linear",
            "job_id": "42410489",
            "metrics": {
                "test_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 34.62,
                    "train_phase_idx": 2
                },
                "train_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 33.8514,
                    "train_phase_idx": 2
                }
            },
            "num_retries": 1,
            "slurm_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "slurm_log_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear",
            "slurm_state": "COMPLETED",
            "weights_init_params_file": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch"
        }, ...

The following hold:

Training completes appropriately, w/o errors.
Able to resume checkpoints.
Evaluation folder structure is as expected above.
Best Metrics are extracted.

configs/config/benchmark/slurm_evaluations/slurm_evaluation_example.json

vissl/engines/slurm_evaluator.py

QuentinDuval · 2021-05-31T13:02:29Z

vissl/engines/slurm_evaluator.py

+        config_files.insert(1, f"config.SLURM.LOG_FOLDER='{log_dir}'")
+        config_files.insert(1, f"config.CHECKPOINT.DIR='{checkpoint_dir}'")
+        config_files.insert(1, f"hydra.run.dir='{ log_dir }'")
+        config_files.insert(1, "hydra.verbose=true")


You could skip that one (verbose=true)

vissl/engines/slurm_evaluator.py

QuentinDuval · 2021-06-02T12:44:04Z

vissl/engines/benchmark_suite_scheduler.py

+            assert (
+                self.evaluation_iter_freq
+                % self.training_config.CHECKPOINT.CHECKPOINT_ITER_FREQUENCY
+            ) == 0, "Evaulation iter frequency must evenly divide the checkpoint iter frequency"  # NOQA


Suggested change

) == 0, "Evaulation iter frequency must evenly divide the checkpoint iter frequency" # NOQA

) == 0, "Evaluation iter frequency must evenly divide the checkpoint iter frequency" # NOQA

QuentinDuval · 2021-06-02T12:44:27Z

vissl/engines/benchmark_suite_scheduler.py

+        if self.evaluation_iter_freq > -1 and not self.max_training_iterations:
+            assert (
+                self.training_config.DATA.TRAIN.DATA_LIMIT != -1
+            ), "When evaluataing iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations"  # NOQA


Suggested change

), "When evaluataing iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA

), "When evaluating iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA

QuentinDuval · 2021-06-02T12:51:24Z

vissl/engines/benchmark_suite_scheduler.py

+        )
+
+        # Add benchmark result information
+        benchmark_result = {}


Suggested change

benchmark_result = {}

benchmark_result = {

"evaluation_name": evaluation_name,

"job_id": None,

...

}

QuentinDuval · 2021-06-02T12:53:04Z

vissl/engines/benchmark_suite_scheduler.py

+            self.training_config.CHECKPOINT.DIR, f"{ training_checkpoint }.torch"
+        )
+        config_files = benchmark_result["config_files"]
+        config_files.insert(


Suggested change

config_files.insert(

for option in ["config.SLURM.USE_SLURM=true", f"config.SLURM.LOG_FOLDER='{log_dir}'", ...]

config_files.insert(1, option)

QuentinDuval · 2021-06-02T12:55:53Z

vissl/utils/misc.py

+    return wrapper
+
+
+def flatten_dict(d, parent_key="", sep="_"):


QuentinDuval · 2021-06-02T12:57:48Z

vissl/engines/benchmark_suite_scheduler.py

+
+            final_metrics = collections.defaultdict(lambda: {"metric": -1})
+
+            # Get the largest metrics over all recorded metrics.


nit: extract this as a function get_largest_metrics

QuentinDuval · 2021-06-02T12:58:22Z

vissl/engines/benchmark_suite_scheduler.py

+        # Create SlurmJob object.
+        job_id = str(benchmark["job_id"])
+        folder = Path(benchmark["slurm_log_dir"])
+        job = submitit.SlurmJob(job_id=job_id, folder=folder, tasks=[0])


QuentinDuval · 2021-06-02T13:00:32Z

tools/launch_benchmark_suite_scheduler_slurm.py

+    "COMMENT": "vissl evaluation job",
+    "PARTITION": "learnfair",
+    "CONSTRAINT": "",
+    "TIMEOUT_MIN": 4320,  # Timeout in minutes.


Suggested change

"TIMEOUT_MIN": 4320, # Timeout in minutes.

"TIMEOUT_MIN": 72 * 60, # Timeout in 72 hours

QuentinDuval · 2021-06-02T13:01:00Z

tools/launch_benchmark_suite_scheduler_slurm.py

+_DEFAULT_SLURM_OPTIONS = {
+    "NAME": "vissl",
+    "COMMENT": "vissl evaluation job",
+    "PARTITION": "learnfair",


Suggestion: "learnlab" is faster

QuentinDuval · 2021-06-02T13:53:24Z

I had some suggestions for tests for the benchmark suite evaluator, which I think will allow us to exercise the most tricky cases:

configs/config/benchmark/linear_image_classification/voc07 because it is using SVM evaluation
configs/config/benchmark/nearest_neighbor/eval_resnet_8gpu_in1k_kNN.yaml because it is using KNN evaluation
configs/config/benchmark/object_detection because the evaluation is not implemented in VISSL

These uses cases do not have to be supported in this PR, but we need to think about them:

Either to make the benchmark suite evaluator support them
Or to refactor these benchmarks so that they fit more naturally (my preferred option)

vissl/trainer/train_task.py

prigoyal

thank you so much @iseessel :) this is really awesome and very impactful feature addition to VISSL.

1st round review! :)

prigoyal · 2021-06-04T12:57:51Z

configs/config/benchmark/suite/benchmark_suite_scheduler_example.json

@@ -0,0 +1,22 @@
+{


question: do we have the requirement that the json file should be in the "configs" folder or can it be anywhere?

If latter, I'd recommend to move the scripts to at least outside of "benchmark" possibly ( we should retain only the yaml or the reproducibility related files within benchmark folder).

How about something like "dev/benchmark_suite/.....json"

prigoyal · 2021-06-04T12:59:05Z

configs/config/benchmark/suite/benchmark_suite_scheduler_example.json

@@ -0,0 +1,22 @@
+{
+ "params": {
+        "training_checkpoint_dir": "(str) Training checkpoint directory. That is the CHECKPOINT.DIR of the training config",


it's slightly counter-intuitive to have the descriptions like this in json. If we create a dev/benchmark_suite/ , we have to options:

rename this file to "template.json"

create a README.md and capture the template there and create an example.json with the "actual" parameters i.e. filled out template.

Agree we should add something to README -- since approach will change shortly with added fbinfra support, I'd like to wait on that.

dev/launch_benchmark_suite_scheduler_slurm.sh

tools/launch_benchmark_suite_scheduler_slurm.py

prigoyal · 2021-06-04T13:12:33Z

vissl/utils/misc.py

+    return wrapper
+
+
+def flatten_dict(d, parent_key="", sep="_"):


nit: let's add the type hints to the inputs.

prigoyal · 2021-06-04T13:12:51Z

vissl/utils/misc.py

+    return wrapper
+
+
+def flatten_dict(d, parent_key="", sep="_"):


nit: can we add a docstring to the function on what it does.

prigoyal

2nd round :)

prigoyal · 2021-06-04T13:29:00Z

vissl/engines/benchmark_suite_scheduler.py

@@ -0,0 +1,545 @@
+# Copyright (c) Facebook, Inc. and its affiliates.


question: is this scheduler supposed to run on "every gpu" ?

typically in vissl/engines/, we have included engines that run on every gpu. the distributed_launcher takes care of launching the engine on each gpu worker.

If this doesn't run on all gpus, then move this to either vissl/utils/ folder of within the tools/ merge this with the new .py added as this can also facilitate better code readability (avoid many hops) :)

prigoyal · 2021-06-04T13:29:21Z

vissl/engines/benchmark_suite_scheduler.py

+
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+


nit: recommend adding a docstring that explains what this does

prigoyal · 2021-06-04T13:30:10Z

vissl/engines/benchmark_suite_scheduler.py

+_SLEEP_TIME_SECONDS = 15
+# Slurm states marked as terminal. SlurmEvulator#evaluate will finish
+# once all jobs are in a terminal state.
+_SLURM_JOB_TERMINAL_STATES = [


nit: remove the SLURM from the name

This is SLURM specific, this will all be refactored shortly to take into account for fbinfra.

prigoyal · 2021-06-04T13:32:37Z

vissl/engines/benchmark_suite_scheduler.py

+        self.training_checkpoint_file = os.path.join(
+            self.training_checkpoint_dir, "train_config.yaml"
+        )


nit: add an assert for it. if the train_config.yaml is not found, that means user training didn't work. so we should exit instantly.

So I have set it up, so that this can be launched simultaneously when the training is launched. Since we don't know when each job will be executed by SLURM (and likely the low resource job will be executed first), I wait for the training config to become available for a set amount of time, then fail.

prigoyal · 2021-06-04T13:33:19Z

vissl/engines/benchmark_suite_scheduler.py

+        """
+        self.evaluation_jobs_finished = set()
+
+        # Required Arguments


for all the required arguments, we must perform some kind of validation. In the __init__ , you can call a self.validate() for instance that can take care of this.

prigoyal · 2021-06-04T13:44:25Z

vissl/engines/benchmark_suite_scheduler.py

+        self.training_config = load_file(self.training_checkpoint_file)
+        self.training_config = AttrDict(self.training_config)


do you think we can try to leverage the vissl https://github.com/facebookresearch/vissl/blob/master/vissl/utils/hydra_config.py#L24 ?

~~Yeah agreed, this is a mistake -- we should use hydra_config, since config options can change.~~

I take this back, I think we should just call AttrDict. The hydra_config function does too much:

We don't want to save this yaml config file.

We don't want to infer config. We want the config file to be exactly as it is seen in training_config.yaml (that being said, in the link above, I think convert_fsdp_dtypes should be called before save_attrdict_to_disk).

prigoyal · 2021-06-04T13:45:16Z

vissl/engines/benchmark_suite_scheduler.py

+            f"Loaded training checkpoint config from: { self.training_checkpoint_file }"
+        )
+        # Build main training task in order to extract iteration info.
+        self.training_task = build_task(self.training_config)


just noting here: we discussed getting rid of the training iterations. :)

So we still need the training iterations, we just wanted to find them in a different way.

I synced up with @QuentinDuval on this. I went down the path of Introducing them dynamically, but this complicates the logic too much. I much prefer scaffolding them at the beginning, it's much easier to follow and debug.

prigoyal · 2021-06-04T13:45:40Z

vissl/engines/benchmark_suite_scheduler.py

+                state_prev: None, state_current: { job.state }
+            """
+
+            print(log)


nit: logger

prigoyal · 2021-06-04T13:46:13Z

vissl/engines/benchmark_suite_scheduler.py

+            is_submitit_available()
+        ), "Please 'pip install submitit' to schedule jobs on SLURM"
+
+    def _validate_training_cfg(self):


nit: let's rename it to _validate_evaluation_setup ?

prigoyal · 2021-06-04T13:47:44Z

vissl/engines/benchmark_suite_scheduler.py

+        self.evaluation_results = self._generate_initial_evaluation_results()
+        self._validate_training_cfg()
+
+    def _validate_class_instance(self):


question: it sounds like we are simply checking availability of libraries like hydra, submitit and nothing to do with the class itself. Could we find a better place for these functions and possible just call them directly instead of putting them under the _validate_class_instance ? :)

facebook-github-bot · 2021-06-04T16:36:07Z

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-04T16:52:04Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

prigoyal

looks great! thank you so much . Some inline comments and also take a look at some open conversations/resolve them.

Remaining steps:

Update the test plan with the ongoing training + evals
Update the test plan with the static evals with a checkpoint from model zoo. Pick the checkpoint for SwAV from here https://github.com/facebookresearch/vissl/blob/master/MODEL_ZOO.md#SwAV and the json is here https://github.com/facebookresearch/vissl/blob/master/configs/config/model_zoo/benchmark_in1k_linear_deepclusterv2_swav.json#L66-L83 . Create a dev/benchmark_suite/example_swav.json for instance if needed
rebase and import to phabricator again . Ensure all tests pass (internal, external)
We will want to post a workplace post afterwards. We will want to share the results visually from the test plan so any figures, snapshots you can grab , do it :)

prigoyal · 2021-06-04T20:11:37Z

tools/benchmark_suite_scheduler.py

@@ -0,0 +1,582 @@
+# Copyright (c) Facebook, Inc. and its affiliates.


nit: as discussed, let's move this to under vissl/utils/

prigoyal · 2021-06-04T20:11:49Z

tools/benchmark_suite_scheduler.py

+    This class is designed to be used to run multiple evaluations on a single (pre)training.
+    Using the #evaluate method we continuously monitor training checkpoints, launch evaluations
+    dynamically as they become available, and amalgamate the evaluation results as they become
+    available.
+
+    For SLURM usage, you should create a JSON configuration file (see benchmark_suite_scheduler_template.json)
+    and use the launch_benchmark_suite_scheduler_slurm.sh for convenience.
+"""


nit: no indent

prigoyal · 2021-06-04T20:13:06Z

tools/launch_benchmark_suite_scheduler_slurm.py

+
+
+# Default slurm options to pass to the executor.
+_DEFAULT_SLURM_OPTIONS = {


as discussed, let's move this to dedicated settings in defaults.yaml :)

prigoyal · 2021-06-04T20:13:28Z

tools/launch_benchmark_suite_scheduler_slurm.py

@@ -0,0 +1,109 @@
+#!/bin/bash


should this be removed?

prigoyal · 2021-06-04T20:15:47Z

tools/benchmark_suite_scheduler.py

+    def _launch_slurm_job(self, args, config):
+        return launch_distributed_on_slurm(engine_name=args.engine_name, cfg=config)
+
+    def _write_json_file(self, data, file_name):


nit: can we use the function save_file https://github.com/facebookresearch/vissl/blob/master/vissl/utils/io.py#L68 which does exactly this?

I responded to this earlier, but because of the file changes, the comment got buried.

They do slightly different things.

#save_file appends, whereas this overwrites.

#save_file adds a new_line.

prigoyal · 2021-06-04T20:16:23Z

tools/benchmark_suite_scheduler.py

+        if not PathManager.exists(evaluation_dir):
+            PathManager.mkdirs(evaluation_dir)


nit: can we leverage this function directly https://github.com/facebookresearch/vissl/blob/master/vissl/utils/io.py#L122-L133 ?

Nice. Thanks, I missed this one!

prigoyal · 2021-06-04T20:16:36Z

tools/benchmark_suite_scheduler.py

+            if not PathManager.exists(child_metrics_dir):
+                PathManager.mkdirs(child_metrics_dir)


nit: can we use this function directly: https://github.com/facebookresearch/vissl/blob/master/vissl/utils/io.py#L122-L133 ?

facebook-github-bot · 2021-06-07T22:17:25Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

iseessel · 2021-06-09T14:43:32Z

dev/benchmark_suite/benchmark_suite_scheduler_swav.json

@@ -0,0 +1,25 @@
+{


Revert this, this is for testing.

iseessel · 2021-06-09T17:32:13Z

tools/launch_benchmark_suite_scheduler_slurm.py

+    ), "slurm_options.PARTITION is a required field to launch the benchmark suite on slurm"
+
+    slurm_options = AttrDict(config["slurm_options"])
+    benchmark_suite_scheduler.evaluate()


This is for testing purposes, revert once finished testing.

facebook-github-bot · 2021-06-10T04:37:43Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-10T04:38:52Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-11T14:19:44Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

prigoyal · 2021-06-11T18:03:34Z

looks great. Ready to merge. Thank you so much @iseessel and thank you @QuentinDuval for amazing mentoring on this.

@iseessel , this PR needs a rebase + reimport and then ready to merge.

facebook-github-bot · 2021-06-14T14:19:24Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-14T14:20:00Z

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-14T15:10:56Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-14T15:11:15Z

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-06-14T17:20:38Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-14T17:21:05Z

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…luations for slurm jobs

facebook-github-bot · 2021-06-14T17:40:36Z

@iseessel has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-06-14T17:41:20Z

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Create a script that continuously evaluates benchmarks as they become available from a pretraining. ![Uploading Screen Shot 2021-06-02 at 10.22.01 AM.png…]() ![Uploading Screen Shot 2021-06-02 at 10.22.19 AM.png…]() <img width="593" alt="Screen Shot 2021-06-02 at 10 22 37 AM" src="https://user-images.githubusercontent.com/25669348/120497511-7888c880-c38c-11eb-8bc1-78bacc5d968b.png"> <img width="1237" alt="Screen Shot 2021-06-02 at 10 22 59 AM" src="https://user-images.githubusercontent.com/25669348/120497575-85a5b780-c38c-11eb-9445-2076e15be888.png"> Next Steps: 1. Deal with sharded checkpoints and their conversion 1. Improve max_iteration logic 1. Extend to FB infra. 1. Write unit tests 1. Think about how these tricky evaluation tests: #325 (comment) 1. Try not to replicate so much logic in the class (e.g. get path names from vissl code, requires some refactoring). 1. Look into email notifications. Testing: 1. Run 8node Swav with 10 epochs with 3 different benchmark evaluations with different resource requirements. SUCCESS. json config: ``` { "params": { "training_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints", "benchmarks": [ { "evaluation_name": "clevr_count_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml" ] }, { "evaluation_name": "clevr_dist_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_dist_linear_benchmark_suite_scheduler_test.yaml" ] }, { "evaluation_name": "in1k_linear", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_in1k_linear_benchmark_suite_scheduler_test.yaml" ] } ], "evaluation_iter_freq": 600, "evaluation_phase_freq": 2, "evaluate_final_phase": true, "autoload_slurm_evaluator_checkpoint": false, "slurm_evaluator_checkpoint": null, "auto_retry_evaluations": true, "retry_evaluation_job_ids": [], "max_retries": 3, "pytorch_ports": [40050, 40051, 40052, 40053, 40054, 40055, 40056, 40057, 40058, 40059, 40060, 40061, 40062, 40063] }, "slurm_options": { "PARTITION": "learnfair" } } ``` Example snippet from `evaluation_metrics.json`: ``` { "model_final_checkpoint_phase9": [ { "checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints", "config_files": [ "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml", "hydra.run.dir='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.CHECKPOINT.DIR='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints'", "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'", "config.SLURM.USE_SLURM=true", "config.MODEL.WEIGHTS_INIT.PARAMS_FILE='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch'" ], "evaluation_name": "clevr_count_linear", "job_id": "42410489", "metrics": { "test_accuracy_list_meter_top_1_res5": { "iteration": 822, "metric": 34.62, "train_phase_idx": 2 }, "train_accuracy_list_meter_top_1_res5": { "iteration": 822, "metric": 33.8514, "train_phase_idx": 2 } }, "num_retries": 1, "slurm_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints", "slurm_log_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear", "slurm_state": "COMPLETED", "weights_init_params_file": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch" }, ... ``` The following hold: 1. Training completes appropriately, w/o errors. 1. Able to resume checkpoints. 1. Evaluation folder structure is as expected above. 1. Best Metrics are extracted. Pull Request resolved: #325 Reviewed By: prigoyal Differential Revision: D28901750 Pulled By: iseessel fbshipit-source-id: 732074043200ac51f3e709d5e67e686f26d36835

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2021

iseessel force-pushed the continuous-evaluations branch from 1317cd6 to cddc794 Compare May 27, 2021 21:11

QuentinDuval reviewed May 31, 2021

View reviewed changes

iseessel force-pushed the continuous-evaluations branch from cddc794 to 85fa89a Compare June 1, 2021 21:39

iseessel changed the title ~~[WIP] Continuous evaluations init commit~~ Continuous evaluations init commit Jun 1, 2021

iseessel force-pushed the continuous-evaluations branch from 82d000a to 3fa5fd0 Compare June 1, 2021 23:33

QuentinDuval reviewed Jun 2, 2021

View reviewed changes

QuentinDuval reviewed Jun 4, 2021

View reviewed changes

vissl/trainer/train_task.py Outdated Show resolved Hide resolved

prigoyal reviewed Jun 4, 2021

View reviewed changes

iseessel force-pushed the continuous-evaluations branch 3 times, most recently from 52df67b to 936bd99 Compare June 4, 2021 16:31

iseessel force-pushed the continuous-evaluations branch from 936bd99 to 8200338 Compare June 4, 2021 16:52

iseessel linked an issue Jun 4, 2021 that may be closed by this pull request

Continuous evaluations for checkpoints for SLURM #335

Closed

prigoyal reviewed Jun 4, 2021

View reviewed changes

iseessel commented Jun 9, 2021

View reviewed changes

dev/benchmark_suite/benchmark_suite_scheduler_swav.json Outdated

@@ -0,0 +1,25 @@

{

Copy link

Contributor Author

iseessel Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this, this is for testing.

iseessel commented Jun 9, 2021

View reviewed changes

iseessel force-pushed the continuous-evaluations branch from 63fec7b to 65e4730 Compare June 10, 2021 04:37

iseessel force-pushed the continuous-evaluations branch from 65e4730 to 39fa0ba Compare June 10, 2021 04:38

iseessel force-pushed the continuous-evaluations branch from 39fa0ba to 0737ed6 Compare June 11, 2021 14:19

iseessel force-pushed the continuous-evaluations branch from 0737ed6 to 2d17ad6 Compare June 14, 2021 14:19

iseessel force-pushed the continuous-evaluations branch from 2d17ad6 to f45d36d Compare June 14, 2021 15:10

iseessel force-pushed the continuous-evaluations branch from f45d36d to cfdacbe Compare June 14, 2021 17:20

Introduce slurm evaluator, Continuous monitoring and launching of eva…

98c6738

…luations for slurm jobs

iseessel force-pushed the continuous-evaluations branch from cfdacbe to 98c6738 Compare June 14, 2021 17:40

facebook-github-bot closed this Jun 14, 2021

	) == 0, "Evaulation iter frequency must evenly divide the checkpoint iter frequency" # NOQA
	) == 0, "Evaluation iter frequency must evenly divide the checkpoint iter frequency" # NOQA

	), "When evaluataing iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA
	), "When evaluating iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA

	config_files.insert(
	for option in ["config.SLURM.USE_SLURM=true", f"config.SLURM.LOG_FOLDER='{log_dir}'", ...]
	config_files.insert(1, option)


		final_metrics = collections.defaultdict(lambda: {"metric": -1})

		# Get the largest metrics over all recorded metrics.

	"TIMEOUT_MIN": 4320, # Timeout in minutes.
	"TIMEOUT_MIN": 72 * 60, # Timeout in 72 hours

		@@ -0,0 +1,545 @@
		# Copyright (c) Facebook, Inc. and its affiliates.


		# This source code is licensed under the MIT license found in the
		# LICENSE file in the root directory of this source tree.

		self.training_config = load_file(self.training_checkpoint_file)
		self.training_config = AttrDict(self.training_config)

		@@ -0,0 +1,582 @@
		# Copyright (c) Facebook, Inc. and its affiliates.



		# Default slurm options to pass to the executor.
		_DEFAULT_SLURM_OPTIONS = {

		if not PathManager.exists(evaluation_dir):
		PathManager.mkdirs(evaluation_dir)

		if not PathManager.exists(child_metrics_dir):
		PathManager.mkdirs(child_metrics_dir)

Continuous evaluations init commit #325

Continuous evaluations init commit #325

Conversation

iseessel commented May 25, 2021 • edited

QuentinDuval May 31, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

QuentinDuval commented Jun 2, 2021 • edited by iseessel

prigoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prigoyal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iseessel Jun 4, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 4, 2021

facebook-github-bot commented Jun 4, 2021

prigoyal left a comment • edited by iseessel

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 10, 2021

facebook-github-bot commented Jun 10, 2021

facebook-github-bot commented Jun 11, 2021

prigoyal commented Jun 11, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

facebook-github-bot commented Jun 14, 2021

iseessel commented May 25, 2021 •

edited

QuentinDuval May 31, 2021 •

edited

QuentinDuval commented Jun 2, 2021 •

edited by iseessel

iseessel Jun 4, 2021 •

edited

prigoyal left a comment •

edited by iseessel