Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Continuous evaluations init commit #325

Closed

Conversation

iseessel
Copy link
Contributor

@iseessel iseessel commented May 25, 2021

Create a script that continuously evaluates benchmarks as they become available from a pretraining.

Uploading Screen Shot 2021-06-02 at 10.22.01 AM.png…
Uploading Screen Shot 2021-06-02 at 10.22.19 AM.png…
Screen Shot 2021-06-02 at 10 22 37 AM
Screen Shot 2021-06-02 at 10 22 59 AM

Next Steps:

  1. Deal with sharded checkpoints and their conversion
  2. Improve max_iteration logic
  3. Extend to FB infra.
  4. Write unit tests
  5. Think about how these tricky evaluation tests: Continuous evaluations init commit #325 (comment)
  6. Try not to replicate so much logic in the class (e.g. get path names from vissl code, requires some refactoring).
  7. Look into email notifications.

Testing:

  1. Run 8node Swav with 10 epochs with 3 different benchmark evaluations with different resource requirements. SUCCESS.

json config:

{
    "params": {
           "training_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints",
           "benchmarks": [
               {
                   "evaluation_name": "clevr_count_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "clevr_dist_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_dist_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "in1k_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_in1k_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               }
           ],
           "evaluation_iter_freq": 600,
           "evaluation_phase_freq": 2,
           "evaluate_final_phase": true,
           "autoload_slurm_evaluator_checkpoint": false,
           "slurm_evaluator_checkpoint": null,
           "auto_retry_evaluations": true,
           "retry_evaluation_job_ids": [],
           "max_retries": 3,
           "pytorch_ports": [40050, 40051, 40052, 40053, 40054, 40055, 40056, 40057, 40058, 40059, 40060, 40061, 40062, 40063]
       },
       "slurm_options": {
           "PARTITION": "learnfair"
       }
}

Example snippet from evaluation_metrics.json:

{
    "model_final_checkpoint_phase9": [
        {
            "checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "config_files": [
                "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml",
                "hydra.run.dir='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.CHECKPOINT.DIR='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.USE_SLURM=true",
                "config.MODEL.WEIGHTS_INIT.PARAMS_FILE='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch'"
            ],
            "evaluation_name": "clevr_count_linear",
            "job_id": "42410489",
            "metrics": {
                "test_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 34.62,
                    "train_phase_idx": 2
                },
                "train_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 33.8514,
                    "train_phase_idx": 2
                }
            },
            "num_retries": 1,
            "slurm_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "slurm_log_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear",
            "slurm_state": "COMPLETED",
            "weights_init_params_file": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch"
        }, ...

The following hold:

  1. Training completes appropriately, w/o errors.
  2. Able to resume checkpoints.
  3. Evaluation folder structure is as expected above.
  4. Best Metrics are extracted.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 25, 2021
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
config_files.insert(1, f"config.SLURM.LOG_FOLDER='{log_dir}'")
config_files.insert(1, f"config.CHECKPOINT.DIR='{checkpoint_dir}'")
config_files.insert(1, f"hydra.run.dir='{ log_dir }'")
config_files.insert(1, "hydra.verbose=true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could skip that one (verbose=true)

vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
vissl/engines/slurm_evaluator.py Outdated Show resolved Hide resolved
@iseessel iseessel changed the title [WIP] Continuous evaluations init commit Continuous evaluations init commit Jun 1, 2021
assert (
self.evaluation_iter_freq
% self.training_config.CHECKPOINT.CHECKPOINT_ITER_FREQUENCY
) == 0, "Evaulation iter frequency must evenly divide the checkpoint iter frequency" # NOQA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) == 0, "Evaulation iter frequency must evenly divide the checkpoint iter frequency" # NOQA
) == 0, "Evaluation iter frequency must evenly divide the checkpoint iter frequency" # NOQA

if self.evaluation_iter_freq > -1 and not self.max_training_iterations:
assert (
self.training_config.DATA.TRAIN.DATA_LIMIT != -1
), "When evaluataing iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
), "When evaluataing iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA
), "When evaluating iterations, please either set the DATA_LIMIT of the training config, or the max_training_iterations" # NOQA

)

# Add benchmark result information
benchmark_result = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
benchmark_result = {}
benchmark_result = {
"evaluation_name": evaluation_name,
"job_id": None,
...
}

self.training_config.CHECKPOINT.DIR, f"{ training_checkpoint }.torch"
)
config_files = benchmark_result["config_files"]
config_files.insert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config_files.insert(
for option in ["config.SLURM.USE_SLURM=true", f"config.SLURM.LOG_FOLDER='{log_dir}'", ...]
config_files.insert(1, option)

return wrapper


def flatten_dict(d, parent_key="", sep="_"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


final_metrics = collections.defaultdict(lambda: {"metric": -1})

# Get the largest metrics over all recorded metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extract this as a function get_largest_metrics

# Create SlurmJob object.
job_id = str(benchmark["job_id"])
folder = Path(benchmark["slurm_log_dir"])
job = submitit.SlurmJob(job_id=job_id, folder=folder, tasks=[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"COMMENT": "vissl evaluation job",
"PARTITION": "learnfair",
"CONSTRAINT": "",
"TIMEOUT_MIN": 4320, # Timeout in minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"TIMEOUT_MIN": 4320, # Timeout in minutes.
"TIMEOUT_MIN": 72 * 60, # Timeout in 72 hours

_DEFAULT_SLURM_OPTIONS = {
"NAME": "vissl",
"COMMENT": "vissl evaluation job",
"PARTITION": "learnfair",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: "learnlab" is faster

@QuentinDuval
Copy link
Contributor

QuentinDuval commented Jun 2, 2021

I had some suggestions for tests for the benchmark suite evaluator, which I think will allow us to exercise the most tricky cases:

  • configs/config/benchmark/linear_image_classification/voc07 because it is using SVM evaluation
  • configs/config/benchmark/nearest_neighbor/eval_resnet_8gpu_in1k_kNN.yaml because it is using KNN evaluation
  • configs/config/benchmark/object_detection because the evaluation is not implemented in VISSL

These uses cases do not have to be supported in this PR, but we need to think about them:

  • Either to make the benchmark suite evaluator support them
  • Or to refactor these benchmarks so that they fit more naturally (my preferred option)

Copy link
Contributor

@prigoyal prigoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you so much @iseessel :) this is really awesome and very impactful feature addition to VISSL.

1st round review! :)

@@ -0,0 +1,22 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we have the requirement that the json file should be in the "configs" folder or can it be anywhere?

If latter, I'd recommend to move the scripts to at least outside of "benchmark" possibly ( we should retain only the yaml or the reproducibility related files within benchmark folder).

How about something like "dev/benchmark_suite/.....json"

@@ -0,0 +1,22 @@
{
"params": {
"training_checkpoint_dir": "(str) Training checkpoint directory. That is the CHECKPOINT.DIR of the training config",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's slightly counter-intuitive to have the descriptions like this in json. If we create a dev/benchmark_suite/ , we have to options:

  1. rename this file to "template.json"
  2. create a README.md and capture the template there and create an example.json with the "actual" parameters i.e. filled out template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should add something to README -- since approach will change shortly with added fbinfra support, I'd like to wait on that.

dev/launch_benchmark_suite_scheduler_slurm.sh Show resolved Hide resolved
return wrapper


def flatten_dict(d, parent_key="", sep="_"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add the type hints to the inputs.

return wrapper


def flatten_dict(d, parent_key="", sep="_"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add a docstring to the function on what it does.

Copy link
Contributor

@prigoyal prigoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd round :)

@@ -0,0 +1,545 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this scheduler supposed to run on "every gpu" ?

typically in vissl/engines/, we have included engines that run on every gpu. the distributed_launcher takes care of launching the engine on each gpu worker.

If this doesn't run on all gpus, then move this to either vissl/utils/ folder of within the tools/ merge this with the new .py added as this can also facilitate better code readability (avoid many hops) :)


# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: recommend adding a docstring that explains what this does

_SLEEP_TIME_SECONDS = 15
# Slurm states marked as terminal. SlurmEvulator#evaluate will finish
# once all jobs are in a terminal state.
_SLURM_JOB_TERMINAL_STATES = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the SLURM from the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SLURM specific, this will all be refactored shortly to take into account for fbinfra.

Comment on lines 110 to 112
self.training_checkpoint_file = os.path.join(
self.training_checkpoint_dir, "train_config.yaml"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add an assert for it. if the train_config.yaml is not found, that means user training didn't work. so we should exit instantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have set it up, so that this can be launched simultaneously when the training is launched. Since we don't know when each job will be executed by SLURM (and likely the low resource job will be executed first), I wait for the training config to become available for a set amount of time, then fail.

"""
self.evaluation_jobs_finished = set()

# Required Arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all the required arguments, we must perform some kind of validation. In the __init__ , you can call a self.validate() for instance that can take care of this.

Comment on lines 200 to 201
self.training_config = load_file(self.training_checkpoint_file)
self.training_config = AttrDict(self.training_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@iseessel iseessel Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, this is a mistake -- we should use hydra_config, since config options can change.

I take this back, I think we should just call AttrDict. The hydra_config function does too much:

  1. We don't want to save this yaml config file.
  2. We don't want to infer config. We want the config file to be exactly as it is seen in training_config.yaml (that being said, in the link above, I think convert_fsdp_dtypes should be called before save_attrdict_to_disk).

f"Loaded training checkpoint config from: { self.training_checkpoint_file }"
)
# Build main training task in order to extract iteration info.
self.training_task = build_task(self.training_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting here: we discussed getting rid of the training iterations. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we still need the training iterations, we just wanted to find them in a different way.

I synced up with @QuentinDuval on this. I went down the path of Introducing them dynamically, but this complicates the logic too much. I much prefer scaffolding them at the beginning, it's much easier to follow and debug.

state_prev: None, state_current: { job.state }
"""

print(log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logger

is_submitit_available()
), "Please 'pip install submitit' to schedule jobs on SLURM"

def _validate_training_cfg(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's rename it to _validate_evaluation_setup ?

self.evaluation_results = self._generate_initial_evaluation_results()
self._validate_training_cfg()

def _validate_class_instance(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: it sounds like we are simply checking availability of libraries like hydra, submitit and nothing to do with the class itself. Could we find a better place for these functions and possible just call them directly instead of putting them under the _validate_class_instance ? :)

@iseessel iseessel force-pushed the continuous-evaluations branch 3 times, most recently from 52df67b to 936bd99 Compare June 4, 2021 16:31
@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@iseessel iseessel linked an issue Jun 4, 2021 that may be closed by this pull request
Copy link
Contributor

@prigoyal prigoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thank you so much . Some inline comments and also take a look at some open conversations/resolve them.

Remaining steps:

@@ -0,0 +1,582 @@
# Copyright (c) Facebook, Inc. and its affiliates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as discussed, let's move this to under vissl/utils/

Comment on lines 26 to 33
This class is designed to be used to run multiple evaluations on a single (pre)training.
Using the #evaluate method we continuously monitor training checkpoints, launch evaluations
dynamically as they become available, and amalgamate the evaluation results as they become
available.

For SLURM usage, you should create a JSON configuration file (see benchmark_suite_scheduler_template.json)
and use the launch_benchmark_suite_scheduler_slurm.sh for convenience.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no indent



# Default slurm options to pass to the executor.
_DEFAULT_SLURM_OPTIONS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, let's move this to dedicated settings in defaults.yaml :)

@@ -0,0 +1,109 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed?

def _launch_slurm_job(self, args, config):
return launch_distributed_on_slurm(engine_name=args.engine_name, cfg=config)

def _write_json_file(self, data, file_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the function save_file https://github.com/facebookresearch/vissl/blob/master/vissl/utils/io.py#L68 which does exactly this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded to this earlier, but because of the file changes, the comment got buried.

They do slightly different things.

  1. #save_file appends, whereas this overwrites.
  2. #save_file adds a new_line.

Comment on lines 206 to 207
if not PathManager.exists(evaluation_dir):
PathManager.mkdirs(evaluation_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks, I missed this one!

Comment on lines 218 to 219
if not PathManager.exists(child_metrics_dir):
PathManager.mkdirs(child_metrics_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@@ -0,0 +1,25 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this, this is for testing.

), "slurm_options.PARTITION is a required field to launch the benchmark suite on slurm"

slurm_options = AttrDict(config["slurm_options"])
benchmark_suite_scheduler.evaluate()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for testing purposes, revert once finished testing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@prigoyal
Copy link
Contributor

looks great. Ready to merge. Thank you so much @iseessel and thank you @QuentinDuval for amazing mentoring on this.

@iseessel , this PR needs a rebase + reimport and then ready to merge.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Jun 14, 2021
Summary:
Create a script that continuously evaluates benchmarks as they become available from a pretraining.

![Uploading Screen Shot 2021-06-02 at 10.22.01 AM.png…]()
![Uploading Screen Shot 2021-06-02 at 10.22.19 AM.png…]()
<img width="593" alt="Screen Shot 2021-06-02 at 10 22 37 AM" src="https://user-images.githubusercontent.com/25669348/120497511-7888c880-c38c-11eb-8bc1-78bacc5d968b.png">
<img width="1237" alt="Screen Shot 2021-06-02 at 10 22 59 AM" src="https://user-images.githubusercontent.com/25669348/120497575-85a5b780-c38c-11eb-9445-2076e15be888.png">

Next Steps:
1. Deal with sharded checkpoints and their conversion
1. Improve max_iteration logic
1. Extend to FB infra.
1. Write unit tests
1. Think about how these tricky evaluation tests: #325 (comment)
1. Try not to replicate so much logic in the class (e.g. get path names from vissl code, requires some refactoring).
1. Look into email notifications.

Testing:

1. Run 8node Swav with 10 epochs with 3 different benchmark evaluations with different resource requirements. SUCCESS.

json config:

```
{
    "params": {
           "training_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints",
           "benchmarks": [
               {
                   "evaluation_name": "clevr_count_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "clevr_dist_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_clevr_dist_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               },
               {
                   "evaluation_name": "in1k_linear",
                   "config_files": [
                       "config=config_local/eval_resnet_8gpu_transfer_in1k_linear_benchmark_suite_scheduler_test.yaml"
                   ]
               }
           ],
           "evaluation_iter_freq": 600,
           "evaluation_phase_freq": 2,
           "evaluate_final_phase": true,
           "autoload_slurm_evaluator_checkpoint": false,
           "slurm_evaluator_checkpoint": null,
           "auto_retry_evaluations": true,
           "retry_evaluation_job_ids": [],
           "max_retries": 3,
           "pytorch_ports": [40050, 40051, 40052, 40053, 40054, 40055, 40056, 40057, 40058, 40059, 40060, 40061, 40062, 40063]
       },
       "slurm_options": {
           "PARTITION": "learnfair"
       }
}
```

Example snippet from `evaluation_metrics.json`:

```
{
    "model_final_checkpoint_phase9": [
        {
            "checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "config_files": [
                "config=config_local/eval_resnet_8gpu_transfer_clevr_count_linear_benchmark_suite_scheduler_test.yaml",
                "hydra.run.dir='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.CHECKPOINT.DIR='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.LOG_FOLDER='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear'",
                "config.SLURM.USE_SLURM=true",
                "config.MODEL.WEIGHTS_INIT.PARAMS_FILE='/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch'"
            ],
            "evaluation_name": "clevr_count_linear",
            "job_id": "42410489",
            "metrics": {
                "test_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 34.62,
                    "train_phase_idx": 2
                },
                "train_accuracy_list_meter_top_1_res5": {
                    "iteration": 822,
                    "metric": 33.8514,
                    "train_phase_idx": 2
                }
            },
            "num_retries": 1,
            "slurm_checkpoint_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear/checkpoints",
            "slurm_log_dir": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/evaluations/model_final_checkpoint_phase9/clevr_count_linear",
            "slurm_state": "COMPLETED",
            "weights_init_params_file": "/checkpoint/iseessel/vissl/2021-06-09-11-19-12/checkpoints/model_final_checkpoint_phase9.torch"
        }, ...
```

The following hold:
1. Training completes appropriately, w/o errors.
1. Able to resume checkpoints.
1. Evaluation folder structure is as expected above.
1. Best Metrics are extracted.

Pull Request resolved: #325

Reviewed By: prigoyal

Differential Revision: D28901750

Pulled By: iseessel

fbshipit-source-id: 732074043200ac51f3e709d5e67e686f26d36835
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Continuous evaluations for checkpoints for SLURM
4 participants