Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra configuration #2

Closed
wants to merge 6 commits into from
Closed

Hydra configuration #2

wants to merge 6 commits into from

Conversation

anthonytec2
Copy link
Owner

Hydra Configuration

@anthonytec2
Copy link
Owner Author

Any other ideas to add to the example?

@omry
Copy link

omry commented Jun 27, 2020

One idea is to separate dataclasses configuring core PL objects to a different module to make it clear those should be reused somehow and not copied by every single user.

pl_examples/hydra_examples/cpu_template.py Outdated Show resolved Hide resolved

@dataclass
class LightningTrainerConf:
# callbacks: Optional[List[Callback]] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are ready to start tackling left-over todos.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one, I feel we would need some user help. For example, how do we support the list of callback objects? 1) A user would need to define a structured config for their callback, 2) They would need to create a yaml list of these configs, 3) We would need to instantiate their object and populate the list, 4) Pass this into trainer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the callback from the example:

class MyPrintingCallback(Callback):

    def on_init_start(self, trainer):
        print('Starting to init trainer!')

    def on_init_end(self, trainer):
        print('trainer is init now')

    def on_train_end(self, trainer, pl_module):
        print('do something when training ends')

This style lets you reuse the callback configs more easily:

callbacks:
  print:
    cls: MyPrintingCallback
  s3_checkpoint:
    cls: S3Checkpoint
    params:
      bucket_name: ???

callbacks_list:
  - ${callbacks.print}
  - ${callbacks.checkpoint}

The code can instantiate the callbacks like:

callbacks = [hydra.utils.instantiate(callback) for callback in cfg.callbacks_list]

Currently, Config group are mutually exclusive: you can only load one config from each config group.
Once facebookresearch/hydra#499 is done we will be able to do something better.
For now you the example can put all the callbacks config in the primary config file or break them into callbacks.yaml that can be added to the defaults list as - callbacks.

PS: I did not try this so there may be unforeseen problems.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about interpolation, when I define callbacks: ${callbbacks.print}, then this resolves to: {'cls': 'MyPrintingCallback'}. Although when I put callbacks into a list format it resolves to {'callbacks': ['${callbacks.print}']}. Is there anything special you need to do for variable interpolation in a list?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I figured out the above issue. The solution you mentioned did not work, I was getting an error regarding incorrect type when trying to use the object: cfg.callbacks_list[0]. I put up a working solution in the most recent commit by just listing out the names of these fields.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error? can you be more specific?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I fixed the example and now it works like your example.

pl_examples/hydra_examples/pl_hydra/trainer_conf.py Outdated Show resolved Hide resolved
pl_examples/models/hydra_config_model.py Outdated Show resolved Hide resolved
@anthonytec2
Copy link
Owner Author

One idea would be to place it in: pytorch-lightning/pytorch_lightning/trainer/ where we can then reuse the hydra configuration

@anthonytec2
Copy link
Owner Author

Also quick aside, do you know much about the submitit plugin? In terms of how environment setup works for running with shared environments?

@omry
Copy link

omry commented Jun 29, 2020

Yes, but let's not use this important issue for unrelated discussions.
Ask your question in the Hydra chat.

@anthonytec2
Copy link
Owner Author

The leftover todos relate to Union types, which have been created as Any or the more generic of the two types.

@anthonytec2
Copy link
Owner Author

How would you like to handle your comment in regards to separating this example into a part to be included and versioned with PL? Maybe, I misunderstood the comment previously, but my intention here was just putting trainer conf into the core folder for merging to the PL.

@omry
Copy link

omry commented Jun 30, 2020

I think all (most?) of the dataclasses belongs in the core:
content of:

  • pytorch_lightning/trainer/trainer_conf.py
  • pl_examples/hydra_examples/conf/scheduler.py
  • pl_examples/hydra_examples/conf/optimizer.py

we can condition the registration with Hydra's ConfigStore on the presence of Hydra.

One think I don't like about this soft dependency is that it makes it hard to require a specific version of Hydra. I guess you could check hydra.version at runtime but it's not great.

@anthonytec2
Copy link
Owner Author

I agree with most of your comments in regards to moving things into core, my only hesitation is that users can define multiple optimizers which each use a different scheduler. At the moment, I am unsure given the config store api to be able to easily reuse this optimizer to a different group.

@omry
Copy link

omry commented Jul 1, 2020

I am not sure I follow your association between optimizer and (LR) scheduler.
Those seems like orthogonal concepts which should have their own config groups.

@anthonytec2
Copy link
Owner Author

I did not explain that too well, basically I wanted to highlight the fact that you could have multiple optimizers/schedulers in a single configuration file. Since, at the moment multiple configurations are not supported for a single group, this would be an issue. Although, for most cases users should be fine with one optimizer at the moment. They can extend this template given for more advanced cases.

@omry
Copy link

omry commented Jul 5, 2020

can you add a second example that is reusing the code here to do something useful (train mnist for example)?
just to be sure we are indeed covering everything we should.

@romesco
Copy link
Collaborator

romesco commented Jul 5, 2020

For the sake of testing, is there a way for me to use hydra's compositional config feature without having hydra take control of the logger, output directory, etc.?

For example, one current inconvenience is that each time I run the pl_template.py, it redownloads MNIST into outputs/Date/Time which is definitely not expected functionality. Also, it is not playing well with manual exit in a tmux/screen session.

@omry
Copy link

omry commented Jul 5, 2020

For the sake of testing, is there a way for me to use hydra's compositional config feature without having hydra take control of the logger, output directory, etc.?

For example, one current inconvenience is that each time I run the pl_template.py, it redownloads MNIST into outputs/Date/Time which is definitely not expected functionality. Also, it is not playing well with manual exit in a tmux/screen session.

You can use the compose API, but you are forfeiting many important features. this is definitely not the recommended example for people to follow for something like this.
A better solution is to fix the bug in the example such that pl_template does not redownload MNIST every time you run it.
In fact, I think this is already fixed this example so I am not sure what you are seeing.

@anthonytec2
Copy link
Owner Author

@romesc-CMU I thought i fixed the redownload issue in a recent commit. Can you pull and try again.

@romesco
Copy link
Collaborator

romesco commented Jul 5, 2020

Ok, let me repull and check it out again. Maybe I was configuring incorrectly, since I did modify a few things to simplify testing.

@anthonytec2
Copy link
Owner Author

In regards to creating a 2nd example, at the moment this example is training MNIST.

@romesco
Copy link
Collaborator

romesco commented Jul 6, 2020

I'm on the most up to date commit and this fixed the dataset I/O issue 👍 .

So far while testing pl_template.py as is (not my simplified version):

  1. I was able to pass different gpu configurations.
  2. I was able to successfully save and load a checkpoint.
    (although, I'm wondering how to get the directory hydra is currently writing to in outputs.)
    I saw hydra.utils.get_original_cwd(), but that only gets me to the entry point. Edit: os.getcwd() works. Didn't realize hydra changes the working directory. That makes sense.

I read up on the structured configs and the Config Stores are making a lot more sense. I'm also understanding the purpose of init_trainer() now. Out of curiosity, when I was reading the structured configs docs, I noticed it lays out the two benefits as:

  • Runtime type checking as you compose or mutate your config
  • Static type checking when using static type checkers (mypy, PyCharm, etc.)

Does that imply this entire example is achievable with the earlier version of hydra that doesn't include structured configs? The main advantage to having them being the type checking? [which I do like]

@omry
Copy link

omry commented Jul 6, 2020

Without structured configs you achieve that with yaml config files.
you get very limited type safety though. (In practice there will likely be some existing issues in 0.11 that are addressed in the RC that this example is relying on).

Please go through the basic Hydra tutorial to get a handle of the basic functionality.

@omry
Copy link

omry commented Jul 6, 2020

@romesc-CMU, Bt the way - did you successfully made any command line, config and runtime config access errors to see what happens? :)

@anthonytec2 anthonytec2 force-pushed the hydra_conf branch 2 times, most recently from 31eaf06 to 50232e0 Compare July 17, 2020 12:26
@anthonytec2
Copy link
Owner Author

anthonytec2 commented Jul 17, 2020

One problem still to resolve is that PL flattens all the hparams passed into the model and tries to log them to the logger defined. A new issue is that when I pass these parameters some are missing and this results in exception. Specifically, since I defined target instead of cls, I get the error missing cls when the logger tries to flatten all the config settings.

@anthonytec2
Copy link
Owner Author

Also, we are having a problem after rebasing with the Tensorboard logger. This relates to this issue: Lightning-AI#2519. The problem is that tensorboard is trying to serialize objects, but the case used to determine if the object is a Dict Config is never hit. The reason being that the hparams object stores a dict of the parameters passed into the model, hence the case is never true that the hparams object is a Container type. https://github.com/PyTorchLightning/pytorch-lightning/blob/9759491940c4108ac8ef01e0b53b31f03a69b4d6/pytorch_lightning/core/saving.py#L364

@anthonytec2
Copy link
Owner Author

Final issue is with None objects facebookresearch/hydra#785. I am also experiencing this and the work around suggested does not work.

@omry
Copy link

omry commented Jul 17, 2020

Final issue is with None objects facebookresearch/hydra#785. I am also experiencing this and the work around suggested does not work.

I am going to look at this one today.
The workaround seems fishy to me, in any case I will fix it properly.

@anthonytec2
Copy link
Owner Author

None bug fixed in example!

pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved
pl_examples/hydra_examples/user_config.py Outdated Show resolved Hide resolved
pl_examples/hydra_examples/pl_template.py Outdated Show resolved Hide resolved
@anthonytec2
Copy link
Owner Author

Okay, I removed the double define of cls after the removal of the cls field in the recent hydra MR. The only issue left is on the parameter saving for tensorboard, which already has an issue and MR fix in the works. I am going to send this MR over to the PL team for initial discussions.

@omry
Copy link

omry commented Jul 18, 2020

MR?

@romesco
Copy link
Collaborator

romesco commented Jul 18, 2020

MR?

merge request I guess?

fix job name template

change to model

create hydra examples folder

fix error with none values

optimizers and lr schedules

clean up model structure

model has data included

dont configure outputs

document hydra example

update readme

rename trainer conf

scheduler example

schedulers update

change out structure for opt and sched

flatten config dirs

reduce number of classes

scheduler and opt configs

spelling

change group

config store location change

import and store

structured conf remaining classes

fix for date

change location of trainer config

fix package name

trainer instantiation

clean up init trainer

type fixes

clean up imports

update readme

add in seed

Update pl_examples/hydra_examples/README.md

Co-authored-by: Omry Yadan <omry@fb.com>

Update pl_examples/hydra_examples/README.md

Co-authored-by: Omry Yadan <omry@fb.com>

change to model

clean up hydra example

data to absolute path

update file name

fix path

isort run

name change

hydra logging

change config dir

use name as logging group

load configs in init py

callout

callbacks

fix callbacks

empty list

example param data

params

example with two other data classes

fix saving params

dataset path correction

comments in trainer conf

logic in user app

better config

clean up arguments

multiprocessing handled by PL settings

cleaner callback list

callback clean up

top level config

wip user config

add in callbacks

fix callbacks in user config

fix group names

name config

fix user config

instantiation without +

change type

split for readability

user config move

master config yaml

hydra from master changes

remove init py

clean up model configuration

add comments

add to readme

function doc

need hydra for instantiate

defaults defined in config yaml

remove to do lines

issue note

remove imports unused

cfg init removal

double define

instantiate changes

change back to full config

Update pl_examples/hydra_examples/pl_template.py

Co-authored-by: Omry Yadan <omry@fb.com>

Revert "double define"

This reverts commit 4a9a962.

fix data configuration

remove bug comment, fixed already

fix callbacks instantiate
This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.
This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize Structured Configs too to be consistent with how it's used in the documentation of Hydra.

@@ -1,13 +1,13 @@
## Hydra Pytorch Lightning Example

This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
The core of this directory consists of a set of structured configs used for pytorch lightining, which are stored under the `from pytorch_lightning.trainer.trainer_conf import PLConfig`. Within the PL config there are 5 cofigurations: 1) Trainer Configuration, 2) Profiler Configuration, 3) Early Stopping Configuration, 4) Logger Configuration and 5) Checkpoint Configuration. All of these are basically mirrors of the arguments that make up these objects. These configuration are used to instantiate the objects using Hydras instantiation utility.
This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a tool that allows for the easy configuration of complex applications.
This directory consists of an example of configuring Pytorch Lightning with [Hydra](https://hydra.cc/). Hydra is a framework that allows for the easy configuration of complex applications.


Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters.
Aside from the PyTorch Lightning configuration we have included a few other important configurations. Optimizer and Scheduler are easy off-the-shelf configurations for configuring your optimizer and learning rate scheduler. You can add them to your config defaults list as needed and use them to configure these objects. Additionally, we provide the arch and data configurations for changing model and data hyperparameters.
Copy link

@omry omry Jul 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these configurations are adopted into the core of PL this should move into the PR description and out of the README.

@rakhimovv
Copy link

Hi! Thanks for a wonderful example! I met a problem when I try to run python pl_template.py trainer.gpus=4 trainer.distributed_backend=ddp it fails

@omry
Copy link

omry commented Jul 21, 2020

@rakhimovv, it would help it you show how it fails.

@rakhimovv
Copy link

@omry @anthonytec2

HYDRA_FULL_ERROR=1 python pl_template.py trainer.gpus=4 trainer.distributed_backend=ddp

produces

Starting to init trainer!
GPU available: True, used: True
[2020-07-21 16:50:34,181][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-07-21 16:50:34,181][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
[2020-07-21 16:50:34,181][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0,1,2,3]
trainer is init now
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
[2020-07-21 16:50:36,017][lightning][INFO] - initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
[2020-07-21 16:50:39,876][lightning][INFO] - initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
[2020-07-21 16:50:41,958][lightning][INFO] - initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
[2020-07-21 16:50:42,120][lightning][INFO] - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
----------------------------------------------------------------------------------------------------
[2020-07-21 16:50:43,026][lightning][INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=ddp
[2020-07-21 16:50:43,027][lightning][INFO] - distributed_backend=ddp
All DDP processes registered. Starting ddp with 4 processes
[2020-07-21 16:50:43,027][lightning][INFO] - All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------
[2020-07-21 16:50:43,027][lightning][INFO] - ----------------------------------------------------------------------------------------------------

  | Name      | Type        | Params | In sizes  | Out sizes
------------------------------------------------------------------
0 | c_d1      | Linear      | 785 K  | [2, 784]  | [2, 1000]
1 | c_d1_bn   | BatchNorm1d | 2 K    | [2, 1000] | [2, 1000]
2 | c_d1_drop | Dropout     | 0      | [2, 1000] | [2, 1000]
3 | c_d2      | Linear      | 10 K   | [2, 1000] | [2, 10]  
[2020-07-21 16:50:54,947][lightning][INFO] - 
  | Name      | Type        | Params | In sizes  | Out sizes
------------------------------------------------------------------
0 | c_d1      | Linear      | 785 K  | [2, 784]  | [2, 1000]
1 | c_d1_bn   | BatchNorm1d | 2 K    | [2, 1000] | [2, 1000]
2 | c_d1_drop | Dropout     | 0      | [2, 1000] | [2, 1000]
3 | c_d2      | Linear      | 10 K   | [2, 1000] | [2, 10]  
[2020-07-21 16:50:54,950][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
[2020-07-21 16:50:54,952][pl_examples.models.hydra_config_model][INFO] - Validation data loader called.
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 560, in ddp_train
    results = self.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1188, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 1204, in _run_sanity_check
    self.reset_val_dataloader(ref_model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 343, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._reset_eval_dataloader(model, 'val')
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 270, in _reset_eval_dataloader
    dataloaders = self.request_dataloader(getattr(model, f'{mode}_dataloader'))
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/data_loading.py", line 364, in request_dataloader
    dataloader = dataloader_fx()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 117, in val_dataloader
    return hydra.utils.instantiate(self.data.dl, dataset=self.test_set)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'
AttributeError: 'LightningTemplateModel' object has no attribute 'test_set'

If I understand correctly, the problem is that data split initialization should happen in def setup(self, stage) not in def prepare_data(self). But even if I copy-paste the code from def prepare_data(self) to def setup(self, stage) in hydra_config_model.py sometimes it works and sometimes it fails with the error attached below. The possible reason, I suppose, is that several processes try to download and write data to the same folder. I assume this is because value interpolation does not work in ddp regime. As the datasets are saved into experiment_folder/datasets, not into ${hydra:runtime.cwd}/datasets

Starting to init trainer!
GPU available: True, used: True
[2020-07-21 17:11:52,484][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-07-21 17:11:52,484][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0,1,2,3]
[2020-07-21 17:11:52,484][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0,1,2,3]
trainer is init now
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
[2020-07-21 17:11:54,322][lightning][INFO] - initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
[2020-07-21 17:11:58,095][lightning][INFO] - initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
Starting to init trainer!
trainer is init now
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
[2020-07-21 17:12:00,256][lightning][INFO] - initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
[2020-07-21 17:12:00,429][lightning][INFO] - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz
0it [00:00, ?it/s]----------------------------------------------------------------------------------------------------
[2020-07-21 17:12:01,446][lightning][INFO] - ----------------------------------------------------------------------------------------------------
distributed_backend=ddp
[2020-07-21 17:12:01,446][lightning][INFO] - distributed_backend=ddp
All DDP processes registered. Starting ddp with 4 processes
[2020-07-21 17:12:01,446][lightning][INFO] - All DDP processes registered. Starting ddp with 4 processes
----------------------------------------------------------------------------------------------------
[2020-07-21 17:12:01,446][lightning][INFO] - ----------------------------------------------------------------------------------------------------
9920512it [00:06, 1496351.07it/s]                                                                                                                                                                                                             
 15%|███████████████████████████▉                                                                                                                                                               | 1482752/9912422 [00:06<00:08, 989320.57it/s]Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
9920512it [00:06, 1499278.68it/s]                                                                                                                                                                                                             
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
 22%|████████████████████████████████████████▉                                                                                                                                                 | 2179072/9912422 [00:06<00:05, 1479882.48it/s]Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
 29%|██████████████████████████████████████████████████████▍                                                                                                                                   | 2899968/9912422 [00:07<00:04, 1526329.32it/s]Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊   | 9740288/9912422 [00:10<00:00, 2441871.66it/s]Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz
32768it [00:05, 6061.31it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
32768it [00:05, 6072.13it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
32768it [00:05, 6082.34it/s]                                                                                                                                                                                                                  
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
1654784it [00:05, 279562.79it/s]                                                                                                                                                                                                              
Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/utils.py", line 35, in call
    return _instantiate_class(type_or_callable, config, *args, **kwargs)
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 478, in _instantiate_class
    return clazz(*args, **final_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 70, in __init__
    self.download()
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py", line 137, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 249, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/opt/conda/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 86, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 198, in run_and_report
    return func()
  File "/home/rakhimov/hydra/hydra/_internal/utils.py", line 336, in <lambda>
    overrides=args.overrides,
  File "/home/rakhimov/hydra/hydra/_internal/hydra.py", line 109, in run
    job_subdir_key=None,
  File "/home/rakhimov/hydra/hydra/core/utils.py", line 123, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/pl_template.py", line 44, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 986, in fit
    self.ddp_train(process_idx=task, q=None, model=model)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pytorch_lightning/trainer/distrib_data_parallel.py", line 511, in ddp_train
    model.setup('fit')
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning-0.8.6.dev0-py3.7.egg/pl_examples/models/hydra_config_model.py", line 113, in setup
    self.train_set = hydra.utils.instantiate(self.data.ds, transform=transform, train=True)
  File "/home/rakhimov/hydra/hydra/utils.py", line 40, in call
    raise HydraException(f"Error calling '{cls}' : {e}") from e
hydra.errors.HydraException: Error calling 'torchvision.datasets.MNIST' : File not found or corrupted.
1654784it [00:06, 273807.53it/s]                                                                                                                                                                                                              
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz
0it [00:00, ?it/s]                                                                                                                                                                                                                           Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz
8192it [00:05, 1553.94it/s]                                                                                                                                                                                                                   
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/s]
Processing...
/pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
8192it [00:05, 1551.38it/s]                                                                                                                                                                                                                   
Extracting /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/rakhimov/pytorch-lightning/pl_examples/hydra_examples/outputs/2020-07-21/17-11-52/datasets/MNIST/raw                                                                                                                                                                                                      | 0/4542 [00:05<?, ?it/s]
Processing...
/pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
9920512it [00:30, 2441871.66it/s]

p.s.

python pl_template.py trainer.gpus=4

works fine, but it uses ddp_spawn regime, not ddp

@omry
Copy link

omry commented Jul 21, 2020

Thanks for reporting.
At the moment the ownership of this example and the level of support PL has for Hydra is still being determined.
This kind of problem should be handled by PL in my opinion and not as a part of the example.

You can follow along here.
feel free to point out the issue you encountered there for visibility.

@anthonytec2
Copy link
Owner Author

@rakhimovv I have noticed the above issue before and as you mentioned, changing to use setup is important. Additionally, I would just hardcode the data path for now until we work through a fix that properly sorts all this out.

@tkornuta-nvidia
Copy link

tkornuta-nvidia commented Jul 24, 2020

@anthonytec2 so what is the desired solution?

We faced the same issue and at the end we overrode the hydra.main decorator with the one that actually enforces
hydra.run.dir=....

@omry
Copy link

omry commented Jul 25, 2020

@anthonytec2 so what is the desired solution?

We faced the same issue and at the end we overrode the hydra.main decorator with the one that actually enforces
hydra.run.dir=....

Terrible hack :)

If you are using PL directly, you should wait for a fix there because it's the one that is spawning the process.
If you are spawning it directly, you can set it's cwd to the Hydra original working directory.
Please join the Hydra chat, we can chat about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants