Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accumulate gradient #507

Closed
wmmxk opened this issue Nov 7, 2019 · 14 comments
Closed

Accumulate gradient #507

wmmxk opened this issue Nov 7, 2019 · 14 comments
Labels
bug Something isn't working

Comments

@wmmxk
Copy link

wmmxk commented Nov 7, 2019

I was trying to use the accumulate gradient feature but run into an error. The training works without the OptimizerCallback(accmulation_steps=2).

runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    callbacks=[DiceCallback(), EarlyStoppingCallback(patience=5, min_delta=0.001), 
                            OptimizerCallback(accumulation_steps=2)],
    logdir=logdir,
    num_epochs=num_epochs,
    verbose=True
)

FYI, the error message:

0/60 * Epoch (train): 0% 0/624 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in
9 logdir=logdir,
10 num_epochs=num_epochs,
---> 11 verbose=True
12 )

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/runner/supervised.py in train(self, model, criterion, optimizer, loaders, logdir, callbacks, scheduler, resume, num_epochs, valid_loader, main_metric, minimize_metric, verbose, state_kwargs, checkpoint_data, fp16, monitoring_params, check)
195 monitoring_params=monitoring_params
196 )
--> 197 self.run_experiment(experiment, check=check)
198
199 def infer(

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_experiment(self, experiment, check)
229 except (Exception, KeyboardInterrupt) as ex:
230 self.state.exception = ex
--> 231 self._run_event("exception")
232
233 return self

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_event(self, event)
100
101 if self.state is not None and hasattr(self.state, f"on
{event}post"):
--> 102 getattr(self.state, f"on
{event}_post")()
103
104 @AbstractMethod

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/state.py in on_exception_post(self)
183 def on_exception_post(self):
184 for logger in self.loggers.values():
--> 185 logger.on_exception(self)
186
187

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/logging.py in on_exception(self, state)
194
195 if state.need_reraise_exception:
--> 196 raise exception
197
198

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_experiment(self, experiment, check)
226 try:
227 for stage in self.experiment.stages:
--> 228 self._run_stage(stage)
229 except (Exception, KeyboardInterrupt) as ex:
230 self.state.exception = ex

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_stage(self, stage)
199
200 self._run_event("epoch_start")
--> 201 self._run_epoch(loaders)
202 self._run_event("epoch_end")
203

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_epoch(self, loaders)
186 self._run_event("loader_start")
187 with torch.set_grad_enabled(self.state.need_backward):
--> 188 self._run_loader(loader)
189 self._run_event("loader_end")
190

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_loader(self, loader)
148
149 for i, batch in enumerate(loader):
--> 150 self._run_batch(batch)
151
152 self.state.timer.reset()

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_batch(self, batch)
130 self.state.timer.stop("_timers/model_time")
131 self.state.timer.stop("_timers/batch_time")
--> 132 self._run_event("batch_end")
133
134 def _run_loader(self, loader):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_event(self, event)
97 if self.callbacks is not None:
98 for callback in self.callbacks.values():
---> 99 getattr(callback, f"on
{event}")(self.state)
100
101 if self.state is not None and hasattr(self.state, f"on_{event}_post"):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/optimizer.py in on_batch_end(self, state)
117 return
118
--> 119 loss = self._get_loss(state)
120
121 self._accumulation_counter += 1

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/optimizer.py in _get_loss(self, state)
91
92 def _get_loss(self, state) -> torch.Tensor:
---> 93 loss = state.get_key(key="loss", inner_key=self.loss_key)
94
95 if isinstance(loss, list):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/state.py in get_key(self, key, inner_key)
114 return getattr(self, key)
115 else:
--> 116 return getattr(self, key)[inner_key]
117
118 def set_key(self, value, key, inner_key=None):

TypeError: 'NoneType' object is not subscriptable

@wmmxk wmmxk added the bug Something isn't working label Nov 7, 2019
@TezRomacH
Copy link
Contributor

Hi! Is there an error if you set accumulation_steps to 1?
OptimizerCallback(accumulation_steps=1)

@Yorko
Copy link
Contributor

Yorko commented Nov 10, 2019

I had the same issue, not sure what exactly fixed it, but try to run the same with catalyst 19.11.1

@wmmxk
Copy link
Author

wmmxk commented Nov 10, 2019

Hi! Thanks for your reply.

When I set the step to 1 OptimizerCallback(accumulation_steps=1), I run into exactly the same error.

@TezRomacH
Copy link
Contributor

Okay, this is interesting. What version of Catalyst do you use?

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

I am using version 19.10.2. I installed by pip install catalyst.

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

Did it work when testing this module?

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

I looked in that line of code, getattr(self, key)[inner_key] causing that error. The key is 'loss', but self.loss is a none. (see the code)

Hope this information is helpful. If you want me to check for any other information, please feel free to let me know.

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

Hi! I think I found the reason. When I pass an OptimizerCallback explicitly, without using the default one. The OptimizerCallback is the first one in all callbacks. I checked that line for callback in self.callbacks.values().
Because the OptimizerCallback is the first one, the runner.loss is still None at that time.

I will continue to figure out why OptimizerCallback is the first one.

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

I found the reason. It is because the order of the OptimzerCallback is small as you can see in Optimizer = 40 Schedule = 60.
So when the order gets unexpected when sorted

@wmmxk
Copy link
Author

wmmxk commented Nov 11, 2019

Maybe this is a bug. When I set self.accumulation_steps=2, because of self._accumulation_counter +=1 . The if [statement] (

if (self._accumulation_counter + 1) % self.accumulation_steps == 0:
) is always true.

@bamps53
Copy link

bamps53 commented Nov 12, 2019

Me too, but when I tried to reproduce this issue with segmentation tutorial, I couldn't.
So, there might be something we cannot use with OptimizerCallback??
I just added OptimizerCallback(accumulation_steps=2).

from catalyst.dl.callbacks import DiceCallback, IouCallback, \
  CriterionCallback, CriterionAggregatorCallback, OptimizerCallback

runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    
    # our dataloaders
    loaders=loaders,
    
    callbacks=[
        # Each criterion is calculated separately.
        CriterionCallback(
            input_key="mask",
            prefix="loss_dice",
            criterion_key="dice"
        ),
        CriterionCallback(
            input_key="mask",
            prefix="loss_iou",
            criterion_key="iou"
        ),
        CriterionCallback(
            input_key="mask",
            prefix="loss_bce",
            criterion_key="bce",
            multiplier=0.8
        ),
        
        # And only then we aggregate everything into one loss.
        CriterionAggregatorCallback(
            prefix="loss",
            loss_keys=["loss_dice", "loss_iou", "loss_bce"],
            loss_aggregate_fn="sum" # or "mean"
        ),
        
        # metrics
        DiceCallback(input_key="mask"),
        IouCallback(input_key="mask"),
        OptimizerCallback(accumulation_steps=2)
    ],
    # path to save logs
    logdir=logdir,
    
    num_epochs=num_epochs,
    
    # save our best checkpoint by IoU metric
    main_metric="iou",
    # IoU needs to be maximized.
    minimize_metric=False,
    
    # for FP16. It uses the variable from the very first cell
    #fp16=fp16_params,
    
    # prints train logs
    verbose=True
)

@IliaLarchenko
Copy link

@wmmxk try to add CriterionCallback() to callbacks.
I'm not sure, but it seems to me, you should explicitly use CriterionCallback if you explicitly use OptimizerCallback

@Scitator
Copy link
Member

could you reproduce the issue with 20.03.3 version?

@Scitator Scitator reopened this Mar 24, 2020
@Scitator
Copy link
Member

should be already fixed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants