Accumulate gradient #507

wmmxk · 2019-11-07T21:28:38Z

I was trying to use the accumulate gradient feature but run into an error. The training works without the OptimizerCallback(accmulation_steps=2).

runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=loaders,
    callbacks=[DiceCallback(), EarlyStoppingCallback(patience=5, min_delta=0.001), 
                            OptimizerCallback(accumulation_steps=2)],
    logdir=logdir,
    num_epochs=num_epochs,
    verbose=True
)

FYI, the error message:

0/60 * Epoch (train): 0% 0/624 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in
9 logdir=logdir,
10 num_epochs=num_epochs,
---> 11 verbose=True
12 )

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/runner/supervised.py in train(self, model, criterion, optimizer, loaders, logdir, callbacks, scheduler, resume, num_epochs, valid_loader, main_metric, minimize_metric, verbose, state_kwargs, checkpoint_data, fp16, monitoring_params, check)
195 monitoring_params=monitoring_params
196 )
--> 197 self.run_experiment(experiment, check=check)
198
199 def infer(

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_experiment(self, experiment, check)
229 except (Exception, KeyboardInterrupt) as ex:
230 self.state.exception = ex
--> 231 self._run_event("exception")
232
233 return self

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_event(self, event)
100
101 if self.state is not None and hasattr(self.state, f"on{event}post"):
--> 102 getattr(self.state, f"on{event}_post")()
103
104 @AbstractMethod

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/state.py in on_exception_post(self)
183 def on_exception_post(self):
184 for logger in self.loggers.values():
--> 185 logger.on_exception(self)
186
187

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/logging.py in on_exception(self, state)
194
195 if state.need_reraise_exception:
--> 196 raise exception
197
198

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_experiment(self, experiment, check)
226 try:
227 for stage in self.experiment.stages:
--> 228 self._run_stage(stage)
229 except (Exception, KeyboardInterrupt) as ex:
230 self.state.exception = ex

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_stage(self, stage)
199
200 self._run_event("epoch_start")
--> 201 self._run_epoch(loaders)
202 self._run_event("epoch_end")
203

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_epoch(self, loaders)
186 self._run_event("loader_start")
187 with torch.set_grad_enabled(self.state.need_backward):
--> 188 self._run_loader(loader)
189 self._run_event("loader_end")
190

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_loader(self, loader)
148
149 for i, batch in enumerate(loader):
--> 150 self._run_batch(batch)
151
152 self.state.timer.reset()

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in _run_batch(self, batch)
130 self.state.timer.stop("_timers/model_time")
131 self.state.timer.stop("_timers/batch_time")
--> 132 self._run_event("batch_end")
133
134 def _run_loader(self, loader):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/runner.py in run_event(self, event)
97 if self.callbacks is not None:
98 for callback in self.callbacks.values():
---> 99 getattr(callback, f"on{event}")(self.state)
100
101 if self.state is not None and hasattr(self.state, f"on_{event}_post"):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/optimizer.py in on_batch_end(self, state)
117 return
118
--> 119 loss = self._get_loss(state)
120
121 self._accumulation_counter += 1

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/callbacks/optimizer.py in _get_loss(self, state)
91
92 def _get_loss(self, state) -> torch.Tensor:
---> 93 loss = state.get_key(key="loss", inner_key=self.loss_key)
94
95 if isinstance(loss, list):

~/.conda/envs/mmdet_cloud/lib/python3.6/site-packages/catalyst/dl/core/state.py in get_key(self, key, inner_key)
114 return getattr(self, key)
115 else:
--> 116 return getattr(self, key)[inner_key]
117
118 def set_key(self, value, key, inner_key=None):

TypeError: 'NoneType' object is not subscriptable

The text was updated successfully, but these errors were encountered:

TezRomacH · 2019-11-08T10:40:05Z

Hi! Is there an error if you set accumulation_steps to 1?
OptimizerCallback(accumulation_steps=1)

Yorko · 2019-11-10T11:41:33Z

I had the same issue, not sure what exactly fixed it, but try to run the same with catalyst 19.11.1

wmmxk · 2019-11-10T21:03:28Z

Hi! Thanks for your reply.

When I set the step to 1 OptimizerCallback(accumulation_steps=1), I run into exactly the same error.

TezRomacH · 2019-11-10T21:15:51Z

Okay, this is interesting. What version of Catalyst do you use?

wmmxk · 2019-11-11T03:20:41Z

I am using version 19.10.2. I installed by pip install catalyst.

wmmxk · 2019-11-11T03:21:33Z

Did it work when testing this module?

wmmxk · 2019-11-11T06:06:45Z

I looked in that line of code, getattr(self, key)[inner_key] causing that error. The key is 'loss', but self.loss is a none. (see the code)

Hope this information is helpful. If you want me to check for any other information, please feel free to let me know.

wmmxk · 2019-11-11T06:53:01Z

Hi! I think I found the reason. When I pass an OptimizerCallback explicitly, without using the default one. The OptimizerCallback is the first one in all callbacks. I checked that line for callback in self.callbacks.values().
Because the OptimizerCallback is the first one, the runner.loss is still None at that time.

I will continue to figure out why OptimizerCallback is the first one.

wmmxk · 2019-11-11T07:03:52Z

I found the reason. It is because the order of the OptimzerCallback is small as you can see in Optimizer = 40 Schedule = 60.
So when the order gets unexpected when sorted

wmmxk · 2019-11-11T08:02:02Z

Maybe this is a bug. When I set self.accumulation_steps=2, because of self._accumulation_counter +=1 . The if [statement] (

catalyst/catalyst/dl/callbacks/optimizer.py

Line 139 in af3572f

if (self._accumulation_counter + 1) % self.accumulation_steps == 0:

) is always true.

bamps53 · 2019-11-12T02:51:36Z

Me too, but when I tried to reproduce this issue with segmentation tutorial, I couldn't.
So, there might be something we cannot use with OptimizerCallback??
I just added OptimizerCallback(accumulation_steps=2).

from catalyst.dl.callbacks import DiceCallback, IouCallback, \
  CriterionCallback, CriterionAggregatorCallback, OptimizerCallback

runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    
    # our dataloaders
    loaders=loaders,
    
    callbacks=[
        # Each criterion is calculated separately.
        CriterionCallback(
            input_key="mask",
            prefix="loss_dice",
            criterion_key="dice"
        ),
        CriterionCallback(
            input_key="mask",
            prefix="loss_iou",
            criterion_key="iou"
        ),
        CriterionCallback(
            input_key="mask",
            prefix="loss_bce",
            criterion_key="bce",
            multiplier=0.8
        ),
        
        # And only then we aggregate everything into one loss.
        CriterionAggregatorCallback(
            prefix="loss",
            loss_keys=["loss_dice", "loss_iou", "loss_bce"],
            loss_aggregate_fn="sum" # or "mean"
        ),
        
        # metrics
        DiceCallback(input_key="mask"),
        IouCallback(input_key="mask"),
        OptimizerCallback(accumulation_steps=2)
    ],
    # path to save logs
    logdir=logdir,
    
    num_epochs=num_epochs,
    
    # save our best checkpoint by IoU metric
    main_metric="iou",
    # IoU needs to be maximized.
    minimize_metric=False,
    
    # for FP16. It uses the variable from the very first cell
    #fp16=fp16_params,
    
    # prints train logs
    verbose=True
)

IliaLarchenko · 2019-11-15T19:52:05Z

@wmmxk try to add CriterionCallback() to callbacks.
I'm not sure, but it seems to me, you should explicitly use CriterionCallback if you explicitly use OptimizerCallback

Scitator · 2020-03-24T08:29:25Z

could you reproduce the issue with 20.03.3 version?

Scitator · 2020-04-22T12:41:02Z

should be already fixed :)

wmmxk added the bug Something isn't working label Nov 7, 2019

Scitator closed this as completed Mar 24, 2020

Scitator reopened this Mar 24, 2020

Scitator closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulate gradient #507

Accumulate gradient #507

wmmxk commented Nov 7, 2019

TezRomacH commented Nov 8, 2019

Yorko commented Nov 10, 2019

wmmxk commented Nov 10, 2019

TezRomacH commented Nov 10, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019 •

edited

bamps53 commented Nov 12, 2019

IliaLarchenko commented Nov 15, 2019

Scitator commented Mar 24, 2020

Scitator commented Apr 22, 2020

Accumulate gradient #507

Accumulate gradient #507

Comments

wmmxk commented Nov 7, 2019

0/60 * Epoch (train): 0% 0/624 [00:00<?, ?it/s]

TezRomacH commented Nov 8, 2019

Yorko commented Nov 10, 2019

wmmxk commented Nov 10, 2019

TezRomacH commented Nov 10, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019

wmmxk commented Nov 11, 2019 • edited

bamps53 commented Nov 12, 2019

IliaLarchenko commented Nov 15, 2019

Scitator commented Mar 24, 2020

Scitator commented Apr 22, 2020

wmmxk commented Nov 11, 2019 •

edited