Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big error saving checkpoint #76

Closed
dillfrescott opened this issue Apr 23, 2023 · 5 comments
Closed

Big error saving checkpoint #76

dillfrescott opened this issue Apr 23, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@dillfrescott
Copy link

138.519   Total estimated model params size (MB)
Epoch 105:  25%|███████████▊                                   | 5/20 [00:00<00:02,  6.81it/s, loss=0.0487, v_num=owzc]C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py:61: UserWarning: Warning, `hyper_parameters` dropped from checkpoint. An attribute is not picklable: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'
  rank_zero_warn(f"Warning, `{key}` dropped from checkpoint. An attribute is not picklable: {err}")
Traceback (most recent call last):
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py", line 54, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\utilities\cloud_io.py", line 67, in _atomic_save
    torch.save(checkpoint, bytesbuffer)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 653, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 668, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\micro\Downloads\fish-diffusion\tools\diffusion\train.py", line 98, in <module>
    trainer.fit(model, train_loader, valid_loader, ckpt_path=args.resume)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
    self._run_train()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 229, in advance
    self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1394, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 296, in on_train_batch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 363, in _save_topk_checkpoint
    self._save_none_monitor_checkpoint(trainer, monitor_candidates)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 669, in _save_none_monitor_checkpoint
    self._save_checkpoint(trainer, filepath)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 366, in _save_checkpoint
    trainer.save_checkpoint(filepath, self.save_weights_only)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1939, in save_checkpoint
    self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 511, in save_checkpoint
    self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 466, in save_checkpoint
    self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\plugins\io\torch_io.py", line 62, in save_checkpoint
    _atomic_save(checkpoint, path)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\lightning_fabric\utilities\cloud_io.py", line 67, in _atomic_save
    torch.save(checkpoint, bytesbuffer)
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 440, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "C:\Users\micro\miniconda3\envs\fish\lib\site-packages\torch\serialization.py", line 305, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at ..\caffe2\serialize\inline_container.cc:337] . unexpected pos 645876672 vs 645876560
@Majboor
Copy link

Majboor commented Apr 23, 2023

lol, a very common error called "Buy a good PC" or use a cloud GPU. Possibly you could get into the Edge Tpu stuff pretty niche doesn't require good PC

@dillfrescott
Copy link
Author

I have a 4090, what are you talking about?

@Majboor
Copy link

Majboor commented Apr 23, 2023

'MemoryError' means you are training on more than you have maybe you have less ram, try to load little chunks of data instead of the whole at once.

@leng-yue
Copy link
Member

leng-yue commented Apr 24, 2023

It appears that there is an issue with your disk (or torch) which caused the saving process to fail. Unfortunately, we are unable to resolve this issue from our end.

@leng-yue leng-yue added the bug Something isn't working label Apr 24, 2023
@dillfrescott
Copy link
Author

Ah, okay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants