You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
In caffe, one can save the model weights at any time, for example, in iterations of 32768. Just like pressing 'Ctrl + C', then the training process stops and the model weights are saved automatically.
But in mxnet, I can't find a way like in caffe to save weights at any time, it seems that mxnet can only save weights in integral epochs, such as epoch 1, 2, 3 ...
I feel sorry, because for large-scale datasets, it takes several days for training for one epoch. So, if something wrong, then a lot of time is wasted.
Any solutions for this ? Please in detail.
Thanks.
The text was updated successfully, but these errors were encountered:
Try to add follow code after batch_end_callback in 'base_module.py'.
arg_params, aux_params = self.get_params()
if epoch_end_callback is not None:
for callback in _as_list(epoch_end_callback):
callback(nbatch, self.symbol, arg_params, aux_params)
When define 'mx.callback.do_checkpoint', you can pass 'period' to control number of iterations to save checkpoints.
@solin319 Thank you, I will have a try, and do you know some specific saving ways that just press 'Ctrl + C', then the training process stops and the model weights are saved automatically ?
Hi @Yochengliu , I would like to follow up this topic and find someone who can possibly help you. Have you found any ways to meet your needs? @nswamy can you label this as 'feature request'?
In caffe, one can save the model weights at any time, for example, in iterations of 32768. Just like pressing 'Ctrl + C', then the training process stops and the model weights are saved automatically.
But in mxnet, I can't find a way like in caffe to save weights at any time, it seems that mxnet can only save weights in integral epochs, such as epoch 1, 2, 3 ...
I feel sorry, because for large-scale datasets, it takes several days for training for one epoch. So, if something wrong, then a lot of time is wasted.
Any solutions for this ? Please in detail.
Thanks.
The text was updated successfully, but these errors were encountered: