Memory out on SM-MNIST #8

fanshuhuangjia · 2020-12-17T06:19:35Z

I trained the model easily by following your instructions,but i got "OSError: [Errno 12] Cannot allocate memory" when 319999/1100000。I have tried to set n_worker=0 and pin_memory=False,but it didn't work.So,I wonder how many cpu memories i need to train the model on SM-MNIST?(My CPU memories:80G)

White-Link · 2020-12-18T20:57:54Z

Hi, thank you for your interest in our work!

We are currently investigating the issue and trying to reproduce your error. In the meantime and to help us understand your problem, could you please provide additional details such as the exact command that you executed to launch training, the line at which the error occurs (if available) or any other relevant information on your experimental setup (such as package versions different than those given in the requirements.txt file)?

fanshuhuangjia · 2020-12-19T01:31:59Z

Hi,

thank you for the answer.

The environment I am using has the same modules with the requirements file except for torch. I use torch==1.5.0 in order to be compatible with CUDA10.2.
The comment I use:
OMP_NUM_THREADS=4 python -m torch.distributed.launch --nproc_per_node=2 train.py --device 0 1 --apex_amp --ny 20 --nz 20 --beta_z 2 --nt_cond 5 --nt_inf 5 --dataset smmnist --nc 1 --seq_len 15 --data_dir data --save_path logs

I used 2 GeFoece RTX 2080Ti to train the model.
The error is shown as follows:

So,i changed the n_workers=0 and pin_memory=False,but it still used about 53G Cpu memories when 46% 501884/1100000.
Thank you!

White-Link · 2020-12-21T20:58:12Z

Unfortunately, we were not able to reproduce this error as our program runs on our architecture and one GPU without exceeding 30GB of RAM for hundreds of thousands of iterations. However, there are some following possible workarounds that you might want to try.

If you cannot use PyTorch 1.4.0 for CUDA-related reasons, you probably should be using the 1.5.1 version instead of the 1.5.0 one, since 1.5.1 solves many bugs of 1.5.0.
Our model for MNIST should be trainable on your hardware with only one GPU. It might help you to complete its training as reducing the number of GPUs significantly reduces memory usage.
Validation steps in our code use additional memory, you can space them out using the --val_interval option to reduce memory consumption.
Besides configuration-dependent matters, this issue could originate from the way that PyTorch handles data loading together with Python multiprocessing, as described in this PyTorch issue. We just pushed a new version of our code that uses this workaround that may help to solve the issue.
If the above solution does not work, it could be difficult for us to solve the issue without substantially changing our code architecture, which we would like to preserve for reproducibily purposes. However, you could try other workarounds from this issue, such as implementing shared memory as suggested in this message. In our case, this should be applied to the self.data attribute, since it contains the sequence of MNIST digits that is used in every data loading process.

Please let us know whether any of these suggestions solves your problem!

fanshuhuangjia · 2020-12-25T01:33:53Z

Thank you for the suggestions! I'm sorry i didn't reply in time.
I had tried to follow your suggestions(1.2.3.4) but it didn't work.By running on only one GPU,i finished the whole training with a large cost of CPU memories.(I will put up the results soon after evaluation)

As you can see,It needs about 100G to train the model.
I will try to train the model on the other computers to check if there are something wrong in my computer.
Thanks again for your time and effort! It helps me a lot!

fanshuhuangjia · 2020-12-25T02:13:47Z

Though the problem haven't been solved,I am happy that the results are matching with the paper(PSNR 16.93 ± 0.07 SSIM 0.7799 ± 0.0020).

Thanks again for your detailed instruction!

White-Link · 2020-12-25T17:10:42Z

No worries, thank you for the update!

There might be another explanation: this Apex issue reports that Apex usage in specific configurations leads to CPU memory leak such as the one you encounter. If possible, could you please try to train our model without the --apex_amp option to check whether your issue still occurs? Training at full precision is significantly slower but you should be able to quickly confirm the absence of memory leak with the setup that you first used when reporting this problem.

If this is indeed the cause of the memory leak, there is unfortunately not much we can do on our side except reporting it in our instructions. You have then two solutions:

try the workarounds suggested in the corresponding issue, such as installing different versions of PyTorch or compiling it yourself;
train the model using the --torch_amp option with the latest version of PyTorch (1.7.1), which integrates Apex's mixed-precision training directly into PyTorch (note that this feature is experimental as we have not reproduced our results with this option yet).

Please let us know whether this helps solving your issue.

fanshuhuangjia · 2020-12-28T12:23:09Z

Hi, following your suggestions i trained the model without --apex_amp and the issue of CPU memory leak did't occur anymore.So it verifies your explanation that Apex usage in specific configurations leads to CPU memory leak.

As FabianIsensee said in this Apex issue that the problem will go away if you compile pytorch yourself with a more recent version of cuDNN. So i checked my version of CuDNN but found that i didn't install a CuDNN in my computer.But the issue still occured after installtion.

Now,i'd like to train the model using PyTorch(1.7.1) and check my installation environment of Apex.
Thanks again for your time and effort!!!

Mentioned Apex-related memory leak issue (#8)

White-Link · 2020-12-28T17:21:51Z

Nice, thank you for your help! We are closing this issue since the source of the problem was found. Please let us know if you have any other question!

fanshuhuangjia · 2021-01-02T01:56:27Z

Hi, I trained the model using PyTorch(1.7.1) with --torch_amp ,the issue did't occur anymore. And I got the results as follows:
psnr 16.743256 +/- 0.06728019263638955
ssim 0.77448565 +/- 0.0019244341182183178
lpips 0.11437263 +/- 0.001075665273820588
It gives a similar results, you can use it as a reference.
Thank you!

White-Link added a commit that referenced this issue Dec 21, 2020

Reduced memory usage by data loading processes (#8)

06c1138

White-Link added a commit that referenced this issue Dec 28, 2020

Added a troubleshooting section in the README

62f5c2e

Mentioned Apex-related memory leak issue (#8)

White-Link closed this as completed Dec 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory out on SM-MNIST #8

Memory out on SM-MNIST #8

fanshuhuangjia commented Dec 17, 2020

White-Link commented Dec 18, 2020

fanshuhuangjia commented Dec 19, 2020

White-Link commented Dec 21, 2020

fanshuhuangjia commented Dec 25, 2020

fanshuhuangjia commented Dec 25, 2020

White-Link commented Dec 25, 2020 •

edited

Loading

fanshuhuangjia commented Dec 28, 2020

White-Link commented Dec 28, 2020

fanshuhuangjia commented Jan 2, 2021

Memory out on SM-MNIST #8

Memory out on SM-MNIST #8

Comments

fanshuhuangjia commented Dec 17, 2020

White-Link commented Dec 18, 2020

fanshuhuangjia commented Dec 19, 2020

White-Link commented Dec 21, 2020

fanshuhuangjia commented Dec 25, 2020

fanshuhuangjia commented Dec 25, 2020

White-Link commented Dec 25, 2020 • edited Loading

fanshuhuangjia commented Dec 28, 2020

White-Link commented Dec 28, 2020

fanshuhuangjia commented Jan 2, 2021

White-Link commented Dec 25, 2020 •

edited

Loading