-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory out on SM-MNIST #8
Comments
Hi, thank you for your interest in our work! We are currently investigating the issue and trying to reproduce your error. In the meantime and to help us understand your problem, could you please provide additional details such as the exact command that you executed to launch training, the line at which the error occurs (if available) or any other relevant information on your experimental setup (such as package versions different than those given in the |
Unfortunately, we were not able to reproduce this error as our program runs on our architecture and one GPU without exceeding 30GB of RAM for hundreds of thousands of iterations. However, there are some following possible workarounds that you might want to try.
Please let us know whether any of these suggestions solves your problem! |
No worries, thank you for the update! There might be another explanation: this Apex issue reports that Apex usage in specific configurations leads to CPU memory leak such as the one you encounter. If possible, could you please try to train our model without the If this is indeed the cause of the memory leak, there is unfortunately not much we can do on our side except reporting it in our instructions. You have then two solutions:
Please let us know whether this helps solving your issue. |
Hi, following your suggestions i trained the model without As FabianIsensee said in this Apex issue that the problem will go away if you compile pytorch yourself with a more recent version of cuDNN. So i checked my version of CuDNN but found that i didn't install a CuDNN in my computer.But the issue still occured after installtion. Now,i'd like to train the model using PyTorch(1.7.1) and check my installation environment of Apex. |
Mentioned Apex-related memory leak issue (#8)
Nice, thank you for your help! We are closing this issue since the source of the problem was found. Please let us know if you have any other question! |
Hi, I trained the model using PyTorch(1.7.1) with |
I trained the model easily by following your instructions,but i got "OSError: [Errno 12] Cannot allocate memory" when 319999/1100000。I have tried to set n_worker=0 and pin_memory=False,but it didn't work.So,I wonder how many cpu memories i need to train the model on SM-MNIST?(My CPU memories:80G)
The text was updated successfully, but these errors were encountered: