Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #139

Closed
amartin211 opened this issue Jun 10, 2020 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@amartin211
Copy link

Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.

It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.

Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

The details of the error:


RuntimeError Traceback (most recent call last)
in
7 batch_size=16384, virtual_batch_size=1024,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
133 virtual_batch_size=self.virtual_batch_size,
134 momentum=self.momentum,
--> 135 device_name=self.device_name).to(self.device)
136
137 self.reducing_matrix = create_explain_matrix(self.network.input_dim,

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in init(self, input_dim, output_dim, n_d, n_a, n_steps, gamma, cat_idxs, cat_dims, cat_emb_dim, n_independent, n_shared, epsilon, virtual_batch_size, momentum, device_name)
250 device_name = 'cpu'
251 self.device = torch.device(device_name)
--> 252 self.to(self.device)
253
254 def forward(self, x):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
--> 425 return self._apply(convert)
426
427 def register_backward_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
221 # with torch.no_grad():
222 with torch.no_grad():
--> 223 param_applied = fn(param)
224 should_use_set_data = compute_should_use_set_data(param, param_applied)
225 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)

RuntimeError: CUDA error: an illegal memory access was encountered

@amartin211 amartin211 added the bug Something isn't working label Jun 10, 2020
@Optimox
Copy link
Collaborator

Optimox commented Jun 10, 2020

This looks like a setup problem with your GPU, are you able to run other models using your GPU?

Or maybe try to reduce your batch size.

It does not seem directly related to tabnet but to pytorch/your gpu.

@eduardocarvp
Copy link
Collaborator

I have never seem this problem either... Maybe something related to the CUDA version/driver version? I'm running on CUDA 10.1.

Perhaps this can help: https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

@tanulsingh
Copy link

@eduardocarvp I experienced the same issue , your gpu is running out of memory , restart the process and it will work fine .It worked for me

@amartin211
Copy link
Author

It worked after restarted my kernel indeed. I have this issue because of the large size of my dataset (about 25GB). What I'm still unable to do right now is to run my model multiple times as I'm trying different hyper parameters without restarting my kernel each time. I'm on a google cloud instance and it takes about 20 min just to read this large dataset. Basically I would need to find a way to free up gpu memory from previous models as if I was fitting for the first time every time. Have tried gc.collect() but didn't work so far.

@Optimox
Copy link
Collaborator

Optimox commented Jun 13, 2020

@amartin211 you could have a look here : https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530
torch.cuda.empty_cache() would be more effevtive to clean the GPU I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants