RuntimeError: CUDA error: an illegal memory access was encountered #139

amartin211 · 2020-06-10T09:27:06Z

Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.

It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.

Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.

What is the current behavior?

If the current behavior is a bug, please provide the steps to reproduce.

Expected behavior

Screenshots

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

The details of the error:

RuntimeError Traceback (most recent call last)
in
7 batch_size=16384, virtual_batch_size=1024,
8 num_workers=0,
----> 9 drop_last=False
10 )

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
133 virtual_batch_size=self.virtual_batch_size,
134 momentum=self.momentum,
--> 135 device_name=self.device_name).to(self.device)
136
137 self.reducing_matrix = create_explain_matrix(self.network.input_dim,

/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in init(self, input_dim, output_dim, n_d, n_a, n_steps, gamma, cat_idxs, cat_dims, cat_emb_dim, n_independent, n_shared, epsilon, virtual_batch_size, momentum, device_name)
250 device_name = 'cpu'
251 self.device = torch.device(device_name)
--> 252 self.to(self.device)
253
254 def forward(self, x):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
--> 425 return self._apply(convert)
426
427 def register_backward_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
221 # with torch.no_grad():
222 with torch.no_grad():
--> 223 param_applied = fn(param)
224 should_use_set_data = compute_should_use_set_data(param, param_applied)
225 if should_use_set_data:

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)

RuntimeError: CUDA error: an illegal memory access was encountered

The text was updated successfully, but these errors were encountered:

Optimox · 2020-06-10T09:34:40Z

This looks like a setup problem with your GPU, are you able to run other models using your GPU?

Or maybe try to reduce your batch size.

It does not seem directly related to tabnet but to pytorch/your gpu.

eduardocarvp · 2020-06-10T09:40:34Z

I have never seem this problem either... Maybe something related to the CUDA version/driver version? I'm running on CUDA 10.1.

Perhaps this can help: https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

tanulsingh · 2020-06-12T20:41:27Z

@eduardocarvp I experienced the same issue , your gpu is running out of memory , restart the process and it will work fine .It worked for me

amartin211 · 2020-06-13T08:56:02Z

It worked after restarted my kernel indeed. I have this issue because of the large size of my dataset (about 25GB). What I'm still unable to do right now is to run my model multiple times as I'm trying different hyper parameters without restarting my kernel each time. I'm on a google cloud instance and it takes about 20 min just to read this large dataset. Basically I would need to find a way to free up gpu memory from previous models as if I was fitting for the first time every time. Have tried gc.collect() but didn't work so far.

Optimox · 2020-06-13T11:25:15Z

@amartin211 you could have a look here : https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530
torch.cuda.empty_cache() would be more effevtive to clean the GPU I guess

amartin211 added the bug Something isn't working label Jun 10, 2020

amartin211 assigned eduardocarvp, Hartorn, j-abi and Optimox Jun 10, 2020

amartin211 closed this as completed Jun 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered #139

RuntimeError: CUDA error: an illegal memory access was encountered #139

amartin211 commented Jun 10, 2020

Optimox commented Jun 10, 2020

eduardocarvp commented Jun 10, 2020

tanulsingh commented Jun 12, 2020

amartin211 commented Jun 13, 2020

Optimox commented Jun 13, 2020

RuntimeError: CUDA error: an illegal memory access was encountered #139

RuntimeError: CUDA error: an illegal memory access was encountered #139

Comments

amartin211 commented Jun 10, 2020

Optimox commented Jun 10, 2020

eduardocarvp commented Jun 10, 2020

tanulsingh commented Jun 12, 2020

amartin211 commented Jun 13, 2020

Optimox commented Jun 13, 2020