You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.
It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.
Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.
What is the current behavior?
If the current behavior is a bug, please provide the steps to reproduce.
It worked after restarted my kernel indeed. I have this issue because of the large size of my dataset (about 25GB). What I'm still unable to do right now is to run my model multiple times as I'm trying different hyper parameters without restarting my kernel each time. I'm on a google cloud instance and it takes about 20 min just to read this large dataset. Basically I would need to find a way to free up gpu memory from previous models as if I was fitting for the first time every time. Have tried gc.collect() but didn't work so far.
Describe the bug
I'm having this CUDA error when fitting the classifier. I googled it and find out that this is a common PyTorch error so I have tried to solve this by explicitly setting the gpu device (I have only one GPU Tesla T4) but it didn't work. Although when setting the classifier with parameter : device_name: 'auto' it recognises my GPU devise.
I also tried different batch sizes but without success.
It runs nicely with CPUs though and I'm really not sure on how to make it work with GPU. Would appreciate any help if you have encountered this issue already.
Also, have check my dataset multiple times to ensure they were no NaNs or Inf values in it.
What is the current behavior?
If the current behavior is a bug, please provide the steps to reproduce.
Expected behavior
Screenshots
Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:
Additional context
The details of the error:
RuntimeError Traceback (most recent call last)
in
7 batch_size=16384, virtual_batch_size=1024,
8 num_workers=0,
----> 9 drop_last=False
10 )
/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_model.py in fit(self, X_train, y_train, X_valid, y_valid, loss_fn, weights, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last)
133 virtual_batch_size=self.virtual_batch_size,
134 momentum=self.momentum,
--> 135 device_name=self.device_name).to(self.device)
136
137 self.reducing_matrix = create_explain_matrix(self.network.input_dim,
/opt/conda/lib/python3.7/site-packages/pytorch_tabnet/tab_network.py in init(self, input_dim, output_dim, n_d, n_a, n_steps, gamma, cat_idxs, cat_dims, cat_emb_dim, n_independent, n_shared, epsilon, virtual_batch_size, momentum, device_name)
250 device_name = 'cpu'
251 self.device = torch.device(device_name)
--> 252 self.to(self.device)
253
254 def forward(self, x):
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
--> 425 return self._apply(convert)
426
427 def register_backward_hook(self, hook):
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
199 def _apply(self, fn):
200 for module in self.children():
--> 201 module._apply(fn)
202
203 def compute_should_use_set_data(tensor, tensor_applied):
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn)
221 #
with torch.no_grad():
222 with torch.no_grad():
--> 223 param_applied = fn(param)
224 should_use_set_data = compute_should_use_set_data(param, param_applied)
225 if should_use_set_data:
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t)
421
422 def convert(t):
--> 423 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
424
425 return self._apply(convert)
RuntimeError: CUDA error: an illegal memory access was encountered
The text was updated successfully, but these errors were encountered: