New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device)) RuntimeError: CUDA error: device-side assert triggered #240
Comments
Hello, This line Tabnet is not compatible with all sklearn pipeline, I guess that's the problem. Could you share the code you are running with TabNet? |
Here it is: imputer = SimpleImputer(missing_values=np.nan,strategy='mean') inner_cv = TimeSeriesSplit(n_splits=5)#.split(featureMatrix) #set the training parameters for Random Forest tab_model = TabNetRegressor(device_name = "cuda") tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)]) tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True) tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10) |
Also tried without batch_size and now i get: tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) File "", line 1, in File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward RuntimeError: CUDA error: device-side assert triggered |
I also did a grid search using the cpu, for most iterations it works but some give an error: [CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_independent=4, m__n_shared=1, m__n_steps=1, score=(train=nan, test=nan), total= 38.5sC:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: FitFailedWarning) |
I also crashes when i do not use grid search but just fit the model: Here is a runnable code example, and the example data as a pickle: import numpy as np imputer = SimpleImputer(missing_values=np.nan,strategy='mean') scorer = make_scorer(mean_squared_error, greater_is_better= False) inner_cv = TimeSeriesSplit(n_splits=5) outer_cv = PredefinedHoldoutSplit(valid_indices=[range(0,100,1)]) #set the training parameters for Random Forest tab_model = TabNetRegressor(device_name = "cuda") tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)]) tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True) #run with basic settings: #run with feature search |
Here the error without grid search on the GPU: tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) File "", line 1, in File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward RuntimeError: CUDA error: device-side assert triggered |
And here without gridsearch on the cpu: tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values)) File "", line 1, in File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support RuntimeError: index -1 is out of bounds for dimension 1 with size 2423 |
When i run grid search on the cpu it actually works on some combinations and fails on others: [CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_a=15, m__n_d=28, m__n_independent=3, m__n_shared=2, m__n_steps=1, score=(train=-17302.082, test=-26498.246), total= 53.9s |
this error It can also be due to gradient fading or explosion. You can add a clip_value which may help or change the learning rate. |
"It can also be due to gradient fading or explosion" Checked again and the values i gave the model to fit the featureMatrix against were off. So probably the model might not able to fit anything sensible. Fixed this and the error still occurs. Will have a look at the other suggestions. And attached the updated data. |
I checked for infs, made sure that it´s float32 compatible and NaN are being imputed by the pipeline. As some of the GridSearches work and some don´t it is either a problem of the settings, as the standard settings don´t work and some of the tried ones do. Or some cross validation folds are working an some do not. |
maybe entmax is working but sparsemax is creating this error, did you check that? |
I ran the gridsearch on the cpu using only entmax and for the cpu this fixed the problem, even when varying the other parameters. Will also check for the gpu (cuda). |
While it now works for the cpu with entmax, for GPU I still get the following error: [Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 21.3s finished File "", line 1, in File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_search.py", line 765, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 159, in fit File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 474, in _set_network File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 272, in init File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 612, in to File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 381, in _apply File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 610, in convert RuntimeError: CUDA error: device-side assert triggered |
Also ran the cpu again and even with entmax the erro is not completely gone just seems to happen less often. |
@ThomasWolf0701 Could you share some examples of parameters which trigger the error and some that don't? |
Alright, I'm closing this as I could not reproduce it easily. Please feel free to reopen with more information when you have some to share. |
Are you using a RTX 30x0 GPU? I'm encountering the same problem. I changed the scale = *** line to init method of GLU, and got cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Maybe it's caused by CUDNN. |
Ok will try to run with different settings and can check which ones crash. |
Im am using a GTX 1070 and Cuda 10.2 |
I'm facing the same issue. Were you able to resolve this? |
I'm facing the same issue. // after reducing gradien explosion, it seems not occuring. i changed weight decay higher |
@Optimox I am facing the same issue to train a model on tabular data, from the call stack, its coming from the line tau_star = tau.gather(dim, support_size - 1) I am trying to build a simple model with below hyper parameters. Data is tabular and most of them are float64. model = TabNetClassifier(n_d=8, n_a=8, n_steps=1, gamma=1.3,
I am unsure what these params are, Could you please suggest/assist here |
@bhavana3 try setting a gradient clipping limit |
Describe the bug
When running on GPU Tabnet crashes with scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))
RuntimeError: CUDA error: device-side assert triggered
What is the current behavior?
It works when the matrix I use contains only integers but fails with floats.
I also made sure that NaN values are imputed and there are no Inf. Also the largest value fits into float32,
Also set the batch size to a very low level.
If the current behavior is a bug, please provide the steps to reproduce.
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)
Expected behavior
Screenshots
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10)
Traceback (most recent call last):
File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 329, in fit
fit_params_steps = self._check_fit_params(**fit_params)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 248, in _check_fit_params
"=sample_weight)`.".format(pname))
ValueError: Pipeline.fit does not accept the batch_size parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g.
Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)
.tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)
No early stopping will be performed, last training weights will be used.
Traceback (most recent call last):
File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward
out = self.feat_transformersstep
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward
x = self.shared(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward
scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))
RuntimeError: CUDA error: device-side assert triggered
Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:
Additional context
The text was updated successfully, but these errors were encountered: