Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device)) RuntimeError: CUDA error: device-side assert triggered #240

Closed
ThomasWolf0701 opened this issue Nov 23, 2020 · 25 comments
Assignees
Labels
bug Something isn't working

Comments

@ThomasWolf0701
Copy link

Describe the bug
When running on GPU Tabnet crashes with scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))
RuntimeError: CUDA error: device-side assert triggered

What is the current behavior?
It works when the matrix I use contains only integers but fails with floats.
I also made sure that NaN values are imputed and there are no Inf. Also the largest value fits into float32,
Also set the batch size to a very low level.

If the current behavior is a bug, please provide the steps to reproduce.
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)

Expected behavior

Screenshots

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10)
Traceback (most recent call last):

File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),batch_size = 10)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 329, in fit
fit_params_steps = self._check_fit_params(**fit_params)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 248, in _check_fit_params
"=sample_weight)`.".format(pname))

ValueError: Pipeline.fit does not accept the batch_size parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight).

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)
No early stopping will be performed, last training weights will be used.
Traceback (most recent call last):

File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward
out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward
x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward
scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

Other relevant information:
poetry version:
python version:
Operating System:
Additional tools:

Additional context

@ThomasWolf0701 ThomasWolf0701 added the bug Something isn't working label Nov 23, 2020
@Optimox
Copy link
Collaborator

Optimox commented Nov 23, 2020

Hello,

This line ValueError: Pipeline.fit does not accept the batch_size parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight) makes me think that you are using tabnet inside a sklearn pipeline.

Tabnet is not compatible with all sklearn pipeline, I guess that's the problem. Could you share the code you are running with TabNet?

@ThomasWolf0701
Copy link
Author

ThomasWolf0701 commented Nov 23, 2020

Here it is:

imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
scorer = make_scorer(mean_squared_error, greater_is_better= False)

inner_cv = TimeSeriesSplit(n_splits=5)#.split(featureMatrix)
outer_cv = PredefinedHoldoutSplit(valid_indices=[range(0,100,1)]

#set the training parameters for Random Forest
paramsTab = {
'm__n_steps': randint(1,3),
'm__n_a': randint(8,64),
'm__n_d': randint(8,64),
'm__gamma': uniform(1, 1),
'm__n_shared': randint(1, 5),
'm__n_independent': randint(1, 5),
'm__momentum': loguniform(0.01, 0.4),
"m__mask_type":["sparsemax", "entmax"]
}

tab_model = TabNetRegressor(device_name = "cuda")

tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)])

tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True)

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values),m__batch_size = 10)
tab_search.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

@ThomasWolf0701
Copy link
Author

ThomasWolf0701 commented Nov 23, 2020

Also tried without batch_size and now i get:
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))
No early stopping will be performed, last training weights will be used.
Traceback (most recent call last):

File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward
out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward
x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward
scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

@ThomasWolf0701
Copy link
Author

I also did a grid search using the cpu, for most iterations it works but some give an error:

[CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_independent=4, m__n_shared=1, m__n_steps=1, score=(train=nan, test=nan), total= 38.5sC:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward
M = self.att_transformers[step](prior, att)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward
x = self.selector(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward
return sparsemax(input, self.dim)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward
tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support
tau = input_cumsum.gather(dim, support_size - 1)
RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

FitFailedWarning)
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.9min remaining: 0.0s

@Optimox
Copy link
Collaborator

Optimox commented Nov 23, 2020

Well I think this is related to GridSearchCV and TabNet you have to wrap tabnet into a bigger class. This is not included in the library yet.

You can have a look at #238 and #80.
Maybe @nclibz can help you.

@ThomasWolf0701
Copy link
Author

I also crashes when i do not use grid search but just fit the model:
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

Here is a runnable code example, and the example data as a pickle:
bugfixData.zip

import numpy as np
import pandas as pd
from scipy import sparse
from sklearn.impute import SimpleImputer
from sklearn.model_selection import TimeSeriesSplit
from mlxtend.evaluate import PredefinedHoldoutSplit
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from scipy.stats import uniform
from scipy.stats import loguniform
from scipy.stats import randint
from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.pipeline import Pipeline

imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

scorer = make_scorer(mean_squared_error, greater_is_better= False)

inner_cv = TimeSeriesSplit(n_splits=5)

outer_cv = PredefinedHoldoutSplit(valid_indices=[range(0,100,1)])

#set the training parameters for Random Forest
paramsTab = {
'm__n_steps': randint(1,3),
'm__n_a': randint(8,64),
'm__n_d': randint(8,64),
'm__gamma': uniform(1, 1),
'm__n_shared': randint(1, 5),
'm__n_independent': randint(1, 5),
'm__momentum': loguniform(0.01, 0.4),
"m__mask_type":["sparsemax", "entmax"]
}

tab_model = TabNetRegressor(device_name = "cuda")

tab_model = Pipeline(steps=[('i', imputer),('m', tab_model)])

tab_search = RandomizedSearchCV(tab_model,scoring = scorer ,param_distributions=paramsTab, random_state=42, cv=inner_cv, verbose=5, n_jobs=1, return_train_score=True)

#run with basic settings:
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

#run with feature search
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

@ThomasWolf0701
Copy link
Author

Here the error without grid search on the GPU:
Device used : cuda

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))
No early stopping will be performed, last training weights will be used.
Traceback (most recent call last):

File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 151, in forward
out = self.feat_transformersstep

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 375, in forward
x = self.shared(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 409, in forward
scale = torch.sqrt(torch.FloatTensor([0.5]).to(x.device))

RuntimeError: CUDA error: device-side assert triggered

@ThomasWolf0701
Copy link
Author

And here without gridsearch on the cpu:

tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))
No early stopping will be performed, last training weights will be used.
epoch 0 | loss: 20287.99239| 0:00:05s
epoch 1 | loss: 20290.36925| 0:00:09s
Traceback (most recent call last):

File "", line 1, in
tab_model.fit(np.array(featureMatrix.sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward
M = self.att_transformers[step](prior, att)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward
x = self.selector(x)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward
return sparsemax(input, self.dim)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward
tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support
tau = input_cumsum.gather(dim, support_size - 1)

RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

@ThomasWolf0701
Copy link
Author

When i run grid search on the cpu it actually works on some combinations and fails on others:

[CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_a=15, m__n_d=28, m__n_independent=3, m__n_shared=2, m__n_steps=1, score=(train=-17302.082, test=-26498.246), total= 53.9s
Device used : cpu
[CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_a=15, m__n_d=28, m__n_independent=3, m__n_shared=2, m__n_steps=1
No early stopping will be performed, last training weights will be used.
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 1.4min remaining: 0.0s
epoch 0 | loss: 21073.06641| 0:00:00s
epoch 1 | loss: 21039.7832| 0:00:01s
epoch 2 | loss: 21034.89062| 0:00:02s
epoch 3 | loss: 21004.1582| 0:00:02s
epoch 4 | loss: 20963.84766| 0:00:03s
epoch 5 | loss: 20944.4707| 0:00:04s
epoch 6 | loss: 20882.54297| 0:00:04s
epoch 7 | loss: 20821.25977| 0:00:05s
epoch 8 | loss: 20774.79102| 0:00:06s
epoch 9 | loss: 20731.32812| 0:00:07s
epoch 10 | loss: 20618.95117| 0:00:07s
epoch 11 | loss: 20547.69727| 0:00:08s
epoch 12 | loss: 20460.16602| 0:00:09s
epoch 13 | loss: 20259.37695| 0:00:09s
epoch 14 | loss: 20088.0 | 0:00:10s
epoch 15 | loss: 19996.99023| 0:00:11s
epoch 16 | loss: 19853.4375| 0:00:12s
epoch 17 | loss: 19550.07812| 0:00:12s
epoch 18 | loss: 19418.98438| 0:00:13s
epoch 19 | loss: 19086.25781| 0:00:14s
epoch 20 | loss: 18964.19531| 0:00:14s
epoch 21 | loss: 18776.58594| 0:00:15s
epoch 22 | loss: 18383.12109| 0:00:16s
epoch 23 | loss: 18161.46094| 0:00:16s
epoch 24 | loss: 17829.95117| 0:00:17s
epoch 25 | loss: 17601.00781| 0:00:18s
epoch 26 | loss: 17238.53516| 0:00:18s
epoch 27 | loss: 17147.02734| 0:00:19s
epoch 28 | loss: 16768.46484| 0:00:20s
epoch 29 | loss: 16510.31641| 0:00:20s
epoch 30 | loss: 16261.33984| 0:00:21s
epoch 31 | loss: 16023.36523| 0:00:22s
epoch 32 | loss: 15777.2998| 0:00:23s
epoch 33 | loss: 15456.74023| 0:00:23s
epoch 34 | loss: 15194.16406| 0:00:24s
epoch 35 | loss: 14878.4541| 0:00:25s
epoch 36 | loss: 14469.09863| 0:00:26s
epoch 37 | loss: 14036.95996| 0:00:26s
epoch 38 | loss: 13652.70703| 0:00:27s
epoch 39 | loss: 13418.50098| 0:00:28s
epoch 40 | loss: 13139.40527| 0:00:28s
epoch 41 | loss: 12592.125| 0:00:29s
epoch 42 | loss: 12505.69727| 0:00:30s
epoch 43 | loss: 12072.10449| 0:00:30s
epoch 44 | loss: 11610.03027| 0:00:31s
epoch 45 | loss: 11290.94141| 0:00:32s
epoch 46 | loss: 10918.68359| 0:00:32s
epoch 47 | loss: 10666.70801| 0:00:33s
epoch 48 | loss: 10278.82324| 0:00:34s
epoch 49 | loss: 9967.96875| 0:00:35s
epoch 50 | loss: 9642.63379| 0:00:35s
epoch 51 | loss: 9285.58594| 0:00:36s
epoch 52 | loss: 8993.60645| 0:00:37s
epoch 53 | loss: 8890.46582| 0:00:37s
epoch 54 | loss: 8505.64355| 0:00:38s
epoch 55 | loss: 8310.10352| 0:00:39s
epoch 56 | loss: 7661.33545| 0:00:40s
epoch 57 | loss: 7571.51367| 0:00:40s
epoch 58 | loss: 7224.49414| 0:00:41s
epoch 59 | loss: 7024.98682| 0:00:42s
epoch 60 | loss: 6703.88184| 0:00:42s
epoch 61 | loss: 6481.61816| 0:00:43s
epoch 62 | loss: 6211.16309| 0:00:44s
epoch 63 | loss: 5893.41602| 0:00:44s
epoch 64 | loss: 5768.60596| 0:00:45s
epoch 65 | loss: 5505.07666| 0:00:46s
epoch 66 | loss: 5340.00146| 0:00:46s
epoch 67 | loss: 5207.09424| 0:00:47s
epoch 68 | loss: 4787.2002| 0:00:48s
epoch 69 | loss: 4758.07373| 0:00:48s
epoch 70 | loss: 4562.87109| 0:00:49s
epoch 71 | loss: 4998.98828| 0:00:50s
[CV] m__gamma=1.3745401188473625, m__mask_type=sparsemax, m__momentum=0.019673133446482256, m__n_a=15, m__n_d=28, m__n_independent=3, m__n_shared=2, m__n_steps=1, score=(train=nan, test=nan), total= 50.7sC:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py:552: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 173, in fit
self._train_epoch(train_dataloader)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 349, in _train_epoch
batch_logs = self._train_batch(X, y)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 384, in _train_batch
output, M_loss = self.network(X)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 276, in forward
return self.tabnet(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 144, in forward
M = self.att_transformers[step](prior, att)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 322, in forward
x = self.selector(x)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 91, in forward
return sparsemax(input, self.dim)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 43, in forward
tau, supp_size = SparsemaxFunction._threshold_and_support(input, dim=dim)
File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\sparsemax.py", line 76, in _threshold_and_support
tau = input_cumsum.gather(dim, support_size - 1)
RuntimeError: index -1 is out of bounds for dimension 1 with size 2423

@Optimox
Copy link
Collaborator

Optimox commented Nov 23, 2020

this error tau = input_cumsum.gather(dim, support_size - 1) RuntimeError: index -1 is out of bounds for dimension 1 with size 2423 always appeared when you have nan or inf values in your training data. Please check the numpy array you are giving after sparse to dense.

It can also be due to gradient fading or explosion. You can add a clip_value which may help or change the learning rate.

@ThomasWolf0701
Copy link
Author

ThomasWolf0701 commented Nov 23, 2020

"It can also be due to gradient fading or explosion"

Checked again and the values i gave the model to fit the featureMatrix against were off. So probably the model might not able to fit anything sensible. Fixed this and the error still occurs. Will have a look at the other suggestions.

And attached the updated data.
bugFix2.zip

@ThomasWolf0701
Copy link
Author

I checked for infs, made sure that it´s float32 compatible and NaN are being imputed by the pipeline.
Also i made sure that the values are correct. So this is not the probblem.

As some of the GridSearches work and some don´t it is either a problem of the settings, as the standard settings don´t work and some of the tried ones do. Or some cross validation folds are working an some do not.

@Optimox
Copy link
Collaborator

Optimox commented Nov 23, 2020

maybe entmax is working but sparsemax is creating this error, did you check that?

@ThomasWolf0701
Copy link
Author

ThomasWolf0701 commented Nov 23, 2020

maybe entmax is working but sparsemax is creating this error, did you check that?

I ran the gridsearch on the cpu using only entmax and for the cpu this fixed the problem, even when varying the other parameters. Will also check for the gpu (cuda).

@ThomasWolf0701
Copy link
Author

While it now works for the cpu with entmax, for GPU I still get the following error:

[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 21.3s finished
Traceback (most recent call last):

File "", line 1, in
tab_search.fit(np.array(featureMatrix.iloc[:,:].sparse.to_dense().values),np.array(values))

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
return f(**kwargs)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\model_selection_search.py", line 765, in fit
self.best_estimator_.fit(X, y, **fit_params)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 335, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 159, in fit
self._set_network()

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\abstract_model.py", line 474, in _set_network
mask_type=self.mask_type,

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\pytorch_tabnet\tab_network.py", line 272, in init
self.to(self.device)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 612, in to
return self._apply(convert)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply
module._apply(fn)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 359, in _apply
module._apply(fn)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 381, in _apply
param_applied = fn(param)

File "C:\Users\Thomas Wolf\anaconda3\envs\my-rdkit-env\lib\site-packages\torch\nn\modules\module.py", line 610, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)

RuntimeError: CUDA error: device-side assert triggered

@ThomasWolf0701
Copy link
Author

Also ran the cpu again and even with entmax the erro is not completely gone just seems to happen less often.

@Optimox
Copy link
Collaborator

Optimox commented Dec 6, 2020

@ThomasWolf0701 Could you share some examples of parameters which trigger the error and some that don't?

@Optimox
Copy link
Collaborator

Optimox commented Dec 18, 2020

Alright, I'm closing this as I could not reproduce it easily. Please feel free to reopen with more information when you have some to share.

@Optimox Optimox closed this as completed Dec 18, 2020
@Promisery
Copy link

Are you using a RTX 30x0 GPU? I'm encountering the same problem. I changed the scale = *** line to init method of GLU, and got cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Maybe it's caused by CUDNN.
I'm currently using CUDA 11.2 and cuDNN 8.0.5

@ThomasWolf0701
Copy link
Author

Alright, I'm closing this as I could not reproduce it easily. Please feel free to reopen with more information when you have some to share.

Ok will try to run with different settings and can check which ones crash.

@ThomasWolf0701
Copy link
Author

Are you using a RTX 30x0 GPU? I'm encountering the same problem. I changed the scale = *** line to init method of GLU, and got cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Maybe it's caused by CUDNN.
I'm currently using CUDA 11.2 and cuDNN 8.0.5

Im am using a GTX 1070 and Cuda 10.2

@Vishal-ib97
Copy link

I'm facing the same issue. Were you able to resolve this?

@soonwoojung
Copy link

soonwoojung commented Jun 7, 2021

I'm facing the same issue. // after reducing gradien explosion, it seems not occuring. i changed weight decay higher

@bhavana3
Copy link

bhavana3 commented Feb 10, 2022

@Optimox I am facing the same issue to train a model on tabular data, from the call stack, its coming from the line tau_star = tau.gather(dim, support_size - 1)
RuntimeError: Invalid index in gather at /opt/conda/conda-bld/pytorch_1579022034529/work/aten/src/TH/generic/THTensorEvenMoreMath.cpp:657

I am trying to build a simple model with below hyper parameters. Data is tabular and most of them are float64.

model = TabNetClassifier(n_d=8, n_a=8, n_steps=1, gamma=1.3,
lambda_sparse=0,optimizer_fn=torch.optim.Adam,
optimizer_params=dict(lr=2e-2),
mask_type='entmax', device_name='cpu', output_dim=1,
scheduler_params=dict(milestones=[100,150], gamma=0.9),
scheduler_fn=torch.optim.lr_scheduler.MultiStepLR)
#'sparsemax'

        model.fit(X_train=X_train, y_train=y_train, eval_set=[(X_train, y_train), (X_val, y_val)], max_epochs=EPOCHS, patience=20, batch_size=1024, virtual_batch_size=128, eval_name=['train', 'valid'],eval_metric=['auc'], num_workers=0, weights=1, drop_last=False)

I am unsure what these params are, Could you please suggest/assist here

@soonwoojung

@Optimox
Copy link
Collaborator

Optimox commented Feb 14, 2022

@bhavana3 try setting a gradient clipping limit clip_value = 5 for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants