Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Error in run_miscnn #16

Closed
anirban1513 opened this issue Dec 7, 2020 · 2 comments
Closed

Getting Error in run_miscnn #16

anirban1513 opened this issue Dec 7, 2020 · 2 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@anirban1513
Copy link

File "scripts/run_miscnn.py", line 120, in
save_models=False)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/miscnn/evaluation/cross_validation.py", line 166, in run_fold
iterations=iterations, callbacks=cb_list)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/miscnn/neural_network/model.py", line 201, in evaluate
max_queue_size=self.batch_queue_size)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
tmp_logs = train_function(iterator)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(*args, **kwds)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node functional_1/conv3d/Conv3D (defined at /home/ubuntu/anaconda3/lib/python3.7/site-packages/miscnn/neural_network/model.py:201) ]] [Op:__inference_train_function_9673]

Function call stack:
train_function

2020-12-07 06:04:13.638644: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated

@anirban1513
Copy link
Author

Getting this error while using CUDA 11.0. AWS EC2.

W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/usr/local/cuda/lib:/opt/amazon/efa/lib:/usr/local/mpi/lib:/usr/lib64/openmpi/lib/:/usr/local/cuda/lib64:/usr/local/lib:/usr/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/mpi/lib:/lib/:
2020-12-07 10:21:30.923374: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1408] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2020-12-07 10:21:30.923392: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1447] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI could not be loaded or symbol could not be found.
2020-12-07 10:21:30.923423: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1430] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI could not be loaded or symbol could not be found.

@muellerdo
Copy link
Member

Hey @anirban1513,

this look like a Tensorflow exception due to some corrupt/missing driver of faulty hardware.

I would recommend running a simplistic Tensorflow script and checking out the Tensorflow Git repository for help:
https://github.com/tensorflow/tensorflow/issues

Cheers,
Dominik

@muellerdo muellerdo self-assigned this Dec 7, 2020
@muellerdo muellerdo added the wontfix This will not be worked on label Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants