Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

dr460neye · 2024-04-12T13:07:07Z

Hi there,

i tried now each feasible way to install the WebUI on a Linux server with multiple GPUs.

There are some smaller issues identified:

Accellerate configuration is not configured at the first time setup
The tensorflow version in use ( 2.15) contains some issues where the GPU is not used.

When I use common commands for CUDA version checkup and installation verification, only tensorflow and torch commands fail.
This problem was fixed for Ubuntu in Version 2.16.

It seems that it was detected for WSL users, but still appears on other Ubuntu installations.

Server:

4 CPUs with 8 cores each , Intel
4 NVidia P100 Tesla
Python 3.10.12
Ubuntu Server 22.04
Non-Gui installation / No X-Server

Errors:

accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/home/excel/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --huber_c="0.1" --huber_schedule="snr" --learning_rate="0.0001" --logging_dir="/home/excel/kohya_ss/logs" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --lr_warmup_steps="57" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512" --max_train_steps="570" --min_timestep=0 --mixed_precision="fp16" --network_alpha="1" --network_dim=8 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/home/excel/kohya_ss/outputs/hedgeforest" --output_name="hedgeforest" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="1" --train_data_dir="/home/excel/bildersets/hedgehogs/images" --unet_lr=0.0001 --xformers
2024-04-11 16:08:53.335832: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 16:08:53.376158: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 16:08:53.376187: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 16:08:53.377607: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-11 16:08:53.384410: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 16:08:53.384622: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-11 16:08:54.317341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

So i kindly request to upgrade to tensorflow 2.16 and also add the "cuda" options for the pip package installation as a default requirements_nvidia.txt

bmaltais · 2024-04-12T14:34:47Z

Not sure I totally get what changes you are asking for... there is no requirements_nvidia.txt file... so what actual requirements do you want to see in it?

bmaltais · 2024-04-12T14:35:29Z

Which requirements file need the upgrade to tensorflow 2.16? How do you install the GUI? Do you use special parameters to specify the requirements file? I don't use linux so I am not familiar with how you need the current solution to be changed and updated to properly work on your platform...

dr460neye · 2024-04-12T15:03:05Z

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1
The tensorflow instructed package ensures that CDNN etc are installed.
2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

bmaltais · 2024-04-12T16:29:26Z

But how will you call this? Is the setup.sh going to handle this as is? I think the best would be if you create a pull request to propose all the needed code change to make this work properly. That way I can merge it and others will be able to use it…

ja1496 · 2024-04-13T15:07:29Z

I execute it by pulling kohya_ss on the Ubuntu system/ Before setup.sh, please modify the requirements in both requests.linux.txt and requests.linux_docker.txt————
Tensorboard=2.16.2 tensorflow=2.16.1 and torch=2.2.1 torch vision=0.17.1 torch studio=2.2.1-- index URL https://download.pytorch.org/whl/cu121 —————— Solved the above issues.
But I also believe that it may be due to the activation of secureroot in the bios that the driver of NVIDIA on Ubuntu is not working, causing the above problem. But now I can run normally using multiple GPUs or conducting distributed training using deepspeed

ja1496 · 2024-04-13T15:11:43Z

Python 3.10.11
Ubuntu 22.04
nvidia driver 545

sirius422 · 2024-06-02T18:41:03Z

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

Installing tensorflow[and-cuda] and adding this little script in the Tensorflow issue into gui.sh does solve the problem.

Now I can get [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] when running python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))", no errors occures.

As for TensorRT, you may check this, download and extract the tar file and setup the LD_LIBRARY_PATH in gui.sh, use symlink if needed.

b-fission · 2024-06-02T19:30:20Z

kohya uses pytorch for GPU training, so any messages from tensorflow saying "unable to register ____ factory" or "could not find cuda drivers" can be ignored.

There's no practical use for installing a cuda-enabled build of tensorflow. It's only brought in as a dependency for tensorboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

dr460neye commented Apr 12, 2024

bmaltais commented Apr 12, 2024

bmaltais commented Apr 12, 2024 •

edited

Loading

dr460neye commented Apr 12, 2024

bmaltais commented Apr 12, 2024

ja1496 commented Apr 13, 2024

ja1496 commented Apr 13, 2024

sirius422 commented Jun 2, 2024 •

edited

Loading

b-fission commented Jun 2, 2024 •

edited

Loading

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

Comments

dr460neye commented Apr 12, 2024

bmaltais commented Apr 12, 2024

bmaltais commented Apr 12, 2024 • edited Loading

dr460neye commented Apr 12, 2024

bmaltais commented Apr 12, 2024

ja1496 commented Apr 13, 2024

ja1496 commented Apr 13, 2024

sirius422 commented Jun 2, 2024 • edited Loading

b-fission commented Jun 2, 2024 • edited Loading

bmaltais commented Apr 12, 2024 •

edited

Loading

sirius422 commented Jun 2, 2024 •

edited

Loading

b-fission commented Jun 2, 2024 •

edited

Loading