Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS on Linux install #2263

Open
dr460neye opened this issue Apr 12, 2024 · 8 comments

Comments

@dr460neye
Copy link

Hi there,

i tried now each feasible way to install the WebUI on a Linux server with multiple GPUs.

There are some smaller issues identified:

  • Accellerate configuration is not configured at the first time setup
  • The tensorflow version in use ( 2.15) contains some issues where the GPU is not used.

When I use common commands for CUDA version checkup and installation verification, only tensorflow and torch commands fail.
This problem was fixed for Ubuntu in Version 2.16.

It seems that it was detected for WSL users, but still appears on other Ubuntu installations.

Server:

  • 4 CPUs with 8 cores each , Intel
  • 4 NVidia P100 Tesla
  • Python 3.10.12
  • Ubuntu Server 22.04
  • Non-Gui installation / No X-Server

Errors:

accelerate launch --mixed_precision="fp16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/home/excel/kohya_ss/sd-scripts/train_network.py" --bucket_no_upscale --bucket_reso_steps=64 --cache_latents --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --huber_c="0.1" --huber_schedule="snr" --learning_rate="0.0001" --logging_dir="/home/excel/kohya_ss/logs" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --lr_warmup_steps="57" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="512,512" --max_train_steps="570" --min_timestep=0 --mixed_precision="fp16" --network_alpha="1" --network_dim=8 --network_module=networks.lora --optimizer_type="AdamW8bit" --output_dir="/home/excel/kohya_ss/outputs/hedgeforest" --output_name="hedgeforest" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --text_encoder_lr=0.0001 --train_batch_size="1" --train_data_dir="/home/excel/bildersets/hedgehogs/images" --unet_lr=0.0001 --xformers
2024-04-11 16:08:53.335832: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 16:08:53.376158: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-11 16:08:53.376187: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-11 16:08:53.377607: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-11 16:08:53.384410: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-11 16:08:53.384622: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-11 16:08:54.317341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

So i kindly request to upgrade to tensorflow 2.16 and also add the "cuda" options for the pip package installation as a default requirements_nvidia.txt

@bmaltais
Copy link
Owner

Not sure I totally get what changes you are asking for... there is no requirements_nvidia.txt file... so what actual requirements do you want to see in it?

@bmaltais
Copy link
Owner

bmaltais commented Apr 12, 2024

Which requirements file need the upgrade to tensorflow 2.16? How do you install the GUI? Do you use special parameters to specify the requirements file? I don't use linux so I am not familiar with how you need the current solution to be changed and updated to properly work on your platform...

@dr460neye
Copy link
Author

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1
The tensorflow instructed package ensures that CDNN etc are installed.
2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

@bmaltais
Copy link
Owner

But how will you call this? Is the setup.sh going to handle this as is? I think the best would be if you create a pull request to propose all the needed code change to make this work properly. That way I can merge it and others will be able to use it…

@ja1496
Copy link

ja1496 commented Apr 13, 2024

I execute it by pulling kohya_ss on the Ubuntu system/ Before setup.sh, please modify the requirements in both requests.linux.txt and requests.linux_docker.txt————
Tensorboard=2.16.2 tensorflow=2.16.1 and torch=2.2.1 torch vision=0.17.1 torch studio=2.2.1-- index URL https://download.pytorch.org/whl/cu121 —————— Solved the above issues.
But I also believe that it may be due to the activation of secureroot in the bios that the driver of NVIDIA on Ubuntu is not working, causing the above problem. But now I can run normally using multiple GPUs or conducting distributed training using deepspeed

@ja1496
Copy link

ja1496 commented Apr 13, 2024

Python 3.10.11
Ubuntu 22.04
nvidia driver 545

@sirius422
Copy link

sirius422 commented Jun 2, 2024

I switched to: tensorboard==2.16.2 tensorflow[and-cuda]==2.16.1 The tensorflow instructed package ensures that CDNN etc are installed. 2.16 is used to ensure that the error for Ubuntu WSL and Server is fixed

As not every GPU is an nvidia, i suggested that we add a requirements_nvidia.txt, which contains the cuda package as default, while for other setups the normal requirements_linux.txt is used

Installing tensorflow[and-cuda] and adding this little script in the Tensorflow issue into gui.sh does solve the problem.

Now I can get [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] when running python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))", no errors occures.

As for TensorRT, you may check this, download and extract the tar file and setup the LD_LIBRARY_PATH in gui.sh, use symlink if needed.

@b-fission
Copy link
Contributor

b-fission commented Jun 2, 2024

kohya uses pytorch for GPU training, so any messages from tensorflow saying "unable to register ____ factory" or "could not find cuda drivers" can be ignored.

There's no practical use for installing a cuda-enabled build of tensorflow. It's only brought in as a dependency for tensorboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants