Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subfolders name issue in Linux pod #406

Closed
Norian11 opened this issue Mar 20, 2023 · 13 comments
Closed

Subfolders name issue in Linux pod #406

Norian11 opened this issue Mar 20, 2023 · 13 comments

Comments

@Norian11
Copy link

Well I'm in a Linux pod and I launched the UI but when i clic in Train it always sends me an error saying that the folder dataset name is wrong, but it seems all normal, someone knows why happens?. My dataset subfolder is 15_sks but it creates the next error:

File "/workspace/kohya_ss/lora_gui.py", line 407, in train_model
repeats = int(folder.split('_')[0])
ValueError: invalid literal for int() with base 10: '.ipynb

I think it should be okay to erase things and write "repeats = 5" I'm not sure if that would cause other errors but I will try.

By the way, I was looking in the code the part where it defines the instance-token and class-token based on the subfolder name and I didn't find that, does the name no matter anymore or someone knows where I can find that part of the code too?

@bmaltais
Copy link
Owner

Can you paste the full command being run? This look like the subfolder name is not being read properly...

Try adding print(folder) right before line 407 to print out what it is trying to parse... might help troubleshoot...

@Norian11
Copy link
Author

Ok I will do that, just need two hour to get in my computer and I will send the full error

@bmaltais
Copy link
Owner

bmaltais commented Mar 20, 2023

Hopefully that print(folder) line will shine some light as to why it is failing to extract the repeat value from the subfolder name.

@Norian11
Copy link
Author

Norian11 commented Mar 20, 2023

Hi, a question in print(folder) do you mean that i have to copy there the path of an specific folder or just leave it like that?

Here is the longer traceback of that issue

Folder 15_sks: 30 steps
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1024, in process_api
result = await self.call_function(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/gradio/blocks.py", line 836, in call_function
prediction = await anyio.to_thread.run_sync(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/anyio/_backends/asyncio.py", line 867, in run
result = context.run(func, *args)
File "/workspace/kohya_ss/dreambooth_gui.py", line 336, in train_model
repeats = int(folder.split('
')[0])
ValueError: invalid literal for int() with base 10: '.ipynb'

@Norian11
Copy link
Author

Norian11 commented Mar 20, 2023

well i tried just writing "repeats = 5" but it generates a billion of other errors, i just think that this cant be run in services like vast.ia that uses a jupyter UI. Im gonna try again in a Linux Desktop Template that has a normal computer interface.

Folder 15_sks: 2 images found
Folder 15_sks: 10 steps
Folder .ipynb_checkpoints: 0 images found
Folder .ipynb_checkpoints: 0 steps
max_train_steps = 10
stop_text_encoder_training = 0
lr_warmup_steps = 1
accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket --pretrained_model_name_or_path="/workspace/Lora/pretrained/Deliberate.safetensors" --train_data_dir="/workspace/Lora/data" --resolution=512,512 --output_dir="/workspace/Lora/output" --logging_dir="/workspace/Lora/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=5e-5 --unet_lr=0.0001 --network_dim=8 --output_name="last" --lr_scheduler_num_cycles="1" --learning_rate="0.0001" --lr_scheduler="cosine" --lr_warmup_steps="1" --train_batch_size="1" --max_train_steps="10" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --optimizer_type="AdamW" --bucket_reso_steps=64 --xformers --bucket_no_upscale
2023-03-20 15:58:05.138133: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-20 15:58:05.294123: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-20 15:58:05.895761: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-20 15:58:05.895830: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-20 15:58:05.895841: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-20 15:58:07.883114: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-20 15:58:08.044271: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-20 15:58:08.627422: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-20 15:58:08.627489: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-03-20 15:58:08.627501: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /workspace/kohya_ss/venv/lib/python3.10/site-packages/xformers/_C.so) WARNING:root:WARNING: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by /workspace/kohya_ss/venv/lib/python3.10/site-packages/xformers/_C.so)
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Traceback (most recent call last):
File "/workspace/kohya_ss/train_network.py", line 16, in
import library.train_util as train_util
File "/workspace/kohya_ss/library/train_util.py", line 39, in
import albumentations as albu
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/init.py", line 5, in
from .augmentations import *
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/init.py", line 2, in
from .blur.functional import *
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/blur/init.py", line 1, in
from .functional import *
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/albumentations/augmentations/blur/functional.py", line 5, in
import cv2
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/cv2/init.py", line 181, in
bootstrap()
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/cv2/init.py", line 153, in bootstrap
native_module = importlib.import_module("cv2")
File "/opt/conda/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python3', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=/workspace/Lora/pretrained/Deliberate.safetensors', '--train_data_dir=/workspace/Lora/data', '--resolution=512,512', '--output_dir=/workspace/Lora/output', '--logging_dir=/workspace/Lora/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=1', '--train_batch_size=1', '--max_train_steps=10', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

@adrianlungu
Copy link
Contributor

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder.

Using ls -la you can check if there are any hidden folders.

This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting.

Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

@adrianlungu
Copy link
Contributor

@Norian11 as for your second error, which I also ran into afterwards, it seems in some Docker containers, some dependencies are missing, suggested by the ImportError: libGL.so.1: cannot open shared object file: No such file or directory line.

I ran apt-get update && apt-get install ffmpeg libsm6 libxext6 -y inside the container via ssh and then got past that error.

@bmaltais
Copy link
Owner

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder.

Using ls -la you can check if there are any hidden folders.

This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting.

Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

This is a good idea... I will add a check for hidden folder and ignore... I bet someone will eventually complain but I thing more users will benefit from it ;-)

@bmaltais
Copy link
Owner

I have pushed the fix to the dev branch.

@adrianlungu
Copy link
Contributor

There is still an issue I'm encountering on my instance over which I haven't been able to get over yet, but I'm gonna open another issue since it's unrelated to this

@bmaltais bmaltais mentioned this issue Mar 22, 2023
@Norian11
Copy link
Author

I haven't tried it yet, I will launch it today but by the comments it seems solved. Thank you so much guys for your solution now we can finally run in without paying that much for Collab, thanks for your work!

@ohminy
Copy link

ohminy commented Mar 24, 2023

Just ran into this issue as well as, as it turns out, there was a folder called .ipynb_checkpoints in the Image Folder.
Using ls -la you can check if there are any hidden folders.
This is what was making repeats = int(folder.split('')[0]) panic as it does not yield what it's expecting.
Maybe it's a good idea to ignore hidden folders ? There are systems that generate hidden folders for various reasons which should maybe be ignored for deducing the number of repeats.

This is a good idea... I will add a check for hidden folder and ignore... I bet someone will eventually complain but I thing more users will benefit from it ;-)

How can I ignore hidden folders?? Did you find right way?

@adrianlungu
Copy link
Contributor

@ohminy this was already updated by @bmaltais over here: #424

I personally just deleted the hidden folders meanwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants