Skip to content
This repository has been archived by the owner on Aug 15, 2019. It is now read-only.

Error when train a model "tensorflow.python.eager.profiler.ProfilerAlreadyRunningError: Another profiler is running" #262

Open
VicGrygorchyk opened this issue Mar 3, 2019 · 9 comments

Comments

@VicGrygorchyk
Copy link

VicGrygorchyk commented Mar 3, 2019

Hi! When I try a comman python3 faceswap.py train -A ./photo/fst -B ./photo/snd -m ./photo/models/ I got the problem (log below).
Might worth to mention, I complied tensorflow myself, as when I used pip install tensorflow I got tensorflow not found error.
The command extract works without problem.
Else, I had error can't find module named 'numpy.core._multiarray_umath, as mentioned is this issue #261, so I updated numpy via pip install (But there is a comment that numpy 16 is broken, I have 1.16.2 ).

crash_report.2019.03.03.222044908664.log 
03/03/2019 22:20:25 MainProcess     training_0      _base           load_generator            DEBUG    Loading generator: b
03/03/2019 22:20:25 MainProcess     training_0      _base           load_generator            DEBUG    input_size: 64, output_size: 64
03/03/2019 22:20:25 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing TrainingDataGenerator: (model_input_size: 64, model_output_shape: 64, training_opts: {'alignments': {'a': '/home/faceswap/photo/fst/alignments.json', 'b': '/home/faceswap/photo/snd/alignments.json'}, 'preview_scaling': 1.0, 'no_flip': False, 'preview_images': 14, 'training_size': 256, 'coverage_ratio': 0.625, 'mask_type': None, 'warp_to_landmarks': False, 'no_logs': False}, landmarks: False)
03/03/2019 22:20:25 MainProcess     training_0      training_data   set_mask_function         DEBUG    Mask function: None
03/03/2019 22:20:25 MainProcess     training_0      training_data   __init__                  DEBUG    Initializing ImageManipulation: (input_size: 64, output_size: 64, coverage_ratio: 0.625)
03/03/2019 22:20:25 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized ImageManipulation
03/03/2019 22:20:25 MainProcess     training_0      training_data   __init__                  DEBUG    Initialized TrainingDataGenerator
03/03/2019 22:20:25 MainProcess     training_0      training_data   minibatch_ab              DEBUG    Queue batches: (image_count: 960, batchsize: 64, side: 'b', do_shuffle: True, is_timelapse: False)
03/03/2019 22:20:25 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager adding: (name: 'train_b', maxsize: 512)
03/03/2019 22:20:25 MainProcess     training_0      queue_manager   add_queue                 DEBUG    QueueManager added: (name: 'train_b')
03/03/2019 22:20:25 MainProcess     training_0      multithreading  __init__                  DEBUG    Initializing MultiThread: (target: 'load_batches', thread_count: 1)
03/03/2019 22:20:25 MainProcess     training_0      multithreading  __init__                  DEBUG    Initialized MultiThread: 'load_batches'
03/03/2019 22:20:25 MainProcess     training_0      multithreading  start                     DEBUG    Starting thread(s): 'load_batches'
03/03/2019 22:20:25 MainProcess     training_0      multithreading  start                     DEBUG    Starting th  File "/home/faceswap/scripts/train.py", line 97, in process
    self.end_thread(thread, err)
  File "/home/faceswap/scripts/train.py", line 122, in end_thread
    thread.join()
  File "/home/faceswap/lib/multithreading.py", line 179, in join
    raise thread.err[1].with_traceback(thread.err[2])
  File "/home/faceswap/lib/multithreading.py", line 117, in run
    self._target(*self._args, **self._kwargs)
  File "/home/faceswap/scripts/train.py", line 148, in training
    raise err
  File "/home/faceswap/scripts/train.py", line 138, in training
    self.run_training_cycle(model, trainer)
  File "/home/faceswap/scripts/train.py", line 210, in run_training_cycle
    trainer.train_one_step(viewer, timelapse)
  File "/home/faceswap/plugins/train/trainer/_base.py", line 149, in train_one_step
    self.log_tensorboard(side, side_loss)
  File "/home/faceswap/plugins/train/trainer/_base.py", line 172, in log_tensorboard
    self.tensorboard[side].on_batch_end(self.model.state.iterations, logs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/callbacks_v1.py", line 362, in on_batch_end
    profiler.start()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/eager/profiler.py", line 70, in start
    raise ProfilerAlreadyRunningError('Another profiler is running.')
tensorflow.python.eager.profiler.ProfilerAlreadyRunningError: Another profiler is running.
PyWavelets==1.0.2
pyxdg==0.25
PyYAML==3.11
pyzmq==18.0.0
qtconsole==4.4.3
requests==2.9.1
scikit-image==0.14.2
scikit-learn==0.20.3
scipy==1.2.1
screen-resolution-extra==0.0.0
Send2Trash==1.5.0
six==1.12.0
ssh-import-id==5.5
sympy==1.3
system-service==0.3
tensorboard==1.13.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
termcolor==1.1.0
terminado==0.8.1
testpath==0.4.2
toolz==0.9.0
tornado==5.1.1
tqdm==4.31.1
traitlets==4.3.2
ufw==0.35
unattended-upgrades==0.1
urllib3==1.13.1
virtualenv==16.4.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.4.2
xkit==0.0.0(venv) 

Please, point me out what I'm doing wrong with setup. Should I use another tensorflow version here?
If it makes any value, before getting error I see such logs:

03/03/2019 22:20:20 INFO     Log level set to: INFO
Using TensorFlow backend.
03/03/2019 22:20:22 INFO     Model A Directory: /home/faceswap/photo/solo
03/03/2019 22:20:22 INFO     Model B Directory: /home/faceswap/photo/ford
03/03/2019 22:20:22 INFO     Training data directory: /home/faceswap/photo/models
03/03/2019 22:20:22 INFO     ===============================================
03/03/2019 22:20:22 INFO     - Starting                                    -
03/03/2019 22:20:22 INFO     - Press 'ENTER' to save and quit              -
03/03/2019 22:20:22 INFO     - Press 'S' to save model weights immediately -
03/03/2019 22:20:22 INFO     ===============================================
03/03/2019 22:20:23 INFO     Loading data, this may take a while...
03/03/2019 22:20:23 INFO     Loading Model from Original plugin...
03/03/2019 22:20:24 INFO     Loading config: '/home/faceswap/config/train.ini'
03/03/2019 22:20:24 WARNING  No existing state file found. Generating.
03/03/2019 22:20:25 WARNING  Failed loading existing training data. Generating new models
03/03/2019 22:20:25 INFO     Loading Trainer from Original plugin...
03/03/2019 22:20:25 INFO     Enabled TensorBoard Logging
03/03/2019 22:20:44 CRITICAL Error caught! Exiting...
03/03/2019 22:20:44 ERROR    Caught exception in thread: 'training_0'
03/03/2019 22:20:46 ERROR    Got Exception on main handler:
Traceback (most recent call last):

Thanks to anyone trying to help!

@Kirin-kun
Copy link

Kirin-kun commented Mar 4, 2019

Did you stop bazel before starting to train?

# bazel shutdown

I did the same thing than you, compiling tensorflow from source, and I had this message. It didn't come back once I stopped the bazel server.

@leinine
Copy link

leinine commented Mar 18, 2019

Same issue, any solutions to solve this error? Thanks!

@VicGrygorchyk
Copy link
Author

VicGrygorchyk commented Mar 18, 2019

Same issue, any solutions to solve this error? Thanks!

I have not found a solution. bazel shutdown hasn't help me.
I gave up and switched to docker image of this project.

@torzdf
Copy link
Collaborator

torzdf commented Mar 18, 2019

Without a full crash_report we cannot help diagnose these issues.

@seven110
Copy link

I have the same problem. Extract works. Train doesn't. Logs have been attached.
Thanks for your help.
crash_report.2019.06.17.081214571325.log

@torzdf
Copy link
Collaborator

torzdf commented Jun 17, 2019

Tensorflow 1.14 not tested nor supported. Downgrade.

@seven110
Copy link

Also here's the faceswap.log.
Thanks.
faceswap.log

@torzdf
Copy link
Collaborator

torzdf commented Jun 17, 2019

Eager Profiler is a TF 1.14 feature. It is not supported. Downgrade.

@seven110
Copy link

Back down TF to 1.13.0. Works now.
Thanks for the fast reply!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants