Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in training #33

Open
lchunleo opened this issue Apr 27, 2020 · 4 comments
Open

Error in training #33

lchunleo opened this issue Apr 27, 2020 · 4 comments

Comments

@lchunleo
Copy link

Hi

i trying to run the training but encountered issue in the following.

UnboundLocalError: local variable 'discriminator_loss' referenced before assignment
when trying to print("discriminator_loss : %f" % discriminator_loss)

@deepak112
Copy link
Owner

You still facing this issue?

@lchunleo
Copy link
Author

lchunleo commented May 5, 2020

You still facing this issue?

Thanks for checking. I managed to resolve the above issue due to some path issues.

I tried to perform the training on my own dataset on google colab but i am still unable to get it running. is it very resource intensive? i tried changing my batch size to almost the min but unable to do so. i resize my images to 384x384

--batch_size=2 --epochs=3 --number_of_images=805 --train_test_ratio=0.8

2020-05-05 02:12:47.508295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-05 02:12:50.039906: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz
2020-05-05 02:12:50.041986: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2b5d480 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-05 02:12:50.042028: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-05-05 02:12:50.044879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-05 02:12:50.046654: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-05-05 02:12:50.046687: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (d1ec168119ba): /proc/driver/nvidia/version does not exist
tcmalloc: large alloc 1139539968 bytes == 0x42f50000 @ 0x7fe4800d11e7 0x7fe47dc395e1 0x7fe47dc9e8e0 0x7fe47dd2c447 0x50ac25 0x50c5b9 0x509d48 0x50aa7d 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50c5b9 0x509d48 0x50aa7d 0x50c5b9 0x508245 0x50b403 0x635222 0x6352d7 0x638a8f 0x639631 0x4b0f40 0x7fe47fcceb97 0x5b2fda
tcmalloc: large alloc 1207959552 bytes == 0x97eaa000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e250237b 0x7fe3e25024a7 0x7fe3ec966113 0x7fe3ec969fe7 0x7fe3e6d11544 0x7fe3e6d11d7f 0x7fe3e6ce4a9b 0x7fe3e6ce5670 0x7fe3e6d0d1c4 0x7fe3e6cdf64c 0x7fe3e6ce2c42 0x7fe3e68b616b 0x7fe3e68a4e11 0x7fe3e6556b71 0x7fe4703ba817 0x7fe4703dc4f4 0x50ac25 0x50c5b9 0x508245 0x50a080 0x50aa7d 0x50d390 0x508245 0x50a080 0x50aa7d
--------------- Epoch 1 ---------------
0% 0/322 [00:00<?, ?it/s]2020-05-05 02:13:34.929150: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 12230590464 exceeds 10% of free system memory.
2020-05-05 02:13:34.929150: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 12230590464 exceeds 10% of free system memory.
tcmalloc: large alloc 12230590464 bytes == 0x35285c000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e2504a7d 0x7fe3ebdbd116 0x7fe3e271db42 0x7fe3e270fe85 0x7fe3e280c4e1 0x7fe3e28091d3 0x7fe47e9b36df 0x7fe47fa956db 0x7fe47fdce88f
tcmalloc: large alloc 12230590464 bytes == 0x62c05c000 @ 0x7fe4800b3b6b 0x7fe4800d3379 0x7fe3ee4b01f7 0x7fe3e260be4f 0x7fe3e2692e6b 0x7fe3e2501996 0x7fe3e2504a7d 0x7fe3ebf21847 0x7fe3e271db42 0x7fe3e270fe85 0x7fe3e280c4e1 0x7fe3e28091d3 0x7fe47e9b36df 0x7fe47fa956db 0x7fe47fdce88f
0% 1/322 [00:42<3:47:39, 42.55s/it]Traceback (most recent call last):
File "train.py", line 138, in
train(values.epochs, values.batch_size, values.input_dir, values.output_dir, values.model_save_dir, values.number_of_images, values.train_test_ratio)
File "train.py", line 83, in train
d_loss_real = discriminator.train_on_batch(image_batch_hr, real_data_Y)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1514, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 3792, in call
outputs = self._graph_fn(*converted_inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1605, in call
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1645, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable _AnonymousVar348 from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/_AnonymousVar348/N10tensorflow3VarE does not exist.
[[node mul_21/ReadVariableOp (defined at /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_18053]

Function call stack:
keras_scratch_graph

@RAKSHIT0406
Copy link

Hello
I can run network,utils files with out any error but when I try to run my training model I am getting an error and I am not understanding how to give input directory and what modifications we have to made to make to give a dataset of images as input for training purpose.
Can you please help me to run this code
Error:
usage: ipykernel_launcher.py [-h] [-i INPUT_DIR] [-o OUTPUT_DIR]
[-m MODEL_SAVE_DIR] [-b BATCH_SIZE] [-e EPOCHS]
[-n NUMBER_OF_IMAGES] [-r TRAIN_TEST_RATIO]
ipykernel_launcher.py: error: unrecognized arguments: -f /root/.local/share/jupyter/runtime/kernel-e1242cbc-2f55-4f63-aae5-b18b2fbfa737.json
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2890: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
This is the one I encountered while I was running training code

@BassantTolba1234
Copy link

Please all,
I need the code of implementation this part
the part is

{The SRResNet networks
were trained with a learning rate of 10−4 and 106 update
iterations. We employed the trained MSE-based SRResNet
network as initialization for the generator when training
the actual GAN to avoid undesired local optima.{

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants