Facing issue while starting the Training #1

kalai2033 · 2020-04-09T12:34:14Z

Hi,
I wanted to use your model for my research purpose. I tried to train the model as per the readme.file. But the training does seem to start at all.

Namespace(L2=0.0, batchSize=25, beta1=0.5, cuda=True, dataset='cityscapes', disc_iter=2, epoch_iter=5000, epochs=40, eval=False, experiment_name='25_samples_factorGAN', factorGAN=1, generator_channels=32, lipschitz_p=1, lipschitz_q=1, loadSize=128, lr=0.0001, num_joint_samples=25, nz=50, objective='JSD', out_path='out', seed=1337, use_real_dep_disc=1, workers=1)
Random Seed: 1337
dataset [AlignedDataset] was created
START TRAINING! Writing logs to out/Image2Image_cityscapes/25_samples_factorGAN/logs
2020-04-09 12:26:30.619713: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
................................

There is nothing more appears in the console. Please help me out.

f90 · 2020-04-09T13:55:47Z

Hey! I suspect this is a problem with your CUDA installation, not something specific to my code. Are you able to run other Pytorch code successfully in your environment?

kalai2033 · 2020-04-09T14:08:28Z

Yes i have been able run other codes. I got this error. When i ran this in my local system, I got the below error

2020-01-25 11:20:08.541504: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-01-25 11:20:08.541639: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-01-25 11:20:08.541689: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly

So I ran this code in google colab environment. I am sharing my colab notebook link.

https://colab.research.google.com/drive/1GdEO9zgHLPnXTZ6a00lXyj0B2pHfyOVa

Kindly please check and let me know

f90 · 2020-04-09T14:13:23Z

It's a bit confusing there is an error occuring in tensorflow since this is a Pytorch project, the only tensorflow code might be used by tensorboard. Did you try running the code in a pip virtualenv where you install only packages listed in requirements.txt to make sure tensorflow does not interfere? Please pull the latest version of the code before you do that, I changed the requirements.txt to not include tensorflow anymore, and a few other things just now.

Also see these posts where very similar issues are reported, maybe this helps:
tensorflow/tensorflow#38100
tensorflow/tensor2tensor#1643

f90 · 2020-04-09T14:18:02Z

Also it might be that the code is already running normally, it just doesn't say anything during training! Check the output logs via tensorboard! Also pull the latest version of my code, I put in a training progress bar so you should now see text output at each training step! And definitely use virtualenv to create a clean environment and then install the required packages listed in requirements.txt into it

kalai2033 · 2020-04-09T15:13:05Z

@f90 Thanks :)... It works now. I can see the epoch progress bar now. I have a few more doubts.

Can I use a rectangular image without resizing? My i/p image size is 600*400.
I don't see any checkpoints created. There have been 3 epochs completed so far.
How to test the final model... I'm using the following code to train

!python Image2Image.py --cuda --batchSize=10 --loadSize 256 --dataset "diff" --num_joint_samples 300 --factorGAN 1 --experiment_name "diff"

f90 · 2020-04-09T15:43:07Z

Glad that it works now! Since you are raising a bunch of new points now, I am going to create separate issues for those so we can handle these. Closing this issue now then, please post in the others from now.

f90 added the bug Something isn't working label Apr 9, 2020

f90 closed this as completed Apr 9, 2020

This was referenced Apr 9, 2020

From #1: Can I use a rectangular image without resizing? #2

Open

From #1: Checkpoint creation #3

Closed

From issue #1: Testing the final model #4

Closed

rachellim mentioned this issue Aug 10, 2020

Stuck after printing 'Successfully opened dynamic library libcublas.so.10.0' tensorflow/tensor2tensor#1643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facing issue while starting the Training #1

Facing issue while starting the Training #1

kalai2033 commented Apr 9, 2020

f90 commented Apr 9, 2020

kalai2033 commented Apr 9, 2020 •

edited

Loading

f90 commented Apr 9, 2020 •

edited

Loading

f90 commented Apr 9, 2020

kalai2033 commented Apr 9, 2020

f90 commented Apr 9, 2020

Facing issue while starting the Training #1

Facing issue while starting the Training #1

Comments

kalai2033 commented Apr 9, 2020

f90 commented Apr 9, 2020

kalai2033 commented Apr 9, 2020 • edited Loading

f90 commented Apr 9, 2020 • edited Loading

f90 commented Apr 9, 2020

kalai2033 commented Apr 9, 2020

f90 commented Apr 9, 2020

kalai2033 commented Apr 9, 2020 •

edited

Loading

f90 commented Apr 9, 2020 •

edited

Loading