Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing issue while starting the Training #1

Closed
kalai2033 opened this issue Apr 9, 2020 · 6 comments
Closed

Facing issue while starting the Training #1

kalai2033 opened this issue Apr 9, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@kalai2033
Copy link

Hi,
I wanted to use your model for my research purpose. I tried to train the model as per the readme.file. But the training does seem to start at all.

Namespace(L2=0.0, batchSize=25, beta1=0.5, cuda=True, dataset='cityscapes', disc_iter=2, epoch_iter=5000, epochs=40, eval=False, experiment_name='25_samples_factorGAN', factorGAN=1, generator_channels=32, lipschitz_p=1, lipschitz_q=1, loadSize=128, lr=0.0001, num_joint_samples=25, nz=50, objective='JSD', out_path='out', seed=1337, use_real_dep_disc=1, workers=1)
Random Seed: 1337
dataset [AlignedDataset] was created
START TRAINING! Writing logs to out/Image2Image_cityscapes/25_samples_factorGAN/logs
2020-04-09 12:26:30.619713: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
................................

There is nothing more appears in the console. Please help me out.

@f90
Copy link
Owner

f90 commented Apr 9, 2020

Hey! I suspect this is a problem with your CUDA installation, not something specific to my code. Are you able to run other Pytorch code successfully in your environment?

@f90 f90 added the bug Something isn't working label Apr 9, 2020
@kalai2033
Copy link
Author

kalai2033 commented Apr 9, 2020

Yes i have been able run other codes. I got this error. When i ran this in my local system, I got the below error

2020-01-25 11:20:08.541504: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-01-25 11:20:08.541639: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-01-25 11:20:08.541689: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly

So I ran this code in google colab environment. I am sharing my colab notebook link.

https://colab.research.google.com/drive/1GdEO9zgHLPnXTZ6a00lXyj0B2pHfyOVa

Kindly please check and let me know

@f90
Copy link
Owner

f90 commented Apr 9, 2020

It's a bit confusing there is an error occuring in tensorflow since this is a Pytorch project, the only tensorflow code might be used by tensorboard. Did you try running the code in a pip virtualenv where you install only packages listed in requirements.txt to make sure tensorflow does not interfere? Please pull the latest version of the code before you do that, I changed the requirements.txt to not include tensorflow anymore, and a few other things just now.

Also see these posts where very similar issues are reported, maybe this helps:
tensorflow/tensorflow#38100
tensorflow/tensor2tensor#1643

@f90
Copy link
Owner

f90 commented Apr 9, 2020

Also it might be that the code is already running normally, it just doesn't say anything during training! Check the output logs via tensorboard! Also pull the latest version of my code, I put in a training progress bar so you should now see text output at each training step! And definitely use virtualenv to create a clean environment and then install the required packages listed in requirements.txt into it

@kalai2033
Copy link
Author

@f90 Thanks :)... It works now. I can see the epoch progress bar now. I have a few more doubts.

  1. Can I use a rectangular image without resizing? My i/p image size is 600*400.
  2. I don't see any checkpoints created. There have been 3 epochs completed so far.
  3. How to test the final model... I'm using the following code to train

!python Image2Image.py --cuda --batchSize=10 --loadSize 256 --dataset "diff" --num_joint_samples 300 --factorGAN 1 --experiment_name "diff"

@f90
Copy link
Owner

f90 commented Apr 9, 2020

Glad that it works now! Since you are raising a bunch of new points now, I am going to create separate issues for those so we can handle these. Closing this issue now then, please post in the others from now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants