Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train stop at"-STARTING TRAINING-------------" #28

Closed
qbzhu2020 opened this issue Jun 10, 2020 · 8 comments
Closed

train stop at"-STARTING TRAINING-------------" #28

qbzhu2020 opened this issue Jun 10, 2020 · 8 comments

Comments

@qbzhu2020
Copy link

I am sorry for bothering you again, but please allow me to show my issue for the last time. After I prepared all the environment, including the python packages and TFrecords, my training always stopped at the string "-STARTING TRAINING-------------", then it won't show any infomation at all ,it just stopped there, and will never finish itself.I don't know why.Here is my training command:

@qbzhu2020
Copy link
Author

python train.py ae_configs/cvpr/low pc_configs/cvpr/res_shallow --restore
"/public/home/xqqstu/fab/code/ckpts/0515_1103 cvpr@low cvpr@res_shallow/ckpts"

@fab-jul
Copy link
Owner

fab-jul commented Jun 11, 2020

maybe an issue with GPU? do you have one in the system?

@qbzhu2020
Copy link
Author

I checked it again and found that I do have one! We can see it in the log: 2020-06-12 13:40:33.870168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
totalMemory: 31.75GiB freeMemory: 31.03GiB

image
And the picture above is the code stop place.

@fab-jul
Copy link
Owner

fab-jul commented Jun 12, 2020

ok. what’s the training data?

@fab-jul
Copy link
Owner

fab-jul commented Jun 12, 2020

and are you running this in your local machine or in some cloud / cluster

@qbzhu2020
Copy link
Author

Thank you for your reply. Well, the dataset is the ImageNet, and I am running code on the CentOS7 of the cluster system. And this morning I‘ve tried running the code on the Windows system of my laptop, unexpectedly it succeeded at last. So I am very confused why it did't work in the cluster. Maybe some configurations on the cluster were wrong.

@fab-jul
Copy link
Owner

fab-jul commented Jun 12, 2020

hm one issue could be that you don’t have enough RAM on the cluster. did you check this? you want probably at least 40GB

@qbzhu2020
Copy link
Author

Following your advice, I searched on the google and found some suggestions such as reduce the batch-size, I changed the batch_size of your model form 30 to 16, and reuse the training command, it works finally!! Thank you very much!

@fab-jul fab-jul closed this as completed Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants