train stop at"-STARTING TRAINING-------------" #28

qbzhu2020 · 2020-06-10T10:51:30Z

I am sorry for bothering you again, but please allow me to show my issue for the last time. After I prepared all the environment, including the python packages and TFrecords, my training always stopped at the string "-STARTING TRAINING-------------", then it won't show any infomation at all ,it just stopped there, and will never finish itself.I don't know why.Here is my training command:

qbzhu2020 · 2020-06-10T10:52:19Z

python train.py ae_configs/cvpr/low pc_configs/cvpr/res_shallow --restore
"/public/home/xqqstu/fab/code/ckpts/0515_1103 cvpr@low cvpr@res_shallow/ckpts"

fab-jul · 2020-06-11T10:18:17Z

maybe an issue with GPU? do you have one in the system?

qbzhu2020 · 2020-06-12T05:57:47Z

I checked it again and found that I do have one! We can see it in the log: 2020-06-12 13:40:33.870168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties
name: Tesla V100-PCIE-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
totalMemory: 31.75GiB freeMemory: 31.03GiB

And the picture above is the code stop place.

fab-jul · 2020-06-12T07:41:19Z

ok. what’s the training data?

fab-jul · 2020-06-12T07:41:37Z

and are you running this in your local machine or in some cloud / cluster

qbzhu2020 · 2020-06-12T08:23:11Z

Thank you for your reply. Well, the dataset is the ImageNet, and I am running code on the CentOS7 of the cluster system. And this morning I‘ve tried running the code on the Windows system of my laptop, unexpectedly it succeeded at last. So I am very confused why it did't work in the cluster. Maybe some configurations on the cluster were wrong.

fab-jul · 2020-06-12T08:24:33Z

hm one issue could be that you don’t have enough RAM on the cluster. did you check this? you want probably at least 40GB

qbzhu2020 · 2020-06-12T08:57:12Z

Following your advice, I searched on the google and found some suggestions such as reduce the batch-size, I changed the batch_size of your model form 30 to 16, and reuse the training command, it works finally!! Thank you very much!

fab-jul closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train stop at"-STARTING TRAINING-------------" #28

train stop at"-STARTING TRAINING-------------" #28

qbzhu2020 commented Jun 10, 2020

qbzhu2020 commented Jun 10, 2020

fab-jul commented Jun 11, 2020

qbzhu2020 commented Jun 12, 2020

fab-jul commented Jun 12, 2020

fab-jul commented Jun 12, 2020

qbzhu2020 commented Jun 12, 2020

fab-jul commented Jun 12, 2020

qbzhu2020 commented Jun 12, 2020

train stop at"-STARTING TRAINING-------------" #28

train stop at"-STARTING TRAINING-------------" #28

Comments

qbzhu2020 commented Jun 10, 2020

qbzhu2020 commented Jun 10, 2020

fab-jul commented Jun 11, 2020

qbzhu2020 commented Jun 12, 2020

fab-jul commented Jun 12, 2020

fab-jul commented Jun 12, 2020

qbzhu2020 commented Jun 12, 2020

fab-jul commented Jun 12, 2020

qbzhu2020 commented Jun 12, 2020