-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train stop at"-STARTING TRAINING-------------" #28
Comments
python train.py ae_configs/cvpr/low pc_configs/cvpr/res_shallow --restore |
maybe an issue with GPU? do you have one in the system? |
ok. what’s the training data? |
and are you running this in your local machine or in some cloud / cluster |
Thank you for your reply. Well, the dataset is the ImageNet, and I am running code on the CentOS7 of the cluster system. And this morning I‘ve tried running the code on the Windows system of my laptop, unexpectedly it succeeded at last. So I am very confused why it did't work in the cluster. Maybe some configurations on the cluster were wrong. |
hm one issue could be that you don’t have enough RAM on the cluster. did you check this? you want probably at least 40GB |
Following your advice, I searched on the google and found some suggestions such as reduce the batch-size, I changed the batch_size of your model form 30 to 16, and reuse the training command, it works finally!! Thank you very much! |
I am sorry for bothering you again, but please allow me to show my issue for the last time. After I prepared all the environment, including the python packages and TFrecords, my training always stopped at the string "-STARTING TRAINING-------------", then it won't show any infomation at all ,it just stopped there, and will never finish itself.I don't know why.Here is my training command:
The text was updated successfully, but these errors were encountered: