Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The total training time #20

Closed
Runsong123 opened this issue Mar 31, 2021 · 8 comments
Closed

The total training time #20

Runsong123 opened this issue Mar 31, 2021 · 8 comments

Comments

@Runsong123
Copy link

Hi:
thanks for making this great work available here!
I want to know the total training time for 3000 epochs for the model in this paper (or if you use a single GPU ? ). I find it's so long for me to train the model on a single 2080Ti.

@pengsongyou
Copy link
Member

Hi @csuzhuzhuxia ,

If you consider 1- or 3-plane model on the ShapeNet dataset, you should be able to at least run 300K iterations in a day. It is strange that it takes so long for you to train. Maybe you can check when the overhead is. I doubt that it might be in the data loading.

Best,
Songyou

@Runsong123
Copy link
Author

Thanks for the quick reply!!
I will check it to find the overhead.

@pengsongyou
Copy link
Member

@csuzhuzhuxia you have closed the issue but I am still curious where the overhead was? Thanks!

@Runsong123 Runsong123 reopened this Apr 11, 2021
@Runsong123
Copy link
Author

@csuzhuzhuxia you have closed the issue but I am still curious where the overhead was? Thanks!

Sorry for the late reply. I am busy in other things in these days, so I just close the issue although I don't find the reason. And thank you for your concern.
I think you mean the 300K iterations in a day not 300K epochs(for ShapNet, 1 epoch = 959 iteration). But when I do the experiment with the shapnet-1-plane config, it is nearly 22k iteration in a day. I guess it due to that the dataset contains too many small files, and the file is saved in HDD disk, so the loader is so slow.

Thank you!

@pengsongyou
Copy link
Member

Still strange because I was also keeping the dataset in HDD (or actually SSD?) and it worked well. But yup, thanks for letting me know!

@Runsong123
Copy link
Author

Still strange because I was also keeping the dataset in HDD (or actually SSD?) and it worked well. But yup, thanks for letting me know!

Hello, I train the model in another machine which installed SSD disk to verify the reason. The speed in SSD disk can reach to nearly 300K iteration a day using shapnet-1-plane setting.

zhu

@pengsongyou
Copy link
Member

Still strange because I was also keeping the dataset in HDD (or actually SSD?) and it worked well. But yup, thanks for letting me know!

Hello, I train the model in another machine which installed SSD disk to verify the reason. The speed in SSD disk can reach to nearly 300K iteration a day using shapnet-1-plane setting.

zhu

I just checked the machine I used to run the model and I also kept the ShapeNet dataset in HDD. Therefore, this should not be the problem. For the 1-plane model on ShapeNet, when I used GTX 1080ti with HDD disk, I just checked again my output files and it took really only 8 hours to get 300K iterations.

I suggest you checking the following:

  • The GPU that you are using. I was using 2080ti and 1080ti (20-30% slower than 2080ti, but not too much).
  • num_workers that you use in the config file.
  • Check the volatile GPU-Util in nvidia-smi. If there are often cases 0%, it means that GPU is waiting for the data loading from CPU. This happened sometimes to me as well but disappeared by simply running your model again.
  • use time.time() in the code, to check where exactly slows down the code. For example, check runtime in between train_step or the train_loader.

Best,
Songyou

@Runsong123
Copy link
Author

Still strange because I was also keeping the dataset in HDD (or actually SSD?) and it worked well. But yup, thanks for letting me know!

Hello, I train the model in another machine which installed SSD disk to verify the reason. The speed in SSD disk can reach to nearly 300K iteration a day using shapnet-1-plane setting.
zhu

I just checked the machine I used to run the model and I also kept the ShapeNet dataset in HDD. Therefore, this should not be the problem. For the 1-plane model on ShapeNet, when I used GTX 1080ti with HDD disk, I just checked again my output files and it took really only 8 hours to get 300K iterations.

I suggest you checking the following:

  • The GPU that you are using. I was using 2080ti and 1080ti (20-30% slower than 2080ti, but not too much).
  • num_workers that you use in the config file.
  • Check the volatile GPU-Util in nvidia-smi. If there are often cases 0%, it means that GPU is waiting for the data loading from CPU. This happened sometimes to me as well but disappeared by simply running your model again.
  • use time.time() in the code, to check where exactly slows down the code. For example, check runtime in between train_step or the train_loader.

Best,
Songyou

Thanks for your quick reply.
I also compared the running time under different disks(SSD, HDD) keeping the same config provided by this repo, I find the difference is very large. Now, the training speed is fine for me when I change the disk. So I guess there exists some problems with my old HDD disk although the difference between a normal SSD and a normal HDD is not large from your test. The last, thank you again for your patient reply!!

-Zhu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants