Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding training duration #12

Open
Kadakol opened this issue Aug 9, 2021 · 3 comments
Open

Regarding training duration #12

Kadakol opened this issue Aug 9, 2021 · 3 comments

Comments

@Kadakol
Copy link

Kadakol commented Aug 9, 2021

Thank you so much for releasing this code!

I had a question regarding the amount of time it takes to complete training. From the paper. I found the following information:

All models are built on the PyTorch framework and trained with NVIDIA 2080Ti GPU. When the patch size of input is set to 256 x 256, the total training time is about 5 days.

I have started training on my system with NVIDIA RTX 3090 GPU. However, after training for 6 days, I notice that only 410k iterations are completed and ~600k iterations are pending. I would expect training to be faster using the 3090 GPU card compared to the 2080Ti GPU card, but this seems to be extremely slow. I am using the code from this repository as is with no modifications apart from the training data paths.

Could you please let me know if I am missing something? Is some additional step required in order to accelerate the training?

Thank you.

@chxy95
Copy link
Owner

chxy95 commented Aug 9, 2021

@Kadakol Two 2080Ti GPUs are used when training with patch size of 256 x 256. It costs about 45s for every 100 iterations and costs 125s for validation every 5000 iterations.

@Kadakol
Copy link
Author

Kadakol commented Aug 10, 2021

Thank you for your response! I understand.

I've included a small section of the training logs. Looks like training every 100 iterations varies from 39 seconds to 138 seconds and the validation phase is taking about 8 minutes.

21-08-10 11:12:37.463 - INFO: <epoch:193, iter: 505,000, lr:5.000e-05, , time:51.837> l_pix: 7.5744e-03
21-08-10 11:30:17.502 - INFO: # Validation # PSNR: 2.0776e+01, norm_PSNR: 4.0822e+01, mu_PSNR: 3.3844e+01
21-08-10 11:30:17.503 - INFO: <epoch:193, iter: 505,000> psnr: 2.0776e+01 norm_PSNR: 4.0822e+01 mu_PSNR: 3.3844e+01
21-08-10 11:30:17.523 - INFO: Saving models and training states.
21-08-10 11:31:42.571 - INFO: <epoch:193, iter: 505,100, lr:5.000e-05, , time:1145.108> l_pix: 7.7541e-03
21-08-10 11:33:07.015 - INFO: <epoch:193, iter: 505,200, lr:5.000e-05, , time:84.443> l_pix: 8.7734e-03
21-08-10 11:34:32.932 - INFO: <epoch:193, iter: 505,300, lr:5.000e-05, , time:85.917> l_pix: 7.4574e-03
21-08-10 11:36:03.741 - INFO: <epoch:193, iter: 505,400, lr:5.000e-05, , time:90.808> l_pix: 7.7654e-03
21-08-10 11:37:49.982 - INFO: <epoch:193, iter: 505,500, lr:5.000e-05, , time:106.241> l_pix: 9.4102e-03
21-08-10 11:39:35.881 - INFO: <epoch:193, iter: 505,600, lr:5.000e-05, , time:105.898> l_pix: 8.2314e-03
21-08-10 11:41:29.546 - INFO: <epoch:193, iter: 505,700, lr:5.000e-05, , time:113.664> l_pix: 7.8461e-03
21-08-10 11:43:29.389 - INFO: <epoch:193, iter: 505,800, lr:5.000e-05, , time:119.013> l_pix: 8.7112e-03
21-08-10 11:45:26.957 - INFO: <epoch:193, iter: 505,900, lr:5.000e-05, , time:117.568> l_pix: 6.8919e-03
21-08-10 11:47:31.645 - INFO: <epoch:193, iter: 506,000, lr:5.000e-05, , time:124.688> l_pix: 8.2699e-03
21-08-10 11:49:33.480 - INFO: <epoch:193, iter: 506,100, lr:5.000e-05, , time:121.835> l_pix: 8.0161e-03
21-08-10 11:51:37.496 - INFO: <epoch:193, iter: 506,200, lr:5.000e-05, , time:124.015> l_pix: 9.0084e-03
21-08-10 11:53:41.639 - INFO: <epoch:193, iter: 506,300, lr:5.000e-05, , time:124.143> l_pix: 7.3761e-03
21-08-10 11:55:55.294 - INFO: <epoch:193, iter: 506,400, lr:5.000e-05, , time:133.655> l_pix: 8.3294e-03
21-08-10 11:58:03.631 - INFO: <epoch:193, iter: 506,500, lr:5.000e-05, , time:128.336> l_pix: 7.9146e-03
21-08-10 12:00:14.385 - INFO: <epoch:193, iter: 506,600, lr:5.000e-05, , time:130.754> l_pix: 7.6176e-03
21-08-10 12:02:27.380 - INFO: <epoch:193, iter: 506,700, lr:5.000e-05, , time:132.995> l_pix: 6.9532e-03
21-08-10 12:04:38.487 - INFO: <epoch:193, iter: 506,800, lr:5.000e-05, , time:131.106> l_pix: 7.9572e-03
21-08-10 12:06:50.470 - INFO: <epoch:193, iter: 506,900, lr:5.000e-05, , time:131.983> l_pix: 7.9140e-03
21-08-10 12:09:04.569 - INFO: <epoch:193, iter: 507,000, lr:5.000e-05, , time:134.098> l_pix: 9.0580e-03
21-08-10 12:11:14.352 - INFO: <epoch:193, iter: 507,100, lr:5.000e-05, , time:129.783> l_pix: 8.0160e-03
21-08-10 12:12:03.875 - INFO: <epoch:194, iter: 507,200, lr:5.000e-05, , time:49.522> l_pix: 8.9110e-03
21-08-10 12:12:43.286 - INFO: <epoch:194, iter: 507,300, lr:5.000e-05, , time:39.410> l_pix: 7.1136e-03
21-08-10 12:13:28.775 - INFO: <epoch:194, iter: 507,400, lr:5.000e-05, , time:45.488> l_pix: 7.1421e-03
21-08-10 12:14:11.765 - INFO: <epoch:194, iter: 507,500, lr:5.000e-05, , time:42.990> l_pix: 7.0376e-03
21-08-10 12:15:01.733 - INFO: <epoch:194, iter: 507,600, lr:5.000e-05, , time:49.968> l_pix: 7.8753e-03
21-08-10 12:15:50.791 - INFO: <epoch:194, iter: 507,700, lr:5.000e-05, , time:49.056> l_pix: 7.3979e-03
21-08-10 12:16:39.123 - INFO: <epoch:194, iter: 507,800, lr:5.000e-05, , time:48.332> l_pix: 7.9646e-03
21-08-10 12:17:33.752 - INFO: <epoch:194, iter: 507,900, lr:5.000e-05, , time:54.628> l_pix: 8.1874e-03
21-08-10 12:18:33.663 - INFO: <epoch:194, iter: 508,000, lr:5.000e-05, , time:59.911> l_pix: 8.5119e-03
21-08-10 12:19:37.124 - INFO: <epoch:194, iter: 508,100, lr:5.000e-05, , time:63.460> l_pix: 7.5296e-03
21-08-10 12:20:43.546 - INFO: <epoch:194, iter: 508,200, lr:5.000e-05, , time:66.422> l_pix: 8.6460e-03
21-08-10 12:21:49.488 - INFO: <epoch:194, iter: 508,300, lr:5.000e-05, , time:65.942> l_pix: 7.7607e-03
21-08-10 12:23:05.766 - INFO: <epoch:194, iter: 508,400, lr:5.000e-05, , time:76.278> l_pix: 9.1717e-03
21-08-10 12:24:27.715 - INFO: <epoch:194, iter: 508,500, lr:5.000e-05, , time:81.949> l_pix: 8.2576e-03
21-08-10 12:25:50.046 - INFO: <epoch:194, iter: 508,600, lr:5.000e-05, , time:82.330> l_pix: 7.7228e-03
21-08-10 12:27:28.734 - INFO: <epoch:194, iter: 508,700, lr:5.000e-05, , time:98.687> l_pix: 7.5902e-03
21-08-10 12:29:08.406 - INFO: <epoch:194, iter: 508,800, lr:5.000e-05, , time:99.673> l_pix: 1.0056e-02
21-08-10 12:31:06.663 - INFO: <epoch:194, iter: 508,900, lr:5.000e-05, , time:118.257> l_pix: 8.6049e-03
21-08-10 12:33:13.858 - INFO: <epoch:194, iter: 509,000, lr:5.000e-05, , time:127.194> l_pix: 7.9943e-03
21-08-10 12:35:30.189 - INFO: <epoch:194, iter: 509,100, lr:5.000e-05, , time:136.331> l_pix: 7.9761e-03
21-08-10 12:37:34.060 - INFO: <epoch:194, iter: 509,200, lr:5.000e-05, , time:123.870> l_pix: 8.8205e-03
21-08-10 12:39:52.601 - INFO: <epoch:194, iter: 509,300, lr:5.000e-05, , time:138.540> l_pix: 9.6689e-03
21-08-10 12:41:59.272 - INFO: <epoch:194, iter: 509,400, lr:5.000e-05, , time:126.671> l_pix: 7.7093e-03
21-08-10 12:44:15.614 - INFO: <epoch:194, iter: 509,500, lr:5.000e-05, , time:135.527> l_pix: 6.8039e-03
21-08-10 12:46:29.271 - INFO: <epoch:194, iter: 509,600, lr:5.000e-05, , time:133.657> l_pix: 7.6752e-03
21-08-10 12:48:38.473 - INFO: <epoch:194, iter: 509,700, lr:5.000e-05, , time:129.201> l_pix: 7.0189e-03
21-08-10 12:49:48.657 - INFO: <epoch:195, iter: 509,800, lr:5.000e-05, , time:70.185> l_pix: 8.7850e-03
21-08-10 12:50:35.257 - INFO: <epoch:195, iter: 509,900, lr:5.000e-05, , time:46.599> l_pix: 7.9587e-03
21-08-10 12:51:21.009 - INFO: <epoch:195, iter: 510,000, lr:5.000e-05, , time:45.751> l_pix: 7.6621e-03

@andre20000131
Copy link

hi,i think time is not problem,if ur computer have some other processes,your gpt can't use all of it power,i met this question ..but i want to know which lossfunction you used?why the loss in my computer always between e-2 to e-1..wish your respoend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants