Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got NAN value in training process by the same command in this code #21

Closed
VictorZoo opened this issue Nov 26, 2021 · 1 comment
Closed

Comments

@VictorZoo
Copy link

Sorry to bother. I download Undistorted_megadepth datasets, and used this command "python -m pixloc.pixlib.train pixloc_megadepth_reproduce --conf pixloc/pixlib/configs/train_pixloc_megadepth.yaml". When in process, I got NAN as follows:

Evaluation: 97%|#########7| 99/102 [00:21<00:00, 5.18it/s]

Evaluation: 98%|#########8| 100/102 [00:21<00:00, 5.18it/s]

Evaluation: 99%|#########9| 101/102 [00:22<00:00, 4.92it/s]

Evaluation: 100%|##########| 102/102 [00:22<00:00, 4.60it/s]

[11/26/2021 01:22:48 pixloc INFO] [Validation] {R_error/0 3.900E+00, t_error/0 1.549E-01, R_error/1 4.264E+00, t_error/1 1.560E-01, R_error/2 3.748E+00, t_error/2 1.468E-01, R_error 3.748E+00, R_error_median 4.480E+00, t_error 1.468E-01, t_error_median 1.756E-01, R_error/init 1.718E+00, t_error/init 1.331E-01, loss/total 1.939E+01, loss/reprojection_error/0 3.426E+01, loss/reprojection_error/1 3.524E+01, loss/reprojection_error/2 2.786E+01, loss/reprojection_error 2.786E+01, loss/reprojection_error_median 3.192E+01, loss/reprojection_error/init 1.634E+01, loss/reprojection_error/init_median 1.069E+01}

[11/26/2021 01:23:24 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('0035', 472, 741)]

[11/26/2021 01:23:31 pixloc INFO] [E 3 | it 550] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.343E+01}

[11/26/2021 01:24:15 pixloc INFO] [E 3 | it 600] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.996E+01}

[11/26/2021 01:24:28 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('1017', 962, 1173)]

[11/26/2021 01:24:58 pixloc INFO] [E 3 | it 650] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.845E+01}

[11/26/2021 01:25:41 pixloc INFO] [E 3 | it 700] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 1.092E+01}

[11/26/2021 01:26:13 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('0348', 167, 168)]

[11/26/2021 01:26:24 pixloc INFO] [E 3 | it 750] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 8.301E+00}

[11/26/2021 01:27:08 pixloc INFO] [E 3 | it 800] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.569E+01}

[11/26/2021 01:27:19 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('5009', 168, 165)]

[11/26/2021 01:27:51 pixloc INFO] [E 3 | it 850] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 2.809E+01}

[11/26/2021 01:28:23 pixloc.pixlib.models.two_view_refiner WARNING] Few points in batch [('5005', 140, 136)]

[11/26/2021 01:28:34 pixloc INFO] [E 3 | it 900] loss {total NAN, reprojection_error/0 NAN, reprojection_error/1 NAN, reprojection_error/2 NAN, reprojection_error NAN, reprojection_error/init 8.891E+00}

It occurred in epoch 3, but when I try another the same command, it occurred in epoch 1 iteration 1180.

I just want to reproduce your results and haven't change any parameters, just the same as you post. I wondered is there something I ignored or something going wrong? My pytorch version is 1.7.1, Numpy 1.21.2.

Thank you so much.

@sarlinpe
Copy link
Member

That is certainly an issue.

  1. Are you training with the latest commits 0ab0e79 ?
  2. Can you run with anomaly detection enabled to figure out where this comes from?
  3. Let's discuss in NAN appear during training #10 instead. So far, with the last fixes, I haven't managed to reproduce the NaNs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants