Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure during optimization #5

Closed
ecmjohnson opened this issue Apr 19, 2022 · 4 comments
Closed

Failure during optimization #5

ecmjohnson opened this issue Apr 19, 2022 · 4 comments

Comments

@ecmjohnson
Copy link

Hello, I'm trying to run ViSER on some of my own datasets. Out of my 5 datasets, 2 succeed and 3 fail: all with the same failure case:

> /HPS/articulated_nerf/work/viser/nnutils/mesh_net.py(809)forward()
-> self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
(Pdb) 
Traceback (most recent call last):
  File "optimize.py", line 59, in <module>
    app.run(main)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "optimize.py", line 56, in main
    trainer.train()
  File "/HPS/articulated_nerf/work/viser/nnutils/train_utils.py", line 339, in train
    total_loss,aux_output = self.model(input_batch)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/HPS/articulated_nerf/work/viser/nnutils/mesh_net.py", line 809, in forward
    self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
  File "/HPS/articulated_nerf/work/viser/nnutils/mesh_net.py", line 809, in forward
    self.match_loss = (csm_pred - csm_gt).norm(2,1)[mask].mean() * 0.1
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Traceback (most recent call last):
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/HPS/articulated_nerf/work/miniconda3/envs/viser/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/HPS/articulated_nerf/work/miniconda3/envs/viser/bin/python', '-u', 'optimize.py', '--local_rank=0', '--name=cactus_full-1003-0', '--checkpoint_dir', 'log', '--n_bones', '21', '--num_epochs', '20', '--dataname', 'cactus_full-init', '--ngpu', '1', '--batch_size', '4', '--seed', '1003']' returned non-zero exit status 1.
Killing subprocess 6097

Full error log 1
Full error log 2
Full error log 3

I would tend to assume this is a division by zero in the identified line. Have you encountered this issue before?

I have tried multiple values of init_frame and end_frame for initially optimizing on a subset (where the failure occurs). I have also tried different seed values. I haven't found any choice of these parameters that cause these datasets to avoid this failure case.

Any help or insight you can provide would be appreciated

@gengshan-y
Copy link
Owner

Hello, this seems to be an initialization issue. The rendered mask might not be overlapped with observed mask when principal point is not initialized properly.

Does this solve the problem? #4 (comment)

@ecmjohnson
Copy link
Author

Ah, let me clarify my understanding: so the ppx and ppy pixel coordinates are not necessarily the principal point of the camera projection (i.e. typically half width and half height respectively) and I should adjust them to be centered on the object for the start_idx frame in the init optimization. Is that correct?

I had already set the ppx and ppy to be half width and half height respectively for my datasets, but it is possible that this point did not overlap the masks in the datasets which failed.

@gengshan-y
Copy link
Owner

Your understanding is correct. Let me give more explanation if you are interested -- ppx, ppy is supposed to be the principal point of camera, but if we initialize them as the correct values, the renderings may not overlap due to the incorrect initial root translation estimation, which causes the problem.

I would suggest using the following to avoid tedious manual initialization of ppx, ppy:

Besides passing principal points to the config file, another option is to pass --cnnpp to optimize.py, which optimizes an image CNN to predict principal points. In this case, we have some mechanism here to ensure the silhouette rendering and ground-truth overlaps.

@ecmjohnson
Copy link
Author

Ah, excellent! That solves the issue of failing during optimization

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants