Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cameras with multiple overlapping regions: Will it work? #81

Open
samhodge-aiml opened this issue Aug 28, 2023 · 24 comments
Open

Cameras with multiple overlapping regions: Will it work? #81

samhodge-aiml opened this issue Aug 28, 2023 · 24 comments

Comments

@samhodge-aiml
Copy link

samhodge-aiml commented Aug 28, 2023

I have a series of photos:
https://drive.google.com/drive/folders/1ZZgZUrFrnP47rx8bN5K6yvYnSC50a-9G?usp=drive_link

Which were take with an iPhone 13 Pro Max

I have used this dataset with Instant NGP from NVIDIA and with Gaussian Splatting to produce a good radiance field.

Do you think this dataset will work with the code in this repository.

My changes are recorded here

diff --git a/data/iphone.py b/data/iphone.py
index 05cf1d5..e34bcc8 100644
--- a/data/iphone.py
+++ b/data/iphone.py
@@ -17,7 +17,7 @@ from util import log,debug
 class Dataset(base.Dataset):
 
     def __init__(self,opt,split="train",subset=None):
-        self.raw_H,self.raw_W = 1080,1920
+        self.raw_H,self.raw_W = 3024,4032
         super().__init__(opt,split)
         self.root = opt.data.root or "data/iphone"
         self.path = "{}/{}".format(self.root,opt.data.scene)
@@ -62,7 +62,7 @@ class Dataset(base.Dataset):
         return image
 
     def get_camera(self,opt,idx):
-        self.focal = self.raw_W*4.2/(12.8/2.55)
+        self.focal = self.raw_W*1.6*35.0
         intr = torch.tensor([[self.focal,0,self.raw_W/2],
                              [0,self.focal,self.raw_H/2],
                              [0,0,1]]).float()
diff --git a/options/barf_iphone.yaml b/options/barf_iphone.yaml
index f344c7b..fbfcc38 100644
--- a/options/barf_iphone.yaml
+++ b/options/barf_iphone.yaml
@@ -1,6 +1,19 @@
-_parent_: options/barf_llff.yaml
+_parent_: options/nerf_iphone.yaml
 
-data:                                                       # data options
-    dataset: iphone                                         # dataset name
-    scene: IMG_0239                                         # scene name
-    image_size: [480,640]                                   # input image sizes [height,width]
+barf_c2f:                                                   # coarse-to-fine scheduling on positional encoding
+
+camera:                                                     # camera options
+    noise:                                                  # synthetic perturbations on the camera poses (Blender only)
+
+optim:                                                      # optimization options
+    lr_pose: 3.e-3                                          # learning rate of camera poses
+    lr_pose_end: 1.e-5                                      # terminal learning rate of camera poses (only used with sched_pose.type=ExponentialLR)
+    sched_pose:                                             # learning rate scheduling options
+        type: ExponentialLR                                 # scheduler (see PyTorch doc)
+        gamma:                                              # decay rate (can be empty if lr_pose_end were specified)
+    warmup_pose:                                            # linear warmup of the pose learning rate (N iterations)
+    test_photo: true                                        # test-time photometric optimization for evaluation
+    test_iter: 100                                          # number of iterations for test-time optimization
+
+visdom:                                                     # Visdom options
+    cam_depth: 0.2                                          # size of visualized cameras
diff --git a/requirements.yaml b/requirements.yaml
index 0baf8b0..2865db4 100644
--- a/requirements.yaml
+++ b/requirements.yaml
@@ -2,6 +2,7 @@ name: barf-env
 channels:
   - conda-forge
   - pytorch
+  - nvidia
 dependencies:
   - numpy
   - scipy
@@ -10,7 +11,8 @@ dependencies:
   - easydict
   - imageio
   - ipdb
-  - pytorch>=1.9.0
+  - pytorch
+  - pytorch-cuda=11.8
   - torchvision
   - tensorboard
   - visdom

and I removed "IMG_" from the file names.

I am training the model now.

Do you have an estimate of how long this might take on a RTX 3090.

What viewer can I use to make renders from the radiance field produced from this training run?

Example image below, EXIF information should be intact:
6063

Sam

@samhodge-aiml
Copy link
Author

Options attached.
options.zip

@samhodge
Copy link

After 2 hours it had not completed 10 iterations, what am I doing wrong?

@chenhsuanlin
Copy link
Owner

@samhodge @samhodge-aiml I'm not super confident whether BARF would work well on your data, as the viewpoint coverage is not as dense as what we had been experimenting before. My estimate of the runtime on a 3090 would be 8-10 hours, but I don't have one to benchmark with so I cannot say for sure (also it has been quite a while since I developed this project). The training shouldn't get stuck at 10 iterations though -- could you share the training log?

@samhodge-aiml
Copy link
Author

samhodge-aiml commented Aug 30, 2023

that is the thing the GPU was loaded up (RTX 3090, 24 Gb, 98% 39% GPU compute, 20% <5% GPU memory)

But nothing really being logged at all.

I will try running it again and see if I can get something to share with you.

There was no error, no Tensorboard logs to speak of, but a file in the output directory, so write permission was OK, I turned off visdom

Let me give you everything I have so far and we can get to the bottom of it.

Thanks a million for the response.

@samhodge-aiml
Copy link
Author

samhodge-aiml commented Aug 30, 2023

here is the stdout

python3 train.py --group=samh --model=barf --yaml=barf_iphone --name=bakerst006 --data.scene=bakerst --barf_c2f=[0.1,0.5] --visdom!
Process ID: 18377
[train.py] (PyTorch code for training NeRF/BARF)
setting configurations...
loading options/base.yaml...
loading options/nerf_llff.yaml...
loading options/barf_llff.yaml...
loading options/barf_iphone.yaml...
* H: 480
* W: 640
* arch:
   * density_activ: softplus
   * layers_feat: [None, 256, 256, 256, 256, 256, 256, 256, 256]
   * layers_rgb: [None, 128, 3]
   * posenc:
      * L_3D: 10
      * L_view: 4
   * skip: [4]
   * tf_init: True
* barf_c2f: [0.1, 0.5]
* batch_size: None
* camera:
   * model: perspective
   * ndc: False
   * noise: None
* cpu: False
* data:
   * augment:
   * center_crop: None
   * dataset: iphone
   * image_size: [480, 640]
   * num_workers: 4
   * preload: True
   * root: None
   * scene: bakerst
   * train_sub: None
   * val_on_test: False
   * val_ratio: 0.1
   * val_sub: None
* device: cuda:0
* freq:
   * ckpt: 5000
   * scalar: 200
   * val: 2000
   * vis: 1000
* gpu: 0
* group: samh
* load: None
* loss_weight:
   * render: 0
   * render_fine: None
* max_epoch: None
* max_iter: 10
* model: barf
* name: bakerst006
* nerf:
   * density_noise_reg: None
   * depth:
      * param: inverse
      * range: [1, 0]
   * fine_sampling: False
   * rand_rays: 2048
   * sample_intvs: 128
   * sample_intvs_fine: None
   * sample_stratified: True
   * setbg_opaque: None
   * view_dep: True
* optim:
   * algo: Adam
   * lr: 0.001
   * lr_end: 0.0001
   * lr_pose: 0.003
   * lr_pose_end: 1e-05
   * sched:
      * gamma: None
      * type: ExponentialLR
   * sched_pose:
      * gamma: None
      * type: ExponentialLR
   * test_iter: 100
   * test_photo: True
   * warmup_pose: None
* output_path: output/samh/bakerst006
* output_root: output
* resume: False
* seed: 0
* tb:
   * num_images: [4, 8]
* visdom: False
* yaml: barf_iphone
(creating new options file...)
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Loading model from: /media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/lpips/weights/v0.1/alex.pth
loading training data...
number of samples: 75                                                                                                            
loading test data...
number of samples: 8                                                                                                             
building networks...
setting up optimizers...
initializing weights from scratch...
setting up visualizers...
TRAINING START
validating:   0%|                                                                                          | 0/8 [00:00<?, ?it/s]/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680557665316/work/aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

Sitting at this point

nvidia-smi

Wed Aug 30 18:37:26 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:06:00.0  On |                  N/A |
| 66%   68C    P2             243W / 350W |   1047MiB / 24576MiB |     39%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:0A:00.0 Off |                  N/A |
| 32%   43C    P8              23W / 350W |     15MiB / 24576MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2400      G   /usr/lib/xorg/Xorg                          134MiB |
|    0   N/A  N/A      4361      G   /usr/bin/gnome-shell                         95MiB |
|    0   N/A  N/A     14240      G   ...sion,SpareRendererForSitePerProcess       53MiB |
|    0   N/A  N/A     18377      C   python3                                     666MiB |
|    0   N/A  N/A     18794      G   ...4151621,13186568319809438527,262144       77MiB |
|    1   N/A  N/A      2400      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

@samhodge-aiml
Copy link
Author

One hour later no progress, I will leave it running overnight and see if anything happens

@samhodge-aiml
Copy link
Author

It has been running for over 10 hours now and now progress, I am going to save the electricity.

@chenhsuanlin
Copy link
Owner

This shouldn't happen. Could you help pinpoint which line it hangs at?

@samhodge
Copy link

I can certainly keyboard interrupt the job and give you the stack trace

@samhodge-aiml
Copy link
Author

Traceback (most recent call last):                                              
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/train.py", line 32, in <module>
    main()
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/train.py", line 29, in main
    m.train(opt)
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 54, in train
    if self.iter_start==0: self.validate(opt,0)
                           ^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/barf.py", line 66, in validate
    super().validate(opt,ep=ep)
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/base.py", line 152, in validate
    var = self.graph.forward(opt,var,mode="val")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 210, in forward
    ret = self.render_by_slices(opt,pose,intr=var.intr,mode=mode) if opt.nerf.rand_rays else \
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 267, in render_by_slices
    ret = self.render(opt,pose,intr=intr,ray_idx=ray_idx,mode=mode) # [B,R,3],[B,R,1]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 236, in render
    center,ray = camera.get_center_and_ray(opt,pose,intr=intr) # [B,HW,3]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/camera.py", line 241, in get_center_and_ray
    grid_3D = cam2world(grid_3D,pose) # [B,HW,3]
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/camera.py", line 213, in cam2world
    return X_hom@pose_inv.transpose(-1,-2)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

@samhodge-aiml
Copy link
Author

Could it be that the focal length for the camera is causing an unsolvable matrix?

@samhodge-aiml
Copy link
Author

related? #76 (comment)

@samhodge
Copy link

Might try here tomorrow https://camp-nerf.github.io/

@chenhsuanlin
Copy link
Owner

Yes, it is likely stuck in the loop as in #76. If you use batch size 1 the issue will likely go away -- I have not been able to figure out exactly where the bug was. CamP should be a quite decent improvement over BARF in joint camera optimization. I would definitely encourage you to try it out if they have the code released.

@samhodge-aiml
Copy link
Author

no code yet, batch size of one it is

@samhodge-aiml
Copy link
Author

Batch size of one didn't seem to work for me either.

@SwirtaB
Copy link

SwirtaB commented Sep 1, 2023

Hi,

while I was working with this codebase I have faced similar issue (training stuck in endless loop). It has turned out that during sampling along the ray, there was exponential (kind of) grow in depth for last few samples with the last ones as big as few thousends (or even 10000 on one occasion). It caused gradients to explode during backward propagation and some of parameters became NaN's, hence calculeted rays got NaN values in them. I wasn't able to pinpoint specific error in implementation. Bare in mind that I was experimenting on heavily modified architecture so I encurage you to check for abnormal values, details in doc.

There are number of strategies to deal with this problem (assuming that gradient explosion is what causing it), the simplest is to clip abnormal samples, which is very fast workaround. This can affect the results, but erroneous samples make up a very small proportion of the total training data, so it shouldn't be too bad.

@samhodge
Copy link

samhodge commented Sep 1, 2023

Thanks a million maybe tomorrow I can eek out a little time to see if I can make this into a PR

The information is very generous but I am not sure if my skills are ready right now to debug and patch the issue, but why die wondering right? I will see what I can do

@chenhsuanlin
Copy link
Owner

@SwirtaB thanks for the feedback! I hadn't been able to deterministically reproduce this issue, and did not realize it had to do with the sampled coordinates. In this case, this line is likely the culprit, where the depth of the last sample is set to a very large number (1e10). @samhodge if you find that tweaking the code to lower it to e.g. 1e3 would help, please let me know and I'm happy to make a hotfix.

@samhodge
Copy link

samhodge commented Sep 1, 2023

Yeah I can certainly write a smoothstep function to roll it off to a limit.

https://en.wikipedia.org/wiki/Smoothstep

@samhodge-aiml
Copy link
Author

Trying this

diff --git a/model/nerf.py b/model/nerf.py
index b0dcb2c..eefef60 100644
--- a/model/nerf.py
+++ b/model/nerf.py
@@ -393,7 +393,7 @@ class NeRF(torch.nn.Module):
         ray_length = ray.norm(dim=-1,keepdim=True) # [B,HW,1]
         # volume rendering: compute probability (using quadrature)
         depth_intv_samples = depth_samples[...,1:,0]-depth_samples[...,:-1,0] # [B,HW,N-1]
-        depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e10)],dim=2) # [B,HW,N]
+        depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e3)],dim=2) # [B,HW,N]
         dist_samples = depth_intv_samples*ray_length # [B,HW,N]
         sigma_delta = density_samples*dist_samples # [B,HW,N]
         alpha = 1-(-sigma_delta).exp_() # [B,HW,N]

@samhodge-aiml
Copy link
Author

I have another idea, that one didn't work:

https://numpy.org/doc/stable/reference/generated/numpy.heaviside.html

@samhodge
Copy link

samhodge commented Sep 1, 2023

Other things that do not work

diff --git a/data/iphone.py b/data/iphone.py
index 05cf1d5..e34bcc8 100644
--- a/data/iphone.py
+++ b/data/iphone.py
@@ -17,7 +17,7 @@ from util import log,debug
 class Dataset(base.Dataset):
 
     def __init__(self,opt,split="train",subset=None):
-        self.raw_H,self.raw_W = 1080,1920
+        self.raw_H,self.raw_W = 3024,4032
         super().__init__(opt,split)
         self.root = opt.data.root or "data/iphone"
         self.path = "{}/{}".format(self.root,opt.data.scene)
@@ -62,7 +62,7 @@ class Dataset(base.Dataset):
         return image
 
     def get_camera(self,opt,idx):
-        self.focal = self.raw_W*4.2/(12.8/2.55)
+        self.focal = self.raw_W*1.6*35.0
         intr = torch.tensor([[self.focal,0,self.raw_W/2],
                              [0,self.focal,self.raw_H/2],
                              [0,0,1]]).float()
diff --git a/model/nerf.py b/model/nerf.py
index b0dcb2c..9a02e77 100644
--- a/model/nerf.py
+++ b/model/nerf.py
@@ -391,9 +391,11 @@ class NeRF(torch.nn.Module):
 
     def composite(self,opt,ray,rgb_samples,density_samples,depth_samples):
         ray_length = ray.norm(dim=-1,keepdim=True) # [B,HW,1]
+        ray_length = numpy.clip(ray_length, 0, 1e3)
+        
         # volume rendering: compute probability (using quadrature)
         depth_intv_samples = depth_samples[...,1:,0]-depth_samples[...,:-1,0] # [B,HW,N-1]
-        depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e10)],dim=2) # [B,HW,N]
+        depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e3)],dim=2) # [B,HW,N]
         dist_samples = depth_intv_samples*ray_length # [B,HW,N]
         sigma_delta = density_samples*dist_samples # [B,HW,N]
         alpha = 1-(-sigma_delta).exp_() # [B,HW,N]
diff --git a/options/barf_iphone.yaml b/options/barf_iphone.yaml
index f344c7b..d58794b 100644
--- a/options/barf_iphone.yaml
+++ b/options/barf_iphone.yaml
@@ -2,5 +2,7 @@ _parent_: options/barf_llff.yaml
 
 data:                                                       # data options
     dataset: iphone                                         # dataset name
-    scene: IMG_0239                                         # scene name
+    scene: bakerst                                         # scene name
     image_size: [480,640]                                   # input image sizes [height,width]
+max_iter: 10
+batch_size: 1
diff --git a/requirements.yaml b/requirements.yaml
index 0baf8b0..2865db4 100644
--- a/requirements.yaml
+++ b/requirements.yaml
@@ -2,6 +2,7 @@ name: barf-env
 channels:
   - conda-forge
   - pytorch
+  - nvidia
 dependencies:
   - numpy
   - scipy
@@ -10,7 +11,8 @@ dependencies:
   - easydict
   - imageio
   - ipdb
-  - pytorch>=1.9.0
+  - pytorch
+  - pytorch-cuda=11.8
   - torchvision
   - tensorboard
   - visdom

@SwirtaB
Copy link

SwirtaB commented Sep 2, 2023

@chenhsuanlin no problem. I have gave your suggestion a try and it only delayed the problem for me, training have hang much later. Then I have cross checked your implementation of composit with NeRF article and their official implementation. By my understanding whole equation 3 from article reduces to alpha composition. In their implementation, they calculate it slightly different (original impl), so I gave it a try. I have commented T calculation and calculate prob as:

prob = (alpha * torch.cumprod(1.0 - alpha + 1e-10, dim=2))[..., None]

Unfortunately that didn't solve the problem, only delayed it again. That being said any workaround that ensures proper samples values (either by clipping or something else) works quite well. Maybe that is proper solution, since NeRF's are still neural networks and improper inputs could leads to all sorts of problems.

EDIT:
T calculations and those from original implementation are identical (in the math sense) and differ only in numerical approach, it haven't noticed it at the beginning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants