Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN or Inf found in input tensor. #331

Open
vr-devil opened this issue Jul 17, 2023 · 1 comment
Open

NaN or Inf found in input tensor. #331

vr-devil opened this issue Jul 17, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@vr-devil
Copy link

Description

(venv) kai@ns-staging:~/workspace/stable-dreamfusion$ python main.py --text "A red dinosaur in boots." --workspace /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k -O --iters 30000
Namespace(file=None, text='A red dinosaur in boots.', negative='', O=True, O2=False, test=False, six_views=False, eval_interval=1, test_interval=100, workspace='/var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k', seed=None, image=None, image_config=None, known_view_interval=4, IF=False, guidance=['SD'], guidance_scale=100, save_mesh=False, mcubes_resolution=256, decimate_target=50000.0, dmtet=False, tet_grid_size=128, init_with='', lock_geo=False, perpneg=False, negative_w=-2, front_decay_factor=2, side_decay_factor=10, iters=30000, lr=0.001, ckpt='latest', cuda_ray=True, taichi_ray=False, max_steps=1024, num_steps=64, upsample_steps=32, update_extra_interval=16, max_ray_batch=4096, latent_iter_ratio=0.2, albedo_iter_ratio=0, min_ambient_ratio=0.1, textureless_ratio=0.2, jitter_pose=False, jitter_center=0.2, jitter_target=0.2, jitter_up=0.02, uniform_sphere_rate=0, grad_clip=-1, grad_clip_rgb=-1, bg_radius=1.4, density_activation='exp', density_thresh=10, blob_density=5, blob_radius=0.2, backbone='grid', optim='adan', sd_version='2.1', hf_key=None, fp16=True, vram_O=False, w=64, h=64, known_view_scale=1.5, known_view_noise_scale=0.002, dmtet_reso_scale=8, batch_size=1, bound=1, dt_gamma=0, min_near=0.01, radius_range=[3.0, 3.5], theta_range=[45, 105], phi_range=[-180, 180], fovy_range=[10, 30], default_radius=3.2, default_polar=90, default_azimuth=0, default_fovy=20, progressive_view=False, progressive_view_init_ratio=0.2, progressive_level=False, angle_overhead=30, angle_front=60, t_range=[0.02, 0.98], dont_override_stuff=False, lambda_entropy=0.001, lambda_opacity=0, lambda_orient=0.01, lambda_tv=0, lambda_wd=0, lambda_mesh_normal=0.5, lambda_mesh_laplacian=0.5, lambda_guidance=1, lambda_rgb=1000, lambda_mask=500, lambda_normal=0, lambda_depth=10, lambda_2d_normal_smooth=0, lambda_3d_normal_smooth=0, save_guidance=False, save_guidance_interval=10, gui=False, W=800, H=800, radius=5, fovy=20, light_theta=60, light_phi=0, max_spp=1, zero123_config='./pretrained/zero123/sd-objaverse-finetune-c_concat-256.yaml', zero123_ckpt='./pretrained/zero123/105000.ckpt', zero123_grad_scale='angle', dataset_size_train=100, dataset_size_valid=8, dataset_size_test=100, exp_start_iter=0, exp_end_iter=30000, images=None, ref_radii=[], ref_polars=[], ref_azimuths=[], zero123_ws=[], default_zero123_w=1)
NeRFNetwork(
  (encoder): GridEncoder: input_dim=3 num_levels=16 level_dim=2 resolution=16 -> 2048 per_level_scale=1.3819 params=(6098120, 2) gridtype=hash align_corners=False interpolation=smoothstep
  (sigma_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=32, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): Linear(in_features=64, out_features=4, bias=True)
    )
  )
  (encoder_bg): FreqEncoder: input_dim=3 degree=6 output_dim=39
  (bg_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=39, out_features=32, bias=True)
      (1): Linear(in_features=32, out_features=3, bias=True)
    )
  )
)
[INFO] loading stable diffusion...
[INFO] loaded stable diffusion!
[INFO] Cmdline: main.py --text A red dinosaur in boots. --workspace /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k -O --iters 30000
[INFO] opt: Namespace(file=None, text='A red dinosaur in boots.', negative='', O=True, O2=False, test=False, six_views=False, eval_interval=1, test_interval=100,
workspace='/var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k', seed=None, image=None, image_config=None, known_view_interval=4, IF=False, guidance=['SD'], guidance_scale=100, save_mesh=False,
mcubes_resolution=256, decimate_target=50000.0, dmtet=False, tet_grid_size=128, init_with='', lock_geo=False, perpneg=False, negative_w=-2, front_decay_factor=2, side_decay_factor=10, iters=30000, lr=0.001,
ckpt='latest', cuda_ray=True, taichi_ray=False, max_steps=1024, num_steps=64, upsample_steps=32, update_extra_interval=16, max_ray_batch=4096, latent_iter_ratio=0.2, albedo_iter_ratio=0, min_ambient_ratio=0.1,
textureless_ratio=0.2, jitter_pose=False, jitter_center=0.2, jitter_target=0.2, jitter_up=0.02, uniform_sphere_rate=0, grad_clip=-1, grad_clip_rgb=-1, bg_radius=1.4, density_activation='exp', density_thresh=10,
blob_density=5, blob_radius=0.2, backbone='grid', optim='adan', sd_version='2.1', hf_key=None, fp16=True, vram_O=False, w=64, h=64, known_view_scale=1.5, known_view_noise_scale=0.002, dmtet_reso_scale=8,
batch_size=1, bound=1, dt_gamma=0, min_near=0.01, radius_range=[3.0, 3.5], theta_range=[45, 105], phi_range=[-180, 180], fovy_range=[10, 30], default_radius=3.2, default_polar=90, default_azimuth=0,
default_fovy=20, progressive_view=False, progressive_view_init_ratio=0.2, progressive_level=False, angle_overhead=30, angle_front=60, t_range=[0.02, 0.98], dont_override_stuff=False, lambda_entropy=0.001,
lambda_opacity=0, lambda_orient=0.01, lambda_tv=0, lambda_wd=0, lambda_mesh_normal=0.5, lambda_mesh_laplacian=0.5, lambda_guidance=1, lambda_rgb=1000, lambda_mask=500, lambda_normal=0, lambda_depth=10,
lambda_2d_normal_smooth=0, lambda_3d_normal_smooth=0, save_guidance=False, save_guidance_interval=10, gui=False, W=800, H=800, radius=5, fovy=20, light_theta=60, light_phi=0, max_spp=1,
zero123_config='./pretrained/zero123/sd-objaverse-finetune-c_concat-256.yaml', zero123_ckpt='./pretrained/zero123/105000.ckpt', zero123_grad_scale='angle', dataset_size_train=100, dataset_size_valid=8,
dataset_size_test=100, exp_start_iter=0, exp_end_iter=30000, images=None, ref_radii=[], ref_polars=[], ref_azimuths=[], zero123_ws=[], default_zero123_w=1)
[INFO] Trainer: df | 2023-07-17_21-08-20 | cuda | fp16 | /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k
[INFO] #parameters: 12204151
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.

......

==> [2023-07-17_21-23-40] Start Training /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k Epoch 81/300, lr=0.050000 ...
loss=1.0000 (1.0000), lr=0.050000: : 100% 100/100 [00:18<00:00,  5.36it/s]
==> [2023-07-17_21-23-59] Finished Epoch 81/300. CPU=3.9GB, GPU=8.0GB.
++> Evaluate /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k at epoch 81 ...
loss=0.0000 (0.0000): : 100% 8/8 [00:00<00:00, 53.78it/s]
++> Evaluate epoch 81 Finished.
==> [2023-07-17_21-23-59] Start Training /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k Epoch 82/300, lr=0.050000 ...
loss=1.0000 (1.0000), lr=0.050000: :  50% 50/100 [00:09<00:09,  5.39it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  51% 51/100 [00:09<00:09,  5.36it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  52% 52/100 [00:09<00:08,  5.35it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  53% 53/100 [00:09<00:08,  5.35it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  54% 54/100 [00:10<00:08,  5.34it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  55% 55/100 [00:10<00:08,  5.36it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  56% 56/100 [00:10<00:08,  5.33it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  57% 57/100 [00:10<00:08,  5.33it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  58% 58/100 [00:10<00:07,  5.36it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  59% 59/100 [00:10<00:07,  5.34it/s]NaN or Inf found in input tensor.
loss=nan (nan), lr=0.050000: :  60% 60/100 [00:11<00:07,  5.35it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/kai/workspace/stable-dreamfusion/main.py:410 in <module>                                   │
│                                                                                                  │
│   407 │   │   │   test_loader = NeRFDataset(opt, device=device, type='test', H=opt.H, W=opt.W,   │
│   408 │   │   │                                                                                  │
│   409 │   │   │   max_epoch = np.ceil(opt.iters / len(train_loader)).astype(np.int32)            │
│ ❱ 410 │   │   │   trainer.train(train_loader, valid_loader, test_loader, max_epoch)              │
│   411 │   │   │                                                                                  │
│   412 │   │   │   if opt.save_mesh:                                                              │
│   413 │   │   │   │   trainer.save_mesh()                                                        │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/utils.py:812 in train                                │
│                                                                                                  │
│    809 │   │   for epoch in range(self.epoch + 1, max_epochs + 1):                               │
│    810 │   │   │   self.epoch = epoch                                                            │
│    811 │   │   │                                                                                 │
│ ❱  812 │   │   │   self.train_one_epoch(train_loader, max_epochs)                                │
│    813 │   │   │                                                                                 │
│    814 │   │   │   if self.workspace is not None and self.local_rank == 0:                       │
│    815 │   │   │   │   self.save_checkpoint(full=True, best=False)                               │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/utils.py:1049 in train_one_epoch                     │
│                                                                                                  │
│   1046 │   │   │   │   │   save_guidance_path = save_guidance_folder / f'step_{self.global_step  │
│   1047 │   │   │   │   else:                                                                     │
│   1048 │   │   │   │   │   save_guidance_path = None                                             │
│ ❱ 1049 │   │   │   │   pred_rgbs, pred_depths, loss = self.train_step(data, save_guidance_path=  │
│   1050 │   │   │                                                                                 │
│   1051 │   │   │   # hooked grad clipping for RGB space                                          │
│   1052 │   │   │   if self.opt.grad_clip_rgb >= 0:                                               │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/utils.py:537 in train_step                           │
│                                                                                                  │
│    534 │   │   │   else:                                                                         │
│    535 │   │   │   │   bg_color = torch.rand(3).to(self.device) # single color random bg         │
│    536 │   │                                                                                     │
│ ❱  537 │   │   outputs = self.model.render(rays_o, rays_d, mvp, H, W, staged=False, perturb=Tru  │
│    538 │   │   pred_depth = outputs['depth'].reshape(B, 1, H, W)                                 │
│    539 │   │   pred_mask = outputs['weights_sum'].reshape(B, 1, H, W)                            │
│    540 │   │   if 'normal_image' in outputs:                                                     │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/renderer.py:1163 in render                           │
│                                                                                                  │
│   1160 │   │   if self.dmtet:                                                                    │
│   1161 │   │   │   results = self.run_dmtet(rays_o, rays_d, mvp, h, w, **kwargs)                 │
│   1162 │   │   elif self.cuda_ray:                                                               │
│ ❱ 1163 │   │   │   results = self.run_cuda(rays_o, rays_d, **kwargs)                             │
│   1164 │   │   elif self.taichi_ray:                                                             │
│   1165 │   │   │   results = self.run_taichi(rays_o, rays_d, **kwargs)                           │
│   1166 │   │   else:                                                                             │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/renderer.py:739 in run_cuda                          │
│                                                                                                  │
│    736 │   │   │   │   flatten_rays = raymarching.flatten_rays(rays, xyzs.shape[0]).long()       │
│    737 │   │   │   │   light_d = light_d[flatten_rays]                                           │
│    738 │   │   │                                                                                 │
│ ❱  739 │   │   │   sigmas, rgbs, normals = self(xyzs, dirs, light_d, ratio=ambient_ratio, shadi  │
│    740 │   │   │   weights, weights_sum, depth, image = raymarching.composite_rays_train(sigmas  │
│    741 │   │   │                                                                                 │
│    742 │   │   │   # normals related regularizations                                             │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/venv/lib/python3.10/site-packages/torch/nn/modules/module │
│ .py:1501 in _call_impl                                                                           │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/network_grid.py:110 in forward                       │
│                                                                                                  │
│   107 │   │   # l: [3], plane light direction, nomalized in [-1, 1]                              │
│   108 │   │   # ratio: scalar, ambient ratio, 1 == no shading (albedo only), 0 == only shading   │
│   109 │   │                                                                                      │
│ ❱ 110 │   │   sigma, albedo = self.common_forward(x)                                             │
│   111 │   │                                                                                      │
│   112 │   │   if shading == 'albedo':                                                            │
│   113 │   │   │   normal = None                                                                  │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/network_grid.py:73 in common_forward                 │
│                                                                                                  │
│    70 │   │   # sigma                                                                            │
│    71 │   │   enc = self.encoder(x, bound=self.bound, max_level=self.max_level)                  │
│    72 │   │                                                                                      │
│ ❱  73 │   │   h = self.sigma_net(enc)                                                            │
│    74 │   │                                                                                      │
│    75 │   │   sigma = self.density_activation(h[..., 0] + self.density_blob(x))                  │
│    76 │   │   albedo = torch.sigmoid(h[..., 1:])                                                 │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/venv/lib/python3.10/site-packages/torch/nn/modules/module │
│ .py:1501 in _call_impl                                                                           │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/nerf/network_grid.py:29 in forward                        │
│                                                                                                  │
│    26 │                                                                                          │
│    27 │   def forward(self, x):                                                                  │
│    28 │   │   for l in range(self.num_layers):                                                   │
│ ❱  29 │   │   │   x = self.net[l](x)                                                             │
│    30 │   │   │   if l != self.num_layers - 1:                                                   │
│    31 │   │   │   │   x = F.relu(x, inplace=True)                                                │
│    32 │   │   return x                                                                           │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/venv/lib/python3.10/site-packages/torch/nn/modules/module │
│ .py:1501 in _call_impl                                                                           │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /home/kai/workspace/stable-dreamfusion/venv/lib/python3.10/site-packages/torch/nn/modules/linear │
│ .py:114 in forward                                                                               │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

loss=nan (nan), lr=0.050000: :  60% 60/100 [00:11<00:07,  5.19it/s]

Steps to Reproduce

python main.py --text "A red dinosaur in boots." --workspace /var/lib/aigc/stable-dreamfusion/trial_dinosaur_iter30k -O --iters 30000

Expected Behavior

no crash.

Environment

Ubuntu 22.02 / PyTorch 2.0.1 / CUDA 11.7

@vr-devil vr-devil added the bug Something isn't working label Jul 17, 2023
@lang-dye
Copy link

try disable "--cuda_ray‘’, I solve this issue with it. I guess that it happened with enable "--cuda_ray‘’ and '--fp16' together,cause CUDA raymarching calculate tensor error, but pytorch is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants