Nan When training #8

siny1998 · 2024-01-17T07:02:00Z

Hi,

Excellent work on the U-Mamba model! I have been attempting to train U-Mamba using my own datasets, which include both 2D and 3D data. However, I've faced an issue where the training process sometimes results in a 'nan' training loss despite trying different datasets. Have you experienced this issue during your training of U-Mamba? I used the nnUNetTrainerUMambaEnc trainer for this process.

Looking forward to your insights.

Best regards,
Leuan

JunMa11 · 2024-01-17T18:50:18Z

Hi @siny1998 ,

Thanks for your interest.

does the nnUNetTrainerUMambaBot have this issue?

siny1998 · 2024-01-18T01:38:07Z

I haven't used the nnUNetTrainerUMambaBot yet. I'll give it a try.

JiarunLiu · 2024-01-20T04:11:08Z

Hi @JunMa11,

I have similar issue when I training on Dataset704_Endovis17 dataset with nnUNetTrainerUMambaEnc. But using nnUNetTrainerUMambaEnc won't have this issue. This is my training logs with command nnUNetv2_train 704 2d all -tr nnUNetTrainerUMambaEnc:

This is the configuration used by this training:
Configuration name: 2d
 {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [384, 640], 'median_image_size_in_voxels': [1080.0, 1920.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [False, False, False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset704_Endovis17', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 1080, 1920], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 100.00444773514101, 'median': 92.0, 'min': 1.0, 'percentile_00_5': 24.0, 'percentile_99_5': 238.0, 'std': 51.584682233895585}, '1': {'max': 249.0, 'mean': 86.51525510193234, 'median': 72.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 233.0, 'std': 52.24999179949625}, '2': {'max': 255.0, 'mean': 93.29896387602795, 'median': 79.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 244.0, 'std': 56.38243456877845}}} 

2024-01-20 11:56:11.330918: unpacking dataset...
2024-01-20 11:56:11.683550: unpacking done...
2024-01-20 11:56:11.684135: do_dummy_2d_data_aug: False
/mnt/disk2/jiarunliu/Documents/mamba/U-Mamba/umamba/nnunetv2/nets/UMambaEnc.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert C == self.dim
2024-01-20 11:56:27.301467: Unable to plot network architecture:
2024-01-20 11:56:27.302820: module 'torch.onnx' has no attribute '_optimize_trace'
2024-01-20 11:56:27.350771: 
2024-01-20 11:56:27.350951: Epoch 0
2024-01-20 11:56:27.351222: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
2024-01-20 12:01:32.749162: train_loss nan
2024-01-20 12:01:32.749484: val_loss nan
2024-01-20 12:01:32.749587: Pseudo dice [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2024-01-20 12:01:32.749661: Epoch time: 305.4 s
2024-01-20 12:01:32.749717: Yayy! New best EMA pseudo Dice: 0.0

So far, this problem only occurs with Dataset704_Endovis17 + nnUNetTrainerUMambaEnc. I'll trying other settings/datasets later.

eclipse0922 · 2024-01-21T13:41:18Z

Ohter cases
Dataset701_AbdomenCT with nnUNetTrainerUMambaEnc.
I gave the command nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc.
I got the same problem as others

I also tried nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaBot.
it works without nan.

Other Dataset
702(3D) Bot/ Enc both seems OK
703(2D) Bot causes negative loss form epoch 1/ Enc seems OK

704(2D) Bot/Enc both seems OK

I didn't try it all the way to the end until the model converged, so there was no problem at the beginning, but it's possible it could have changed after several epochs passed.

And the default optimizer is set to SGD for the nnUNetTrainerUMambaEnc nnUNetTrainerUMambaBot classes,
I changed it to use nnUNetTrainerAdam instead, it worked without nan issue at the beginning. but it shows me nan again after several epoch passed.

eclipse0922 · 2024-01-22T08:27:01Z

As with any training model, the combination of optimiser, trainer, and learning rate seems to matter.
I tried SGD with learning rate 1e-3, much stable than before(default 1e-2)

JunMa11 · 2024-01-29T21:46:23Z

Hi all,

Thanks for your valuable feedback!
We are testing the model on more datasets. Welcome to subscribe to our update here.

innocence0206 · 2024-01-30T13:55:27Z

I got the 'Nan' loss too with nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc, and I have tried redoing nnUNetv2_plan_and_preprocess -d 701 --verify_dataset_integrity, but it didn't work

gumayusi3 · 2024-02-16T01:47:20Z

I also have this problem when training OCTA dataset

wyjzll · 2024-03-18T12:37:56Z

Excellent work! I also faced the same problem on several datasets. Dice, Train_loss, and Val_loss collapsed after 40+ or 200+ epoches randomly. UmambaBot worked but Enc didn't.

Missyfirst · 2024-03-28T08:06:01Z

@wyjzll I have the same problems with you. UmambaBot worked but Enc didn't.

JunMa11 · 2024-04-01T20:56:58Z

Hi all,

Thanks for your valuable feedback. After diving into the implementation, we found that AMP will lead to nan.

We have provided new Enc trainers without AMP:

https://github.com/bowang-lab/U-Mamba/blob/main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainerUMambaEncNoAMP.py

JunMa11 added the enhancement New feature or request label Jan 29, 2024

anwai98 mentioned this issue Apr 1, 2024

UMamba in encoder computational-cell-analytics/vimunet-benchmarking#1

Open

JunMa11 closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan When training #8

Nan When training #8

siny1998 commented Jan 17, 2024

JunMa11 commented Jan 17, 2024

siny1998 commented Jan 18, 2024

JiarunLiu commented Jan 20, 2024

eclipse0922 commented Jan 21, 2024 •

edited

Loading

eclipse0922 commented Jan 22, 2024 •

edited

Loading

JunMa11 commented Jan 29, 2024

innocence0206 commented Jan 30, 2024

gumayusi3 commented Feb 16, 2024

wyjzll commented Mar 18, 2024

Missyfirst commented Mar 28, 2024

JunMa11 commented Apr 1, 2024

Nan When training #8

Nan When training #8

Comments

siny1998 commented Jan 17, 2024

JunMa11 commented Jan 17, 2024

siny1998 commented Jan 18, 2024

JiarunLiu commented Jan 20, 2024

eclipse0922 commented Jan 21, 2024 • edited Loading

eclipse0922 commented Jan 22, 2024 • edited Loading

JunMa11 commented Jan 29, 2024

innocence0206 commented Jan 30, 2024

gumayusi3 commented Feb 16, 2024

wyjzll commented Mar 18, 2024

Missyfirst commented Mar 28, 2024

JunMa11 commented Apr 1, 2024

eclipse0922 commented Jan 21, 2024 •

edited

Loading

eclipse0922 commented Jan 22, 2024 •

edited

Loading