Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan When training #8

Closed
siny1998 opened this issue Jan 17, 2024 · 11 comments
Closed

Nan When training #8

siny1998 opened this issue Jan 17, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@siny1998
Copy link

Hi,

Excellent work on the U-Mamba model! I have been attempting to train U-Mamba using my own datasets, which include both 2D and 3D data. However, I've faced an issue where the training process sometimes results in a 'nan' training loss despite trying different datasets. Have you experienced this issue during your training of U-Mamba? I used the nnUNetTrainerUMambaEnc trainer for this process.

Looking forward to your insights.

Best regards,
Leuan

@JunMa11
Copy link
Collaborator

JunMa11 commented Jan 17, 2024

Hi @siny1998 ,

Thanks for your interest.

does the nnUNetTrainerUMambaBot have this issue?

@siny1998
Copy link
Author

I haven't used the nnUNetTrainerUMambaBot yet. I'll give it a try.

@JiarunLiu
Copy link

Hi @JunMa11,

I have similar issue when I training on Dataset704_Endovis17 dataset with nnUNetTrainerUMambaEnc. But using nnUNetTrainerUMambaEnc won't have this issue. This is my training logs with command nnUNetv2_train 704 2d all -tr nnUNetTrainerUMambaEnc:

This is the configuration used by this training:
Configuration name: 2d
 {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [384, 640], 'median_image_size_in_voxels': [1080.0, 1920.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [False, False, False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset704_Endovis17', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 1080, 1920], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 100.00444773514101, 'median': 92.0, 'min': 1.0, 'percentile_00_5': 24.0, 'percentile_99_5': 238.0, 'std': 51.584682233895585}, '1': {'max': 249.0, 'mean': 86.51525510193234, 'median': 72.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 233.0, 'std': 52.24999179949625}, '2': {'max': 255.0, 'mean': 93.29896387602795, 'median': 79.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 244.0, 'std': 56.38243456877845}}} 

2024-01-20 11:56:11.330918: unpacking dataset...
2024-01-20 11:56:11.683550: unpacking done...
2024-01-20 11:56:11.684135: do_dummy_2d_data_aug: False
/mnt/disk2/jiarunliu/Documents/mamba/U-Mamba/umamba/nnunetv2/nets/UMambaEnc.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert C == self.dim
2024-01-20 11:56:27.301467: Unable to plot network architecture:
2024-01-20 11:56:27.302820: module 'torch.onnx' has no attribute '_optimize_trace'
2024-01-20 11:56:27.350771: 
2024-01-20 11:56:27.350951: Epoch 0
2024-01-20 11:56:27.351222: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
2024-01-20 12:01:32.749162: train_loss nan
2024-01-20 12:01:32.749484: val_loss nan
2024-01-20 12:01:32.749587: Pseudo dice [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2024-01-20 12:01:32.749661: Epoch time: 305.4 s
2024-01-20 12:01:32.749717: Yayy! New best EMA pseudo Dice: 0.0

So far, this problem only occurs with Dataset704_Endovis17 + nnUNetTrainerUMambaEnc. I'll trying other settings/datasets later.

@eclipse0922
Copy link

eclipse0922 commented Jan 21, 2024

Ohter cases
Dataset701_AbdomenCT with nnUNetTrainerUMambaEnc.
I gave the command nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc.
I got the same problem as others

I also tried nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaBot.
it works without nan.

Other Dataset
702(3D) Bot/ Enc both seems OK
703(2D) Bot causes negative loss form epoch 1/ Enc seems OK
image

704(2D) Bot/Enc both seems OK

I didn't try it all the way to the end until the model converged, so there was no problem at the beginning, but it's possible it could have changed after several epochs passed.

And the default optimizer is set to SGD for the nnUNetTrainerUMambaEnc nnUNetTrainerUMambaBot classes,
I changed it to use nnUNetTrainerAdam instead, it worked without nan issue at the beginning. but it shows me nan again after several epoch passed.

@eclipse0922
Copy link

eclipse0922 commented Jan 22, 2024

As with any training model, the combination of optimiser, trainer, and learning rate seems to matter.
I tried SGD with learning rate 1e-3, much stable than before(default 1e-2)

@JunMa11
Copy link
Collaborator

JunMa11 commented Jan 29, 2024

Hi all,

Thanks for your valuable feedback!
We are testing the model on more datasets. Welcome to subscribe to our update here.

@JunMa11 JunMa11 added the enhancement New feature or request label Jan 29, 2024
@innocence0206
Copy link

I got the 'Nan' loss too with nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc, and I have tried redoing nnUNetv2_plan_and_preprocess -d 701 --verify_dataset_integrity, but it didn't work

@gumayusi3
Copy link

I also have this problem when training OCTA dataset

@wyjzll
Copy link

wyjzll commented Mar 18, 2024

Excellent work! I also faced the same problem on several datasets. Dice, Train_loss, and Val_loss collapsed after 40+ or 200+ epoches randomly. UmambaBot worked but Enc didn't.

@Missyfirst
Copy link

@wyjzll I have the same problems with you. UmambaBot worked but Enc didn't.

@JunMa11
Copy link
Collaborator

JunMa11 commented Apr 1, 2024

Hi all,

Thanks for your valuable feedback. After diving into the implementation, we found that AMP will lead to nan.

We have provided new Enc trainers without AMP:

https://github.com/bowang-lab/U-Mamba/blob/main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainerUMambaEncNoAMP.py

@JunMa11 JunMa11 closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants