-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nan When training #8
Comments
Hi @siny1998 , Thanks for your interest. does the |
I haven't used the nnUNetTrainerUMambaBot yet. I'll give it a try. |
Hi @JunMa11, I have similar issue when I training on This is the configuration used by this training:
Configuration name: 2d
{'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [384, 640], 'median_image_size_in_voxels': [1080.0, 1920.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [False, False, False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True}
These are the global plan.json settings:
{'dataset_name': 'Dataset704_Endovis17', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 1080, 1920], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 100.00444773514101, 'median': 92.0, 'min': 1.0, 'percentile_00_5': 24.0, 'percentile_99_5': 238.0, 'std': 51.584682233895585}, '1': {'max': 249.0, 'mean': 86.51525510193234, 'median': 72.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 233.0, 'std': 52.24999179949625}, '2': {'max': 255.0, 'mean': 93.29896387602795, 'median': 79.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 244.0, 'std': 56.38243456877845}}}
2024-01-20 11:56:11.330918: unpacking dataset...
2024-01-20 11:56:11.683550: unpacking done...
2024-01-20 11:56:11.684135: do_dummy_2d_data_aug: False
/mnt/disk2/jiarunliu/Documents/mamba/U-Mamba/umamba/nnunetv2/nets/UMambaEnc.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert C == self.dim
2024-01-20 11:56:27.301467: Unable to plot network architecture:
2024-01-20 11:56:27.302820: module 'torch.onnx' has no attribute '_optimize_trace'
2024-01-20 11:56:27.350771:
2024-01-20 11:56:27.350951: Epoch 0
2024-01-20 11:56:27.351222: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
2024-01-20 12:01:32.749162: train_loss nan
2024-01-20 12:01:32.749484: val_loss nan
2024-01-20 12:01:32.749587: Pseudo dice [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2024-01-20 12:01:32.749661: Epoch time: 305.4 s
2024-01-20 12:01:32.749717: Yayy! New best EMA pseudo Dice: 0.0 So far, this problem only occurs with |
As with any training model, the combination of optimiser, trainer, and learning rate seems to matter. |
Hi all, Thanks for your valuable feedback! |
I got the 'Nan' loss too with |
I also have this problem when training OCTA dataset |
Excellent work! I also faced the same problem on several datasets. Dice, Train_loss, and Val_loss collapsed after 40+ or 200+ epoches randomly. UmambaBot worked but Enc didn't. |
@wyjzll I have the same problems with you. UmambaBot worked but Enc didn't. |
Hi all, Thanks for your valuable feedback. After diving into the implementation, we found that AMP will lead to nan. We have provided new Enc trainers without AMP: |
Hi,
Excellent work on the U-Mamba model! I have been attempting to train U-Mamba using my own datasets, which include both 2D and 3D data. However, I've faced an issue where the training process sometimes results in a 'nan' training loss despite trying different datasets. Have you experienced this issue during your training of U-Mamba? I used the nnUNetTrainerUMambaEnc trainer for this process.
Looking forward to your insights.
Best regards,
Leuan
The text was updated successfully, but these errors were encountered: