Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单卡训练时出现loss是nan,请问是什么原因 #5

Open
hezheyuan opened this issue Dec 14, 2022 · 10 comments
Open

单卡训练时出现loss是nan,请问是什么原因 #5

hezheyuan opened this issue Dec 14, 2022 · 10 comments

Comments

@hezheyuan
Copy link

No description provided.

@Zzh-tju
Copy link
Owner

Zzh-tju commented Dec 14, 2022

晒一下训练log

@hezheyuan
Copy link
Author

2022-12-15 15:19:25,166 - mmrotate - INFO - Environment info:

sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090
CUDA_HOME: /usr/local/cuda-11.6
NVCC: Cuda compilation tools, release 11.6, V11.6.55
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.6
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.3.2 (built against CUDA 11.5)
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1
OpenCV: 4.6.0
MMCV: 1.6.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.6
MMRotate: 0.1.0+5fe611f

2022-12-15 15:19:25,678 - mmrotate - INFO - Distributed training: False
2022-12-15 15:19:26,160 - mmrotate - INFO - Config:
dataset_type = 'DOTADataset'
data_root = '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RResize', img_scale=(1024, 1024)),
dict(
type='RRandomFlip',
flip_ratio=[0.25, 0.25, 0.25],
direction=['horizontal', 'vertical', 'diagonal']),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 1024),
flip=False,
transforms=[
dict(type='RResize'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=1,
workers_per_gpu=2,
train=dict(
type='DOTADataset',
ann_file=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',
img_prefix=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='RResize', img_scale=(1024, 1024)),
dict(
type='RRandomFlip',
flip_ratio=[0.25, 0.25, 0.25],
direction=['horizontal', 'vertical', 'diagonal']),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
],
version='oc'),
val=dict(
type='DOTADataset',
ann_file=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',
img_prefix=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 1024),
flip=False,
transforms=[
dict(type='RResize'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img'])
])
],
version='oc'),
test=dict(
type='DOTADataset',
ann_file=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',
img_prefix=
'/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1024, 1024),
flip=False,
transforms=[
dict(type='RResize'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img'])
])
],
version='oc'))
evaluation = dict(interval=12, metric='mAP')
optimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.3333333333333333,
step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=12)
log_config = dict(
interval=50,
hooks=[dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
angle_version = 'oc'
model = dict(
type='KnowledgeDistillationRotatedSingleStageDetector',
backbone=dict(
type='ResNet',
depth=18,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
zero_init_residual=False,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')),
neck=dict(
type='FPN',
in_channels=[64, 128, 256, 512],
out_channels=256,
start_level=1,
add_extra_convs='on_input',
num_outs=5),
bbox_head=dict(
type='LDRotatedRetinaHead',
num_classes=15,
in_channels=256,
stacked_convs=4,
feat_channels=256,
assign_by_circumhbbox='oc',
anchor_generator=dict(
type='RotatedAnchorGenerator',
octave_base_scale=4,
scales_per_octave=3,
ratios=[1.0, 0.5, 2.0],
strides=[8, 16, 32, 64, 128]),
bbox_coder=dict(
type='DeltaXYWHAOBBoxCoder',
angle_range='oc',
norm_factor=None,
edge_swap=False,
proj_xy=False,
target_means=(0.0, 0.0, 0.0, 0.0, 0.0),
target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0),
loss_bbox=dict(type='GDLoss', loss_weight=5.0, loss_type='gwd'),
reg_max=8,
reg_decoded_bbox=True,
loss_ld=dict(type='GDLoss', loss_type='gwd', loss_weight=5.0),
loss_kd=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=30, T=5),
loss_im=dict(type='IMLoss', loss_weight=2.0),
imitation_method='finegrained'),
train_cfg=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.4,
min_pos_iou=0,
ignore_iof_thr=-1,
iou_calculator=dict(type='RBboxOverlaps2D')),
allowed_border=-1,
pos_weight=-1,
debug=False),
test_cfg=dict(
nms_pre=2000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(iou_thr=0.1),
max_per_img=2000),
teacher_config=
'./configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py',
teacher_ckpt=
'/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth',
output_feature=True)
teacher_ckpt = '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth'
work_dir = './work_dirs/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc'
auto_resume = False
gpu_ids = range(0, 1)

2022-12-15 15:19:26,208 - mmrotate - INFO - Set random seed to 942796273, deterministic: False
2022-12-15 15:19:32,558 - mmrotate - INFO - initialize ResNet with init_cfg {'type': 'Pretrained', 'checkpoint': 'torchvision://resnet18'}
2022-12-15 15:19:32,622 - mmrotate - INFO - initialize FPN with init_cfg {'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'}
Name of parameter - Initialization information

@hezheyuan
Copy link
Author

{"env_info": "sys.platform: linux\nPython: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0]\nCUDA available: True\nGPU 0: NVIDIA GeForce RTX 3090\nCUDA_HOME: /usr/local/cuda-11.6\nNVCC: Cuda compilation tools, release 11.6, V11.6.55\nGCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0\nPyTorch: 1.12.1\nPyTorch compiling details: PyTorch built with:\n - GCC 9.3\n - C++ Version: 201402\n - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n\nTorchVision: 0.13.1\nOpenCV: 4.6.0\nMMCV: 1.6.0\nMMCV Compiler: GCC 9.3\nMMCV CUDA Compiler: 11.6\nMMRotate: 0.1.0+5fe611f", "config": "dataset_type = 'DOTADataset'\ndata_root = '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/'\nimg_norm_cfg = dict(\n mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)\ntrain_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n]\ntest_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n]\ndata = dict(\n samples_per_gpu=1,\n workers_per_gpu=2,\n train=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n ],\n version='oc'),\n val=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'),\n test=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'))\nevaluation = dict(interval=12, metric='mAP')\noptimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)\noptimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))\nlr_config = dict(\n policy='step',\n warmup='linear',\n warmup_iters=500,\n warmup_ratio=0.3333333333333333,\n step=[8, 11])\nrunner = dict(type='EpochBasedRunner', max_epochs=12)\ncheckpoint_config = dict(interval=12)\nlog_config = dict(\n interval=50,\n hooks=[dict(type='TextLoggerHook'),\n dict(type='TensorboardLoggerHook')])\ndist_params = dict(backend='nccl')\nlog_level = 'INFO'\nload_from = None\nresume_from = None\nworkflow = [('train', 1)]\nopencv_num_threads = 0\nmp_start_method = 'fork'\nangle_version = 'oc'\nmodel = dict(\n type='KnowledgeDistillationRotatedSingleStageDetector',\n backbone=dict(\n type='ResNet',\n depth=18,\n num_stages=4,\n out_indices=(0, 1, 2, 3),\n frozen_stages=1,\n zero_init_residual=False,\n norm_cfg=dict(type='BN', requires_grad=True),\n norm_eval=True,\n style='pytorch',\n init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')),\n neck=dict(\n type='FPN',\n in_channels=[64, 128, 256, 512],\n out_channels=256,\n start_level=1,\n add_extra_convs='on_input',\n num_outs=5),\n bbox_head=dict(\n type='LDRotatedRetinaHead',\n num_classes=15,\n in_channels=256,\n stacked_convs=4,\n feat_channels=256,\n assign_by_circumhbbox='oc',\n anchor_generator=dict(\n type='RotatedAnchorGenerator',\n octave_base_scale=4,\n scales_per_octave=3,\n ratios=[1.0, 0.5, 2.0],\n strides=[8, 16, 32, 64, 128]),\n bbox_coder=dict(\n type='DeltaXYWHAOBBoxCoder',\n angle_range='oc',\n norm_factor=None,\n edge_swap=False,\n proj_xy=False,\n target_means=(0.0, 0.0, 0.0, 0.0, 0.0),\n target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),\n loss_cls=dict(\n type='FocalLoss',\n use_sigmoid=True,\n gamma=2.0,\n alpha=0.25,\n loss_weight=1.0),\n loss_bbox=dict(type='GDLoss', loss_weight=5.0, loss_type='gwd'),\n reg_max=8,\n reg_decoded_bbox=True,\n loss_ld=dict(type='GDLoss', loss_type='gwd', loss_weight=5.0),\n loss_kd=dict(\n type='KnowledgeDistillationKLDivLoss', loss_weight=30, T=5),\n loss_im=dict(type='IMLoss', loss_weight=2.0),\n imitation_method='finegrained'),\n train_cfg=dict(\n assigner=dict(\n type='MaxIoUAssigner',\n pos_iou_thr=0.5,\n neg_iou_thr=0.4,\n min_pos_iou=0,\n ignore_iof_thr=-1,\n iou_calculator=dict(type='RBboxOverlaps2D')),\n allowed_border=-1,\n pos_weight=-1,\n debug=False),\n test_cfg=dict(\n nms_pre=2000,\n min_bbox_size=0,\n score_thr=0.05,\n nms=dict(iou_thr=0.1),\n max_per_img=2000),\n teacher_config=\n './configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py',\n teacher_ckpt=\n '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth',\n output_feature=True)\nteacher_ckpt = '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth'\nwork_dir = './work_dirs/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc'\nauto_resume = False\ngpu_ids = range(0, 1)\n", "seed": 884529645, "exp_name": "rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc.py"}
{"mode": "train", "epoch": 1, "iter": 50, "lr": 0.001, "memory": 4366, "data_time": 0.05191, "loss_cls": NaN, "loss_bbox": NaN, "loss_ld": NaN, "loss_kd": NaN, "loss_im": NaN, "loss": NaN, "grad_norm": NaN, "time": 0.2035}

@hezheyuan
Copy link
Author

晒一下训练log
是不是使用单卡的原因?我现在也没找到原因

@Zzh-tju
Copy link
Owner

Zzh-tju commented Dec 16, 2022

这种情况多半是数据集的问题,预处理步骤或者标签问题。请遵循mmrotate的数据集下载与预处理方式

可以先关闭蒸馏损失,在config文件中将KD与LD,Feature imitation的损失设为0,观察是否还会nan

@hezheyuan
Copy link
Author

这种情况多半是数据集的问题,预处理步骤或者标签问题。请遵循mmrotate的数据集下载与预处理方式

可以先关闭蒸馏损失,在config文件中将KD与LD,Feature imitation的损失设为0,观察是否还会nan

谢谢!数据集不会有问题,用了很久的,在mmrotate里训练不存在问题。我试一下关闭特征蒸馏损失看一下效果。我使用自己写的Feature imitation 方法也存在损失NAN的问题。

@hezheyuan
Copy link
Author

关掉蒸馏损失之后,训练不会在出现NAN的情况,是为什么

@Zzh-tju
Copy link
Owner

Zzh-tju commented Dec 19, 2022

你的教师模型用的哪个,config文件是哪个

@hezheyuan
Copy link
Author

你的教师模型用的哪个,config文件是哪个
1.教师模型使用的rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth,是从mmrotate官方下载的
2.config文件使用的是./configs/ld/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc.py
3.教师的config文件使用的是./configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py
我现在找到NAN的原因了,我观察到网络的bounding box预测值是NAN,导致在计算损失时无法计算,调整学习率或者warmup率也没用。
但是如果我训练configs/gwd/rotated_retinanet_distribution_hbb_gwd_r50_fpn_2x_dota_oc.py,没有错误。

@Zzh-tju
Copy link
Owner

Zzh-tju commented Dec 20, 2022

肯定不能用mmrotate官方的权重作为教师啊,它的box表示是4个数,而不是4n个数的概率分布

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants