train all nan? #20

l1uw3n · 2020-11-18T11:20:26Z

I use ：python train.py --batch-size 16 --img 512 512 --data person.yaml --cfg yolov4-p5.yaml --weights yolov4-p5.pt --sync-bn --device 1 --name yolov4-p5-tune --hyp 'data/hyp.finetune.yaml' --epochs 450 --resume

result:

WongKinYiu · 2020-11-18T12:49:56Z

sync bn can not work with single gpu.

could you show train_batch0.jpg in your runs/expxx folder?
could you show the training snapshot for training from scratch?

l1uw3n · 2020-11-18T14:20:03Z

1.I can display rain_batch0.jpg correctly.
2.training log is:
python train.py --batch-size 16 --img 512 512 --data xx.yaml --cfg yolov4-p5.yaml --weights yolov4-p5.pt --device 1 --name yolov4-p5-tune --hyp 'data/hyp.finetune.yaml' --epochs 450 --resume
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov4-p5.yaml', data='./data/xx.yaml', device='1', epochs=450, evolve=False, global_rank=-1, hyp='data/hyp.finetune.yaml', img_size=[512, 512], local_rank=-1, logdir='runs/', multi_scale=False, name='yolov4-p5-tune', noautoanchor=False, nosave=False, notest=False, rect=False, resume='get_last', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov4-p5.pt', world_size=1)
Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.8, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mixup': 0.2}
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'models.yolo.Model' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.container.Sequential' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv2d' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.batchnorm.SyncBatchNorm' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.pooling.MaxPool2d' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.upsampling.Upsample' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)

             from  n    params  module                                  arguments

0 -1 1 928 models.common.Conv [3, 32, 3, 1]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 2614016 models.common.BottleneckCSP [256, 256, 15]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 10438144 models.common.BottleneckCSP [512, 512, 15]
9 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
10 -1 1 20728832 models.common.BottleneckCSP [1024, 1024, 7]
11 -1 1 7610368 models.common.SPPCSP [1024, 512, 1]
12 -1 1 131584 models.common.Conv [512, 256, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 8 1 131584 models.common.Conv [512, 256, 1, 1]
15 [-1, -2] 1 0 models.common.Concat [1]
16 -1 1 2298880 models.common.BottleneckCSP2 [512, 256, 3]
17 -1 1 33024 models.common.Conv [256, 128, 1, 1]
18 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
19 6 1 33024 models.common.Conv [256, 128, 1, 1]
20 [-1, -2] 1 0 models.common.Concat [1]
21 -1 1 576000 models.common.BottleneckCSP2 [256, 128, 3]
22 -1 1 295424 models.common.Conv [128, 256, 3, 1]
23 -2 1 295424 models.common.Conv [128, 256, 3, 2]
24 [-1, 16] 1 0 models.common.Concat [1]
25 -1 1 2298880 models.common.BottleneckCSP2 [512, 256, 3]
26 -1 1 1180672 models.common.Conv [256, 512, 3, 1]
27 -2 1 1180672 models.common.Conv [256, 512, 3, 2]
28 [-1, 11] 1 0 models.common.Concat [1]
29 -1 1 9185280 models.common.BottleneckCSP2 [1024, 512, 3]
30 -1 1 4720640 models.common.Conv [512, 1024, 3, 1]
31 [22, 26, 30] 1 78980 models.yolo.Detect [6, [[13, 17, 31, 25, 24, 51, 61, 45], [48, 102, 119, 96, 97, 189, 217, 184], [171, 384, 324, 451, 616, 618, 800, 800]], [256, 512, 1024]]
Model Summary: 476 layers, 7.03027e+07 parameters, 7.03027e+07 gradients

Transferred 935/943 items from yolov4-p5.pt
Optimizer groups: 158 .bias, 163 conv.weight, 155 other
Scanning labels ***train.cache (8989 found, 0 missing, 0 empty, 0 duplicate
Scanning labels ***test_A.cache (499 found, 0 missing, 1 empty, 0 duplicate

Analyzing anchors... anchors/target = 4.77, Best Possible Recall (BPR) = 0.9877
Image sizes 512 train, 512 test
Using 8 dataloader workers
Starting training for 450 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
 0/449     9.97G       nan       nan       nan       nan        63       512:   0%| | 1/562 [00:07<1:03:3

WongKinYiu · 2020-11-18T14:38:33Z

please provide the log file for training from scratch, not for fine-tuning.

l1uw3n · 2020-11-19T02:47:01Z

我采用如下命令从头训练得到了同样的结果：
python train.py --batch-size 1 --img 896 896 --data ship.yaml --cfg yolov4-p5.yaml --weights '' --device 1 --name yolov4-p5-tune --epochs 450
训练的数据和格式在yolov5上训练过，数据应该不会有问题。感谢您的耐心解答，我也不知道出了什么问题。

WongKinYiu · 2020-11-19T03:35:16Z

可以提供您在兩邊所使用的.data和.yaml嗎

l1uw3n · 2020-11-19T08:08:19Z

您好，我在两边使用的.yaml如下所示都是一样的：

jorgegaticav · 2021-08-21T17:21:30Z

@l1uw3n I'm having the same problem. Could you fix it?

jorgegaticav · 2021-08-22T18:40:45Z

I'm running it like this:
python -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 8 --img 2048 2048 --data mydata.yaml --cfg yolov4-p7.yaml --weights '' --device 0,1,2,3 --name yolov4-p7

wuzuiyuzui mentioned this issue Nov 18, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train all nan? #20

train all nan? #20

l1uw3n commented Nov 18, 2020

WongKinYiu commented Nov 18, 2020

l1uw3n commented Nov 18, 2020 •

edited

Loading

WongKinYiu commented Nov 18, 2020

l1uw3n commented Nov 19, 2020

WongKinYiu commented Nov 19, 2020

l1uw3n commented Nov 19, 2020 •

edited

Loading

jorgegaticav commented Aug 21, 2021 •

edited

Loading

jorgegaticav commented Aug 22, 2021

train all nan? #20

train all nan? #20

Comments

l1uw3n commented Nov 18, 2020

WongKinYiu commented Nov 18, 2020

l1uw3n commented Nov 18, 2020 • edited Loading

WongKinYiu commented Nov 18, 2020

l1uw3n commented Nov 19, 2020

WongKinYiu commented Nov 19, 2020

l1uw3n commented Nov 19, 2020 • edited Loading

jorgegaticav commented Aug 21, 2021 • edited Loading

jorgegaticav commented Aug 22, 2021

l1uw3n commented Nov 18, 2020 •

edited

Loading

l1uw3n commented Nov 19, 2020 •

edited

Loading

jorgegaticav commented Aug 21, 2021 •

edited

Loading