Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train all nan? #20

Open
l1uw3n opened this issue Nov 18, 2020 · 8 comments
Open

train all nan? #20

l1uw3n opened this issue Nov 18, 2020 · 8 comments

Comments

@l1uw3n
Copy link

l1uw3n commented Nov 18, 2020

I use :python train.py --batch-size 16 --img 512 512 --data person.yaml --cfg yolov4-p5.yaml --weights yolov4-p5.pt --sync-bn --device 1 --name yolov4-p5-tune --hyp 'data/hyp.finetune.yaml' --epochs 450 --resume

result:
image

@WongKinYiu
Copy link
Owner

sync bn can not work with single gpu.

could you show train_batch0.jpg in your runs/expxx folder?
could you show the training snapshot for training from scratch?

@l1uw3n
Copy link
Author

l1uw3n commented Nov 18, 2020

1.I can display rain_batch0.jpg correctly.
2.training log is:
python train.py --batch-size 16 --img 512 512 --data xx.yaml --cfg yolov4-p5.yaml --weights yolov4-p5.pt --device 1 --name yolov4-p5-tune --hyp 'data/hyp.finetune.yaml' --epochs 450 --resume
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=11019MB)

Namespace(adam=False, batch_size=16, bucket='', cache_images=False, cfg='./models/yolov4-p5.yaml', data='./data/xx.yaml', device='1', epochs=450, evolve=False, global_rank=-1, hyp='data/hyp.finetune.yaml', img_size=[512, 512], local_rank=-1, logdir='runs/', multi_scale=False, name='yolov4-p5-tune', noautoanchor=False, nosave=False, notest=False, rect=False, resume='get_last', single_cls=False, sync_bn=False, total_batch_size=16, weights='yolov4-p5.pt', world_size=1)
Start Tensorboard with "tensorboard --logdir runs/", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'momentum': 0.937, 'weight_decay': 0.0005, 'giou': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.8, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mixup': 0.2}
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'models.yolo.Model' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.container.Sequential' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv2d' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.batchnorm.SyncBatchNorm' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.pooling.MaxPool2d' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)
/home/gpu1/anaconda3/envs/ptlw/lib/python3.6/site-packages/torch/serialization.py:649: SourceChangeWarning: source code of class 'torch.nn.modules.upsampling.Upsample' has changed. you can retrieve the original source code by accessing the object's source attribute or set torch.nn.Module.dump_patches = True and use the patch tool to revert the changes.
warnings.warn(msg, SourceChangeWarning)

             from  n    params  module                                  arguments

0 -1 1 928 models.common.Conv [3, 32, 3, 1]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 2614016 models.common.BottleneckCSP [256, 256, 15]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 10438144 models.common.BottleneckCSP [512, 512, 15]
9 -1 1 4720640 models.common.Conv [512, 1024, 3, 2]
10 -1 1 20728832 models.common.BottleneckCSP [1024, 1024, 7]
11 -1 1 7610368 models.common.SPPCSP [1024, 512, 1]
12 -1 1 131584 models.common.Conv [512, 256, 1, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 8 1 131584 models.common.Conv [512, 256, 1, 1]
15 [-1, -2] 1 0 models.common.Concat [1]
16 -1 1 2298880 models.common.BottleneckCSP2 [512, 256, 3]
17 -1 1 33024 models.common.Conv [256, 128, 1, 1]
18 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
19 6 1 33024 models.common.Conv [256, 128, 1, 1]
20 [-1, -2] 1 0 models.common.Concat [1]
21 -1 1 576000 models.common.BottleneckCSP2 [256, 128, 3]
22 -1 1 295424 models.common.Conv [128, 256, 3, 1]
23 -2 1 295424 models.common.Conv [128, 256, 3, 2]
24 [-1, 16] 1 0 models.common.Concat [1]
25 -1 1 2298880 models.common.BottleneckCSP2 [512, 256, 3]
26 -1 1 1180672 models.common.Conv [256, 512, 3, 1]
27 -2 1 1180672 models.common.Conv [256, 512, 3, 2]
28 [-1, 11] 1 0 models.common.Concat [1]
29 -1 1 9185280 models.common.BottleneckCSP2 [1024, 512, 3]
30 -1 1 4720640 models.common.Conv [512, 1024, 3, 1]
31 [22, 26, 30] 1 78980 models.yolo.Detect [6, [[13, 17, 31, 25, 24, 51, 61, 45], [48, 102, 119, 96, 97, 189, 217, 184], [171, 384, 324, 451, 616, 618, 800, 800]], [256, 512, 1024]]
Model Summary: 476 layers, 7.03027e+07 parameters, 7.03027e+07 gradients

Transferred 935/943 items from yolov4-p5.pt
Optimizer groups: 158 .bias, 163 conv.weight, 155 other
Scanning labels ***train.cache (8989 found, 0 missing, 0 empty, 0 duplicate
Scanning labels ***test_A.cache (499 found, 0 missing, 1 empty, 0 duplicate

Analyzing anchors... anchors/target = 4.77, Best Possible Recall (BPR) = 0.9877
Image sizes 512 train, 512 test
Using 8 dataloader workers
Starting training for 450 epochs...

 Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
 0/449     9.97G       nan       nan       nan       nan        63       512:   0%| | 1/562 [00:07<1:03:3

@WongKinYiu
Copy link
Owner

please provide the log file for training from scratch, not for fine-tuning.

@l1uw3n
Copy link
Author

l1uw3n commented Nov 19, 2020

我采用如下命令从头训练得到了同样的结果:
python train.py --batch-size 1 --img 896 896 --data ship.yaml --cfg yolov4-p5.yaml --weights '' --device 1 --name yolov4-p5-tune --epochs 450
训练的数据和格式在yolov5上训练过,数据应该不会有问题。感谢您的耐心解答,我也不知道出了什么问题。

@WongKinYiu
Copy link
Owner

可以提供您在兩邊所使用的.data和.yaml嗎

@l1uw3n
Copy link
Author

l1uw3n commented Nov 19, 2020

您好,我在两边使用的.yaml如下所示 都是一样的:

image

@jorgegaticav
Copy link

jorgegaticav commented Aug 21, 2021

@l1uw3n I'm having the same problem. Could you fix it?

@jorgegaticav
Copy link

I'm running it like this:
python -m torch.distributed.launch --nproc_per_node 4 train.py --batch-size 8 --img 2048 2048 --data mydata.yaml --cfg yolov4-p7.yaml --weights '' --device 0,1,2,3 --name yolov4-p7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants