Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

terminate called after throwing an instance of 'c10::CUDAError' #1161

Open
TalalAhmed311 opened this issue Nov 24, 2022 · 1 comment
Open

Comments

@TalalAhmed311
Copy link

I was training Yolov7 on my custom data but after 1st epoch it produces this error. Can't find any helpful resources, would appreciate if someone look into it.

YOLOR 馃殌 v0.1-115-g072f76c torch 1.12.1+cu113 CUDA:0 (Tesla T4, 15109.75MB)

Namespace(adam=False, artifact_alias='latest', batch_size=4, bbox_interval=-1, bucket='', cache_images=False, cfg='cfg/training/yolov7.yaml', data='data/data.yaml', device='', entity=None, epochs=2, evolve=False, exist_ok=False, freeze=[0], global_rank=-1, hyp='data/hyp.scratch.p5.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='yolov7', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/yolov73', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=4, upload_dataset=False, v5_metric=False, weights='yolov7.pt', workers=0, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)
Overriding model.yaml nc=80 with nc=2

             from  n    params  module                                  arguments                     

0 -1 1 928 models.common.Conv [3, 32, 3, 1]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 36992 models.common.Conv [64, 64, 3, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 8320 models.common.Conv [128, 64, 1, 1]
5 -2 1 8320 models.common.Conv [128, 64, 1, 1]
6 -1 1 36992 models.common.Conv [64, 64, 3, 1]
7 -1 1 36992 models.common.Conv [64, 64, 3, 1]
8 -1 1 36992 models.common.Conv [64, 64, 3, 1]
9 -1 1 36992 models.common.Conv [64, 64, 3, 1]
10 [-1, -3, -5, -6] 1 0 models.common.Concat [1]
11 -1 1 66048 models.common.Conv [256, 256, 1, 1]
12 -1 1 0 models.common.MP []
13 -1 1 33024 models.common.Conv [256, 128, 1, 1]
14 -3 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 147712 models.common.Conv [128, 128, 3, 2]
16 [-1, -3] 1 0 models.common.Concat [1]
17 -1 1 33024 models.common.Conv [256, 128, 1, 1]
18 -2 1 33024 models.common.Conv [256, 128, 1, 1]
19 -1 1 147712 models.common.Conv [128, 128, 3, 1]
20 -1 1 147712 models.common.Conv [128, 128, 3, 1]
21 -1 1 147712 models.common.Conv [128, 128, 3, 1]
22 -1 1 147712 models.common.Conv [128, 128, 3, 1]
23 [-1, -3, -5, -6] 1 0 models.common.Concat [1]
24 -1 1 263168 models.common.Conv [512, 512, 1, 1]
25 -1 1 0 models.common.MP []
26 -1 1 131584 models.common.Conv [512, 256, 1, 1]
27 -3 1 131584 models.common.Conv [512, 256, 1, 1]
28 -1 1 590336 models.common.Conv [256, 256, 3, 2]
29 [-1, -3] 1 0 models.common.Concat [1]
30 -1 1 131584 models.common.Conv [512, 256, 1, 1]
31 -2 1 131584 models.common.Conv [512, 256, 1, 1]
32 -1 1 590336 models.common.Conv [256, 256, 3, 1]
33 -1 1 590336 models.common.Conv [256, 256, 3, 1]
34 -1 1 590336 models.common.Conv [256, 256, 3, 1]
35 -1 1 590336 models.common.Conv [256, 256, 3, 1]
36 [-1, -3, -5, -6] 1 0 models.common.Concat [1]
37 -1 1 1050624 models.common.Conv [1024, 1024, 1, 1]
38 -1 1 0 models.common.MP []
39 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
40 -3 1 525312 models.common.Conv [1024, 512, 1, 1]
41 -1 1 2360320 models.common.Conv [512, 512, 3, 2]
42 [-1, -3] 1 0 models.common.Concat [1]
43 -1 1 262656 models.common.Conv [1024, 256, 1, 1]
44 -2 1 262656 models.common.Conv [1024, 256, 1, 1]
45 -1 1 590336 models.common.Conv [256, 256, 3, 1]
46 -1 1 590336 models.common.Conv [256, 256, 3, 1]
47 -1 1 590336 models.common.Conv [256, 256, 3, 1]
48 -1 1 590336 models.common.Conv [256, 256, 3, 1]
49 [-1, -3, -5, -6] 1 0 models.common.Concat [1]
50 -1 1 1050624 models.common.Conv [1024, 1024, 1, 1]
51 -1 1 7609344 models.common.SPPCSPC [1024, 512, 1]
52 -1 1 131584 models.common.Conv [512, 256, 1, 1]
53 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
54 37 1 262656 models.common.Conv [1024, 256, 1, 1]
55 [-1, -2] 1 0 models.common.Concat [1]
56 -1 1 131584 models.common.Conv [512, 256, 1, 1]
57 -2 1 131584 models.common.Conv [512, 256, 1, 1]
58 -1 1 295168 models.common.Conv [256, 128, 3, 1]
59 -1 1 147712 models.common.Conv [128, 128, 3, 1]
60 -1 1 147712 models.common.Conv [128, 128, 3, 1]
61 -1 1 147712 models.common.Conv [128, 128, 3, 1]
62[-1, -2, -3, -4, -5, -6] 1 0 models.common.Concat [1]
63 -1 1 262656 models.common.Conv [1024, 256, 1, 1]
64 -1 1 33024 models.common.Conv [256, 128, 1, 1]
65 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
66 24 1 65792 models.common.Conv [512, 128, 1, 1]
67 [-1, -2] 1 0 models.common.Concat [1]
68 -1 1 33024 models.common.Conv [256, 128, 1, 1]
69 -2 1 33024 models.common.Conv [256, 128, 1, 1]
70 -1 1 73856 models.common.Conv [128, 64, 3, 1]
71 -1 1 36992 models.common.Conv [64, 64, 3, 1]
72 -1 1 36992 models.common.Conv [64, 64, 3, 1]
73 -1 1 36992 models.common.Conv [64, 64, 3, 1]
74[-1, -2, -3, -4, -5, -6] 1 0 models.common.Concat [1]
75 -1 1 65792 models.common.Conv [512, 128, 1, 1]
76 -1 1 0 models.common.MP []
77 -1 1 16640 models.common.Conv [128, 128, 1, 1]
78 -3 1 16640 models.common.Conv [128, 128, 1, 1]
79 -1 1 147712 models.common.Conv [128, 128, 3, 2]
80 [-1, -3, 63] 1 0 models.common.Concat [1]
81 -1 1 131584 models.common.Conv [512, 256, 1, 1]
82 -2 1 131584 models.common.Conv [512, 256, 1, 1]
83 -1 1 295168 models.common.Conv [256, 128, 3, 1]
84 -1 1 147712 models.common.Conv [128, 128, 3, 1]
85 -1 1 147712 models.common.Conv [128, 128, 3, 1]
86 -1 1 147712 models.common.Conv [128, 128, 3, 1]
87[-1, -2, -3, -4, -5, -6] 1 0 models.common.Concat [1]
88 -1 1 262656 models.common.Conv [1024, 256, 1, 1]
89 -1 1 0 models.common.MP []
90 -1 1 66048 models.common.Conv [256, 256, 1, 1]
91 -3 1 66048 models.common.Conv [256, 256, 1, 1]
92 -1 1 590336 models.common.Conv [256, 256, 3, 2]
93 [-1, -3, 51] 1 0 models.common.Concat [1]
94 -1 1 525312 models.common.Conv [1024, 512, 1, 1]
95 -2 1 525312 models.common.Conv [1024, 512, 1, 1]
96 -1 1 1180160 models.common.Conv [512, 256, 3, 1]
97 -1 1 590336 models.common.Conv [256, 256, 3, 1]
98 -1 1 590336 models.common.Conv [256, 256, 3, 1]
99 -1 1 590336 models.common.Conv [256, 256, 3, 1]
100[-1, -2, -3, -4, -5, -6] 1 0 models.common.Concat [1]
101 -1 1 1049600 models.common.Conv [2048, 512, 1, 1]
102 75 1 328704 models.common.RepConv [128, 256, 3, 1]
103 88 1 1312768 models.common.RepConv [256, 512, 3, 1]
104 101 1 5246976 models.common.RepConv [512, 1024, 3, 1]
105 [102, 103, 104] 1 39550 models.yolo.IDetect [2, [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]], [256, 512, 1024]]
/usr/local/lib/python3.7/dist-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Model Summary: 415 layers, 37201950 parameters, 37201950 gradients, 105.1 GFLOPS

Transferred 552/566 items from yolov7.pt
Scaled weight_decay = 0.0005
Optimizer groups: 95 .bias, 95 conv.weight, 98 other
train: Scanning '../datasets/labels/train.cache' images and labels... 448 found, 0 missing, 0 empty, 0 corrupted: 100% 448/448 [00:00<?, ?it/s]
val: Scanning '../datasets/labels/val.cache' images and labels... 113 found, 0 missing, 0 empty, 0 corrupted: 100% 113/113 [00:00<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.45, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 0 dataloader workers
Logging results to runs/train/yolov73
Starting training for 2 epochs...

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
   0/1     11.5G   0.06093   0.01211   0.01035   0.08339        17       640: 100% 112/112 [01:45<00:00,  1.06it/s]
           Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95:   7% 1/15 [00:02<00:32,  2.32s/it]

terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from record at ../aten/src/ATen/cuda/CUDAEvent.h:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fcd3bc9a20e in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: + 0xf3a88 (0x7fcd7e55ca88 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: + 0xf6ffe (0x7fcd7e55fffe in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: + 0x478fd8 (0x7fcd8d8b8fd8 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fcd3bc817a5 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #5: + 0x372545 (0x7fcd8d7b2545 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6a4c70 (0x7fcd8dae4c70 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x308 (0x7fcd8dae5068 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #8: python3() [0x5a29b4]
frame #9: python3() [0x53c75b]
frame #10: python3() [0x42282d]

frame #20: python3() [0x607796]
frame #23: python3() [0x64db82]
frame #25: __libc_start_main + 0xe7 (0x7fcdb2a92c87 in /lib/x86_64-linux-gnu/libc.so.6)

@Kannan665
Copy link

I have come across these kind of errors related to Cuda, after the first set of epochs.... Are you using docker, as advices by the authors????

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants