terminate called after throwing an instance of 'c10::CUDAError' #1161

TalalAhmed311 · 2022-11-24T13:13:02Z

I was training Yolov7 on my custom data but after 1st epoch it produces this error. Can't find any helpful resources, would appreciate if someone look into it.

YOLOR 🚀 v0.1-115-g072f76c torch 1.12.1+cu113 CUDA:0 (Tesla T4, 15109.75MB)

Namespace(adam=False, artifact_alias='latest', batch_size=4, bbox_interval=-1, bucket='', cache_images=False, cfg='cfg/training/yolov7.yaml', data='data/data.yaml', device='', entity=None, epochs=2, evolve=False, exist_ok=False, freeze=[0], global_rank=-1, hyp='data/hyp.scratch.p5.yaml', image_weights=False, img_size=[640, 640], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='yolov7', noautoanchor=False, nosave=False, notest=False, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/yolov73', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=4, upload_dataset=False, v5_metric=False, weights='yolov7.pt', workers=0, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.1, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.3, cls_pw=1.0, obj=0.7, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.2, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.0, paste_in=0.15, loss_ota=1
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)
Overriding model.yaml nc=80 with nc=2

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 -1 1 3 -1 1 4 -1 1 5 -2 1 6 -1 1 7 -1 1 8 -1 1 9 -1 1 10 [-1, -3, -5, -6] 1 11 12 -1 1 13 14 15 16 [-1, -3] 1 17 18 19 20 21 22 23 [-1, -3, -5, -6] 1 24 25 -1 1 26 27 28 29 [-1, -3] 1 30 31 32 33 34 35 36 [-1, -3, -5, -6] 1 37 38 -1 1 39 40 41 42 [-1, -3] 1 43 44 45 46 47 48 49 [-1, -3, -5, -6] 1 50 51 52 53 -1 1 54 55 [-1, -2] 1 56 57 58 59 60 61 62[-1, -2, -3, -4, -5, -6] 1 63 64 65 -1 1 66 67 [-1, -2] 1 68 69 70 71 72 73 74[-1, -2, -3, -4, -5, -6] 1 75 76 -1 1 77 78 79 80 [-1, -3, 63] 1 81 82 83 84 85 86 87[-1, -2, -3, -4, -5, -6] 1 88 89 -1 1 90 91 92 93 [-1, -3, 51] 1 94 95 96 97 98 99 100[-1, -2, -3, -4, -5, -6] 1 101 102 103 104 105 [102, 103, 104] 1 /usr/local/lib/python3. return _VF.meshgrid(tensors, Model Summary: 928 models.common.Conv [3, 32, 3, 1]
18560 models.common.Conv [32, 64, 3, 2]
36992 models.common.Conv [64, 64, 3, 1]
73984 models.common.Conv [64, 128, 3, 2]
8320 models.common.Conv [128, 64, 1, 1]
8320 models.common.Conv [128, 64, 1, 1]
36992 models.common.Conv [64, 64, 3, 1]
36992 models.common.Conv [64, 64, 3, 1]
36992 models.common.Conv [64, 64, 3, 1]
36992 models.common.Conv [64, 64, 3, 1]
0 models.common.Concat [1]
-1 1 66048 models.common.Conv [256, 256, 1, 1]
0 models.common.MP []
-1 1 33024 models.common.Conv [256, 128, 1, 1]
-3 1 33024 models.common.Conv [256, 128, 1, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
-2 1 33024 models.common.Conv [256, 128, 1, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
0 models.common.Concat [1]
-1 1 263168 models.common.Conv [512, 512, 1, 1]
0 models.common.MP []
-1 1 131584 models.common.Conv [512, 256, 1, 1]
-3 1 131584 models.common.Conv [512, 256, 1, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
-2 1 131584 models.common.Conv [512, 256, 1, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
0 models.common.Concat [1]
-1 1 1050624 models.common.Conv [1024, 1024, 1, 1]
0 models.common.MP []
-1 1 525312 models.common.Conv [1024, 512, 1, 1]
-3 1 525312 models.common.Conv [1024, 512, 1, 1]
-1 1 2360320 models.common.Conv [512, 512, 3, 2]
0 models.common.Concat [1]
-1 1 262656 models.common.Conv [1024, 256, 1, 1]
-2 1 262656 models.common.Conv [1024, 256, 1, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
0 models.common.Concat [1]
-1 1 1050624 models.common.Conv [1024, 1024, 1, 1]
-1 1 7609344 models.common.SPPCSPC [1024, 512, 1]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
37 1 262656 models.common.Conv [1024, 256, 1, 1]
0 models.common.Concat [1]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
-2 1 131584 models.common.Conv [512, 256, 1, 1]
-1 1 295168 models.common.Conv [256, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
0 models.common.Concat [1]
-1 1 262656 models.common.Conv [1024, 256, 1, 1]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
24 1 65792 models.common.Conv [512, 128, 1, 1]
0 models.common.Concat [1]
-1 1 33024 models.common.Conv [256, 128, 1, 1]
-2 1 33024 models.common.Conv [256, 128, 1, 1]
-1 1 73856 models.common.Conv [128, 64, 3, 1]
-1 1 36992 models.common.Conv [64, 64, 3, 1]
-1 1 36992 models.common.Conv [64, 64, 3, 1]
-1 1 36992 models.common.Conv [64, 64, 3, 1]
0 models.common.Concat [1]
-1 1 65792 models.common.Conv [512, 128, 1, 1]
0 models.common.MP []
-1 1 16640 models.common.Conv [128, 128, 1, 1]
-3 1 16640 models.common.Conv [128, 128, 1, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 2]
0 models.common.Concat [1]
-1 1 131584 models.common.Conv [512, 256, 1, 1]
-2 1 131584 models.common.Conv [512, 256, 1, 1]
-1 1 295168 models.common.Conv [256, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
-1 1 147712 models.common.Conv [128, 128, 3, 1]
0 models.common.Concat [1]
-1 1 262656 models.common.Conv [1024, 256, 1, 1]
0 models.common.MP []
-1 1 66048 models.common.Conv [256, 256, 1, 1]
-3 1 66048 models.common.Conv [256, 256, 1, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 2]
0 models.common.Concat [1]
-1 1 525312 models.common.Conv [1024, 512, 1, 1]
-2 1 525312 models.common.Conv [1024, 512, 1, 1]
-1 1 1180160 models.common.Conv [512, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
-1 1 590336 models.common.Conv [256, 256, 3, 1]
0 models.common.Concat [1]
-1 1 1049600 models.common.Conv [2048, 512, 1, 1]
75 1 328704 models.common.RepConv [128, 256, 3, 1]
88 1 1312768 models.common.RepConv [256, 512, 3, 1]
101 1 5246976 models.common.RepConv [512, 1024, 3, 1]
39550 models.yolo.IDetect [2, [[12, 16, 19, 36, 40, 28], [36, 75, 76, 55, 72, 146], [142, 110, 192, 243, 459, 401]], [256, 512, 1024]]
7/dist-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
**kwargs) # type: ignore[attr-defined]
415 layers, 37201950 parameters, 37201950 gradients, 105.1 GFLOPS

Transferred 552/566 items from yolov7.pt
Scaled weight_decay = 0.0005
Optimizer groups: 95 .bias, 95 conv.weight, 98 other
train: Scanning '../datasets/labels/train.cache' images and labels... 448 found, 0 missing, 0 empty, 0 corrupted: 100% 448/448 [00:00<?, ?it/s]
val: Scanning '../datasets/labels/val.cache' images and labels... 113 found, 0 missing, 0 empty, 0 corrupted: 100% 113/113 [00:00<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.45, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 0 dataloader workers
Logging results to runs/train/yolov73
Starting training for 2 epochs...

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size
   0/1     11.5G   0.06093   0.01211   0.01035   0.08339        17       640: 100% 112/112 [01:45<00:00,  1.06it/s]
           Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95:   7% 1/15 [00:02<00:32,  2.32s/it]

terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from record at ../aten/src/ATen/cuda/CUDAEvent.h:115 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fcd3bc9a20e in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: + 0xf3a88 (0x7fcd7e55ca88 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: + 0xf6ffe (0x7fcd7e55fffe in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: + 0x478fd8 (0x7fcd8d8b8fd8 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fcd3bc817a5 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #5: + 0x372545 (0x7fcd8d7b2545 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #6: + 0x6a4c70 (0x7fcd8dae4c70 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x308 (0x7fcd8dae5068 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #8: python3() [0x5a29b4]
frame #9: python3() [0x53c75b]
frame #10: python3() [0x42282d]

frame #20: python3() [0x607796]
frame #23: python3() [0x64db82]
frame #25: __libc_start_main + 0xe7 (0x7fcdb2a92c87 in /lib/x86_64-linux-gnu/libc.so.6)

The text was updated successfully, but these errors were encountered:

Kannan665 · 2022-11-29T03:33:31Z

I have come across these kind of errors related to Cuda, after the first set of epochs.... Are you using docker, as advices by the authors????

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminate called after throwing an instance of 'c10::CUDAError' #1161

terminate called after throwing an instance of 'c10::CUDAError' #1161

TalalAhmed311 commented Nov 24, 2022

Kannan665 commented Nov 29, 2022

terminate called after throwing an instance of 'c10::CUDAError' #1161

terminate called after throwing an instance of 'c10::CUDAError' #1161

Comments

TalalAhmed311 commented Nov 24, 2022

Kannan665 commented Nov 29, 2022