An illegal memory access was encountered #45

PkuRainBow · 2018-10-27T07:10:59Z

🐛 Bug

I just run the below script with 4 x P100.

PYTHON="/root/miniconda3/bin/python"
CONFIG="./configs/e2e_mask_rcnn_R_50_FPN_1x.yaml"

export NGPUS=4
${PYTHON} -m torch.distributed.launch --nproc_per_node=$NGPUS \
	./tools/train_net.py --config-file $CONFIG

Expected behavior

Here is the error information,

It seems that the first two few iterations are ok. (iter: 0, 20)

Then in the iter 40, the number in the bracket becomes nan. Then I got the error informing me that an illegal memory was encountered.

Environment

I just install all the enviroments follow the instructions

PyTorch Version 1.0
Linux 16.04
Python version: 3.6
CUDA/cuDNN version: 8.0
GPU models and configuration: 4 X P100

The text was updated successfully, but these errors were encountered:

fmassa · 2018-10-27T07:12:07Z

Could you give more information?

I suspect it happens because you used a too high learning rate, and training diverged, giving large indices.

PkuRainBow · 2018-10-27T07:21:38Z

Could you give more information?

I suspect it happens because you used a too high learning rate, and training diverged, giving large indices.

@fmassa , Thanks for you quick reply.
Here I paste the default yaml file,

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"
  BACKBONE:
    CONV_BODY: "R-50-FPN"
    OUT_CHANNELS: 256
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  ROI_MASK_HEAD:
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    FEATURE_EXTRACTOR: "MaskRCNNFPNFeatureExtractor"
    PREDICTOR: "MaskRCNNC4Predictor"
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 2
    RESOLUTION: 28
    SHARE_BOX_FEATURE_EXTRACTOR: False
  MASK_ON: True
DATASETS:
  TRAIN: ("coco_2014_train", "coco_2014_valminusminival")
  TEST: ("coco_2014_minival",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  # BASE_LR: 0.0025
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  # IMS_PER_BATCH: 2

fmassa · 2018-10-27T07:44:30Z

So, you have changed the IMS_PER_BATCH to be 2, and the learning rate as well?

fmassa · 2018-10-27T07:45:49Z

Try following the learning rate adaptation rules that I mentioned in the README, they are necessary for training to not diverge

PkuRainBow · 2018-10-29T07:19:45Z

@fmassa I still can not figure the problem.

fmassa · 2018-10-29T08:58:28Z

So, to double check:

you are using 4 GPUs
you set IMS_PER_BATCH to 2

Is that right?

Note that the meaning of IMS_PER_BATCH is different in maskrcnn-benchmark than it is from Detectron.
If you use fewer GPUs than 8, then you might need to change s few hyper parameters for training to behave the same.
Have a look at https://github.com/facebookresearch/maskrcnn-benchmark#single-gpu-training for the differences and what to do.

Let me know if you still have problems

PkuRainBow · 2018-10-29T09:01:59Z

@fmassa Thanks for your kind help.

I will update if I have got progress.

zimenglan-sysu-512 · 2018-12-04T01:43:20Z

hi @fmassa
after several thousands iterations or several tens of thousands iterations, the loss become NaN

2018-12-04 07:02:12,736 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:11:04  iter: 38300  loss: 0.4934 (0.6051)  loss_classifier: 0.2030 (0.2690)  loss_box_reg: 0.1892 (0.2426)  loss
_objectness: 0.0369 (0.0527)  loss_rpn_box_reg: 0.0336 (0.0409)  time: 1.0707 (1.0797)  data: 0.0126 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:02:34,353 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:10:43  iter: 38320  loss: 0.5649 (0.6050)  loss_classifier: 0.2554 (0.2689)  loss_box_reg: 0.2274 (0.2426)  loss
_objectness: 0.0426 (0.0527)  loss_rpn_box_reg: 0.0374 (0.0409)  time: 1.0791 (1.0797)  data: 0.0115 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:02:54,637 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:10:15  iter: 38340  loss: nan (nan)  loss_classifier: 0.2202 (nan)  loss_box_reg: nan (nan)  loss_objectness: na
n (nan)  loss_rpn_box_reg: nan (nan)  time: 1.0134 (1.0797)  data: 0.0101 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:03:13,254 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:09:39  iter: 38360  loss: nan (nan)  loss_classifier: nan (nan)  loss_box_reg: nan (nan)  loss_objectness: nan (
nan)  loss_rpn_box_reg: nan (nan)  time: 0.9273 (1.0796)  data: 0.0099 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:03:31,830 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:09:04  iter: 38380  loss: nan (nan)  loss_classifier: nan (nan)  loss_box_reg: nan (nan)  loss_objectness: nan (
nan)  loss_rpn_box_reg: nan (nan)  time: 0.9140 (1.0795)  data: 0.0100 (0.0124)  lr: 0.010000  max mem: 3778

do u have ideas to solve it?

fmassa · 2018-12-04T10:12:18Z

@zimenglan-sysu-512 difficult to say without more context. Is this COCO? Are you using a standard model or have you adapted one of the models? It might require some digging to understand where the problem might come from.

zimenglan-sysu-512 · 2018-12-04T15:12:32Z

hi @fmassa
i want to add light-head rcnn to train R-50-C4 on COCO dataset, maybe something wrong in my code to implement. i need to check my code.
thanks.

fmassa added the question Further information is requested label Oct 29, 2018

engineer1109 mentioned this issue Dec 20, 2018

Strange Problem #293

Open

dedoogong mentioned this issue Aug 16, 2019

RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' #1048

Open

Jacobew mentioned this issue Apr 19, 2020

add dcn from mmdetection #693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An illegal memory access was encountered #45

An illegal memory access was encountered #45

PkuRainBow commented Oct 27, 2018 •

edited

fmassa commented Oct 27, 2018

PkuRainBow commented Oct 27, 2018 •

edited

fmassa commented Oct 27, 2018

fmassa commented Oct 27, 2018

PkuRainBow commented Oct 29, 2018

fmassa commented Oct 29, 2018

PkuRainBow commented Oct 29, 2018

zimenglan-sysu-512 commented Dec 4, 2018 •

edited

fmassa commented Dec 4, 2018

zimenglan-sysu-512 commented Dec 4, 2018 •

edited

An illegal memory access was encountered #45

An illegal memory access was encountered #45

Comments

PkuRainBow commented Oct 27, 2018 • edited

🐛 Bug

Expected behavior

Environment

fmassa commented Oct 27, 2018

PkuRainBow commented Oct 27, 2018 • edited

fmassa commented Oct 27, 2018

fmassa commented Oct 27, 2018

PkuRainBow commented Oct 29, 2018

fmassa commented Oct 29, 2018

PkuRainBow commented Oct 29, 2018

zimenglan-sysu-512 commented Dec 4, 2018 • edited

fmassa commented Dec 4, 2018

zimenglan-sysu-512 commented Dec 4, 2018 • edited

PkuRainBow commented Oct 27, 2018 •

edited

PkuRainBow commented Oct 27, 2018 •

edited

zimenglan-sysu-512 commented Dec 4, 2018 •

edited

zimenglan-sysu-512 commented Dec 4, 2018 •

edited