runtime error #1

pillar02 · 2017-03-29T02:39:26Z

hi,

have you successfully run the train.py?
I encountered a runtime error saying: "div_ only supports scalar multiplication" from line "x/=norm.expand_as(x)" in modules/l2norm.py
Then I changed this line to "x = x.div(nor.expand_as(x))" but got another cuda runtime error "device-side assert triggered" from line "return torch.cat([g_cxcy, g_wh], 1)" in box_utils.py

BTW, i am using python 2.7 instead of python3.

amdegroot · 2017-03-29T03:04:53Z

Yes I've successfully trained several models. For some reason I cannot reproduce this error on my machine. Did you make sure your repo is up to date with the current master branch?

amdegroot · 2017-03-29T03:08:05Z

I am using Python 3, and have not tested it using 2.7, so that is the only thing I can think of at the moment if you're local repo is up to date. I will add the lack of 2.7 support to the README if that's the issue.

pillar02 · 2017-03-29T07:01:05Z

I didn't build from the source but installed pytorch from pip. I have also made some changes to adapt your code to python 2.7 (link star expression)

I checked the latest master branch and found that https://github.com/pytorch/pytorch/blob/master/torch/autograd/variable.py#L317-L320 still only supports scalar division. In the case of your code "x/=norm.expand_as(x)", it is clearly an element-wise division. But I don't understand how the python version can affect this.

pillar02 · 2017-03-29T07:15:35Z

BTW, could you please give me a rough time estimation for running one epoch ( with machine specs)?

amdegroot · 2017-03-29T15:45:52Z

Yeah I agree I don't understand how it is working on my computer if that's the case. I'll look into it more after my classes today, sorry I don't have an answer right this second. As for the time estimate, it takes ~1.4 seconds to run a batch of size 32 forward and backward, but I'm not in my lab right now so I can't remember the exact time per epoch. Will get back to you on all of this right after class.

amdegroot · 2017-03-29T15:46:17Z

And that's on a single Tesla K80 ^

amdegroot · 2017-03-29T17:43:23Z

I think if you update to the latest version of Pytorch you will see that element-wise division with .div_() is supported. I do remember that it was originally not supported, but they added it not too long ago. When I run something as simple as:

x = torch.Tensor([1,2,3,4,5,6])
y = torch.Tensor([2,2,2,2,2,2])
x/=y

the correct result is returned. With a batch size of 32, on 1 Tesla K80, it takes me ~ 109 sec. per epoch.

pillar02 · 2017-03-30T01:46:59Z

As I mentioned in the previous post, in the latest github pytorch source code (master branch), it still shows:

def div_(self, other):
    if not isinstance(other, Variable) and not torch.is_tensor(other):
        return DivConstant(other, inplace=True)(self)
    raise RuntimeError("div_ only supports scalar multiplication")

I still don't understand how it works for your case. But I will try to update my pytorch to the latest version.
Thanks a lot.

amdegroot · 2017-03-30T04:23:16Z

Yeah, I apologize for lack of a better answer, but since I cannot reproduce I am closing the issue for now. Let me know if updating PyTorch fixes the issue, I will try to see if I can figure out more info myself in the mean time..

amdegroot · 2017-03-30T06:26:03Z

Ah, figured it out. That line in the source code is referring to Variables, so it is just saying Variables cannot be divided by Tensors, but Variables can be divided by other Variables of the same size (which is the case here) and Tensors can be divided by other Tensors of the same size.

torch/csrc/generic/methods/TensorMath.cwrap line 1038 has what looks like the place that bridges the python and C for the tensor div_ definition, and it's implied in torch/tensor.py 378: return self.div_(other) even though it doesn't seem like self.div_ is defined.

So again, not sure what the exact source of the problem is in your case, but my best bet is your version of PyTorch. Hopefully that helps.

amdegroot · 2017-03-30T07:02:22Z

Also, update on training time: it takes approx. 37.5 sec. per epoch with a gtx1060 and batch size of 16, which is what I am currently using (ran out of money to afford the K80 EC2 instance :P).

pillar02 · 2017-03-30T08:12:31Z

Thanks a lot. I will definitely update my Pytorch.

Regarding the training time, it only takes 37.5 sec for one epoch? (I suppose you were training using VOC2007 with about 10000 images, right?). I have tried training a mxnet SSD implementation which takes about 270 sec for one epoch using both VOC2007 and VOC2012 data on my titan x gpu card. Does this mean this pytorch ssd is even faster than the mxnet implementation, which doesn't seem to be true.

amdegroot · 2017-03-30T09:05:41Z

Yeah, that's my bad. Disregard that number, its late here. Training on purely the training set (2501 images) from VOC07 it takes on average ~140 sec. per epoch on a single GTX 1060... So yeah the previous number was off by alot. I would be curious to see how it compares on a Titan X though.

pillar02 · 2017-03-30T10:02:29Z

one more question ;-)

I am wondering how you got the fc-reduced VGG-16 weights?

amdegroot · 2017-03-30T10:16:40Z

Hahah of course... I converted them to Chainer and then from Chainer to PyTorch. I also was able to convert them to Torch and then from Torch to PyTorch, but the specific weight file I supply was one that took the Chainer route.

pillar02 · 2017-04-06T13:35:48Z

hi, i just updated the pytorch to the latest version (0.1.11_5) and had the train.py run.
luckily, i didn't get the div_ error.
but this time, I got "RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/T1" at "conf = labels[best_truth_idx] + 1 " from the box_utils.py
any idea about this? it seems like something to do with tensor add

amdegroot · 2017-04-06T20:40:25Z

Is this on the first feed forward or were you able to get through some iterations? The only time that line has every been an issue was a while back when I had an explicit 'background' label in the voc labelmap and it just became an index out of range issue for softmax. But I'm currently training as I type this and can't think of what could be causing that. Have you pulled the most recent update of master? Or maybe you're on a different branch?

meetps · 2017-07-04T12:51:46Z

I faced this issue as well, with PyTorch version ( 0.1.12_4 ) which is very recent.

I fixed it by changing the forward() function in L2Norm.py as follows:

def forward(self, x):
    norm = x.pow(2).sum(1).sqrt()+self.eps
    norm_stretch = norm.expand_as(x)
    x = x / norm_stretch
    out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
    return out

I then am facing an issue in the box_utils.py as:

THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu line=226 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train_cars.py", line 232, in <module>
    train()
  File "train_cars.py", line 184, in train
    loss_l, loss_c = criterion(out, targets)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mshah/code/ssd.pytorch/layers/modules/multibox_loss.py", line 70, in forward
    match(self.threshold,truths,defaults,self.variance,labels,loc_t,conf_t,idx)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 107, in match
    loc = encode(matches, priors, variances)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 133, in encode
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

superhans · 2017-07-07T15:16:13Z

Edit : I believe there are basic Python2.7 vs Python3 compatibility issues which cause the problem, since this code was written for Python3 and not Python2.7

Adding the line from __future__ import division in box_utils.py, prior_box.py and detection.py gets rid of the above error, and some other errors.

acrosson · 2017-09-09T22:40:25Z

great suggestion @superhans . adding from __future__ import division to most of the files, gets rid of any nan, inf in the loss for python 2.7

amdegroot closed this as completed Mar 30, 2017

valentinpy referenced this issue in valentinpy/ssd.pytorch Nov 26, 2018

refolder files #1

79afbb1

GYC1996 mentioned this issue Apr 20, 2019

Expected object of scalar type Float but got scalar type Byte for sequence elment 1 insequence argument at position #1 'tensors' #325

Open

tangtaogo mentioned this issue May 22, 2019

RuntimeError: CUDA error: out of memory #353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime error #1

runtime error #1

pillar02 commented Mar 29, 2017

amdegroot commented Mar 29, 2017

amdegroot commented Mar 29, 2017

pillar02 commented Mar 29, 2017

pillar02 commented Mar 29, 2017

amdegroot commented Mar 29, 2017 •

edited

Loading

amdegroot commented Mar 29, 2017

amdegroot commented Mar 29, 2017

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017

amdegroot commented Mar 30, 2017

amdegroot commented Mar 30, 2017 •

edited

Loading

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017 •

edited

Loading

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017

pillar02 commented Apr 6, 2017

amdegroot commented Apr 6, 2017

meetps commented Jul 4, 2017 •

edited

Loading

superhans commented Jul 7, 2017 •

edited

Loading

acrosson commented Sep 9, 2017

runtime error #1

runtime error #1

Comments

pillar02 commented Mar 29, 2017

amdegroot commented Mar 29, 2017

amdegroot commented Mar 29, 2017

pillar02 commented Mar 29, 2017

pillar02 commented Mar 29, 2017

amdegroot commented Mar 29, 2017 • edited Loading

amdegroot commented Mar 29, 2017

amdegroot commented Mar 29, 2017

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017

amdegroot commented Mar 30, 2017

amdegroot commented Mar 30, 2017 • edited Loading

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017 • edited Loading

pillar02 commented Mar 30, 2017

amdegroot commented Mar 30, 2017

pillar02 commented Apr 6, 2017

amdegroot commented Apr 6, 2017

meetps commented Jul 4, 2017 • edited Loading

superhans commented Jul 7, 2017 • edited Loading

acrosson commented Sep 9, 2017

amdegroot commented Mar 29, 2017 •

edited

Loading

amdegroot commented Mar 30, 2017 •

edited

Loading

amdegroot commented Mar 30, 2017 •

edited

Loading

meetps commented Jul 4, 2017 •

edited

Loading

superhans commented Jul 7, 2017 •

edited

Loading