Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime error #1

Closed
pillar02 opened this issue Mar 29, 2017 · 20 comments
Closed

runtime error #1

pillar02 opened this issue Mar 29, 2017 · 20 comments

Comments

@pillar02
Copy link

hi,

have you successfully run the train.py?
I encountered a runtime error saying: "div_ only supports scalar multiplication" from line "x/=norm.expand_as(x)" in modules/l2norm.py
Then I changed this line to "x = x.div(nor.expand_as(x))" but got another cuda runtime error "device-side assert triggered" from line "return torch.cat([g_cxcy, g_wh], 1)" in box_utils.py

BTW, i am using python 2.7 instead of python3.

@amdegroot
Copy link
Owner

Yes I've successfully trained several models. For some reason I cannot reproduce this error on my machine. Did you make sure your repo is up to date with the current master branch?

@amdegroot
Copy link
Owner

I am using Python 3, and have not tested it using 2.7, so that is the only thing I can think of at the moment if you're local repo is up to date. I will add the lack of 2.7 support to the README if that's the issue.

@pillar02
Copy link
Author

I didn't build from the source but installed pytorch from pip. I have also made some changes to adapt your code to python 2.7 (link star expression)

I checked the latest master branch and found that https://github.com/pytorch/pytorch/blob/master/torch/autograd/variable.py#L317-L320 still only supports scalar division. In the case of your code "x/=norm.expand_as(x)", it is clearly an element-wise division. But I don't understand how the python version can affect this.

@pillar02
Copy link
Author

BTW, could you please give me a rough time estimation for running one epoch ( with machine specs)?

@amdegroot
Copy link
Owner

amdegroot commented Mar 29, 2017

Yeah I agree I don't understand how it is working on my computer if that's the case. I'll look into it more after my classes today, sorry I don't have an answer right this second. As for the time estimate, it takes ~1.4 seconds to run a batch of size 32 forward and backward, but I'm not in my lab right now so I can't remember the exact time per epoch. Will get back to you on all of this right after class.

@amdegroot
Copy link
Owner

And that's on a single Tesla K80 ^

@amdegroot
Copy link
Owner

I think if you update to the latest version of Pytorch you will see that element-wise division with .div_() is supported. I do remember that it was originally not supported, but they added it not too long ago. When I run something as simple as:

x = torch.Tensor([1,2,3,4,5,6])
y = torch.Tensor([2,2,2,2,2,2])
x/=y

the correct result is returned. With a batch size of 32, on 1 Tesla K80, it takes me ~ 109 sec. per epoch.

@pillar02
Copy link
Author

As I mentioned in the previous post, in the latest github pytorch source code (master branch), it still shows:

def div_(self, other):
    if not isinstance(other, Variable) and not torch.is_tensor(other):
        return DivConstant(other, inplace=True)(self)
    raise RuntimeError("div_ only supports scalar multiplication")

I still don't understand how it works for your case. But I will try to update my pytorch to the latest version.
Thanks a lot.

@amdegroot
Copy link
Owner

Yeah, I apologize for lack of a better answer, but since I cannot reproduce I am closing the issue for now. Let me know if updating PyTorch fixes the issue, I will try to see if I can figure out more info myself in the mean time..

@amdegroot
Copy link
Owner

Ah, figured it out. That line in the source code is referring to Variables, so it is just saying Variables cannot be divided by Tensors, but Variables can be divided by other Variables of the same size (which is the case here) and Tensors can be divided by other Tensors of the same size.

torch/csrc/generic/methods/TensorMath.cwrap line 1038 has what looks like the place that bridges the python and C for the tensor div_ definition, and it's implied in torch/tensor.py 378: return self.div_(other) even though it doesn't seem like self.div_ is defined.

So again, not sure what the exact source of the problem is in your case, but my best bet is your version of PyTorch. Hopefully that helps.

@amdegroot
Copy link
Owner

amdegroot commented Mar 30, 2017

Also, update on training time: it takes approx. 37.5 sec. per epoch with a gtx1060 and batch size of 16, which is what I am currently using (ran out of money to afford the K80 EC2 instance :P).

@pillar02
Copy link
Author

Thanks a lot. I will definitely update my Pytorch.

Regarding the training time, it only takes 37.5 sec for one epoch? (I suppose you were training using VOC2007 with about 10000 images, right?). I have tried training a mxnet SSD implementation which takes about 270 sec for one epoch using both VOC2007 and VOC2012 data on my titan x gpu card. Does this mean this pytorch ssd is even faster than the mxnet implementation, which doesn't seem to be true.

@amdegroot
Copy link
Owner

amdegroot commented Mar 30, 2017

Yeah, that's my bad. Disregard that number, its late here. Training on purely the training set (2501 images) from VOC07 it takes on average ~140 sec. per epoch on a single GTX 1060... So yeah the previous number was off by alot. I would be curious to see how it compares on a Titan X though.

@pillar02
Copy link
Author

one more question ;-)

I am wondering how you got the fc-reduced VGG-16 weights?

@amdegroot
Copy link
Owner

Hahah of course... I converted them to Chainer and then from Chainer to PyTorch. I also was able to convert them to Torch and then from Torch to PyTorch, but the specific weight file I supply was one that took the Chainer route.

@pillar02
Copy link
Author

pillar02 commented Apr 6, 2017

hi, i just updated the pytorch to the latest version (0.1.11_5) and had the train.py run.
luckily, i didn't get the div_ error.
but this time, I got "RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/T1" at "conf = labels[best_truth_idx] + 1 " from the box_utils.py
any idea about this? it seems like something to do with tensor add

@amdegroot
Copy link
Owner

Is this on the first feed forward or were you able to get through some iterations? The only time that line has every been an issue was a while back when I had an explicit 'background' label in the voc labelmap and it just became an index out of range issue for softmax. But I'm currently training as I type this and can't think of what could be causing that. Have you pulled the most recent update of master? Or maybe you're on a different branch?

@meetps
Copy link

meetps commented Jul 4, 2017

I faced this issue as well, with PyTorch version ( 0.1.12_4 ) which is very recent.

I fixed it by changing the forward() function in L2Norm.py as follows:

def forward(self, x):
    norm = x.pow(2).sum(1).sqrt()+self.eps
    norm_stretch = norm.expand_as(x)
    x = x / norm_stretch
    out = self.weight.unsqueeze(0).unsqueeze(2).unsqueeze(3).expand_as(x) * x
    return out

I then am facing an issue in the box_utils.py as:

THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu line=226 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "train_cars.py", line 232, in <module>
    train()
  File "train_cars.py", line 184, in train
    loss_l, loss_c = criterion(out, targets)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/mshah/code/ssd.pytorch/layers/modules/multibox_loss.py", line 70, in forward
    match(self.threshold,truths,defaults,self.variance,labels,loc_t,conf_t,idx)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 107, in match
    loc = encode(matches, priors, variances)
  File "/home/mshah/code/ssd.pytorch/layers/box_utils.py", line 133, in encode
    return torch.cat([g_cxcy, g_wh], 1)  # [num_priors,4]
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

@superhans
Copy link

superhans commented Jul 7, 2017

Edit : I believe there are basic Python2.7 vs Python3 compatibility issues which cause the problem, since this code was written for Python3 and not Python2.7

Adding the line from __future__ import division in box_utils.py, prior_box.py and detection.py gets rid of the above error, and some other errors.

@acrosson
Copy link

acrosson commented Sep 9, 2017

great suggestion @superhans . adding from __future__ import division to most of the files, gets rid of any nan, inf in the loss for python 2.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants