Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index out of bounds #2

Closed
lu-ming-lei opened this issue Nov 20, 2019 · 12 comments
Closed

index out of bounds #2

lu-ming-lei opened this issue Nov 20, 2019 · 12 comments

Comments

@lu-ming-lei
Copy link

There was an error called "index out of bounds" when I ran the code.

@lu-ming-lei
Copy link
Author

error like this:
/opt/conda/conda-bld/pytorch_1570910687650/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda ->auto::operator()(int)->auto: block: [88,0,0], thread: [105,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

@PointsCoder
Copy link

I also met this error:

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/ATen/native/cuda/IndexKernel.cu:53:

lambda ->auto::operator()(int)->auto: block: [134,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Traceback (most recent call last):

H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()

RuntimeError: CUDA error: device-side assert triggered

@cvchanghao
Copy link

I also met this error:

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/ATen/native/cuda/IndexKernel.cu:53:

lambda ->auto::operator()(int)->auto: block: [134,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Traceback (most recent call last):

H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()

RuntimeError: CUDA error: device-side assert triggered

Hi,Did you solved this problem?

@PointsCoder
Copy link

The 'Index out of bound' error happened because torch.topk sometimes crashed in ill conditions. Actually the error happened in function get_graph_feature:
feature = x.view(batch_size * num_points, -1)[idx, :]
The variable idx computed in function knn:
idx = distance.topk(k=k, dim=-1)[1]
sometimes return out-of-bound values. To verify this, you can either add a assertion or raise an exception after that line.

Replacing topk with sort seems to be stable.

@WangYueFt
Copy link
Owner

The 'Index out of bound' error happened because torch.topk sometimes crashed in ill conditions. Actually the error happened in function get_graph_feature:
feature = x.view(batch_size * num_points, -1)[idx, :]
The variable idx computed in function knn:
idx = distance.topk(k=k, dim=-1)[1]
sometimes return out-of-bound values. To verify this, you can either add a assertion or raise an exception after that line.

Replacing topk with sort seems to be stable.

That's a great observation! Thanks for the suggestions!

@wangyujiewj
Copy link

The 'Index out of bound' error happened because torch.topk sometimes crashed in ill conditions. Actually the error happened in function get_graph_feature:
feature = x.view(batch_size * num_points, -1)[idx, :]
The variable idx computed in function knn:
idx = distance.topk(k=k, dim=-1)[1]
sometimes return out-of-bound values. To verify this, you can either add a assertion or raise an exception after that line.

Replacing topk with sort seems to be stable.

when I replaced topk with sort, I encountered a new problem:

Traceback (most recent call last):
u, s, v = torch.svd(H[i])
RuntimeError: Lapack Error gesdd : 2 superdiagonals failed to converge. at /opt/conda/conda-bld/pytorch_1549635019666/work/aten/src/TH/generic/THTensorLapack.cpp:493

I still don't know how to solve it....

@jl626
Copy link

jl626 commented Feb 21, 2020

That lapack error means that your matrix H is ill-conditioned. A simple fix is adding an identity matrix with some scaling values (e.g. torch.eye(n,) * 1e-7)

@pebroe
Copy link

pebroe commented Feb 26, 2020

After having replaced topk with sort, I now got this error:

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [0,0,0], thread: [479,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

This is traced back to

/PRNet/model.py", line 453, in forward H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()

@PointsCoder
Copy link

After having replaced topk with sort, I now got this error:

/opt/conda/conda-bld/pytorch_1549630534704/work/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 3]: block: [0,0,0], thread: [479,0,0] Assertion indexValue >= 0 && indexValue < src.sizes[dim] failed.

This is traced back to

/PRNet/model.py", line 453, in forward H = torch.matmul(src_centered, src_corr_centered.transpose(2, 1).contiguous()).cpu()

The traceback of a cuda error can be inaccurate sometimes. A general approach to debug this type of errors is firstly setting CUDA_LAUNCH_BLOCKING=1 when running your program. In this way you can get a more accurate traceback location. Then you can raise an exception at the traceback location to help you catch the invalid index and figure out what happened during training. Hope this can help you.

@ShengyuH
Copy link

@lu-ming-lei hi, I think it's due to your torch version, it seems some torch versions have unstable torch.svd() behavior, especially when the covariance matrix is ill-conditioned. You can install pytorch1.0.1, this works for me.

@WangYueFt
Copy link
Owner

@lu-ming-lei hi, I think it's due to your torch version, it seems some torch versions have unstable torch.svd() behavior, especially when the covariance matrix is ill-conditioned. You can install pytorch1.0.1, this works for me.

This is to confirm pytorch1.0.1 work for me and I'll close this issue. Feel free to reopen it if it doesn't work.

@yangninghua
Copy link

@wangyujiewj
Does your "sort" implement this way?

def knn(x, k):
    inner = -2 * torch.matmul(x.transpose(2, 1).contiguous(), x)
    xx = torch.sum(x ** 2, dim=1, keepdim=True)
    distance = -xx - inner - xx.transpose(2, 1).contiguous()

    #idx = distance.topk(k=k, dim=-1)[1]  # (batch_size, num_points, k)

    d_sorted, d_index = torch.sort(distance, dim=-1, descending=True)
    d_i, d_j, d_k = d_sorted.shape
    if d_k > k:
        d_sorted_top = d_sorted[:,:,:20]
        d_index_top = d_index[:,:,:20]
    else:
        d_sorted_top = d_sorted[:,:,:d_k]
        d_index_top = d_index[:,:,:d_k]

    return d_index_top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants