[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op. #8415

zhuwenxi · 2021-07-07T08:11:30Z

Problem Statement

scatter_nd crashes on cuda backend, when input data shape is slightly larger than usual.

Code to reproduce

import tvm
import numpy as np
import tvm.relay as relay

dev = tvm.cuda()
target = tvm.target.Target("cuda")

# input data:
data_np = np.zeros((32, 128, 128, 256)).astype(np.float32)
indices_np = np.random.uniform(1,5,(32, 600, 3)).astype(np.int64)
updates_np = np.random.rand(32, 600, 256).astype(np.float32)

# Construct relay input nodes:
data = relay.var("data", shape=data_np.shape, dtype=str(data_np.dtype))
indices = relay.var("indices", shape=indices_np.shape, dtype=str(indices_np.dtype))
updates = relay.var("updates", shape=updates_np.shape, dtype=str(updates_np.dtype))

# Compute indices:
indices_dim = len(indices_np.shape)
axes = list(range(indices_dim))
indices_t = relay.transpose(indices, axes[-1:] + axes[:-1])

# Construct relay scatter_nd op:
out = relay.op.scatter_nd(data, indices_t, updates, "update")
func = relay.Function([data, indices, updates], out)

# Execute scatter_nd:
intrp = relay.create_executor("debug", device=dev, target=target)
op_res = intrp.evaluate(func)(data_np, indices_np, updates_np)

Error Message

Root Cause

We can see the problem more clearly from the cuda code generated. The TIR implementation of scatter_nd would cause a int32 overflow when "i" is large, thus the if statement is always evaluate to true, and conducts a invalid memory access.

zhuwenxi · 2021-07-07T08:13:55Z

@tqchen @masahi @icemelon9 Could you help review this fix, or bridge someone who is willing to?

tkonolige · 2021-07-07T17:53:57Z

I've implemented an alternative fix in #8419

zhuwenxi · 2021-07-08T02:49:04Z

I've implemented an alternative fix in #8419

@tkonolige Thank you for review this PR.

Just curious, is there any specific reason why you created another PR? I understand the alternative fix you proposed maybe better than the existing one, in someway probably. But why bother create a new separate PR? We can discuss and improve the code here, I think that's exactly what code review is about, right?

tkonolige · 2021-07-08T17:03:38Z

@zhuwenxi Sorry, I definitely shouldn't have opened a new PR. I just got a little hasty and did the fix myself. Feel free to copy the code from that PR into this one. I'll end up closing the other one.

zhuwenxi · 2021-07-09T04:58:54Z

@tkonolige That's OK, it happens.

mbrookhart · 2021-07-09T16:43:03Z

You hit a flaky test, I found a fix here: #8431

1. Existing scatter_nd cuda implementation has a very large bound, which could overflow int32 range when input tensor shape is large enough; 2. The overflow could cause the if statement always evaluate to true, thus conducts invalid memory accesses; 3. We fix this problem in this commit by reducing the bound, the original large bound is not only unnecessary, but also degrading the performance; With this fix, scatter_op's performance improves 100x on some cases.

zhuwenxi · 2021-07-13T02:15:12Z

@mbrookhart I've fixed the UT failure.

mbrookhart · 2021-07-13T02:57:39Z

Thanks @zhuwenxi @tkonolige

…apache#8415) * Fix the integer overflow problem of the scatter_nd op. * Fix scatter_nd's crash problem: 1. Existing scatter_nd cuda implementation has a very large bound, which could overflow int32 range when input tensor shape is large enough; 2. The overflow could cause the if statement always evaluate to true, thus conducts invalid memory accesses; 3. We fix this problem in this commit by reducing the bound, the original large bound is not only unnecessary, but also degrading the performance; With this fix, scatter_op's performance improves 100x on some cases. Co-authored-by: wenxizhu <wenxizhu@tencent.com>

zhuwenxi mentioned this pull request Jul 9, 2021

[FIX] Correct initial data copy of scatter_nd #8419

Closed

mbrookhart approved these changes Jul 9, 2021

View reviewed changes

wenxizhu added 2 commits July 12, 2021 11:26

Fix the integer overflow problem of the scatter_nd op.

6487e4b

zhuwenxi force-pushed the bugfix/wenxizhu/fix-scatter-integer-overflow branch from 842243b to cd30a24 Compare July 12, 2021 03:39

mbrookhart merged commit d043cb9 into apache:main Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op. #8415

[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op. #8415

zhuwenxi commented Jul 7, 2021

zhuwenxi commented Jul 7, 2021

tkonolige commented Jul 7, 2021

zhuwenxi commented Jul 8, 2021 •

edited

Loading

tkonolige commented Jul 8, 2021 •

edited

Loading

zhuwenxi commented Jul 9, 2021

mbrookhart commented Jul 9, 2021

zhuwenxi commented Jul 13, 2021

mbrookhart commented Jul 13, 2021

[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op. #8415

[BugFix][TOPI] Fix the integer overflow problem of the scatter_nd op. #8415

Conversation

zhuwenxi commented Jul 7, 2021

Problem Statement

Code to reproduce

Error Message

Root Cause

zhuwenxi commented Jul 7, 2021

tkonolige commented Jul 7, 2021

zhuwenxi commented Jul 8, 2021 • edited Loading

tkonolige commented Jul 8, 2021 • edited Loading

zhuwenxi commented Jul 9, 2021

mbrookhart commented Jul 9, 2021

zhuwenxi commented Jul 13, 2021

mbrookhart commented Jul 13, 2021

zhuwenxi commented Jul 8, 2021 •

edited

Loading

tkonolige commented Jul 8, 2021 •

edited

Loading