Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TVM][Bugfix] fix storage_rewrite bug when input is big #2580

Merged
merged 6 commits into from Feb 14, 2019

Conversation

zhiics
Copy link
Member

@zhiics zhiics commented Feb 10, 2019

This PR fixes a bug when input size is large. For example, the size of a tensor in terms of bits may exceed the maximum of int32. It causes core dump at runtime. This bug is identified with the help from @yzhliu

@yzhliu @tqchen @wweic please review

@wweic
Copy link
Contributor

wweic commented Feb 10, 2019

vta/tests/python/unittest/test_vta_insn.py is failing at line https://github.com/dmlc/tvm/blob/919bea8c79de5de9996cb4714fdb92b2149a023b/vta/python/vta/ir_pass.py#L716. It's because your change will wrap index expression inside a Cast expression like
(int64((((i0*64) + (i1*16)) + i3)) + (int64(cthread.s)*(int64)128)).

https://github.com/dmlc/tvm/blob/919bea8c79de5de9996cb4714fdb92b2149a023b/src/arithmetic/detect_linear_equation.cc#L29 needs to support Cast node.

@zhiics
Copy link
Member Author

zhiics commented Feb 12, 2019

@tqchen Could you please take a look?
Should we also do ComputeReduce for e->allocs[0]->extents at line 554? The test is evaluated to 16384 * 16384 = 268435456 on CPU test in the CI, but is kept as 16384 x 16384 in ci_i386 test. Any advice? Thanks.

@tqchen
Copy link
Member

tqchen commented Feb 12, 2019

Let us make the behavior so that for now, we can keep arithmetic in int32 if there is no overflow, and use int64 only when there is a chance of overflow

@zhiics
Copy link
Member Author

zhiics commented Feb 12, 2019

@tqchen Thanks. I will add it and update the PR.

@tqchen
Copy link
Member

tqchen commented Feb 13, 2019

Copy link
Contributor

@wweic wweic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@Anthony-Mai
Copy link
Contributor

I have reservation about this code change. It seems to me this is too much a cost of increased risk and code complexity for something not of too much practical value, e.g., number of bits exceeds max int32. You end up having code which uses int32 in one case and int64 in another? Super confusing! If number of bits exceeds max int32 you likely have other problem to worry about, like, unable to allocate memory that big.

I suggest either this should be left not fixed. Or fix in more conservative way, like using unsigned int32 instead of signed, so you can have up to 4 billion bits. Or if you feel strongly about supporting more than 4 billion bits, uniformly use int64. Just don't have a mixed case of int32/int64 which is too risky.

@tqchen
Copy link
Member

tqchen commented Feb 13, 2019

@Anthony-Mai has some point on this. Perhaps a simple way forward is to first use a CHECK to make sure we do not go OOM and use int32 atm. As discussed in #2588 , we want to transition to int64 once we have a proper narrowing pass that detects the best data types, and we can move on from there

@zhiics
Copy link
Member Author

zhiics commented Feb 13, 2019

@Anthony-Mai Thanks for your comment. I think allocating more than 2G or even 4G memory on a 64-bit system should be okay. We have an usage that requires a large amount of memory. We are also probably going to test very large input images on some networks, where uint32 might not be enough.

I am not sure if we need to use int64 all around the other places because it seems that VTA at least only uses int32, it would result in many casts. In addition, as per @yzhliu comment, we are warning the users here. I understand that using one consistent type is ideal, but casting as needed in this case is more like a small optimization. Anyway, I am happy to hear other voices. @tqchen @yzhliu @wweic

@zhiics
Copy link
Member Author

zhiics commented Feb 13, 2019

@tqchen Sorry, I didn't see your comment above. It seems we sent around the same time... Please take another look. But it seems we need more than max int32 to support the usage.

@tqchen
Copy link
Member

tqchen commented Feb 13, 2019

I think we all agree on fixing the problem. The question is should we fix it now or wait and come back to fix it after we transition most defaults from int32->int64 and apply optimal narrowing.

@yzhliu
Copy link
Member

yzhliu commented Feb 13, 2019

Does uint32 work? If so this can be a valid fix which satisfy both the special use case @zhiics and the team met so far, and avoiding type mismatch.

@zhiics
Copy link
Member Author

zhiics commented Feb 13, 2019

@yzhliu uint32 should be enough for that special case. NVM, for even larger tests, I think we might be able to wait till everything is fixed. Let me change it to uint32.

@tqchen
Copy link
Member

tqchen commented Feb 13, 2019

I would advise against uint32, int64 is a better choice mainly because the additional 1 bit wont really bring too much benefit

@zhiics
Copy link
Member Author

zhiics commented Feb 13, 2019

To be honest, I personally also think int64 is better. uint32 doesn't change anything in this context because they are still different types.

@zhiics
Copy link
Member Author

zhiics commented Feb 13, 2019

okay, if nobody disagrees, I will change it back to int64.

@yzhliu
Copy link
Member

yzhliu commented Feb 13, 2019

oh I thought with uint32 we can avoid mismatch. If not I'm good with the i64 fix. Let's also track it in #2588

@yzhliu yzhliu merged commit 326fff5 into apache:master Feb 14, 2019
@yzhliu
Copy link
Member

yzhliu commented Feb 14, 2019

Thanks @zhiics @wweic @tqchen @Anthony-Mai Let’s merge it for now, and come back when i32->i64 transition has been finished.

@zhiics zhiics deleted the fix_storage_rewrite branch February 14, 2019 17:08
libing4752 pushed a commit to libing4752/tvm that referenced this pull request Feb 18, 2019
* fix storage_rewrite bug when input is big

* cast when necessary

* simplification

* simplification

* int64->uint32

* revert uint32->int64
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
* fix storage_rewrite bug when input is big

* cast when necessary

* simplification

* simplification

* int64->uint32

* revert uint32->int64
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
* fix storage_rewrite bug when input is big

* cast when necessary

* simplification

* simplification

* int64->uint32

* revert uint32->int64
@yzhliu yzhliu mentioned this pull request Mar 2, 2019
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants