[Fix] fix multibox_transform race condition#7188
[Fix] fix multibox_transform race condition#7188hsuanguo wants to merge 3 commits intoapache:mainfrom hsuanguo:bug/initial-invalid-boxes
Conversation
Laurawly
left a comment
There was a problem hiding this comment.
Sorry for the late reply. It seems that we didn't actually fix the race issue this line
tvm/python/tvm/topi/cuda/ssd/multibox.py
Line 234 in 5bd2967
| variances[2], | ||
| variances[3], | ||
| ) | ||
| out_base_idx = i * num_anchors * 6 + tid * 6 |
There was a problem hiding this comment.
To be consistent with the CPU implementation:
tvm/python/tvm/topi/vision/ssd/multibox.py
Line 229 in e561007
When
cls_id is not larger than 0, we don't put its value in out_loc, could you elaborate on how did you deal with that case?
There was a problem hiding this comment.
You are right, it doesn't actually solve the race condition with this, it just however reduced the chance on my test env as I can no longer re-produce the issue I had with this.
Sorry for the inaccuracy here. This PR was initially only for the initialization part.
There was a problem hiding this comment.
To be consistent with the CPU implementation:
tvm/python/tvm/topi/vision/ssd/multibox.py
Line 229 in e561007
When
cls_idis not larger than 0, we don't put its value inout_loc, could you elaborate on how did you deal with that case?
The implementation is not exact the same as the CPU version, the order of the output will not be the same as the CPU version, but the boxes would be, regarding removing the background that was my mistake, I missed it.
I think it might be better to close this PR and come with a better solution, or keep the initialization part if it makes sense to you(revert to the initial title)? @Laurawly what do you think? In any case I think the op needs some rework to be more robust.
There was a problem hiding this comment.
@hsuanguo I agree the pr needs more work. A new PR with the issues fixed would be great.
| valid_count[i] += 1 | ||
|
|
||
| for j in range(valid_count[i], num_anchors): | ||
| out_loc[i, j, 0] = -1.0 |
There was a problem hiding this comment.
For the initialization, out_loc[i, j, 1~5] are also not initialized right? cc @kevinthesun
There was a problem hiding this comment.
The class id is the most important part, usually if we set it to -1 as background, values on out_loc[i, j, 1~5] then will be ignored, so for a minimum change to fix, this should be enough.
|
Close for now and will check if there is a better solution |
This fix is proposed by @alec-anyvision, and this PR should solve the race condition mentioned here #6631.
In a very rare condition I have seen a couple of times that the outputs were inconsistent from time to time, this was unlikely to be caught as the memory is usually initialized as 0 and those are invalid boxes.
Usually I could reproduce it when
multibox_transform_locis called after some unrelated code/tests which could potentially used the same memory.cc @Laurawly @zhiics