Skip to content

[Fix] fix multibox_transform race condition#7188

Closed
hsuanguo wants to merge 3 commits intoapache:mainfrom
hsuanguo:bug/initial-invalid-boxes
Closed

[Fix] fix multibox_transform race condition#7188
hsuanguo wants to merge 3 commits intoapache:mainfrom
hsuanguo:bug/initial-invalid-boxes

Conversation

@hsuanguo
Copy link
Contributor

@hsuanguo hsuanguo commented Jan 1, 2021

This fix is proposed by @alec-anyvision, and this PR should solve the race condition mentioned here #6631.

In a very rare condition I have seen a couple of times that the outputs were inconsistent from time to time, this was unlikely to be caught as the memory is usually initialized as 0 and those are invalid boxes.

Usually I could reproduce it when multibox_transform_loc is called after some unrelated code/tests which could potentially used the same memory.

cc @Laurawly @zhiics

@hsuanguo hsuanguo changed the title [Fix] Initializes all invalid box classes [Fix] fix multibox_transform race condition Jan 4, 2021
@tqchen
Copy link
Member

tqchen commented Jan 13, 2021

@Laurawly @zhiics it would be great if you can followup on this thread

Copy link
Contributor

@Laurawly Laurawly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. It seems that we didn't actually fix the race issue this line

temp_valid_count[tid * num_anchors + k] += temp_valid_count[
right?

variances[2],
variances[3],
)
out_base_idx = i * num_anchors * 6 + tid * 6
Copy link
Contributor

@Laurawly Laurawly Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the CPU implementation:


When cls_id is not larger than 0, we don't put its value in out_loc, could you elaborate on how did you deal with that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it doesn't actually solve the race condition with this, it just however reduced the chance on my test env as I can no longer re-produce the issue I had with this.
Sorry for the inaccuracy here. This PR was initially only for the initialization part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with the CPU implementation:

When cls_id is not larger than 0, we don't put its value in out_loc, could you elaborate on how did you deal with that case?

The implementation is not exact the same as the CPU version, the order of the output will not be the same as the CPU version, but the boxes would be, regarding removing the background that was my mistake, I missed it.

I think it might be better to close this PR and come with a better solution, or keep the initialization part if it makes sense to you(revert to the initial title)? @Laurawly what do you think? In any case I think the op needs some rework to be more robust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsuanguo I agree the pr needs more work. A new PR with the issues fixed would be great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsaputra What do you think?

valid_count[i] += 1

for j in range(valid_count[i], num_anchors):
out_loc[i, j, 0] = -1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the initialization, out_loc[i, j, 1~5] are also not initialized right? cc @kevinthesun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class id is the most important part, usually if we set it to -1 as background, values on out_loc[i, j, 1~5] then will be ignored, so for a minimum change to fix, this should be enough.

@hsuanguo
Copy link
Contributor Author

hsuanguo commented Feb 9, 2021

Close for now and will check if there is a better solution

@hsuanguo hsuanguo closed this Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants