Fix refresh manager race#10727
Conversation
…time Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for looking into this and fixing. This is quite hard to reason about. I'm wondering if we should just use a lock here? Would this hit for every request? Or just during redirection? If the latter maybe it's OK?
If we do keep this approach, I think we should add some specific tests using https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. See other examples that use CAS loops that use this for testing.
Alternatively, if we believe that this eventual consistency is OK, is there a way to fix the tests so they don't flake? Test only change?
/wait
| } | ||
| }); | ||
| return true; | ||
| } else if (info->last_callback_time_ms_.load() != last_callback_time_ms) { |
There was a problem hiding this comment.
If we stick with this approach I think it would be more clear if this was a basic else statement paired with the CAS, inside a large if statement with the counter increment.
There was a problem hiding this comment.
I think if we have the if counter set to 0 is in a if statement, it would also race. because if thread 1 set the counter to 0, the incr in thread 2 might not trigger the threshold. Let me rewrite this logic to make it clearer.
| // If someone else updated the last callback time, then they will trigger the callback. | ||
| // During this time we don't want to continue to increment the count, so we enforce the | ||
| // count is 0 | ||
| *count = 0; |
There was a problem hiding this comment.
Based on the comment above, should this be a decrement to undo the increment? Or is that not safe since it might also race?
There was a problem hiding this comment.
decrement can race as well. If the decrement is called after count got set to 0 in another thread, this would set the counter to -1.
Signed-off-by: Henry Yang <hyang@lyft.com>
Signed-off-by: Henry Yang <hyang@lyft.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, few more question/comment.
/wait
| postCallBack = true; | ||
| } | ||
|
|
||
| // If a callback should be triggered(in this or some other threads) signaled by the changed |
|
|
||
| // If a callback should be triggered(in this or some other threads) signaled by the changed | ||
| // last callback time, we reset the count to 0 | ||
| if (postCallBack || info->last_callback_time_ms_.load() != last_callback_time_ms) { |
There was a problem hiding this comment.
I'm still a little confused here by this logic so I think it needs some more comments. What is the exact sequence that leads to post_callback being false but the times being not equal?
There was a problem hiding this comment.
ok, let me add those to the comments
Signed-off-by: Henry Yang <hyang@lyft.com>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, this makes sense. Great comments!
redis: fix refresh manager race (envoyproxy#10727)
Description: Enforce the count is 0 if one of the other threads have reset the callback time.
Risk Level: Low
Testing:
Docs Changes:
Release Notes:
Fixes #10384