-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix a potential race in iterating counters #28112
Conversation
@sywhang - its best if the text here describes what the user needs to do to hit the issue and some estimate of how likely the situation is to occur. The resulting behavior when the bug is triggered is important as well (crashing as you mentioned). The total impact is then a combination of both: likelihood_to_hit * effect_when_hit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @davmason for another pair of eyes (despite being simple I was also the one who missed catching it during the port so I'd appreciate getting a fresh eyes on it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Can you elaborate on the customer impact and how often it will occur? We will consider this for 3.1.
I am affected by this, trying to add Counters to the EventSource dynamically, the process crashes because of this System.InvalidOperationException: Collection was modified; enumeration operation may not execute. Any workaround you suggest until 3.1.11 is out? |
Summary
There was a mistake that was made in backporting dotnet/runtime#40259 to #28089 where
_counters
was used to iterate instead of the snapshotted value ofcounters
._counters
is a list that needs to be lock-protected, but this access was happening outside of the lock, which is what the snapshot is for. This caused a race-condition on the read/write on this list, causing a crash. The problem does not exist in the 5.0 fix, only for the 3.1 backport.Customer Impact
Medium/High. When hit with this issue, a customer may experience a crash. A customer may run into this issue when they turn on the runtime counters (System.Runtime) or any other set of counter providers, and then modify the list of counters while a counter callback is happening by either creating or disposing instances of EventCounters. For customers that don't consume/create their own EventCounters, this shouldn't be a problem. For those who do, it's pretty easy to hit with the likelihood increasing the shorter they set the counter polling interval.
Regression
Yes. From 3.1.8 -> 3.1.9 servicing release.
Testing
We do not have a local repro of this failure, but we tested the fix by providing a custom build of the runtime with this fix to the internal partner test that initially reported this problem. They ran it in production that hit this issue and reported that this addresses the issue.
Risk
The root cause/fix is well-understood and the fix was verified through partner testing - I would say it is relatively low.