-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Conversation
@adiaaida |
ping |
@adiaaida @brianrob ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on this. Some comments and questions.
private Hashtable _categoryTable; | ||
private Hashtable _nameTable; | ||
private Hashtable _helpTable; | ||
private readonly object _customCategoryTableLock = new Object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend only using InternalSyncObject. Introducing a new lock here can cause deadlocks if we don't get it right. I've seen that sort of thing in this area before and we've removed locks for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but InternalSyncObject is static and protect race on PerformanceCounterLib static methods(create/destroy machine/lcid instances), "_customCategoryTableLock", _categoryTableLock, _nameTableLock, _helpTableLock protect race on machineName/lcid PerformanceCounterLib internal instance(GetCounters etc...), is correct mix "static locks" with "instance locks"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing fundamentally wrong with this. For me the question is whether or not the number of instances and actions upon them in a multi-threaded fashion will result in significant performance degradation if we use a single lock. My suspicion is that this won't be an issue because this is all about creation of counters and not use of counters. I would be much more concerned if we did something like this when reading counters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -320,7 +320,9 @@ internal void CloseTables() | |||
_nameTable = null; | |||
_helpTable = null; | |||
_categoryTable = null; | |||
_customCategoryTable = null; | |||
//race with FindCustomCategory | |||
lock (_customCategoryTableLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please add enclosing braces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, done
Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null); | ||
} | ||
if (_customCategoryTable == null) | ||
_customCategoryTable = new Hashtable(StringComparer.OrdinalIgnoreCase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please add enclosing braces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, done
DateTime now = DateTime.UtcNow; | ||
while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10)) | ||
{ | ||
CloseAllLibraries(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like CloseAllTables got removed. Is that intentional? I'd expect that you'd want completely clean state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloseAllLibraries() call library.Close() that call CloseTables(), CloseAllTables() call CloseTable() for every library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thank you.
*/ | ||
bool isPublished = false; | ||
DateTime now = DateTime.UtcNow; | ||
while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel reasonably good about adding locks here, but I don't think that we should add a 10 second retry here. If you want to add a couple of retries, that might be OK. I also don't think that I would throw an exception either. The only real thing that can be done in the case of the thrown exception is to wait longer, and so you'd expect the user of performance counters to do that when attempting to get the counter - users already have to do this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i have written above the idea here is "wait for publication" not to protect access to shared resources, let me understand why lock is useful?If i understood well the issue is not race, but this is a "random" registry publication delay(read on guide) so after some time(10 seconds or less?or maybe after n retry is better?) if performance counter does not exists something went wrong or someone removed same counter. What i mean is that is not sure "wait longer" is the only solution, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point - It's just not clear to me what the right number of retries or the correct amount of time is here. That's why I would guess that the desktop implementation does not have this type of logic and instead depends upon the caller to decide what to do (not necessarily better, but more flexible).
@weshaggard, do you know if there is any precedent in managed APIs that wrap Windows OS APIs where delay is an issue (such as this registry publication delay)? Do we generally take the approach that callers need to do retries, or do we have a pattern that we've used before to attempt to solve this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub could answer this. I thought there are some examples (maybe in IO?) but I can't find one specifically so that may not be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danmosemsft. @stephentoub do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is not retries, this is 'wait for publication'. Previously the API was asynchronous (you set things in motion, but state update may not have happened before return), and you have made the synchronous (which gets you into the business of deciding how long to wait, and what to do when that fails.
We ARE changing the contract (it can throw an exception now where it definitely did NOT before), so this is a compatibility break (how much do we care?)
This is also a 'compatibility' library (it is not part of the cross platform core), and my expectation is that we are not trying to innovate here (just make existing things work).
This suggests that we should not change things here. What are the disadvantages of simply leaving the old behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vancem i agree with you this is 'wait for publication' and the idea is to mitigate the issue. The scope of this PR is for first resolve some race issues and now seems ok. The second is try to remove [assembly:CollectionBehavior(DisableTestParallelization = true)] to run tests concurrently, but maybe this is not fundamental. @danmosemsft added this attribute after some tests fail due to "wait for publication" side effect and internal cleanup/load cached tables. If we don't want to change too much on this namespace we can keep the attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation is to pull out the wait loop and do that separately. If the change was only motivated by tests, it is easy enough to simply do that waiting in the test rather than here. Which we choose really depends on whether our customers would be happier with or without it (but we are biased to doing nothing if we don't know). I am suspicious that we don't know, so I would keep the semantics the same (less delta between us an .NET Desktop).
So my recommendation is to pull it (we can put it back if we care in the future...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok @vancem i'll remove 'wait for publication strategy' and fix only race, thank's! @danmosemsft i think we need to keep attribute(or expose IsPublished()
to category and update all tests if it's so important).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MarcoRossignoli!
using Xunit; | ||
|
||
namespace System.Diagnostics.Tests | ||
{ | ||
[SkipOnTargetFramework(TargetFrameworkMonikers.Uap)] // In appcontainer, cannot write to perf counters | ||
public static class PerformanceCounterCategoryTests | ||
{ | ||
#if MyTrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like MyTrait was left in inadvertently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As i have written above this is test/debug code, this test will be removed, or do you think we need an outerloop test of some seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I do like the idea of your multi-threaded test making making sure that we haven't introduced any deadlocks with this (or future) changes. So, if you were to modify your test just to make sure that it actually completes within a reasonable time period, I think that would be a good addition to the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok if the idea of some sort of "wait" will be accepted i'll refactor an [OuterLoop()] test(for instance create/destroy n performance counter concurrently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
If you are successful, we can include a change to reverse this? Then repeatedly run CI to see whether we can still hit a problem. |
@brianrob should be ok now, you can review, thank's! |
@dotnet-bot test NETFX x86 Release Build |
1 similar comment
@dotnet-bot test NETFX x86 Release Build |
@brianrob, is this PR good to merge? |
s_libraryTable = null; | ||
} | ||
} | ||
//race with GetPerformanceCounterLib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this locking actually make a material difference in race conditions with GetPerformanceCounterLib? GetPerformanceCounterLib doesn't take the lock on InternalSyncObj if s_libraryTable is already initialized. And even if it did, it would only take the lock long enough to initialize the table, but it would be handing out a reference to a PerformanceCounterLib that we're closing here, so code could be using a PerformanceCounterLib concurrently with it being Close'd. Is that ok? Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?
Yes @stephentoub testing with parallel Register/Unregister(CloseAllLibraries) and GetCounterXXX i see a lot of new PerformanceCounterLib(machineName, lcidString);
on GetPerformanceCounterLib for s_libraryTable = null;
this lock is only to protect mutation s_libraryTable and avoid a lot of benign instances of PerformanceCounterLib in parallel scenario
but it would be handing out a reference to a PerformanceCounterLib that we're closing here
doesn't seem to be a problem because the "registry tables" will be reloaded, if counter is no more present a not found exception will raise.
Again is a borderline scenario(parallel register/unregister).
@@ -638,83 +642,87 @@ internal bool FindCustomCategory(string category, out PerformanceCounterCategory | |||
RegistryKey baseKey = null; | |||
categoryType = PerformanceCounterCategoryType.Unknown; | |||
|
|||
if (_customCategoryTable == null) | |||
//race with CloseTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of code to put under a lock; this is not only synchronizing with CloseTables, but with other calls to FindCustomCategory. Might that be a performance/scalability problem?
I would think a better way to deal with this would be to instead just change this method to not refer to _customCategoryTable other than for the initial checking/setting of the field, e.g.
Hashtable table =
_customCategoryTable ??
Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null) ??
_customCategoryTable;
and then use table
for the rest of the method rather than using _customCategoryTable
, only taking a lock (table)
when we actually modify table
, since Hashtable is thread-safe for any number of readers with one writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might that be a performance/scalability problem?
i followed internal FindCustomCategory() and is used only in init path today, no hot path.
However your solution is better, so i agree with you. Thank's a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@dotnet-bot test OSX x64 Debug Build |
@dotnet-bot test Windows x86 Release Build |
Looks like this one is ready to merge. @stephentoub do you have any other feedback? |
lock (_customCategoryTableLock) | ||
{ | ||
_customCategoryTable = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this lock still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand, i'll remove it. Thanks!
@@ -674,7 +678,10 @@ internal bool FindCustomCategory(string category, out PerformanceCounterCategory | |||
// In this case we return an 'Unknown' category type and 'false' to indicate the category is *not* custom. | |||
// | |||
categoryType = PerformanceCounterCategoryType.Unknown; | |||
_customCategoryTable[category] = categoryType; | |||
lock (_customCategoryTableLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need a separate _customCategoryTableLock... just lock on table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'are right. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub should be ok now
@@ -1237,7 +1244,6 @@ internal static void UnregisterCategory(string categoryName) | |||
{ | |||
RegisterFiles(categoryName, true); | |||
DeleteRegistryEntry(categoryName); | |||
CloseAllTables(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is CloseAllTables no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloseAllLibraries() call Close() on every PerformanceCounterLib instance that call CloseTables(). Isn't this redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I'm not familiar with the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianrob Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to say that we should leave CloseAllTables in place. I agree that the work is done implicitly by calling CloseAllLibraries, but some of this work is done outside of locks and I'd hate to change the characteristics of a race condition and break user code. Given that this functionality is here for compatibility, I'd leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianrob ok
@dotnet-bot test Linux x64 Release Build please |
@dotnet-bot test OSX x64 Debug Build please |
1 similar comment
@dotnet-bot test OSX x64 Debug Build please |
@brianrob do you have further feedback? I'd like to get this committed to see whether it helps tests stop being flaky. |
@danmosemsft, I just replied back to the last comment. Once the requested changes are made as long as @stephentoub is OK with the changes, I am too. |
@danmosemsft @brianrob @stephentoub restored CloseAllTables(). |
@dotnet-bot test NETFX x86 Release Build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* sync PerformanceCounterLib * add enclosing braces * address PR feedback * address PR feedback * address PR feedback * nit: trim spaces * address PR feedback * address stephentoub PR feedback * nit: tab * nit: tab * address PR feedback Commit migrated from dotnet/corefx@41c8ba2
Closes #25403
I tried to understand where is the problem and after tests:
the new reloaded CategoryTable sometimets doesn't contains new counter as if all modification to registry are not yet published to all threads
(by design read guide note https://msdn.microsoft.com/en-us/library/cs38wsc4(v=vs.110).aspx "If the list has not been refreshed, the attempt to use the category will fail").
So i added a "retry" strategy to wait new counter publication IsPublished() to mitigate the issue.
(the other table are re-loaded from registry)
race with CloseAllLibraries() on s_libraryTable
My idea is that this namespace wasn't built to handle special "concurrent" scenarios where Create()/Delete()/GetXXXX() are used heavily at same time,
and it make sense, usually i "install a counter" and after i use it, so concurrent modification are rare.
Parallel unit testing is borderline.
Maybe if we want to have a fully concurrent api we've to change more code and have more "coarse" locks also to protect concurrent registry access, or we can
only sync internal structure of PerformanceCounterLib and keep attribute.
I think that today [assembly: CollectionBehavior(DisableTestParallelization = true)] is necessary with this implementation to mitigate the issue.
After these changes sometimes(after heavy loaded minutes with 64 parallel thread in loop Exists/Delete/Create/GetCounter/ReadValue) i get corrupted registry error like(no more null reference for race):
This is WIP code for discussion(tests/debug code).
cc: @joperezr @danmosemsft @adiaaida