Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Sync PerformanceCounterLib #26475

Merged
merged 11 commits into from
Mar 23, 2018
Merged

Sync PerformanceCounterLib #26475

merged 11 commits into from
Mar 23, 2018

Conversation

MarcoRossignoli
Copy link
Member

@MarcoRossignoli MarcoRossignoli commented Jan 20, 2018

Closes #25403

I tried to understand where is the problem and after tests:

  • at the end of RegisterCategory() CloseAllLibraries() clear PerformanceCounterLib instances but if at the same time we call PerformanceCounterLib.GetXXXX()
    the new reloaded CategoryTable sometimets doesn't contains new counter as if all modification to registry are not yet published to all threads
    (by design read guide note https://msdn.microsoft.com/en-us/library/cs38wsc4(v=vs.110).aspx "If the list has not been refreshed, the attempt to use the category will fail").
    So i added a "retry" strategy to wait new counter publication IsPublished() to mitigate the issue.
  • In DeleteCategory() the call PerformanceCounterLib.CloseAllLibraries() is redundant because is called in PerformanceCounterLib.UnregisterCategory().
  • On CloseAllLibraries() i add lock because concurrent GetPerformanceCounterLib() raise NullReferenceException() on s_libraryTable
  • Add lock to protect _customCategoryTable, also on CloseTables() because raise NullReferenceException() on FindCustomCategory()
    (the other table are re-loaded from registry)
  • Add more "coarse" lock on GetPerformanceCounterLib() for two reason, possible multiple instancing of PerformanceCounterLib and
    race with CloseAllLibraries() on s_libraryTable

My idea is that this namespace wasn't built to handle special "concurrent" scenarios where Create()/Delete()/GetXXXX() are used heavily at same time,
and it make sense, usually i "install a counter" and after i use it, so concurrent modification are rare.
Parallel unit testing is borderline.
Maybe if we want to have a fully concurrent api we've to change more code and have more "coarse" locks also to protect concurrent registry access, or we can
only sync internal structure of PerformanceCounterLib and keep attribute.
I think that today [assembly: CollectionBehavior(DisableTestParallelization = true)] is necessary with this implementation to mitigate the issue.
After these changes sometimes(after heavy loaded minutes with 64 parallel thread in loop Exists/Delete/Create/GetCounter/ReadValue) i get corrupted registry error like(no more null reference for race):

The Counter layout for the Category specified is invalid, a counter of the type:  AverageCount64, AverageTimer32, CounterMultiTimer, CounterMultiTimerInverse, CounterMultiTimer100Ns, CounterMultiTimer100NsInverse, RawFraction, or SampleFraction has to be immediately followed by any of the base counter types: AverageBase, CounterMultiBase, RawBase or SampleBase.
Cannot load Counter Name data because an invalid index '' was read from the registry.

This is WIP code for discussion(tests/debug code).

cc: @joperezr @danmosemsft @adiaaida

@MarcoRossignoli MarcoRossignoli changed the title Sync PerformanceCounterLib WIP Sync PerformanceCounterLib Jan 20, 2018
@danmoseley
Copy link
Member

@adiaaida

@MarcoRossignoli MarcoRossignoli changed the title WIP Sync PerformanceCounterLib [WIP] Sync PerformanceCounterLib Jan 25, 2018
@MarcoRossignoli MarcoRossignoli changed the title [WIP] Sync PerformanceCounterLib Sync PerformanceCounterLib Jan 31, 2018
@MarcoRossignoli
Copy link
Member Author

ping

@danmoseley
Copy link
Member

@adiaaida @brianrob ?

Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this. Some comments and questions.

private Hashtable _categoryTable;
private Hashtable _nameTable;
private Hashtable _helpTable;
private readonly object _customCategoryTableLock = new Object();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend only using InternalSyncObject. Introducing a new lock here can cause deadlocks if we don't get it right. I've seen that sort of thing in this area before and we've removed locks for this reason.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but InternalSyncObject is static and protect race on PerformanceCounterLib static methods(create/destroy machine/lcid instances), "_customCategoryTableLock", _categoryTableLock, _nameTableLock, _helpTableLock protect race on machineName/lcid PerformanceCounterLib internal instance(GetCounters etc...), is correct mix "static locks" with "instance locks"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing fundamentally wrong with this. For me the question is whether or not the number of instances and actions upon them in a multi-threaded fashion will result in significant performance degradation if we use a single lock. My suspicion is that this won't be an issue because this is all about creation of counters and not use of counters. I would be much more concerned if we did something like this when reading counters.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, seems ok, checked with call hierarchy tool, called only in init paths(_initialized).
callhierarchy

@@ -320,7 +320,9 @@ internal void CloseTables()
_nameTable = null;
_helpTable = null;
_categoryTable = null;
_customCategoryTable = null;
//race with FindCustomCategory
lock (_customCategoryTableLock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please add enclosing braces.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, done

Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null);
}
if (_customCategoryTable == null)
_customCategoryTable = new Hashtable(StringComparer.OrdinalIgnoreCase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please add enclosing braces.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, done

DateTime now = DateTime.UtcNow;
while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10))
{
CloseAllLibraries();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like CloseAllTables got removed. Is that intentional? I'd expect that you'd want completely clean state.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloseAllLibraries() call library.Close() that call CloseTables(), CloseAllTables() call CloseTable() for every library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you.

*/
bool isPublished = false;
DateTime now = DateTime.UtcNow;
while ((DateTime.UtcNow.Subtract(now).TotalSeconds < 10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel reasonably good about adding locks here, but I don't think that we should add a 10 second retry here. If you want to add a couple of retries, that might be OK. I also don't think that I would throw an exception either. The only real thing that can be done in the case of the thrown exception is to wait longer, and so you'd expect the user of performance counters to do that when attempting to get the counter - users already have to do this, right?

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i have written above the idea here is "wait for publication" not to protect access to shared resources, let me understand why lock is useful?If i understood well the issue is not race, but this is a "random" registry publication delay(read on guide) so after some time(10 seconds or less?or maybe after n retry is better?) if performance counter does not exists something went wrong or someone removed same counter. What i mean is that is not sure "wait longer" is the only solution, do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point - It's just not clear to me what the right number of retries or the correct amount of time is here. That's why I would guess that the desktop implementation does not have this type of logic and instead depends upon the caller to decide what to do (not necessarily better, but more flexible).

@weshaggard, do you know if there is any precedent in managed APIs that wrap Windows OS APIs where delay is an issue (such as this registry publication delay)? Do we generally take the approach that callers need to do retries, or do we have a pattern that we've used before to attempt to solve this problem?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub could answer this. I thought there are some examples (maybe in IO?) but I can't find one specifically so that may not be the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danmosemsft. @stephentoub do you have any thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is not retries, this is 'wait for publication'. Previously the API was asynchronous (you set things in motion, but state update may not have happened before return), and you have made the synchronous (which gets you into the business of deciding how long to wait, and what to do when that fails.

We ARE changing the contract (it can throw an exception now where it definitely did NOT before), so this is a compatibility break (how much do we care?)

This is also a 'compatibility' library (it is not part of the cross platform core), and my expectation is that we are not trying to innovate here (just make existing things work).

This suggests that we should not change things here. What are the disadvantages of simply leaving the old behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vancem i agree with you this is 'wait for publication' and the idea is to mitigate the issue. The scope of this PR is for first resolve some race issues and now seems ok. The second is try to remove [assembly:CollectionBehavior(DisableTestParallelization = true)] to run tests concurrently, but maybe this is not fundamental. @danmosemsft added this attribute after some tests fail due to "wait for publication" side effect and internal cleanup/load cached tables. If we don't want to change too much on this namespace we can keep the attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation is to pull out the wait loop and do that separately. If the change was only motivated by tests, it is easy enough to simply do that waiting in the test rather than here. Which we choose really depends on whether our customers would be happier with or without it (but we are biased to doing nothing if we don't know). I am suspicious that we don't know, so I would keep the semantics the same (less delta between us an .NET Desktop).

So my recommendation is to pull it (we can put it back if we care in the future...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok @vancem i'll remove 'wait for publication strategy' and fix only race, thank's! @danmosemsft i think we need to keep attribute(or expose IsPublished() to category and update all tests if it's so important).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Xunit;

namespace System.Diagnostics.Tests
{
[SkipOnTargetFramework(TargetFrameworkMonikers.Uap)] // In appcontainer, cannot write to perf counters
public static class PerformanceCounterCategoryTests
{
#if MyTrait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like MyTrait was left in inadvertently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i have written above this is test/debug code, this test will be removed, or do you think we need an outerloop test of some seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I do like the idea of your multi-threaded test making making sure that we haven't introduced any deadlocks with this (or future) changes. So, if you were to modify your test just to make sure that it actually completes within a reasonable time period, I think that would be a good addition to the test suite.

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Feb 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if the idea of some sort of "wait" will be accepted i'll refactor an [OuterLoop()] test(for instance create/destroy n performance counter concurrently).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@danmoseley
Copy link
Member

If you are successful, we can include a change to reverse this?
https://github.com/dotnet/corefx/pull/25401/files

Then repeatedly run CI to see whether we can still hit a problem.

@MarcoRossignoli
Copy link
Member Author

@brianrob should be ok now, you can review, thank's!

@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test NETFX x86 Release Build

1 similar comment
@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test NETFX x86 Release Build

@ahsonkhan
Copy link
Member

@brianrob, is this PR good to merge?

s_libraryTable = null;
}
}
//race with GetPerformanceCounterLib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this locking actually make a material difference in race conditions with GetPerformanceCounterLib? GetPerformanceCounterLib doesn't take the lock on InternalSyncObj if s_libraryTable is already initialized. And even if it did, it would only take the lock long enough to initialize the table, but it would be handing out a reference to a PerformanceCounterLib that we're closing here, so code could be using a PerformanceCounterLib concurrently with it being Close'd. Is that ok? Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe the locking we're adding here is purely about protecting the mutation of s_libraryTable, and we're not concerned with the other issues?

Yes @stephentoub testing with parallel Register/Unregister(CloseAllLibraries) and GetCounterXXX i see a lot of new PerformanceCounterLib(machineName, lcidString); on GetPerformanceCounterLib for s_libraryTable = null; this lock is only to protect mutation s_libraryTable and avoid a lot of benign instances of PerformanceCounterLib in parallel scenario

but it would be handing out a reference to a PerformanceCounterLib that we're closing here

doesn't seem to be a problem because the "registry tables" will be reloaded, if counter is no more present a not found exception will raise.
Again is a borderline scenario(parallel register/unregister).

@@ -638,83 +642,87 @@ internal bool FindCustomCategory(string category, out PerformanceCounterCategory
RegistryKey baseKey = null;
categoryType = PerformanceCounterCategoryType.Unknown;

if (_customCategoryTable == null)
//race with CloseTables
Copy link
Member

@stephentoub stephentoub Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of code to put under a lock; this is not only synchronizing with CloseTables, but with other calls to FindCustomCategory. Might that be a performance/scalability problem?

I would think a better way to deal with this would be to instead just change this method to not refer to _customCategoryTable other than for the initial checking/setting of the field, e.g.

Hashtable table =
    _customCategoryTable ??
    Interlocked.CompareExchange(ref _customCategoryTable, new Hashtable(StringComparer.OrdinalIgnoreCase), null) ??
    _customCategoryTable;

and then use table for the rest of the method rather than using _customCategoryTable, only taking a lock (table) when we actually modify table, since Hashtable is thread-safe for any number of readers with one writer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might that be a performance/scalability problem?

i followed internal FindCustomCategory() and is used only in init path today, no hot path.
However your solution is better, so i agree with you. Thank's a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@MarcoRossignoli MarcoRossignoli changed the title Sync PerformanceCounterLib [WIP]Sync PerformanceCounterLib Mar 14, 2018
@MarcoRossignoli MarcoRossignoli changed the title [WIP]Sync PerformanceCounterLib Sync PerformanceCounterLib Mar 14, 2018
@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test OSX x64 Debug Build

@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test Windows x86 Release Build

@MarcoRossignoli MarcoRossignoli changed the title Sync PerformanceCounterLib WIP: Sync PerformanceCounterLib Mar 14, 2018
@MarcoRossignoli MarcoRossignoli changed the title WIP: Sync PerformanceCounterLib Sync PerformanceCounterLib Mar 14, 2018
@brianrob
Copy link
Member

Looks like this one is ready to merge. @stephentoub do you have any other feedback?

lock (_customCategoryTableLock)
{
_customCategoryTable = null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this lock still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand, i'll remove it. Thanks!

@@ -674,7 +678,10 @@ internal bool FindCustomCategory(string category, out PerformanceCounterCategory
// In this case we return an 'Unknown' category type and 'false' to indicate the category is *not* custom.
//
categoryType = PerformanceCounterCategoryType.Unknown;
_customCategoryTable[category] = categoryType;
lock (_customCategoryTableLock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need a separate _customCategoryTableLock... just lock on table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'are right. Thanks!

Copy link
Member Author

@MarcoRossignoli MarcoRossignoli Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub should be ok now

@@ -1237,7 +1244,6 @@ internal static void UnregisterCategory(string categoryName)
{
RegisterFiles(categoryName, true);
DeleteRegistryEntry(categoryName);
CloseAllTables();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is CloseAllTables no longer needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloseAllLibraries() call Close() on every PerformanceCounterLib instance that call CloseTables(). Isn't this redundant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I'm not familiar with the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianrob Can you confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to say that we should leave CloseAllTables in place. I agree that the work is done implicitly by calling CloseAllLibraries, but some of this work is done outside of locks and I'd hate to change the characteristics of a race condition and break user code. Given that this functionality is here for compatibility, I'd leave it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test Linux x64 Release Build please
@dotnet-bot test OSX x64 Debug Build please

@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test OSX x64 Debug Build please

1 similar comment
@MarcoRossignoli
Copy link
Member Author

@dotnet-bot test OSX x64 Debug Build please

@danmoseley
Copy link
Member

danmoseley commented Mar 22, 2018

@brianrob do you have further feedback? I'd like to get this committed to see whether it helps tests stop being flaky.

@brianrob
Copy link
Member

@danmosemsft, I just replied back to the last comment. Once the requested changes are made as long as @stephentoub is OK with the changes, I am too.

@MarcoRossignoli
Copy link
Member Author

MarcoRossignoli commented Mar 22, 2018

@danmosemsft @brianrob @stephentoub restored CloseAllTables().

@brianrob
Copy link
Member

@dotnet-bot test NETFX x86 Release Build
@dotnet-bot test UWP CoreCLR x64 Debug Build

Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@stephentoub stephentoub merged commit 41c8ba2 into dotnet:master Mar 23, 2018
@MarcoRossignoli MarcoRossignoli deleted the perflib-test branch March 23, 2018 07:57
@karelz karelz added this to the 2.1.0 milestone Mar 27, 2018
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* sync PerformanceCounterLib

* add enclosing braces

* address PR feedback

* address PR feedback

* address PR feedback

* nit: trim spaces

* address PR feedback

* address stephentoub PR feedback

* nit: tab

* nit: tab

* address PR feedback


Commit migrated from dotnet/corefx@41c8ba2
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants