Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is __psynch_cvwait and why does it uses so much CPU? #35394

Closed
scheglov opened this issue Dec 12, 2018 · 21 comments

Comments

Projects
None yet
4 participants
@scheglov
Copy link
Contributor

commented Dec 12, 2018

Reproduction is not easy - requires code and configuration that is only on my machine.
But maybe someone knows what it is.

image

@mraleph

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2018

It's an implementation detail of pthread_cond_wait (which waits on the condition variable).

I think in reality it is not using any CPU, but observatory profiler is sending signals to the waiting thread (because it does not know that waiting thread is not doing any actual work).

/cc @rmacnak-google

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

In that profile, slightly less time is spent in the wait than old space GC. Old space GC does most of its work on other threads with the main thread waiting on the helper threads (for the non-concurrent parts). This should have been visible in the profile as GCOldSpace appearing as a caller of __psynch_cvwait. GCOldSpace isn't a function but rather a "tag" serving as a marker, so I suspect the table view of the profile doesn't include tags as a callees or caller. If you look at this same profile in the tree view, you should see __psynch_cvwait as a child of GCOldSpace.

This profile is worrying because GCOldSpace should only appear significant in the profile if most marking isn't running concurrently. What are you running?

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

I don't see GCOldSpace in the tree view at all.
image

I'm running a tiny experimental app that uses package:watcher to wait for changes in two folders, and then does some sync work (parse, build library scopes, walk libraries graph, etc; about 600 ms work), and some more async work (resolve affected libraries; about 150 ms work).

The command line to run:
/Users/scheglov/Applications/dart-sdk/bin/dart --observe:5000 --code_comments=false --source_lines=false --pause_isolates_on_unhandled_exceptions=false --pause_isolates_on_exit=false --enable-asserts /Users/scheglov/Source/Dart/sdk.git/sdk/pkg/analyzer/tool/analysis_driver/incremental_dependencies.dart, just in case if it triggers some unexpected VM flag.

image

@mraleph

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2018

@rmacnak-google I have a related question, why don't we use setitimer() with ITIMER_PROF to implement observatory profiler? this would remove this sort of confusing profiles - because ITIMER_PROF would not fire when thread is sitting inside a __psync_cvwait.

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

If we didn't take samples during a wait, that would hide the cost GC and loading. I think the issue here is that the wait isn't attributed to a meaningful caller.

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

@scheglov Can you try with https://dart-review.googlesource.com/c/sdk/+/87380 patched in, and say if you find it more understandable?

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

@rmacnak-google It is not attributed to GCOldSpace. I guess, this is an improvement, although I still don't understand why it does so much old GC. Are there any usual reason why this happens? Too large heap? Too many objects? Too often new GC?

image

Here is also the allocation profile.
image

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

It looks like your program is allocating a lot of long-lived lists and maps, based on large old accumulated values for _List, _GrowableList and _Uint32List (_Uint32Lists are involved in the default map implementation). The time in OldSpaceGC should go down dramatically after https://dart-review.googlesource.com/c/sdk/+/87121.

To see where these lists and maps are coming from, you could start your program paused, navigate to the _GrowableList and _InternalLinkedHashMap classes and click 'enable' under trace allocations, resume the program, then visit the classes again and click 'refresh allocation profile'.

Your program is also allocating a lot of short-lived Uint8Lists (large new accumulated value but low old accumulated value), possibly these are buffers from I/O. Short-lived objects are handled very quickly by the GC, so this is probably fine unless you don't expect the program to involve many byte buffers.

dart-bot pushed a commit that referenced this issue Dec 17, 2018

[vm, observatory] Respect requested tag order even when --profile-vm=…
…false.

Blame time waiting for service or kernel isolate startup against LoadWait instead of VM.

In the table profile view, include VM tags. Because of the flattening nature of this view, it is less intrusive than in the tree view.

Bug: #35394
Change-Id: If13c57daa96cb69c5e5a009dfd973b491b694d04
Reviewed-on: https://dart-review.googlesource.com/c/87380
Reviewed-by: Alexander Aprelev <aam@google.com>
Commit-Queue: Ryan Macnak <rmacnak@google.com>
@a-siva

This comment has been minimized.

Copy link
Contributor

commented Dec 20, 2018

is there any more actionable item in this issue? If not I would close the issue.

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 1, 2019

Well, I cannot decide whether to close this issue or not.
On one hand, my original question was solved, and I have some idea why this symbol is referenced.
On the other hand, I'm still perplexed why there is so much GCOldSpace.

I'm using Dart VM version: 2.2.0-edge.737209b85c340067898996ae88039a72f5f646a5 (Mon Dec 31 16:23:28 2018 +0000) on "macos_x64", and I see that https://dart-review.googlesource.com/c/sdk/+/87121 has been landed. I was told that it will bring down time in OldSpaceGC dramatically. But I still see a lot.

image

It seems that during the operation that I run old GC takes about 350ms, which seems a lot for about 350MB heap. Here are 3 states of the allocation profile screen - before the operation, after the first, and after the the same second operation. The whole operation is about 950ms.

Is old GC time a function of the number of old objects, so no much of the heap size? Even so, 350ms for 350MB is seems too much. Or not? There might be 3-4 million objects in the heap.

image

image

image

My understanding is that the problem is not with allocating objects, but with keeping these objects for long time. Is this right? So, I don't need allocation profile, I need "reference profile". Unfortunately "heap snapshot" is not very helpful. The fact that _List is often references from _GrowableList does not tell my anything. Because, OK, but from where all these _GrowableList instances are referenced? And their incoming references, etc?

image

What I always would like to have in Observatory is something like https://www.yourkit.com/docs/java/help/merged_paths.jsp

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Jan 2, 2019

Based on that description, I think 'merged paths' is the same thing as Observatory's 'dominator tree (merged siblings by class)'.

In any case, new heap analysis tools are a separate issue.

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

@rmacnak-google I agree, that new heap analysis tools is a separate issue.

But there are a several of questions about old GC performance.

  1. Is it expected to spend 350ms on the heap 350MB and 4.2 million objects?

  2. Is old GC time a function of the number of objects, and not the heap size per se? The number of references?

  3. Is it expected that old GC uses so much time even after https://dart-review.googlesource.com/c/sdk/+/87121? I really don't know, just referring to the fact that it was advertised to make things better. Or is this already a better state, that I just don't appreciate enough? :-)

It is OK, if you don't want or cannot answer on these questions, just currently it seems like the issue was closed on a formal reason that something requiring a new issue, and not directly related for this issue was mentioned.

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

Feel free to close it if you think that it would require more time to answer these questions than an issue assumes.

@scheglov scheglov reopened this Jan 2, 2019

@mraleph

This comment has been minimized.

Copy link
Contributor

commented Jan 2, 2019

@scheglov could you confirm the reproduction? it does not seem that pkg/analyzer/tool/analysis_driver/incremental_dependencies.dart is available in a normal SDK checkout, is it part of some special branch?

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

I'm sorry, I see that I have mentioned that reproduction is not easy in the beginning, but of course we all forgot about this after the long holidays / vacations. I will prepare CL with these files and report back in this issue. Just a bit busy right now.

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

https://dart-review.googlesource.com/c/sdk/+/88228

  1. There is a hardcoded path to SDK in pkg/analyzer/tool/analysis_driver/incremental_dependencies.dart, change it.

  2. Run it, e.g. dart --observe:5000 --code_comments=false --source_lines=false --pause_isolates_on_unhandled_exceptions=false --pause_isolates_on_exit=false --enable-asserts /Users/scheglov/Source/Dart/sdk.git/sdk/pkg/analyzer/tool/analysis_driver/incremental_dependencies.dart

  3. Wait for analysis to finish.

  4. Open canonical_name.dart from kernel. Change CanonicalName.getChild to getChild2, save and observe processing changes, re-analysis, etc. Change back to getChild, save.

  5. Open Observatory at 127.0.0.1:5000, go to cpu profile (table), clear statistics.

  6. Do (4) several times. Refresh cpu profile (table) statistics. Notice that GCOldSpace is about 30%.

I realize that this code generates a lot of (new?) garbage. It is just surprising how expensive GCOldSpace is relative to code that does change processing.

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

I understood this issue to be an Observatory / profiler usability question. I think you're considering it to be a performance question.

  1. 350ms on a heap 350MB strikes me as about ~2x too slow for a desktop machine with concurrent marking off.

  2. The time for an old space GC is related both to the number of objects and to the number of pointers. For two heaps of equal size, the heap with more small objects will take longer to GC than a heap with fewer large objects. For two heaps of equal size, the heap with more object arrays will take longer to GC than a heap with more typed data arrays or strings.

  3. 350ms is not expected with concurrent marking on. It should be closer to 30ms or lower.

When I run your reproduction, in the profiler I see 4.29% GCOldSpace, 8.53% GCNewSpace. In your instructions you say to run dart instead of out/ReleaseX64/dart (or xcodebuild/ReleaseX64/dart for Macs). Perhaps you are profiling with an older prebuilt VM where concurrent marking is still disabled?

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 3, 2019

Thank you for your answers!

Yes, it started as a usability question issue. But the it has been a performance question for me, because __psynch_cvwait raises questions when it uses lots of CPU. I'm sorry about the confusion.

I'm downloading pre-build bleeding edge builds, often. So, it was a fresh build. I updated SDK and re-built it now, and with 2.2.0-edge.30b808dc6a2b9187bd430ff720c9139d1e6ca8f9 (Wed Jan 2 16:55:45 2019 -0800) on "macos_x64" GCOldSpace is still slow :-(

image

When you look at GCOldSpace and GCNewSpace , do you clear data on cpu profile (table) after initial warm-up in form of changing CanonicalName.getChild to getChild2, saving and changing back? I do, because I'm interested in performance of processing changes, not combined performance profile of initial analysis and processing changes.

With initial analysis in the mix, results are not interesting. See below after initial analysis and then after change/save, undo/save - GCOldSpace is still higher than you get, already 8+%.
image
image

Hm... Maybe something is wrong with my Mac? I don't know... Fixes for Meltdown and Spectre that ruin performance? I will try on my Linux desktop tomorrow, when I get back to the office.

@rmacnak-google

This comment has been minimized.

Copy link
Member

commented Jan 3, 2019

Oh, I missed the profile clearing. I see in Observatory's timeline that the VM is filling the pauses between processing the file changes with idle time GCs. In an idle GC, both phases are not concurrent, which explains the 2x. I imagine for your purposes it would be better if the profiler did not tick during idle GC because you're interest in where the Dart code is spending time in these incremental updates. For other purposes, like figuring out why a long lived Dart process handling a low rate of small messages has a high CPU time, it is better for the profiler to not leave it out. I need to think more about how to address this.

As a quick hack, idle GC can be disabled with --idle_timeout_micros=0.

@scheglov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 3, 2019

Thank you!

With --idle_timeout_micros=0 I see GCOldSpace with just about 0.5%.
So, I can see performance of things that matter.

So, probably it was all the way usability issue :-)

@scheglov scheglov closed this Jan 3, 2019

@mraleph

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

@rmacnak-google did you happen to notice if those idle GCs are actually "improving" heap usage (e.g. considerably reducing fragmentation)? Because maybe we are churning CPU for no good reason on these idle GCs.

dart-bot pushed a commit that referenced this issue Jan 8, 2019

[vm, profiler] Place idle GC into a separate VM tag.
Bug: #35394
Change-Id: If95ac1cf140f0a37a161cd9d0bd846c62ce29eb2
Reviewed-on: https://dart-review.googlesource.com/c/88382
Reviewed-by: Siva Annamalai <asiva@google.com>
Commit-Queue: Ryan Macnak <rmacnak@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.