New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimisations for GarbageCleaner #31

Open

eclipsewebmaster opened this issue May 8, 2024 · 22 comments

Labels

bugzilla Core legacy

eclipsewebmaster commented May 8, 2024

| --- | --- |
| Bugzilla Link | 570670 |
| Status | ASSIGNED |
| Importance | P3 minor |
| Reported | Jan 26, 2021 11:54 EDT |
| Modified | Jan 19, 2024 21:38 EDT |
| Version | 1.11 |
| Reporter | Jason Koch |

Description

I have a series of patches that improve performance in the GarbageCleaner codebase. After a brief discussion on the mailing list I propose these patches.

All existing TCs pass. This results in a 4x speedup on a 16-core machine and significantly lower overall ('real' cpu usage) when parsing large (70+gb) hprof file with ~1 billion objects.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:22

Created attachment 285390
1_simplify_objectmarker_marking_code.patch

:notepad_spiral: 1_simplify_objectmarker_marking_code.patch

eclipsewebmaster added bugzilla Core legacy labels

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:23

Created attachment 285391
2_getPage_perform_read_outside_sync_block.patch

:notepad_spiral: 2_getPage_perform_read_outside_sync_block.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:23

Created attachment 285392
3_remove_unneeded_sync_in_ArrayIntCompressed.patch

:notepad_spiral: 3_remove_unneeded_sync_in_ArrayIntCompressed.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:23

Created ~~attachment 285393~~ (attachment deleted)
4_threadlocal_cache_for_indexwriter_page.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:23

Created attachment 285394
5_use_fj_for_GarbageCleaner.patch

:notepad_spiral: 5_use_fj_for_GarbageCleaner.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 12:23

Created attachment 285395
6_patch_feedback.patch

:notepad_spiral: 6_patch_feedback.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 26, 2021 15:21

I've performed some before/after of this patchset and on the hprof files I have, the GC phase uses less memory than Phase1/Phase2. In practice this means that if there is enough heapspace allocated to get through to complete Phase2, then it will be completed through to the end. If there is not enough heapspace to get through to end of Phase2, then we do not get to the point of running GC.

In summary, we can be confident that this patchset does not degrade the heap behaviour of MAT parsing and also gives some clues about future direction for potential optimization.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Mar 25, 2021 11:03

Hi Andrew / Krum - Were you able to check over this and confirm any feedback?
I have assessed memory footprint and it sits below the pass1/2 requirements.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Mar 30, 2021 23:42

I just realised none of these patches apply cleanly; I tried to break it up into smaller patches and made a mess of it. I'll get that fixed.

I'll clean this all up and raise a separate ticket.

Author

eclipsewebmaster commented May 8, 2024

By Andrew Johnson on Mar 31, 2021 05:11

I'm looking at the patches. I think the first one worked for me, but if you want to keep this bug then you can mark old patches as superseded and add new ones.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Apr 01, 2021 00:20

Created attachment 286012
2b_fix_ensure_readdirect_multithreaded.patch

2b_fix_ensure_readdirect_multithreaded.patch

this fixes issues w/ 2

:notepad_spiral: 2b_fix_ensure_readdirect_multithreaded.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Apr 01, 2021 00:23

Created attachment 286013
4_introduce_threadlocal_cache_for_indexwriter_getpage

:notepad_spiral: 4_introduce_threadlocal_cache_for_indexwriter_getpage.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Apr 01, 2021 00:26

Created attachment 286014
7_fix_handle_garbagecleaner_overflow_on_large_dumps

:notepad_spiral: 7_fix_handle_garbagecleaner_overflow_on_large_dumps.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Apr 01, 2021 00:28

I have amended patch 2 with a 2b. Replaced patch 4, and added a patch 7. I believe these should apply cleanly in order now and pass tests properly: 1,2+2b,3,4,5,6,7.

Author

eclipsewebmaster commented May 8, 2024

By Andrew Johnson on Apr 26, 2021 02:50

I've applied the patches to my workspace and they run. It's not a large set of patches (<400 added lines) so isn't a 'large' contribution, so no CQ is needed.

The Fork/join object marker is simpler than the old way, but the old way had some performance tuning,
even though sometimes threads were idle.

The old way:

Depth first search thread - with locality.
Have a local stack for objects close to the current object.
Have a local queue for remaining objects.
Use the global stack for excess objects or when local stack and queue are empty.

so one question is how well the new marking works with limited memory (insufficient to hold most of the
outbound references in memory). For example the CI builds are memory constrained because of the VMs
used for Jenkins, fortunately the test dumps are small. The new thread local cache could help, but was
the old system of locality in marking useful?

A minor improvement for fork/join might be to split very large object arrays - for example if there is
an object array with 1,000,000,000 elements can only one thread process it, or can it be split?

I think the IndexWriter getPage SoftReference also has the possible problem where the SoftReference is cleared between two gets.

To be this in for Eclipse 2021-06 we would need to finish this a few days before 2021-06 M3+3 on Wednesday, May 26, 2021.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on May 06, 2021 22:55

Thanks for the feedback. Quick update: I have some ideas, certainly we can defer / lazy load the array for FjObjectMarker into the next compute phase - and I have a draft of that working locally, it definitely reduces heap footprint during marking.

I am still thinking about the locality question, and it is a good point. There are a few architectures I have in mind that might give superb or might give terrible throughput, I'm yet to really test any of them out,..

On the one hand, locality is beneficial for cache purposes.

In practice however, looking at a heap I have here: 2bn objects, outbound index is 20GB in size, and current CPUs have O(10s of MB) of L3 cache, so the chance of an object ID appearing is quite small; this would favour simpler code paths to help the CPU. In addition, monitoring jmap -histo on a heap of this size indicates at peak only <300,000 FjObjectMarkers are enqueued (after I do the lazy load mentioned above), which is only a few MBs of data.

But there are certainly areas where locality is likely to be more significant, and I'd need to measure and get a bit more detailed to be certain. I think the only way to be sure is to build a few alternative implementations and generate some more ideas on what would be a better solution.

Ideas bubbling around, that may or may not work:

N threads, split heap into 1/N regions, and have all bits[] read/write updates go via an owner thread, and threads, as they outbound.get(), forward any non-local requests via queues to other threads.
Revisit the idea of work-stealing algorithm; some CPUs were underused in past but maybe some heuristics or assumptions can be revisited.
Use a bitset or similar, backed by a CAS [or in the case of split threads, no CAS required], to manage the visited/not.
Within existing FJ pool, do more work between recursion or use iteration to simplify the code paths for CPU [I haven't yet looked at generated ASM].
Split loops so that the fetch/mark/fork happens as distinct stages in processing.

I want to take some measurements to find out why exactly the code is stalling too; it seems from simple perf the majority of the code is spent around the outbound object get, so perhaps we can look at optimising this by batching reads to the IO layer! Much to think about.

Author

eclipsewebmaster commented May 8, 2024

By Andrew Johnson on May 08, 2021 04:13

The locality I was considering was also reader caching for the outbound references.
Looking at the code for outbound refs:

IntIndex1NReader
header - PositionIndexReader
pages of ArrayIntLongCompressed held in a soft reference cache, to index into
body - IntIndexReader
pages of ArrayIntCompressed held in a soft reference cache

and those all read from a SimpleBufferedRandomAccessInputStream which also has a soft reference cache of 512 byte pages

[I suppose now we have CompressedRandomAccessFile in hprof we could gzip compress the index files, but it would be slower, and most of the space saving is from compressing hprof:

file name uncompressed compressed Saving Compression
java_pid32284.0001.hprof 5407868827 1306076258 4101792569 75.85%
java_pid32284.0001.inbound.index 262560438 191519550 71040888 27.06%
java_pid32284.0001.outbound.index 262560438 126402088 136158350 51.86%
java_pid32284.0001.domOut.index 213915350 128975789 84939561 39.71%
java_pid32284.0001.o2hprof.index 181198308 67032436 114165872 63.01%
java_pid32284.0001.o2ret.index 148956613 1724083 147232530 98.84%
java_pid32284.0001.domIn.index 71806040 45299834 26506206 36.91%
java_pid32284.0001.o2c.index 68556040 570041 67985999 99.17%
java_pid32284.0001.a2s.index 52808808 628021 52180787 98.81%
Totals 6670230862 1868228100 4802002762 71.99%

chunked compressed java_pid32284.0001.hprof.gz 1555306556

]

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on May 11, 2021 09:01

Created ~~attachment 286359~~ (attachment deleted)
8_enqueue_only_pointers.patch

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on May 11, 2021 09:02

I've added a draft patch that swaps the int[] enqueue to a int enqueue, which reduces the memory footprint during marking.

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jun 02, 2021 18:40

Created attachment 286512
8_enqueue_only_pointers_proper.patch

Addresses comments (via email?) regarding enqueue of entire arrays instead of only pointers.

:notepad_spiral: file_570670.txt

Author

eclipsewebmaster commented May 8, 2024

By Yi Yang on Jun 08, 2023 03:17

2_getPage_perform_read_outside_sync_block.patch
2b_fix_ensure_readdirect_multithreaded.patch

These two patches reduce synchronization overhead of IntIndexReader, I think they are principally correct, but I have not seen any significant performance improvement.

3_remove_unneeded_sync_in_ArrayIntCompressed.patch

This looks good to me

4_introduce_threadlocal_cache_for_indexwriter_getpage
6_patch_feedback.patch

These two patches reduced the synchronization overhead of IntIndexCollector.set, but did not completely eliminate it. Considering this, do you think that if a completely lock-free ArrayIntUncompressed is provided, we can eliminate the synchronization overhead of IntIndexCollector.set/get and remove the need for these patches altogether?

Author

eclipsewebmaster commented May 8, 2024

By Jason Koch on Jan 19, 2024 21:38

The only change still hanging around in here is the GarbageCleaner patch/es. I'll retest these and see whether it's worth another pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment