Incremental GC Barriers #3831

luc-blaeser · 2023-02-23T16:53:24Z

Incremental GC PR Stack

The Incremental GC is structured in three PRs to ease review:

Incremental GC #3837
Incremental GC Barriers #3831 <-- this PR
Incremental GC Forwarding Pointers #3829

Incremental GC Barriers

Preparation support for write and allocation barriers for the incremental moving (evacuating compacting) GC.

Write barrier: All potential pointer writes are passed to the write barrier which performs the write with additional steps to be implemented in the incremental GC:
* Incremental mark phase: Catch the overwritten pointers to realize incremental snapshot-at-the-beginning marking.
* Incremental update phase: If written pointers refer to old evacuated object locations, adjust them to point to the corresponding new forwarded locations.

Allocation barrier: The allocation barrier catches all newly created objects that are completely initialized (except the content of blobs). This will serve for the following purposes in the incremental GC:
* Incremental mark and evacuation phase: Mark the newly allocated object to retain them during the GC that performs snapshot-at-the-beginning marking.
* Incremental update phase: Update all pointers in the new object to refer to the new forwarded locations.
* Additional GC increment: To limit memory reclamation latency at a high allocation rate during garbage collection, the barrier performs an additional small GC increment.

Static optimization: Compile-time barrier elimination based on a simple conservative analysis of the type of the modified field/array element. Specifically, the barrier is skipped if the type of the written location does not allow pointers (Bool, ?Bool, Char, ?Char, Nat8, ?Nat8, Nat16, ?Nat16, Int8, ?Int8, Int16, ?Int16, (), ?()). (Multi-level optional types can not be ommited as ??null, ???null etc. refer to heap objects. Scalar types with >=32 bits can be indirected due to pointer tagging.)

Runtime Costs

GC benchmark measurements, comparing the number of mutator instructions, average across all benchmark cases, release mode (no sanity checks):

The incremental GC barrier contains the logic of full GC (#3837) including object forwarding, however without the allocation GC increment.
The measurements without barriers includes object forwarding to determine the barrier overheads.

Configuration	Avg. Mutator Instructions
Incremental GC barriers	2.06e10
No barriers with forwarding	1.80e10

14% runtime overhead on top of forwarding pointers.

Testing

Write barrier coverage has been extensively tested by the generational GC and in a separate barrier preparation PR for the incremental GC (#3502).
The allocation barrier is tested part of the incremental GC PR (#3837).

Extended object header

For extensive memory checks in incremental GC debug mode

This reverts commit 0e6c263.

This reverts commit a146ffa.

This reverts commit 01c0f8f.

crusso

Looks like mostly white space changes since last reviewed. LGTM.

### Incremental GC PR Stack The Incremental GC is structured in three PRs to ease review: 1. #3837 2. #3831 3. #3829 **<-- this PR** # Incremental GC Forwarding Pointers Support for forwarding pointers (Brooks pointer) to enable incremental moving (evacuating compacting) GC. Each object stores a forwarding pointer with the following properties: * **Self-reference**: If the object resides at a valid location (i.e. not been relocated to another address), the forwarding pointer stores a reference to the object itself. * **Single-level redirection**: If an object has been moved, the original object stores a pointer to the new object location. This implies that the data at the original object location is no longer valid. Forwarding is only used during the evacuation phase and the updating phase of the incremental GC. Indirection is at most one level, i.e. the relocation target cannot forward again to another location. The GC would need to update all incoming pointers before moving an object again in the next GC run. Invariant: `object.forward().get_ptr() == object.get_ptr() || object.forward().forward().get_ptr() == object.forward().get_ptr()`. ## Changes Changes made to the compiler and runtime system: * **Header extension**: Additional object header space has been inserted for the forwarding pointer. Allocated objects forward to themselves. * **Access indirection**: Every load, store, and object reference comparison effects an additional indirection through the forwarding pointer. ## Runtime Costs Measuring the performance overhead of forwarding pointers in the GC benchmark using release mode (no sanity checks). Total number of mutator instructions, average across all benchmark cases: Configuration | Avg. mutator instructions ------------------|--------------------------- No forwarding | 1.61e10 With forwarding | 1.80e10 **Runtime overhead of 12% on average.** ## Memory Costs Allocated memory size, average across all GC benchmark cases, copying GCs: Configuration | Avg. final heap size ------------------|---------------------- No forwarding | 306 MB With forwarding | 325 MB **Memory overhead of 6% on average.** ## Testing Extensive sanity checks for forwarding pointers were implemented and run in the separate PR (#3546) containing the following sanity check code: * **Indirection check**: Every derefencing of a forwarding pointer is checked whether the pointer is valid, and the invariant above holds. * **Memory scan**: At GC time points of the existing copying or compacting GC, in regular intervals, the entire memory is scanned and all objects and pointers are verifying to be valid (valid forwarding pointer and plausible object tag). * **Artificial forwarding**: For every created object, an artificial dummy object is returned that forwards to the real object. The dummy object stores zeroed content and has an invalid tag. This helps to verify that all heap accesses are correctly forwarded. Artificial forwarding disables the existing garbage collectors (due to the dummy objects not handled by the GCs) and performs memory scans at a defined frequency instead. ## Design Alternative * **Combining tag and forwarding pointer**: #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%. ## Reference [1] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.

### Incremental GC PR Stack The Incremental GC is structured in three PRs to ease review: 1. #3837 **<-- this PR** 2. #3831 3. #3829 # Incremental GC Incremental evacuating-compacting garbage collector. **Objective**: Scalable memory management that allows full heap usage. **Properties**: * All GC pauses have bounded short time. * Full-heap snapshot-at-the-beginning marking. * Focus on reclaiming high-garbage partitions. * Compacting heap space with partition evacuations. * Incremental copying enabled by forwarding pointers. * Using **mark bitmaps** instead of a mark bit in the object headers. * Limiting number of evacuations on memory shortage. ## Design The incremental GC distributes its workload across multiple steps, called increments, that each pause the mutator (user's program) for only a limited amount of time. As a result, the GC appears to run concurrently (although not parallel) to the mutator and thus allows scalable heap usage, where the GC work fits within the instruction-limited IC messages. Similar to the recent Java Shenandoah GC [1], the incremental GC organizes the heap in equally-sized partitions and selects high-garbage partitions for compaction by using incremental evacuation and the Brooks forwarding pointer technique [2]. The GC runs in three phases: 1. **Incremental Mark**: The GC performs full heap incremental tri-color-marking with snapshot-at-the-beginning consistency. For this purpose, write barriers intercept mutator pointer overwrites between GC mark increments. The target object of an overwritten pointer is thereby marked. Concurrent new object allocations are also conservatively marked. To remember the mark state per object, the GC uses partition-associated mark bitmaps that are temporarily allocated during a GC run. The phase additionally needs a mark stack that is a growable linked table list in the heap that can be recycled as garbage during the active GC run. Full heap marking has the advantage that it can also deal with arbitrarily large cyclic garbage, even if spread across multiple partitions. As a side activity, the mark phase also maintains the bookkeeping of the amount of live data per partition. Conservative snapshot-at-the-beginning marking and retaining new allocations is necessary because the WASM call stack cannot be inspected for the root set collection. Therefore, the mark phase must also only start on an empty call stack. 2. **Incremental Evacuation**: The GC prioritizes partitions with a larger amount of garbage for evacuation based on the available free space. It also requires a defined minimum amount of garbage for a partition to be evacuated. Subsequently, marked objects inside the selected partitions are evacuated to free partitions and thereby compacted. To allow incremental object moving and incremental updating of pointers, each object carries a redirection information in its header, which is a forwarding pointer, also called Brooks pointer. For non-moved objects, the forwarding pointer reflexively points back to the object itself, while for moved objects, the forwarding pointer refers to the new object location. Each object access and equality check has to be redirected via this forwarding pointer. During this phase, evacuated partitions are still retained and the original locations of evacuated objects are forwarded to their corresponding new object locations. Therefore, the mutator can continue to use old incoming pointers to evacuated objects. 3. **Incremental Updates**: All pointers to moved objects have to be updated before free space can be reclaimed. For this purpose, the GC performs a full-heap scan and updates all pointers in alive objects to their forwarded address. As mutator may perform concurrent pointer writes behind the update scan line, a write barrier catches such pointer writes and resolves them to the forwarded locations. The same applies to new object allocations that may have old pointer values in their initialized state (e.g. originating from the call stack). Once this phase is completed, all evacuated partitions are freed and can later be reused for new object allocations. At the same time, the GC also frees the mark bitmaps stored in temporary partitions. The update phase can only be completed when the call stack is empty, since the GC does not access the WASM stack. No remembered sets are maintained for tracking incoming pointers to partitions. **Humongous objects**: * Objects with a size larger than a partition require special handling: A sufficient amount of contiguous free partitions is searched and reserved for a large object. Large objects are not moved by the GC. Once they have become garbage (not marked by the GC), their hosting partitions are immediately freed. Both external and internal fragmentation can only occur for huge objects. Partitions storing large objects do not require a mark bitmap during the GC. **Increment limit**: * The GC maintains a synthetic deterministic clock by counting work steps, such as marking an object, copying a word, or updating a pointer. The clock serves for limiting the duration of a GC increment. The GC increment is stopped whenever the limit is reached, such that the GC later resumes its work in a new increment. To also keep the limit on large objects, large arrays are marked and updated in incremental slices. Moreover, huge objects are never moved. For simplicity, the GC increment is only triggered at the compiler-instrumented scheduling points when the call stack is empty. The increment limit is increased depending on the amount of concurrent allocations, to reduce the reclamation latency on a high allocation rate during garbage collection. **Memory shortage** * If memory is scarce during garbage collection, the GC limits the amount of evacuations to available free space of free partitions. This is to prevent the GC to run out of memory while copying alive objects to new partitions. ## Configuration * **Partition size**: 32 MB. * **Increment limit**: Regular increment bounded to 3,500,000 steps (approximately 600 million instructions). Each allocation during GC increases the next scheduled GC increment by 20 additional steps. * **Survival threshold**: If 85% of a partition space is alive (marked), the partition is not evacuated. * **GC start**: Scheduled when the growth (new allocations since the last GC run) account for more than 65% of the heap size. When passing the critical limit of 3.25GB (on the 4GB heap size), the GC is already started when the growth exceeds 1% of the heap size. The configuration can be adjusted to tune the GC. ## Measurement The following results have been measured on the GC benchmark with `dfx` 0.13.1. The `Copying`, `Compacting`, and `Generational` GC are based on the original runtime system ***without*** the forwarding pointer header extension. `No` denotes the disabled GC based on the runtime system ***with*** the forwarding pointer header extension. ### Scalability **Summary**: The incremental GC allows full 4GB heap usage without that it exceeds the message instruction limit. It therefore scales much higher than the existing stop-and-go GCs and naturally also higher than without GC. Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit, heap limit, `dfx` cycles limit). Rounded to two significant figures. | GC | Avg. Allocation Limit | | ----------------- | ----------------------- | | **Incremental** | **150e6** | | No | 47e6 | | Generational | 33e6 | | Compacting | 37e6 | | Copying | 47e6 | 3x higher than the other GCs and also than no GC. Currently, the following limit benchmark cases do not reach the 4GB heap maximum due to GC-independent reasons: * `buffer` applies exponential array list growth where the copying to the larger array exceeds the instruction limit. * `rb-tree`, `trie-map`, and `btree-map` are such garbage-intense that they run out of `dfx` cycles or suffer from a sudden `dfx` network connection interruption. ### GC Pauses Longest GC pause, maximum of all benchmark cases: | GC | Longest GC Pause | | ----------------- | ------------------------- | | **Incremental** | **0.712e9** | | Generational | 1.19e9 | | Compacting | 8.41e9 | | Copying | 5.90e9 | Shorter than all the other GCs. ### Performance Total number of instructions (mutator + GC), average across all benchmark cases: | GC | Avg. Total Instructions | | ----------------- | ----------------------- | | **Incremental** | **1.85e10** | | Generational | 1.91e10 | | Compacting | 2.20e10 | | Copying | 2.05e10 | Faster than all the other GCs. Mutator utilization on average: | GC | Avg. Mutator Utilization | | ----------------- | ------------------------ | | **Incremental** | **94.6%** | | Generational | 85.4% | | Compacting | 75.8% | | Copying | 78.7% | Higher than the other GCs. ### Memory Size Occupied heap size at the end of each benchmark case, average across all cases: | GC | Avg. Final Heap Occupation | | ----------------- | -------------------------- | | **Incremental** | **176 MB** | | No | 497 MB | | Generational | 156 MB | | Compacting | 144 MB | | Copying | 144 MB | Up to 22% higher than the other GCs. Allocated WASM memory space, benchmark average: | GC | Avg. Memory Size | | ----------------- | ----------------------- | | **Incremental** | **296 MB** | | No | 499 MB | | Generational | 191 MB | | Compacting | 188 MB | | Copying | 271 MB | 9% higher than the copying GC. 57% higher (worse) than the generational and the compacting GC. ## Overheads Additional mutator costs implied by the incremental GC: * **Write barrier**: - During the mark and evacuation phase: Marking the target of overwritten pointers. - During the update phase: Resolving forwarding of written pointers. * **Allocation barrier**: - During the mark and evacuation phase: Marking new allocated objects. - During the update phase: Resolve pointer forwarding in initialized objects. * **Pointer forwarding**: - Indirect each object access and equality check via the forwarding pointer. Runtime costs for the barrier are reported in #3831. Runtime costs for the forwarding pointers are reported in #3829. ## Testing 1. RTS unit tests In Motoko repo folder `rts`: ``` make test ``` 2. Motoko test cases In Motoko repo folder `test/run` and `test/run-drun`: ``` export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc" make ``` 3. GC Benchmark cases In `gcbench` repo: ``` ./measure-all.sh ``` 4. Extensive memory sanity checks Adjust `Cargo.toml` in `rts/motoko-rts` folder: ``` default = ["ic", "memory-check"] ``` Run selected benchmark and test cases. Some of the tests will exceed the instruction limit due to the expensive checks. ## Extension to 64-Bit Heaps The design partition information would need to be adjusted to store the partition information dynamically instead of a static allocation. For example, the information could be stored in a reserved space at the beginning of a partition (except if the partition has static data or serves as an extension for hosting a huge object). Apart from that, the GC should be portable and scalable without significant design changes on 64-bit memory. ## Design Alternatives * **Free list**: See the prototype in #3678. The free-list-based incremental GC shows higher reclamation latency, slower performance (free list selection), and potentially higher external fragmentation (no compaction, just free neighbor merging). * **Mark bit in object header**: See implementation in #3756. Storing the mark bit in the object header instead of using a mark bitmap saves memory space, but is more expensive for scanning sparsely marked partitions. Moreover, it increases the amount of dirty pages. * **Remembered set**: Inter-partition pointers could be stored in remembered set to allow more selective and faster pointer updates. Moreover, the write barrier would become more expensive to detect and store relevant pointers in the remembered set. Also, the remembered set would occupy additional memory. * **Allocation increments**: On high allocation rate, the GC could also perform a short GC increment during an allocation. This design is however more complicated as it forbids that the compiler can store low-level pointers on the stack while performing an allocation (e.g. during assignments or array tabulate). It is also slower than the current solution where allocation increments are postponed to next regularly scheduled GC increment, running when the call stack is empty. * **Special incremental GC**: Analzyed in PR #3894. An incremental GC based on a central object table that allows easy object movement and incremental compaction. Compared to this PR, the special GC has 35% worse runtime performance. * **Combining tag and forwarding pointer**: #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%. ## References [1] C. H. Flood, R. Kennke, A. Dinn, A. Haley, and R. Westrelin. Shenandoah. An Open-Source Concurrent Compacting Garbage Collector for OpenJDK. Intl. Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ'16, Lugano, Switzerland, August 2016. [2] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.

luc-blaeser added 30 commits January 26, 2023 13:59

Adjust test

f7b3dd5

Update benchmark results

ff0bea6

Revert test run

0510e5a

Bug fix

f1ca154

Adjust test

fd3e13a

Extended object header

Adjust tests

994a9ac

For extensive memory checks in incremental GC debug mode

Code refactoring

60f60fd

Selective memory checks

0e6c263

Revert "Selective memory checks"

6891ed0

This reverts commit 0e6c263.

Make timer tests less time-dependent

26872cb

Revert test run

fe2880f

Remove allocation increment again

16d1022

GC tuning

a507eff

Merge branch 'master' into luc/incremental-gc

58e58ca

Optimization

008f2b8

Bug fix

a146ffa

Revert "Bug fix"

6f4dd99

This reverts commit a146ffa.

GC tuning

9009fbb

Merge branch 'master' into luc/incremental-gc

9d4fa63

GC tuning

36db146

Reintroduce allocation increment

a4c1341

GC tuning, phase changes in allocation increments

49bde61

GC tuning

5bb7d60

GC tuning

7cc5592

GC tuning

0ec6edf

GC tuning

ba7a146

Merge branch 'master' into luc/incremental-gc

fed433d

Split too large test case

aae861b

Downscale test for extensive memory sanity checks

f0246e1

GC tuning

6709ff0

luc-blaeser and others added 19 commits April 12, 2023 15:29

Merge branch 'master' into luc/forwarding_pointer

59449d5

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

dff7277

Reformat

20a2dba

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

a347d5d

Reformat

9fe83ae

Revert unnecessary changes

72af011

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

033c263

Merge branch 'master' into luc/forwarding_pointer

0513c82

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

47d20ee

Merge branch 'master' into luc/forwarding_pointer

2159789

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

910eb1a

Updating nix hashes

01c0f8f

Adjust tests for different stack size

442eb15

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

98dce2c

Refactoring: Stack size test cases

c7ba335

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

b680ecc

Revert "Updating nix hashes"

fcb4e94

This reverts commit 01c0f8f.

Merge branch 'master' into luc/forwarding_pointer

09510f7

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

ed82364

crusso approved these changes May 10, 2023

View reviewed changes

luc-blaeser added 2 commits May 10, 2023 13:21

Merge branch 'master' into luc/forwarding_pointer

647d517

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

0bba417

Base automatically changed from luc/forwarding_pointer to master May 12, 2023 07:43

Merge branch 'master' into luc/incremental-preparation

d1d204c

luc-blaeser added the automerge-squash When ready, merge (using squash) label May 12, 2023

mergify bot merged commit 55e3a33 into master May 12, 2023
9 checks passed

mergify bot removed the automerge-squash When ready, merge (using squash) label May 12, 2023

luc-blaeser deleted the luc/incremental-preparation branch May 12, 2023 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental GC Barriers #3831

Incremental GC Barriers #3831

luc-blaeser commented Feb 23, 2023 •

edited

crusso left a comment

Incremental GC Barriers #3831

Incremental GC Barriers #3831

Conversation

luc-blaeser commented Feb 23, 2023 • edited

Incremental GC PR Stack

Incremental GC Barriers

Runtime Costs

Testing

crusso left a comment

Choose a reason for hiding this comment

luc-blaeser commented Feb 23, 2023 •

edited