Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental GC (Intermediate Version) #3678

Closed
wants to merge 189 commits into from

Conversation

luc-blaeser
Copy link
Contributor

This is only an intermediate result of work in progress. Not for merging.

Intermediate Version of the Incremental GC

Free-list-based incremental garbage collector. Intermediate version on the implementation roadmap towards an efficient evacuation-compacting incremental GC.

  • Incremental mark-and-sweep collection.
  • Free space managed by a segregated free list.

Note: This version is non-compacting, i.e. may suffer from external fragmentation. It also has other specific downsides that are expected to be improved by the continued GC implementation and its targeted incremental evacuation-compacting GC, cf. section on "Limitations".

Motivation for this intermediate version:

  • Testing the incremental mark phase component that should be integrated in the the targeted compacting incremental GC.
  • Collecting performance results for comparison to the existing and future incremental GC versions.

Design

Incremental garbage collection distributes the free space reclamation work across multiple small increments that each pause the mutator (normal program) for a short amount of time. As a result, the mutator is never blocked for a longer indefinite time, thus providing the experience of smooth non-disruptive GC process. Moreover, in the context of the IC, the GC increment are short enough not to exceed the message instruction limit (with or without DTS). Therefore, it allows large-scaled heap usage up to full heap size (currently 4GB).

Collector

  • Bounded GC increments: The GC is composed of two phases, mark and sweep, that are each split into increments of work of a configurable maximum amount of operation steps (such as marking objects or array slices, traversing heap blocks, or merging free blocks). Between each increment, the mutator can continue to work with potential arbitrary changes to the memory. A new GC run is initiated whenever heap usage exceeds a specified growth limit. The GC increments are triggered on the usual GC schedule points instrumented by the compiler.

  • Snapshot-at-the-beginning marking: The GC incrementally marks all objects that were transitively reachable from the root set according to a (conceptual) memory snapshot at the starting time of the GC run. This snapshot-view is realized by a write-barrier that catches all pointers that are overwritten by the mutator during the mark phase. The mark phase additionally traverses these barrier-recorded pointers. New allocation occurring during the mark phase are conservatively marked, i.e. new objects are retained to at least the next GC run. Since the GC cannot access the call stack (due to a WASM security restriction), it has to start on an empty call stack and also has to conservatively mark new allocations.

  • Lazy sweep: After a completed mark phase, the GC sweeps the heap in multiple increments. Non-marked objects are thereby inserted to the free list, while merging free neighbor blocks. When encountering a marked object, the mark bit is cleared. Concurrent allocations during the sweep phase are retained by marking all new allocated objects that lie behind the sweep line. Free blocks are recorded in a segregated free list to guarantee constant-time allocation, free, and merge operations, except for huge blocks that are managed in an overflow list.

Free List

  • Segegrated free list: The mutator primarily allocates free space from the segregated free list. Only if the free list cannot provide the sufficient free space, the runtime system grows the heap space. The segregated free list consists of a fix number of free lists with increasing size classes. Within constant time, the allocator can determine the matching free list wherein all blocks have at least the requested allocation size. The free remainder of allocated block is put back to the corresponding free list to reduce internal fragmentation (i.e. unused remainder inside a heap block). The last list serves as an overflow storage for huge free blocks, involving linear first-fit allocation search. Free lists are doubly linked to allow O(1) incremental merging during the sweep phase, without having to clear and rebuild the entire the free list during a sweep phase.

Configuration

  • GC start: A new GC run is initiated when (1) the free space in the heap is less than 25% on a heap larger than 32 MB, or (2) less than 512 MB free space is available in the entire heap. This heuristics can be easily adjusted in incremental::should_start().

  • GC increment: The GC increment is bound to 500,000 steps, by counting the following operation as one step: (1) marking an object, (2) marking an array slice of 128 elements, (3) traversing a heap block in the sweep phase, (4) merging free blocks. The boundary is defined by the constant INCREMENT_LIMIT.

  • Free list: The segregated free list uses for 4 size classes, [12,32), [32, 1KB), [1KB, 32MB), [32MB, 4GB). The last list (>=32MB) constitutes the overflow list with a first-fit allocation strategy. This configuration can be adjusted in freelist::SIZE_CLASSES.

Performance

Note: The performance is not yet as good as desired, see also the section on "Limitations". Significant improvements are expected with the next GC implementation step, the incremental evacuation-compacting GC.

Nevertheless, according to the GC benchmark, the incremental GC is typically more efficient than the current copying and compacting GC. As expected, it scales higher with regard to number of allocatable heap space and also entails substantially shorter pauses than all the three other GCs.

The GC benchmark measurements have been performed on the release build (disabled --sanity-checks option) under dfx 0.12.0. Results rounded to two significant figures. The No configuration denotes a runtime system without garbage collection, i.e. no memory reclamation.

Scalability

Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit or heap limit).

GC Avg. Allocation Limit
Incremental 42e6
No 40e6
Generational 16e6
Compacting 14e6
Copying 15e6

2.6x to 3x higher than the existing GCs. Similar to a "no-GC" execution.
Issue: Degrades of around 40% for rb-tree, trie-map, and btree-map that create a lot of short-lived objects. Cf. "Limitations" section.

Average size of allocatable heap space for the benchmark limit cases.

GC Avg. Allocation Limit
Incremental 3.6 GB
No 3.7 GB
Generational 0.92 GB
Compacting 0.70 GB
Copying 0.84 GB

4-5x higher than the existing GCs. Similar to "no-GC" execution.
Same issue of high reclamation latency for programs with a large number of short-lived objects.

GC Pauses

Longest GC pause, benchmark average:

GC Longest GC Pause on Avg.
Incremental 0.45e8
Generational 2.1e8
Compacting 9.9e8
Copying 9.1e8

4.7x shorter than the generational GC, 22x shorter than the compacting GC, and 20x shorter than the copying GC.

Average GC pause, benchmark average:

GC Longest GC Pause on Avg.
Incremental 0.16e8
Generational 0.30e8
Compacting 1.1e8
Copying 0.92e8

1.9x shorter than the generational GC, 6.9x shorter than the compacting GC, and 5.8x shorter than the copying GC.

Performance

Total number of instructions (mutator + GC), average across all benchmark cases:

GC Avg. Total Instructions
Incremental 2.1e10
Generational 1.8e10
Compacting 2.4e10
Copying 2.2e10

14% slower than the generational GC. 14% faster than the compacting GC, and 5% faster than the copying GC.
Note: The total runtime with the incremental GC is not intended to be faster than the other GCs due to the extra overheads (write barrier, marking new allocations, free list).

Mutator utilization on average:

GC Avg. Mutator Utilization
Incremental 90%
Generational 80%
Compacting 62%
Copying 65%

10% higher (better) than the generational GC, >=25% higher than the compacting and the copying GC.

Memory Size

Allocated WASM memory space, benchmark average:

GC Avg. Memory Size
Incremental 410 MB
Generational 220 MB
Compacting 220 MB
Copying 290 MB

41% higher (worse) than the copying GC, 86% higher than the compacting and the generational GC.
Note: This degrade is due to the three benchmark cases that produce a large amount of short-lived objects (rb-tree, trie-map, and btree-map). Cf. "Limitations" section.

Occupied heap size (without free blocks) at the end of each benchmark case, average across all cases:

GC Avg. Final Heap Occupation
Incremental 410 MB
Generational 150 MB
Compacting 150 MB
Copying 150 MB

Around 2.7x less (worse) than the other GCs.
Same reason of short reclamation latency for young objects. Caused by the three benchmark cases rb-tree, trie-map, and btree-map.

Testing

  1. RTS unit tests

    In Motoko repo folder rts:

    make test
    
  2. Motoko test cases

    In Motoko repo folder test/run and test/run-drun:

    export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc --force-gc"
    make
    
  3. GC Benchmark cases

    In gcbench repo:

    ./measure-all.sh
    

Limitations

  • Young object reclamation latency: Short-lived objects are only reclaimed after a relatively long delay, namely at the next GC run, due to the restricted call stack access. Moreover, this intermediate version of the GC always performs a full heap sweep phase, which takes several increments to complete (usually many more than for marking). Therefore, the reclamation latency is rather high, especially for short-lived objects that could be reclaimed fast. More specifically, the benchmark cases rb-tree, trie-map, and btree-map create gigabytes of garbage even during the traversal of the data structure.
    • Outlook: With the planned evacuation-compacting incremental GC, a significant improvement of this is expected as the GC can focus on high-garbage partitions for freeing and thus shorten reclamation latency. This will not only be beneficial for young object collection, but also for other object lifetime patterns. Although, the incremental GC will probably never be as much optimized towards young object reclamation as the generational GC is.
  • Slow allocation with free list: Allocation using a free list is always much slower than bump allocation, which only advances a free pointer in a contiguous free space. The total runtime overhead of a free list is estimated at around 26% by comparing the number of mutator instructions between the incremental and generational GC (both containing a write barrier).
    • Outlook: The envisioned evacuation-compacting incremental GC will support bump allocation on free partitions, thus significantly speeding up allocation.
  • External fragmentation: A GC based on a free list can inherently suffer from external fragmentation, where in the worst case, the free space can be split into many small non-contiguous blocks, such that a larger allocation cannot be accommodated although the sum of free space would be sufficient.
    • Outlook: This is inherently solved by the targeted evacuation-compacting incremental GC which would defragment the heap. Only for humongous objects larger than the partition sizes, fragmentation problems could occur.

Design Alternatives

  • More size classes in segregated free list: Using more fine-grained size classes in the segregated free list proves to be less efficient, even with particular optimization such as using binary search for finding a matching free list and storing fast-forward pointer to the next non-empty free list. Increasing the number of size classes from 4 to 16 reduces total performance by around 28%. See branch with a more fine-grained free list
  • Triggered GC increments on allocation: To cope with a large allocation rate, especially for programs allocating a lot of temporary objects (functional style, computation-bound objects), an additional smaller GC increment could be combined with an allocation (although not starting a GC phase because of missing call stack root set). While this economizes memory usage by 29%, performance drops by around 28%. See branch on allocation increments

This reverts commit 0677ff6.
Revert "remove empty tuple (failing cases in separate test branch)"

This reverts commit 4080014.
@luc-blaeser luc-blaeser self-assigned this Jan 5, 2023
@github-actions
Copy link

github-actions bot commented Jan 5, 2023

Comparing from 44ecd26 to 30dbc5e:
In terms of gas, 4 tests regressed and the mean change is +3.6%.
In terms of size, 4 tests regressed and the mean change is +2.1%.

@luc-blaeser luc-blaeser changed the title Incremental GC (Intermediate Version) Incremental GC (Work in Progress) Jan 5, 2023
@luc-blaeser luc-blaeser changed the title Incremental GC (Work in Progress) Incremental GC (Intermediate Version) Jan 5, 2023
@luc-blaeser luc-blaeser closed this Feb 1, 2023
This was referenced Feb 24, 2023
mergify bot pushed a commit that referenced this pull request May 12, 2023
### Incremental GC PR Stack
The Incremental GC is structured in three PRs to ease review:
1. #3837 **<-- this PR**
2. #3831
3. #3829

# Incremental GC

Incremental evacuating-compacting garbage collector.

**Objective**: Scalable memory management that allows full heap usage.

**Properties**:
* All GC pauses have bounded short time.
* Full-heap snapshot-at-the-beginning marking.
* Focus on reclaiming high-garbage partitions.
* Compacting heap space with partition evacuations.
* Incremental copying enabled by forwarding pointers.
* Using **mark bitmaps** instead of a mark bit in the object headers.
* Limiting number of evacuations on memory shortage.

## Design

The incremental GC distributes its workload across multiple steps, called increments, that each pause the mutator (user's program) for only a limited amount of time. As a result, the GC appears to run concurrently (although not parallel) to the mutator and thus allows scalable heap usage, where the GC work fits within the instruction-limited IC messages.

Similar to the recent Java Shenandoah GC [1], the incremental GC organizes the heap in equally-sized partitions and selects high-garbage partitions for compaction by using incremental evacuation and the Brooks forwarding pointer technique [2].

The GC runs in three phases:
1. **Incremental Mark**: The GC performs full heap incremental tri-color-marking with snapshot-at-the-beginning consistency. For this purpose, write barriers intercept mutator pointer overwrites between GC mark increments. The target object of an overwritten pointer is thereby marked. Concurrent new object allocations are also conservatively marked. To remember the mark state per object, the GC uses partition-associated mark bitmaps that are temporarily allocated during a GC run. The phase additionally needs a mark stack that is a growable linked table list in the heap that can be recycled as garbage during the active GC run. Full heap marking has the advantage that it can also deal with arbitrarily large cyclic garbage, even if spread across multiple partitions. As a side activity, the mark phase also maintains the bookkeeping of the amount of live data per partition. Conservative snapshot-at-the-beginning marking and retaining new allocations is necessary because the WASM call stack cannot be inspected for the root set collection. Therefore, the mark phase must also only start on an empty call stack.

2. **Incremental Evacuation**: The GC prioritizes partitions with a larger amount of garbage for evacuation based on the available free space. It also requires a defined minimum amount of garbage for a partition to be evacuated. Subsequently, marked objects inside the selected partitions are evacuated to free partitions and thereby compacted. To allow incremental object moving and incremental updating of pointers, each object carries a redirection information in its header, which is a forwarding pointer, also called Brooks pointer. For non-moved objects, the forwarding pointer reflexively points back to the object itself, while for moved objects, the forwarding pointer refers to the new object location. Each object access and equality check has to be redirected via this forwarding pointer. During this phase, evacuated partitions are still retained and the original locations of evacuated objects are forwarded to their corresponding new object locations. Therefore, the mutator can continue to use old incoming pointers to evacuated objects.

3. **Incremental Updates**: All pointers to moved objects have to be updated before free space can be reclaimed. For this purpose, the GC performs a full-heap scan and updates all pointers in alive objects to their forwarded address. As mutator may perform concurrent pointer writes behind the update scan line, a write barrier catches such pointer writes and resolves them to the forwarded locations. The same applies to new object allocations that may have old pointer values in their initialized state (e.g. originating from the call stack). Once this phase is completed, all evacuated partitions are freed and can later be reused for new object allocations. At the same time, the GC also frees the mark bitmaps stored in temporary partitions. The update phase can only be completed when the call stack is empty, since the GC does not access the WASM stack. No remembered sets are maintained for tracking incoming pointers to partitions.

**Humongous objects**:
* Objects with a size larger than a partition require special handling: A sufficient amount of contiguous free partitions is searched and reserved for a large object. Large objects are not moved by the GC. Once they have become garbage (not marked by the GC), their hosting partitions are immediately freed. Both external and internal fragmentation can only occur for huge objects. Partitions storing large objects do not require a mark bitmap during the GC.

**Increment limit**:
* The GC maintains a synthetic deterministic clock by counting work steps, such as marking an object, copying a word, or updating a pointer. The clock serves for limiting the duration of a GC increment. The GC increment is stopped whenever the limit is reached, such that the GC later resumes its work in a new increment. To also keep the limit on large objects, large arrays are marked and updated in incremental slices. Moreover, huge objects are never moved. 
For simplicity, the GC increment is only triggered at the compiler-instrumented scheduling points when the call stack is empty. The increment limit is increased depending on the amount of concurrent allocations, to reduce the reclamation latency on a high allocation rate during garbage collection.

**Memory shortage**
* If memory is scarce during garbage collection, the GC limits the amount of evacuations to available free space of free partitions. This is to prevent the GC to run out of memory while copying alive objects to new partitions.

## Configuration

* **Partition size**: 32 MB.

* **Increment limit**: Regular increment bounded to 3,500,000 steps (approximately 600 million instructions). Each allocation during GC increases the next scheduled GC increment by 20 additional steps.

* **Survival threshold**: If 85% of a partition space is alive (marked), the partition is not evacuated.

* **GC start**: Scheduled when the growth (new allocations since the last GC run) account for more than 65% of the heap size. When passing the critical limit of 3.25GB (on the 4GB heap size), the GC is already started when the growth exceeds 1% of the heap size.

The configuration can be adjusted to tune the GC.

## Measurement

The following results have been measured on the GC benchmark with `dfx` 0.13.1. The `Copying`, `Compacting`, and `Generational` GC are based on the original runtime system ***without*** the forwarding pointer header extension. `No` denotes the disabled GC based on the runtime system ***with*** the forwarding pointer header extension. 

### Scalability

**Summary**: The incremental GC allows full 4GB heap usage without that it exceeds the message instruction limit. It therefore scales much higher than the existing stop-and-go GCs and naturally also higher than without GC.

Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit, heap limit, `dfx` cycles limit). Rounded to two significant figures.

| GC                | Avg. Allocation Limit   |
| ----------------- | ----------------------- |
| **Incremental**   | **150e6**               |
| No                | 47e6                    |
| Generational      | 33e6                    |
| Compacting        | 37e6                    |
| Copying           | 47e6                    |

3x higher than the other GCs and also than no GC.

Currently, the following limit benchmark cases do not reach the 4GB heap maximum due to GC-independent reasons:
* `buffer` applies exponential array list growth where the copying to the larger array exceeds the instruction limit.
* `rb-tree`, `trie-map`, and `btree-map` are such garbage-intense that they run out of `dfx` cycles or suffer from a sudden `dfx` network connection interruption.

### GC Pauses

Longest GC pause, maximum of all benchmark cases:

| GC                | Longest GC Pause          |
| ----------------- | ------------------------- |
| **Incremental**   | **0.712e9**               |
| Generational      | 1.19e9                    |
| Compacting        | 8.41e9                    |
| Copying           | 5.90e9                    |

Shorter than all the other GCs.

### Performance

Total number of instructions (mutator + GC), average across all benchmark cases:

| GC                | Avg. Total Instructions | 
| ----------------- | ----------------------- | 
| **Incremental**   | **1.85e10**             | 
| Generational      | 1.91e10                 | 
| Compacting        | 2.20e10                 | 
| Copying           | 2.05e10                 | 

Faster than all the other GCs.

Mutator utilization on average:

| GC                | Avg. Mutator Utilization |
| ----------------- | ------------------------ |
| **Incremental**   | **94.6%**                |
| Generational      | 85.4%                    |
| Compacting        | 75.8%                    |
| Copying           | 78.7%                    |

Higher than the other GCs.

### Memory Size

Occupied heap size at the end of each benchmark case, average across all cases:

| GC                | Avg. Final Heap Occupation |
| ----------------- | -------------------------- |
| **Incremental**   | **176 MB**                 |
| No                | 497 MB                     |
| Generational      | 156 MB                     |
| Compacting        | 144 MB                     |
| Copying           | 144 MB                     |

Up to 22% higher than the other GCs.

Allocated WASM memory space, benchmark average:

| GC                | Avg. Memory Size        |
| ----------------- | ----------------------- |
| **Incremental**   | **296 MB**              |
| No                | 499 MB                  |
| Generational      | 191 MB                  |
| Compacting        | 188 MB                  |
| Copying           | 271 MB                  |

9% higher than the copying GC. 57% higher (worse) than the generational and the compacting GC.

## Overheads

Additional mutator costs implied by the incremental GC:
* **Write barrier**: 
    - During the mark and evacuation phase: Marking the target of overwritten pointers.
    - During the update phase: Resolving forwarding of written pointers.
* **Allocation barrier**:
    - During the mark and evacuation phase: Marking new allocated objects.
    - During the update phase: Resolve pointer forwarding in initialized objects.
* **Pointer forwarding**:
    - Indirect each object access and equality check via the forwarding pointer.

Runtime costs for the barrier are reported in #3831.
Runtime costs for the forwarding pointers are reported in #3829.

## Testing

1. RTS unit tests

    In Motoko repo folder `rts`:
    ```
    make test
    ```

2. Motoko test cases

    In Motoko repo folder `test/run` and `test/run-drun`:
    ```
    export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc"
    make
    ```

3. GC Benchmark cases

    In `gcbench` repo: 
    ```
    ./measure-all.sh
    ```

4. Extensive memory sanity checks

    Adjust `Cargo.toml` in `rts/motoko-rts` folder:
    ```
    default = ["ic", "memory-check"]
    ```

    Run selected benchmark and test cases. Some of the tests will exceed the instruction limit due to the expensive checks.

## Extension to 64-Bit Heaps

The design partition information would need to be adjusted to store the partition information dynamically instead of a static allocation. For example, the information could be stored in a reserved space at the beginning of a partition (except if the partition has static data or serves as an extension for hosting a huge object). Apart from that, the GC should be portable and scalable without significant design changes on 64-bit memory.

## Design Alternatives

* **Free list**: See the prototype in #3678. The free-list-based incremental GC shows higher reclamation latency, slower performance (free list selection), and potentially higher external fragmentation (no compaction, just free neighbor merging).
* **Mark bit in object header**: See implementation in #3756. Storing the mark bit in the object header instead of using a mark bitmap saves memory space, but is more expensive for scanning sparsely marked partitions. Moreover, it increases the amount of dirty pages.
* **Remembered set**: Inter-partition pointers could be stored in remembered set to allow more selective and faster pointer updates. Moreover, the write barrier would become more expensive to detect and store relevant pointers in the remembered set. Also, the remembered set would occupy additional memory.
* **Allocation increments**: On high allocation rate, the GC could also perform a short GC increment during an allocation. This design is however more complicated as it forbids that the compiler can store low-level pointers on the stack while performing an allocation (e.g. during assignments or array tabulate). It is also slower than the current solution where allocation increments are postponed to next regularly scheduled GC increment, running when the call stack is empty.
* **Special incremental GC**: Analzyed in PR #3894. An incremental GC based on a central object table that allows easy object movement and incremental compaction. Compared to this PR, the special GC has 35% worse runtime performance. 
* **Combining tag and forwarding pointer**: #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%.

## References

[1] C. H. Flood, R. Kennke, A. Dinn, A. Haley, and R. Westrelin. Shenandoah. An Open-Source Concurrent Compacting Garbage Collector for OpenJDK. Intl. Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ'16, Lugano, Switzerland, August 2016.

[2] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant