Incremental GC #3837

luc-blaeser · 2023-02-24T12:28:35Z

Incremental GC PR Stack

The Incremental GC is structured in three PRs to ease review:

Incremental GC #3837 <-- this PR
Incremental GC Barriers #3831
Incremental GC Forwarding Pointers #3829

Incremental GC

Incremental evacuating-compacting garbage collector.

Objective: Scalable memory management that allows full heap usage.

Properties:

All GC pauses have bounded short time.
Full-heap snapshot-at-the-beginning marking.
Focus on reclaiming high-garbage partitions.
Compacting heap space with partition evacuations.
Incremental copying enabled by forwarding pointers.
Using mark bitmaps instead of a mark bit in the object headers.
Limiting number of evacuations on memory shortage.

Design

The incremental GC distributes its workload across multiple steps, called increments, that each pause the mutator (user's program) for only a limited amount of time. As a result, the GC appears to run concurrently (although not parallel) to the mutator and thus allows scalable heap usage, where the GC work fits within the instruction-limited IC messages.

Similar to the recent Java Shenandoah GC [1], the incremental GC organizes the heap in equally-sized partitions and selects high-garbage partitions for compaction by using incremental evacuation and the Brooks forwarding pointer technique [2].

The GC runs in three phases:

Incremental Mark: The GC performs full heap incremental tri-color-marking with snapshot-at-the-beginning consistency. For this purpose, write barriers intercept mutator pointer overwrites between GC mark increments. The target object of an overwritten pointer is thereby marked. Concurrent new object allocations are also conservatively marked. To remember the mark state per object, the GC uses partition-associated mark bitmaps that are temporarily allocated during a GC run. The phase additionally needs a mark stack that is a growable linked table list in the heap that can be recycled as garbage during the active GC run. Full heap marking has the advantage that it can also deal with arbitrarily large cyclic garbage, even if spread across multiple partitions. As a side activity, the mark phase also maintains the bookkeeping of the amount of live data per partition. Conservative snapshot-at-the-beginning marking and retaining new allocations is necessary because the WASM call stack cannot be inspected for the root set collection. Therefore, the mark phase must also only start on an empty call stack.
Incremental Evacuation: The GC prioritizes partitions with a larger amount of garbage for evacuation based on the available free space. It also requires a defined minimum amount of garbage for a partition to be evacuated. Subsequently, marked objects inside the selected partitions are evacuated to free partitions and thereby compacted. To allow incremental object moving and incremental updating of pointers, each object carries a redirection information in its header, which is a forwarding pointer, also called Brooks pointer. For non-moved objects, the forwarding pointer reflexively points back to the object itself, while for moved objects, the forwarding pointer refers to the new object location. Each object access and equality check has to be redirected via this forwarding pointer. During this phase, evacuated partitions are still retained and the original locations of evacuated objects are forwarded to their corresponding new object locations. Therefore, the mutator can continue to use old incoming pointers to evacuated objects.
Incremental Updates: All pointers to moved objects have to be updated before free space can be reclaimed. For this purpose, the GC performs a full-heap scan and updates all pointers in alive objects to their forwarded address. As mutator may perform concurrent pointer writes behind the update scan line, a write barrier catches such pointer writes and resolves them to the forwarded locations. The same applies to new object allocations that may have old pointer values in their initialized state (e.g. originating from the call stack). Once this phase is completed, all evacuated partitions are freed and can later be reused for new object allocations. At the same time, the GC also frees the mark bitmaps stored in temporary partitions. The update phase can only be completed when the call stack is empty, since the GC does not access the WASM stack. No remembered sets are maintained for tracking incoming pointers to partitions.

Humongous objects:

Objects with a size larger than a partition require special handling: A sufficient amount of contiguous free partitions is searched and reserved for a large object. Large objects are not moved by the GC. Once they have become garbage (not marked by the GC), their hosting partitions are immediately freed. Both external and internal fragmentation can only occur for huge objects. Partitions storing large objects do not require a mark bitmap during the GC.

Increment limit:

The GC maintains a synthetic deterministic clock by counting work steps, such as marking an object, copying a word, or updating a pointer. The clock serves for limiting the duration of a GC increment. The GC increment is stopped whenever the limit is reached, such that the GC later resumes its work in a new increment. To also keep the limit on large objects, large arrays are marked and updated in incremental slices. Moreover, huge objects are never moved.
For simplicity, the GC increment is only triggered at the compiler-instrumented scheduling points when the call stack is empty. The increment limit is increased depending on the amount of concurrent allocations, to reduce the reclamation latency on a high allocation rate during garbage collection.

Memory shortage

If memory is scarce during garbage collection, the GC limits the amount of evacuations to available free space of free partitions. This is to prevent the GC to run out of memory while copying alive objects to new partitions.

Configuration

Partition size: 32 MB.
Increment limit: Regular increment bounded to 3,500,000 steps (approximately 600 million instructions). Each allocation during GC increases the next scheduled GC increment by 20 additional steps.
Survival threshold: If 85% of a partition space is alive (marked), the partition is not evacuated.
GC start: Scheduled when the growth (new allocations since the last GC run) account for more than 65% of the heap size. When passing the critical limit of 3.25GB (on the 4GB heap size), the GC is already started when the growth exceeds 1% of the heap size.

The configuration can be adjusted to tune the GC.

Measurement

The following results have been measured on the GC benchmark with dfx 0.13.1. The Copying, Compacting, and Generational GC are based on the original runtime system without the forwarding pointer header extension. No denotes the disabled GC based on the runtime system with the forwarding pointer header extension.

Scalability

Summary: The incremental GC allows full 4GB heap usage without that it exceeds the message instruction limit. It therefore scales much higher than the existing stop-and-go GCs and naturally also higher than without GC.

Average amount of allocations for the benchmark limit cases, until reaching a limit (instruction limit, heap limit, dfx cycles limit). Rounded to two significant figures.

GC	Avg. Allocation Limit
Incremental	150e6
No	47e6
Generational	33e6
Compacting	37e6
Copying	47e6

3x higher than the other GCs and also than no GC.

Currently, the following limit benchmark cases do not reach the 4GB heap maximum due to GC-independent reasons:

buffer applies exponential array list growth where the copying to the larger array exceeds the instruction limit.
rb-tree, trie-map, and btree-map are such garbage-intense that they run out of dfx cycles or suffer from a sudden dfx network connection interruption.

GC Pauses

Longest GC pause, maximum of all benchmark cases:

GC	Longest GC Pause
Incremental	0.712e9
Generational	1.19e9
Compacting	8.41e9
Copying	5.90e9

Shorter than all the other GCs.

Performance

Total number of instructions (mutator + GC), average across all benchmark cases:

GC	Avg. Total Instructions
Incremental	1.85e10
Generational	1.91e10
Compacting	2.20e10
Copying	2.05e10

Faster than all the other GCs.

Mutator utilization on average:

GC	Avg. Mutator Utilization
Incremental	94.6%
Generational	85.4%
Compacting	75.8%
Copying	78.7%

Higher than the other GCs.

Memory Size

Occupied heap size at the end of each benchmark case, average across all cases:

GC	Avg. Final Heap Occupation
Incremental	176 MB
No	497 MB
Generational	156 MB
Compacting	144 MB
Copying	144 MB

Up to 22% higher than the other GCs.

Allocated WASM memory space, benchmark average:

GC	Avg. Memory Size
Incremental	296 MB
No	499 MB
Generational	191 MB
Compacting	188 MB
Copying	271 MB

9% higher than the copying GC. 57% higher (worse) than the generational and the compacting GC.

Overheads

Additional mutator costs implied by the incremental GC:

Write barrier:
- During the mark and evacuation phase: Marking the target of overwritten pointers.
- During the update phase: Resolving forwarding of written pointers.
Allocation barrier:
- During the mark and evacuation phase: Marking new allocated objects.
- During the update phase: Resolve pointer forwarding in initialized objects.
Pointer forwarding:
- Indirect each object access and equality check via the forwarding pointer.

Runtime costs for the barrier are reported in #3831.
Runtime costs for the forwarding pointers are reported in #3829.

Testing

RTS unit tests

In Motoko repo folder rts:
```
make test
```

Motoko test cases

In Motoko repo folder test/run and test/run-drun:

export EXTRA_MOC_ARGS="--sanity-checks --incremental-gc"
make

GC Benchmark cases

In gcbench repo:
```
./measure-all.sh
```
Extensive memory sanity checks

Adjust Cargo.toml in rts/motoko-rts folder:
```
default = ["ic", "memory-check"]
```
Run selected benchmark and test cases. Some of the tests will exceed the instruction limit due to the expensive checks.

Extension to 64-Bit Heaps

The design partition information would need to be adjusted to store the partition information dynamically instead of a static allocation. For example, the information could be stored in a reserved space at the beginning of a partition (except if the partition has static data or serves as an extension for hosting a huge object). Apart from that, the GC should be portable and scalable without significant design changes on 64-bit memory.

Design Alternatives

Free list: See the prototype in Incremental GC (Intermediate Version) #3678. The free-list-based incremental GC shows higher reclamation latency, slower performance (free list selection), and potentially higher external fragmentation (no compaction, just free neighbor merging).
Mark bit in object header: See implementation in Incremental Garbage Collector (Mark Bit in Header) #3756. Storing the mark bit in the object header instead of using a mark bitmap saves memory space, but is more expensive for scanning sparsely marked partitions. Moreover, it increases the amount of dirty pages.
Remembered set: Inter-partition pointers could be stored in remembered set to allow more selective and faster pointer updates. Moreover, the write barrier would become more expensive to detect and store relevant pointers in the remembered set. Also, the remembered set would occupy additional memory.
Allocation increments: On high allocation rate, the GC could also perform a short GC increment during an allocation. This design is however more complicated as it forbids that the compiler can store low-level pointers on the stack while performing an allocation (e.g. during assignments or array tabulate). It is also slower than the current solution where allocation increments are postponed to next regularly scheduled GC increment, running when the call stack is empty.
Special incremental GC: Analzyed in PR Special GC (Incremental GC with Central Object Table) #3894. An incremental GC based on a central object table that allows easy object movement and incremental compaction. Compared to this PR, the special GC has 35% worse runtime performance.
Combining tag and forwarding pointer: Combining Tag and Forwarding Pointer (Incremental GC) #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%.

References

[1] C. H. Flood, R. Kennke, A. Dinn, A. Haley, and R. Westrelin. Shenandoah. An Open-Source Concurrent Compacting Garbage Collector for OpenJDK. Intl. Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ'16, Lugano, Switzerland, August 2016.

[2] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.

Due to performance-optimized barrier code

rts/motoko-rts/src/gc/incremental/sort.rs

ulan

The new GC lgtm as far as I can see. Great work and good luck with launching it!

luc-blaeser · 2023-05-10T12:20:10Z

The new GC lgtm as far as I can see. Great work and good luck with launching it!

Thank you very much Ulan for the review of these complex PRs and the many very valuable design discussions.

crusso

Let's ship this puppy!

### Incremental GC PR Stack The Incremental GC is structured in three PRs to ease review: 1. #3837 2. #3831 3. #3829 **<-- this PR** # Incremental GC Forwarding Pointers Support for forwarding pointers (Brooks pointer) to enable incremental moving (evacuating compacting) GC. Each object stores a forwarding pointer with the following properties: * **Self-reference**: If the object resides at a valid location (i.e. not been relocated to another address), the forwarding pointer stores a reference to the object itself. * **Single-level redirection**: If an object has been moved, the original object stores a pointer to the new object location. This implies that the data at the original object location is no longer valid. Forwarding is only used during the evacuation phase and the updating phase of the incremental GC. Indirection is at most one level, i.e. the relocation target cannot forward again to another location. The GC would need to update all incoming pointers before moving an object again in the next GC run. Invariant: `object.forward().get_ptr() == object.get_ptr() || object.forward().forward().get_ptr() == object.forward().get_ptr()`. ## Changes Changes made to the compiler and runtime system: * **Header extension**: Additional object header space has been inserted for the forwarding pointer. Allocated objects forward to themselves. * **Access indirection**: Every load, store, and object reference comparison effects an additional indirection through the forwarding pointer. ## Runtime Costs Measuring the performance overhead of forwarding pointers in the GC benchmark using release mode (no sanity checks). Total number of mutator instructions, average across all benchmark cases: Configuration | Avg. mutator instructions ------------------|--------------------------- No forwarding | 1.61e10 With forwarding | 1.80e10 **Runtime overhead of 12% on average.** ## Memory Costs Allocated memory size, average across all GC benchmark cases, copying GCs: Configuration | Avg. final heap size ------------------|---------------------- No forwarding | 306 MB With forwarding | 325 MB **Memory overhead of 6% on average.** ## Testing Extensive sanity checks for forwarding pointers were implemented and run in the separate PR (#3546) containing the following sanity check code: * **Indirection check**: Every derefencing of a forwarding pointer is checked whether the pointer is valid, and the invariant above holds. * **Memory scan**: At GC time points of the existing copying or compacting GC, in regular intervals, the entire memory is scanned and all objects and pointers are verifying to be valid (valid forwarding pointer and plausible object tag). * **Artificial forwarding**: For every created object, an artificial dummy object is returned that forwards to the real object. The dummy object stores zeroed content and has an invalid tag. This helps to verify that all heap accesses are correctly forwarded. Artificial forwarding disables the existing garbage collectors (due to the dummy objects not handled by the GCs) and performs memory scans at a defined frequency instead. ## Design Alternative * **Combining tag and forwarding pointer**: #3904. This seems to be less efficient than the Brooks pointer technique with a runtime performance degrade of 27.5%, while only offering a small memory saving of around 2%. ## Reference [1] R. A. Brooks. Trading Data Space for Reduced Time and Code Space in Real-Time Garbage Collection on Stock Hardware. ACM Symposium on LISP and Functional Programming, LFP'84, New York, NY, USA, 1984.

### Incremental GC PR Stack The Incremental GC is structured in three PRs to ease review: 1. #3837 2. #3831 **<-- this PR** 3. #3829 # Incremental GC Barriers Preparation support for write and allocation barriers for the incremental moving (evacuating compacting) GC. **Write barrier**: All potential pointer writes are passed to the write barrier which performs the write with additional steps to be implemented in the incremental GC: * Incremental mark phase: Catch the overwritten pointers to realize incremental snapshot-at-the-beginning marking. * Incremental update phase: If written pointers refer to old evacuated object locations, adjust them to point to the corresponding new forwarded locations. **Allocation barrier**: The allocation barrier catches all newly created objects that are completely initialized (except the content of blobs). This will serve for the following purposes in the incremental GC: * Incremental mark and evacuation phase: Mark the newly allocated object to retain them during the GC that performs snapshot-at-the-beginning marking. * Incremental update phase: Update all pointers in the new object to refer to the new forwarded locations. * Additional GC increment: To limit memory reclamation latency at a high allocation rate during garbage collection, the barrier performs an additional small GC increment. ***Static optimization***: Compile-time barrier elimination based on a simple conservative analysis of the type of the modified field/array element. Specifically, the barrier is skipped if the type of the written location does not allow pointers (`Bool`, `?Bool`, `Char`, `?Char`, `Nat8`, `?Nat8`, `Nat16`, `?Nat16`, `Int8`, `?Int8`, `Int16`, `?Int16`, `()`, `?()`). (Multi-level optional types can not be ommited as `??null`, `???null` etc. refer to heap objects. Scalar types with >=32 bits can be indirected due to pointer tagging.) ## Runtime Costs GC benchmark measurements, comparing the number of mutator instructions, average across all benchmark cases, release mode (no sanity checks): The incremental GC barrier contains the logic of full GC (#3837) including object forwarding, however without the allocation GC increment. The measurements without barriers includes object forwarding to determine the barrier overheads. Configuration | Avg. Mutator Instructions ----------------------------|-------------------------- Incremental GC barriers | 2.06e10 No barriers with forwarding | 1.80e10 **14% runtime overhead on top of forwarding pointers.** ## Testing Write barrier coverage has been extensively tested by the generational GC and in a separate barrier preparation PR for the incremental GC (#3502). The allocation barrier is tested part of the incremental GC PR (#3837).

# Separate RTS Builds (Incremental and Non-Incremental GC) Using different memory layouts determined at compile time: * Incremental GC: - Extended header with forwarding pointer field. - Partitioned heap. * Non-incremental GC (copying, compacting, and generational GC): - Small header only comprising the object tag. - Linear heap. ## Runtime System Changes Separate RTS builds by introducing the feature `"incremental_gc"` for incremental GC memory layout. Helper macros: * `#[incremental_gc]`: macro attribute, equivalent to `#[cfg(feature = "incremental_gc")]` * `#[non_incremental_gc]`: macro attribute, equivalent to `#[cfg(not(feature = "incremental_gc"))]` * `is_incremental_gc!()`: procedure macros, equivalent to `cfg!(feature = "incremental_gc")` Different builds: * `rts.wasm`: Release build with non-incremental GCs (containing the copying, compacting, and generational GC). * `rts-debug.wasm`: Debug build with non-incremental GCs (containing the copying, compacting, and generational GC). * `rts-incremental.wasm`: Release build with only the incremental GC (no other GCs). * `rts-incremental-debug.wasm`: Debug build with only the incremental GC (no other GCs). ## Compiler Changes GC-dependent compilation: * Conditional header layout with or without forwarding pointer. * Linking the corresponding matching RTS build, with build-specific imports. Switch based on the condition `!Flags.gc_strategy == Flags.Incremental`. ## Performance Comparing the the following designs: * **Non-Incremental RTS**: Original RTS in `master branch` without incremental GC changes. * **Combined RTS**: Combining incremental and non-incremental GC in one RTS build, PR: #3837 * **Separate RTS**: This PR. ### Binary Size Size of the release RTS binary files. | GC | Non-Incremental RTS | Combined RTS | Separate RTS | | --------------- | --------------------| ------------ | ------------ | | non-incremental | 174 KB | 194 KB | 174 KB | | incremental | - | 194 KB | 175 KB | **11%** reduction. ### Total allocations GC benchmark results with `dfx 0.13.1`. Total amount of allocated memory (heap size + reclaimed memory) at runtime, average across benchmark cases. | GC | Non-Incremental RTS | Combined RTS | Separate RTS | | ------------ | -------------------- | ------------ | ------------ | | copying | 459 MB | 496 MB | 459 MB | | compacting | 459 MB | 496 MB | 459 MB | | generational | 480 MB | 517 MB | 480 MB | | incremental | - | 502 MB | 502 MB | For non-incremental GCs: **8%** reduction compared to the combined RTS, same like non-incremental RTS. ### Memory Size GC benchmark results with `dfx 0.13.1`. Allocated WASM memory size at runtime, average across benchmark cases. | GC | Non-Incremental RTS | Combined RTS | Separate RTS | | ------------ | ------------------- | ------------ | ------------ | | copying | 271 MB | 281 MB | 271 MB | | compacting | 188 MB | 195 MB | 188 MB | | generational | 191 MB | 201 MB | 194 MB | | incremental | - | 294 MB | 294 MB | For non-incremental GCs: **4%** reduction compared to the combined RTS, same like non-incremental RTS. ### Total Instructions GC benchmark results with `dfx 0.13.1`. Number of executed instructions, average across benchmark cases. | GC | Non-Incremental RTS | Combined RTS | Separate RTS | | ------------ | ------------------- | ------------ | ------------ | | copying | 2.05e10 | 2.12e10 | 2.07e10 | | compacting | 2.20e10 | 2.24e10 | 2.21e10 | | generational | 1.91e10 | 1.93e10 | 1.92e10 | | incremental | - | 1.95e10 | 1.95e10 | For non-incremental GCs: Around **2%** reduction compared to combined RTS, around *1%** overhead compared to non-incremental RTS. # Conclusion Advantages of this PR: * Avoiding performance degrades for the classical GCs by introducing the incremental GC support. * Smaller binary sizes.

iclighthouse · 2023-05-13T12:36:22Z

This is the best news I heard this weekend! Great job!

ByronBecker · 2023-05-14T16:02:12Z

🥳 🎉 🎉 🎉

luc-blaeser · 2023-05-15T08:05:25Z

Thanks a lot, iclighthouse!
Thanks a lot, Byron!

# Experiment: Simplified Graph-Copy-Based Stabilization **Simplified version of #4286, without stable memory buffering and without memory flipping on deserialization.** Using graph copying instead of Candid-based serialization for stabilization, to save stable variables across upgrades. ## Goals * **Stop-gap solution until enhanced orthogonal persistence**: More scalable stabilization than the current Candid(ish) serialization. * **With enhanced orthogonal persistence**: Upgrades in the presence of memory layout changes introduced by future compiler versions. ## Design Graph copy of sub-graph of stable objects from main memory to stable memory and vice versa on upgrades. ## Properties * Preserve sharing for all objects like in the heap. * Allow the serialization format to be independent of the main memory layout. * Limit the additional main memory needed during serialization and deserialization. * Avoid deep call stack recursion (stack overflow). ## Memory Compatibility Check Apply a memory compatibility check analogous to the enhanced orthogonal persistence, since the upgrade compatibility of the graph copy is not identical to the Candid subtype relation. ## Algorithm Applying Cheney’s algorithm [1, 2] for both serialization and deserialization: ### Serialization * Cheney’s algorithm using main memory as from-space and stable memory as to-space: * Focusing on stable variables as root (sub-graph of stable objects). * The target pointers and Cheney’s forwarding pointers denote the (skewed) offsets in stable memory. * Using streaming reads for the `scan`-pointer and streaming writes for the `free`-pointer in stable memory. ### Deserialization * Cheney’s algorithm using stable memory as from-space and main memory as to-space: * Starting with the stable root created during the serialization process. * Objects are allocated in main memory using the default allocator. * Using random read/write access on the stable memory. ## Stable Format For a long-term perspective, the object layout of the serialized data in the stable memory is fixed and independent of the main memory layout. * Pointers support 64-bit representations, even if only 32-bit pointers are used in current main memory address space. * The Brooks forwarding pointer is omitted (used by the incremental GC). * The pointers encode skewed stable memory offsets to the corresponding target objects. * References to the null objects are encoded by a sentinel value. ## Specific Aspects * The null object is handled specifically to guarantee the singleton property. For this purpose, null references are encoded as sentinel values that are decoded back to the static singleton of the new program version. * Field hashes in objects are serialized in a blob. On deserialization, the hash blob is allocated in the dynamic heap. Same-typed objects that have been created by the same program version share the same hash blob. * Stable records can dynamically contain non-stable fields due to structural sub-typing. A dummy value can be serialized for such fields as a new program version can no longer access this field through the stable types. * For backwards compatibility, old Candid destabilzation is still supported when upgrading from a program that used older compiler version. * Incremental GC: Serialization needs to consider Brooks forwarding pointers (not to be confused with the Cheney's forwarding information), while deserialization can deal with partitioned heap that can have internal fragmentation (free space at partition ends). ## Complexity Specific aspects that entail complexity: * For each object type, not only serialization and deserialization needs to be implemeneted but also the pointer scanning logic of its serialized and deserialized format. Since the deserialization also targets stable memory the existing pointer visitor logic cannot be used for scanning pointers in its deserialized format. * The deserialization requires scanning the heap which is more complicated for the partitioned heap. The allocator must yield monotonously growing addresses during deserialization. Free space gaps are allowed to complete partitions. ## Open Aspects * Unused fields in stable records that are no longer declared in a new program versions should be removed. This could be done during garbage collection, when objects are moved/evacuated. * The binary serialization and deserialization of `BigInt` entails dynamic allocations (cf. `mp_to_sbin` and `mp_from_sbin` of Tom's math library). ## Related PRs * Motoko Enhanced Orthogonal Persistence: #4225 * Motoko Incremental Garbage Collector: #3837 ## References [1] C. J. Cheney. A Non-Recursive List Compacting Algorithm. Communications of the ACM, 13(11):677-8, November 1970. [2] R. Jones and R. Lins. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley 2003. Algorithm 6.1: Cheney's algorithm, page 123.

# Officializing the Incremental GC After a longer beta testing phase and some fine-tuning, we could officilalize the incremental garbage collector. This would be next intermediate step before making the incremental GC the default GC in Motoko. Some advice may be helpful to be given to developers, ideally also in a forum post, once we have declared the GC production ready. ## Note The incremental GC is enabled by the moc flag `--incremental-gc` (#3837) and is designed to scale for large program heap sizes. While resolving scalability issues with regard to the instruction limit of the GC work, it is now possible to hit other scalability limits: - _Out of memory_: A program can run out of memory if it fills the entire memory space with live objects. - _Upgrade limits_: When using stable variables, the current mechanism of serialization and deserialization to and from stable memory can exceed the instruction limit or run out of memory. ## Recommendations - _Test the upgrade_: Thoroughly test the upgrade mechanism for different data volumes and heap sizes and conservatively determine the amount of stable data that is supported when upgrading the program. - _Monitor the heap size_: Monitor the memory and heap size (`Prim.rts_memory_size()` and `Prim.rts_heap_size()`) of the application in production. - _Limit the heap size_: Implement a custom limit in the application to keep the heap size and data volume below the scalability limit that has been determined during testing, in particular for the upgrade mechanism. - _Avoid large allocations per message_: Avoid large allocations of 100 MB or more per message, but rather distribute larger allocations across multiple messages. Large allocations per message extend the duration of the GC increment. Moreover, memory pressure may occur because the GC has a higher reclamation latency than a classical stop-the-world collector. - _Consider a backup query function_: Depending on the application case, it can be beneficial to offer an privileged _query_ function to extract the critical canister state in several chunks. The runtime system maintains an extra memory reserve for query functions. Of course, such a function has to be implemented with a check that restricts it to authorized callers only. It is also important to test this function well. - _Last resort if memory would be full_: Assuming the memory is full with objects that have shortly become garbage before the memory space has been exhausted, the canister owner or controllers can call the system-level function `__motoko_gc_trigger()` multiple times to run extra GC increments and complete a GC run, for collecting the latest garbage in a full heap. Up to 100 calls of this function may be needed to complete a GC run in a 4GB memory space. The GC keeps an specific memory reserve to be able to perform its work even if the application has exhausted the memory. Usually, this functionality is not needed in practice but is only useful in such exceptional cases.

* Graph copy: Work in progress * Implement stable memory reader writer * Add skip function * Code refactoring * Continue stabilization function * Support update at scan position * Code refactoring * Code refactoring * Extend unit test * Continue implementation * Adjust test * Prepare memory compatibility check * Variable stable to-space offset * Deserialize with partitioned heap * Prepare metadata stabilization * Adjust stable memory size * Stabilization version management * Remove code redundancies * Fix version upgrade * Put object field hashes in a blob * Support object type * Code refactoring * Support blob, fix bug * Renaming variable * Adjust deserialization heap start * Handle null singleton * Fix version upgrade * Support regions * Backup first word in stable memory * Support additional fields in upgraded actor * Make unit tests runnable again * Dummy null singleton in unit test * Add test cases * Support boxed 32-bit and 64-bit numbers * Support more object types * Support more object types * Handle `true` bool constant * Grow main memory on bulk copy * Update benchmark results * Support bigint * Clear deserialized data in stable memory * Update test results * Add documentation * Reformat * Add missing file * Update design/GraphCopyStabilization.md Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Graph Copy: Explicit Stable Data Layout (#4293) Refinement of Graph-Copy-Based Stabilization (#4286): Serialize/deserialize in an explicitly defined and fixed stable layout for a long-term perspective. * Supporting 64-bit pointer representations in stable format, even if main memory currently only uses 32-bit addresses. Open aspect: * Make `BigInt` stable format independent of Tom's math library. * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization/layout.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Handle non-stable fields in stable records * Add object type `Some` * Add test case * Adjust stabilization to incremental GC * Update benchmark results * Distinguish assertions * Fix RTS unit test * Update benchmark results * Adjust test * Adjust test * Fix: Handle all non-stable types during serialization * Fix typos and complete comment * Experiment: Simplified Graph-Copy-Based Stabilization (#4313) # Experiment: Simplified Graph-Copy-Based Stabilization **Simplified version of #4286, without stable memory buffering and without memory flipping on deserialization.** Using graph copying instead of Candid-based serialization for stabilization, to save stable variables across upgrades. ## Goals * **Stop-gap solution until enhanced orthogonal persistence**: More scalable stabilization than the current Candid(ish) serialization. * **With enhanced orthogonal persistence**: Upgrades in the presence of memory layout changes introduced by future compiler versions. ## Design Graph copy of sub-graph of stable objects from main memory to stable memory and vice versa on upgrades. ## Properties * Preserve sharing for all objects like in the heap. * Allow the serialization format to be independent of the main memory layout. * Limit the additional main memory needed during serialization and deserialization. * Avoid deep call stack recursion (stack overflow). ## Memory Compatibility Check Apply a memory compatibility check analogous to the enhanced orthogonal persistence, since the upgrade compatibility of the graph copy is not identical to the Candid subtype relation. ## Algorithm Applying Cheney’s algorithm [1, 2] for both serialization and deserialization: ### Serialization * Cheney’s algorithm using main memory as from-space and stable memory as to-space: * Focusing on stable variables as root (sub-graph of stable objects). * The target pointers and Cheney’s forwarding pointers denote the (skewed) offsets in stable memory. * Using streaming reads for the `scan`-pointer and streaming writes for the `free`-pointer in stable memory. ### Deserialization * Cheney’s algorithm using stable memory as from-space and main memory as to-space: * Starting with the stable root created during the serialization process. * Objects are allocated in main memory using the default allocator. * Using random read/write access on the stable memory. ## Stable Format For a long-term perspective, the object layout of the serialized data in the stable memory is fixed and independent of the main memory layout. * Pointers support 64-bit representations, even if only 32-bit pointers are used in current main memory address space. * The Brooks forwarding pointer is omitted (used by the incremental GC). * The pointers encode skewed stable memory offsets to the corresponding target objects. * References to the null objects are encoded by a sentinel value. ## Specific Aspects * The null object is handled specifically to guarantee the singleton property. For this purpose, null references are encoded as sentinel values that are decoded back to the static singleton of the new program version. * Field hashes in objects are serialized in a blob. On deserialization, the hash blob is allocated in the dynamic heap. Same-typed objects that have been created by the same program version share the same hash blob. * Stable records can dynamically contain non-stable fields due to structural sub-typing. A dummy value can be serialized for such fields as a new program version can no longer access this field through the stable types. * For backwards compatibility, old Candid destabilzation is still supported when upgrading from a program that used older compiler version. * Incremental GC: Serialization needs to consider Brooks forwarding pointers (not to be confused with the Cheney's forwarding information), while deserialization can deal with partitioned heap that can have internal fragmentation (free space at partition ends). ## Complexity Specific aspects that entail complexity: * For each object type, not only serialization and deserialization needs to be implemeneted but also the pointer scanning logic of its serialized and deserialized format. Since the deserialization also targets stable memory the existing pointer visitor logic cannot be used for scanning pointers in its deserialized format. * The deserialization requires scanning the heap which is more complicated for the partitioned heap. The allocator must yield monotonously growing addresses during deserialization. Free space gaps are allowed to complete partitions. ## Open Aspects * Unused fields in stable records that are no longer declared in a new program versions should be removed. This could be done during garbage collection, when objects are moved/evacuated. * The binary serialization and deserialization of `BigInt` entails dynamic allocations (cf. `mp_to_sbin` and `mp_from_sbin` of Tom's math library). ## Related PRs * Motoko Enhanced Orthogonal Persistence: #4225 * Motoko Incremental Garbage Collector: #3837 ## References [1] C. J. Cheney. A Non-Recursive List Compacting Algorithm. Communications of the ACM, 13(11):677-8, November 1970. [2] R. Jones and R. Lins. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley 2003. Algorithm 6.1: Cheney's algorithm, page 123. * Bug fix: Allocations are not monotonically growing in partitioned heap for large objects * Update benchmark results * Update benchmark results * Drop content of destabilized `Any`-typed actor field * Refactor `is_primitive_type` in Candid parser and subtype check * Do not use the cache for the main actor type compatibility check * Update benchmark results * Increase chunk size for stable memory clearing * Custom bigint serialization * Update benchmark results * Update documentation * Update documentation * Optimize array deserialization * Update benchmark results * Code refactoring of upgrade version checks * Remove redundant math functions * Eliminate size redundancy in the `Object` header * Also adjust the `Object` header in the compiler * Revert "Also adjust the `Object` header in the compiler" This reverts commit f75bb76. * Revert "Eliminate size redundancy in the `Object` header" This reverts commit 0fe3926. * Record the upgrade instruction costs * Update tests for new `Prim.rts_upgrade_instructions()` function * Make test more ergonomic * Incremental Graph-Copy-Based Upgrades (#4361) # Incremental Graph-Copy-Based Upgrades Refinement of #4286 Supporting arbitrarily large graph-copy-based upgrades beyond the instruction limit: * Splitting the stabilization/destabilization in multiple asynchronous messages. * Limiting the stabilization work units to fit the update or upgrade messages. * Blocking other messages during the explicit incremental stabilization. * Restricting the upgrade functionality to the canister owner and controllers. * Stopping the GC during the explicit incremental upgrade process. ## Usage For large upgrades: 1. Initiate the explicit stabilization before the upgrade: ``` dfx canister call CANISTER_ID __motoko_stabilize_before_upgrade "()" ``` * An assertion first checks that the caller is the canister owner or a canister controller. * All other messages to the canister will be blocked until the upgrade has been successfully completed. * The GC is stopped. * If defined, the actor's pre-upgrade function is called before the explicit stabilization. * The stabilzation runs in possibly multiple asynchronous messages, each with a limited number of instructions. 2. Run the actual upgrade: ``` dfx deploy CANISTER_ID ``` * Run and complete the stabilization if not yet done in advance. * Perform the actual upgrade of the canister on the IC. * Start the destabilization with a limited number of steps to fit into the upgrade message. * If destabilization cannot be completed, the canister does not start the GC and does not accept messages except step 3. 3. Complete the explicit destabilization after the upgrade: ``` dfx canister call CANISTER_ID __motoko_destabilze_after_upgrade "()" ``` * An assertion checks that the caller is the canister owner or a canister controller. * All other messages remain blocked until the successful completion of the destabilization. * The destabilzation runs in possibly multiple asynchronous messages, each with a limited number of instructions. * If defined, the actor's post-upgrade function is called at the end of the explicit destabilization. * The GC is restarted. ## Remarks * Steps 1 (explicit stabilization) and/or 2 (explicit destabilization) may not be needed if the corresponding operation fits into the upgrade message. * Stabilization and destabilization steps are limited to the increment limits: Operation | Message Type | IC Instruction Limit | **Increment Limit** ----------|--------------|----------------------|-------------------- **Explicit (de)stabilization step** | Update | 20e9 | **16e9** **Actual upgrade** | Upgrade | 200e9 | **160e9** * The stabilization code in the RTS has been restructured to be less monolithic. * Manual merge conflict resolution (work in progress) * Adjust tests, resolve some merge bugs * Adjust RTS test case * Make RTS tests run again * Add missing function export * Adjust imports, manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Adjust persistence initialization * Adjust persistence version management * Adjust stable memory metadata for enhanced orthogonal persistence Distinguish enhanced orthogonal persistence from Candid legacy stabilization * Add comment * Adjust graph stabilization initialization * Adjust GC mode during destabilization * Adjust object visitor for graph destabilization * Adjust incremental graph destabilization * Adjust error message * Adjust tests * Adjust tests * Update benchmark results * Adjust test * Upgrade stable memory version after graph destabilization * Adjust memory sanity check * Clear memory on graph destabilization as first step * Adjust big int serialization for 64-bit * Fix: Clear memory on graph destabilization * Add test case for graph stabilization * Add test case for incremental graph stabilization * Add tests for graph stabilization * Add more tests for graph stabilization * Add more test cases for graph stabilization * Add more test cases for graph stabilization * More conservative persistence version check * Adjust expected test results * Adjust test * Adjust tests * Adjust tests * Adjust RTS test for stabilization * Adjust tests * Adjust test results * Remove unwanted binary files * Adjust comment * Code refactoring * Fix merge mistake * Manual merge conflict resolution * Add test cases * Manual merge conflict resolution * Fix typo in documentation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in documentation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Bug fix: Allow stabilization beyond compiler-specified stable memory limit * Adjustment to RTS unit tests * Add comments * Code refactoring * Fix difference between debug and release test execution * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Delete unused file * Code refactoring * Use correct trap for an unreachable case * Remove dead code * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in function identifier * Fix indendation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Removing unused code * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix RTS compile error * Bug fix: Object size lookup during stabilization * experiment: refactoring of ir extensions in graph-copy PR (#4543) * refactoring of ir * fix arrange_ir.ml --------- Co-authored-by: luc-blaeser <luc.blaeser@dfinity.org> * Manual merge conflict resolution * Adjust test case, remove file check * Manual merge conflict resolution * Manual merge conflict resolution * test graph copy of text and blob iterators (#4562) * Optimize instruction limit checks * Bug fix graph copy limit on destabilization * Incremental stable memory clearing after graph copy * Parameter tuning for graph copy * Manual merge conflict resolution * Manual merge conflict resolution * Remove redundant code * Manual merge conflict resolution: Remove `ObjInd` from graph-copy stabilization * Manual merge conflict resolution * Merge Preparation: Latest IC with Graph Copy (#4630) * Adjust to new system API * Port to latest IC 64-bit system API * Update to new IC with Wasm64 * Updating nix hashes * Update IC dependency (Wasm64 enabled) * Update expected test results * Fix migration test * Use latest `drun` * Adjust expected test results * Updating nix hashes * Update expected test results * Fix `drun` nix build for Linux * Disable DTS in `drun`, refactor `drun` patches * Update expected test results for new `drun` * Limiting amount of stable memory accessed per graph copy increment * Reformat * Adjust expected test result --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> * Message-dependent stable memory access limit * Graph copy: Fix accessed memory limit during stabilization * Enhanced Orthogonal Persistence (Complete Integration) (#4488) * Prepare two compilation targets * Combined RTS Makefile * Port classical compiler backend to combined solution * Adjust nix config file * Start combined RTS * Reduce classical compiler backend changes * Continue combined RTS * Make RTS compilable for enhanced orthogonal persistence * Make RTS tests runnable again for enhanced orthogonal persistence * Adjust compiler backend of enhanced orthogonal persistence * Unify Tom's math library binding * Make classical non-incremental RTS compile again * Make classical incremental GC version compilable again * Make all RTS versions compile again * Adjust memory sanity check for combined RTS modes * Prepare RTS tests for combined modes * Continue RTS test merge * Continue RTS tests combined modes * Continue RTS tests support for combined modes * Adjust LEB128 encoding for combined mode * Adjust RTS test for classical incremental GC * Adjust RTS GC tests * Different heap layouts in RTS tests * Continue RTS GC test multi-mode support * Make all RTS run again * Adjust linker to support combined modes * Adjust libc import in RTS for combined mode * Adjust RTS test dependencies * Bugfix in Makefile * Adjust compiler backend import for combined mode * Adjust RTS import for combined mode * Adjust region management to combined modes * Adjust classical compiler backend to fit combined modes * Reorder object tags to match combined RTS * Adjust test * Adjust linker for multi memory during Wasi mode with regions * Adjust tests * Adjust bigint LEB encoding for combined modes * Adjust bigint LEB128 encoding for combined modes * Adjust test * Adjust tests * Adjust test * Code refactoring: SLEB128 for BigInt * Adjust tests * Adjust test * Reformat * Adjust tests * Adjust benchmark results * Adjust RTS for unit tests * Reintroduce compiler flags in classical mode * Support classical incremental GC * Add missing export for classical incremental GC * Adjust tests * Adjust test * Adjust test * Adjust test * Adjust test * Adjust test * Adjust test * Pass `keep_main_memory` upgrade option only for enhanced orthogonal persistence * Adjust test * Update nix hash * Adjust Motoko base dependency * Adjust tests * Extend documentation * Adjust test * Update documentation * Update documentation * Manual merge conflict resolution * Manual merge refinement * Manual merge conflict resolution * Manual merge conflict resolution * Refactor migration test from classical to new persistence * Adjust migration test * Manual merge conflict resolution * Manual merge conflict resolution * Adjust compiler reference documentation * Test CI build * Test CI build * Adjust performance comparison in CI build * Manual merge conflict resolution * Add test for migration paths * Adjust test for integrated PR * Adjust test case * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Manual merge conflict resolution * Add static assertions, code formatting * Manual merge conflict resolution * Add test case * Refine comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring * Manual merge conflict resolution * Adjust test run script messages * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Merge Preparation: Dynamic Memory Capacity for Integrated EOP (#4586) * Tune for unknown memory capacity in 64-bit * Adjust benchmark results * Fix debug assertion, code refactoring * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring: Improve comments * Reformat * Fix debug assertion * Re-enable memory reserve for upgrade and queries See PR #4158 * Manual merge conflict resolution * Manual merge conflict resolution * Update benchmark results * Manual merge conflict resolution * Manual merge conflict resolution * Merge Preparation: Latest IC with Integrated EOP (#4638) * Adjust to new system API * Port to latest IC 64-bit system API * Update to new IC with Wasm64 * Updating nix hashes * Update IC dependency (Wasm64 enabled) * Update expected test results * Fix migration test * Use latest `drun` * Adjust expected test results * Updating nix hashes * Update expected test results * Fix `drun` nix build for Linux * Disable DTS in `drun`, refactor `drun` patches * Update expected test results for new `drun` * Limiting amount of stable memory accessed per graph copy increment * Reformat * Manual merge conflict resolution * Manual merge conflict resolution * Adjust expected test result --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> * Manual merge conflict resolution * Documentation Update for Enhanced Orthogonal Persistence (#4670) --------- Co-authored-by: Claudio Russo <claudio@dfinity.org> Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: Claudio Russo <claudio@dfinity.org> Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com>

* Adjust emscripten dependency for nix * Use latest emscripten from nix unstable channel * Adjust CI build * Adjust CI build * Adjust CI build * Adjust CI build * Add latest emscripten via nix `sources.json` * Adjust emscripten dependency in `sources.json` * Update sources.json * Update sources.json * Disable base library tests * Adjust build * Adjust tests, disable benchmark * Enable random tests on 64-bit * Bug fix * Exclude inter-actor quickcheck tests * Downscale test for CI * Remove unnecessary clean-up function * Adjust `is_controller` system call * Manual merge from master * Fix direct numeric conversions * Use `drun` with 64-bit main memory * Adjust callback signatures * Adjust ignore callback sentinel value * Bug fix * Remove memory reserve feature * Adjust CI build * Adjust serialization * Bug fix * Bug fix * Bug fix * Adjust IC system calls * Adjust IC system calls * Bug fix * Create Cargo.lock * Adjust region and stable memory accesses * Fix float format length * Update nix setup * Adjust tests * Adjust nix config * Adjust stabilization * Bug fix * Adjust stable memory and region accesses * Adjust region RTS calls * Manual merge RTS tests * Manual merge of compiler * Adjust IC call * Update benchmark * Adjust test * Adjust test script * Adjust tests * Bug fix * Adjust tests * Adjust linker tests * Minor refactoring * Adjust test * Adjust CI build * Update IC dependency * Wasm profiler does not support 64-bit * Test case beyond 4GB * Update CI test configuration * Increase partitioned heap to 64GB * Update IC dependency * Manual merge, to be continued * Adjust BigInt literals * Bug fix * Adjust tests * Manual merge conflict resolution * Code refactoring * Update IC dependency * Increase data segment limit * Adjust test case * Update migration test case * Revert "Code refactoring" This reverts commit 8063f8b. * Adjust test case * Update benchmark results * Update documentation * Update fingerprint to 64-bit * Manual merge Rust allocator * Remove memory reserve * Test CI build * Refine memory compatibility check * Add test case * Distinguish blob and Nat8 arrays * Bug fix * Reformat code * Update benchmark results * Distinguish tuple type in memory compatibility check * Update IC dependency * Revert "Test CI build" This reverts commit d4889f9. * Use 64-bit IC API * Update IC dependency * Update benchmark results * Adjust sanity checks * Reformat * Upgrade IC dependency, use persistence flag * Update IC dependency * Update IC dependency * Manual resolution of undetected merge conflicts * Manual merge conflict resolution * Resolve merge conflicts * Manual merge conflict resolution * Manual merge: Adjust test * Merge branch 'luc/stable-heap' into luc/stable-heap64 * Update base library dependency * Manual merge conflict resolution * Updating nix hashes * Limit array length because of optimized array iterator * Code refactoring * Update benchmark results * Update motoko base dependency * Enhanced Orthogonal Persistence: Use Passive Data Segments (64-Bit) (#4411) Only passive Wasm data segments are used by the compiler and runtime system. In contrast to ordinary active data segments, passive segments can be explicitly loaded to a dynamic address. This simplifies two aspects: * The generated Motoko code can contain arbitrarily large data segments which can loaded to dynamic heap when needed. * The IC can simply retain the main memory on an upgrade without needing to patch the active data segments of the new program version to the persistent memory. However, more specific handling is required for the Rust-implemented runtime system: The Rust-generated active data segments of the runtime system is changed to passive and loaded to the expected static address at the program start (canister initialization and upgrade). The location and size of the RTS data segments is therefore limited to a defined reserve, see above. This is acceptable because the RTS only uses a small sized data segment that is independent of the compiled Motoko program. * Update IC dependency * Merge Preparation: Precise Tagging + Enhanced Orthogonal Persistence (64-Bit) (#4392) Preparing merging #4369 in #4225 * Manual merge conflict resolution * Update Motoko base depedency * Manual merge conflict resolution * Manual merge conflict resolution * Optimization: Object Pooling for Enhanced Orthogonal Persistence (#4465) * Object pooling * Update benchmark results * Optimize further (BigNum pooling) * Update benchmark results * Adjust tests * Optimize static blobs * Adjust test and benchmark results * Update documentation * Manual merge conflict resolution * Update .gitignore * Enhanced Orthogonal Persistence: Refactor 64-bit Port of SLEB128 for BigInt (#4486) * Refactor 64-bit port of SLEB128 for BigInt * Remove redundant test file * Adjust data segment loading To avoid allocation of trap text blob during object pool creation. * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Update benchmark results * Manual merge conflict resolution * Update Motoko base dependency * Manual merge conflict resolution * Apply the expected shift distance * Remove redundant code * Code refactoring: Move constant * Add a debug assertion * Code refactoring: Reduce code difference * Update comment * Represent function indices as `i32` * Use pointer compression on Candid destabilization Candid destabilization remembers aliases as 32-bit pointers in deserialized data. However, the deserialized pointers can be larger than 32-bit due to the 64-bit representation. Therefore, use pointer compression (by 3 bits) to store the 64-bit addresses in the 32-bit alias memo section. * Manual merge conflict resolution * Fix test case * Add comment * Add TODO comment * Code refactoring: Arithmetics * Fix boundary check in small `Int` `pow` function * Code refactoring: `Nat` conversions * Code refactoring: Remove redundant blank. Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix tagging for `hashBlob` * Remove redundant shifts for signed bit count operations * Reenable randomized tests * Update quickcheck documentation * Revert unwanted modification in test case This partially reverts commit ea20c6c. * Adjust test case to original configuration * Try to run original `map-upgrades` test * Tests for wasi stable memory beyond 4GB * Update expected test result * Code refactoring: Linker * Optimizations * Manual merge conflict resolution * Use 64-bit version of Tom's math library * Add benchmark case * Optimize float to int conversion for 64-bit * Manual merge conflict resolution * Experiment Remove `musl`/`libc` dependency from RTS (#4577) * Remove MUSL/LIBC dependency from RTS * Update benchmark result * Manual merge conflict resolution * Unbounded Number of Heap Partitions for 64-Bit (#4556) * EOP: Support Unknown Main Memory Capacity in 64-Bit (#4585) * Tune for unknown memory capacity in 64-bit * Adjust benchmark results * Fix debug assertion, code refactoring * Code refactoring: Improve comments * Reformat * Re-enable memory reserve for upgrade and queries See PR #4158 * Adjust comment * Fix build * EOP: Integrating Latest IC with Memory 64 (#4610) * Adjust to new system API * Port to latest IC 64-bit system API * Update to new IC with Wasm64 * Updating nix hashes * Update IC dependency (Wasm64 enabled) * Update expected test results * Fix migration test * Use latest `drun` * Adjust expected test results * Updating nix hashes * Update expected test results * Fix `drun` nix build for Linux * Disable DTS in `drun`, refactor `drun` patches * Adjust expected test results --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> * Enhanced Orthogonal Persistence (64-Bit with Graph Copy) (#4475) * Graph copy: Work in progress * Implement stable memory reader writer * Add skip function * Code refactoring * Continue stabilization function * Support update at scan position * Code refactoring * Code refactoring * Extend unit test * Continue implementation * Adjust test * Prepare memory compatibility check * Variable stable to-space offset * Deserialize with partitioned heap * Prepare metadata stabilization * Adjust stable memory size * Stabilization version management * Remove code redundancies * Fix version upgrade * Put object field hashes in a blob * Support object type * Code refactoring * Support blob, fix bug * Renaming variable * Adjust deserialization heap start * Handle null singleton * Fix version upgrade * Support regions * Backup first word in stable memory * Support additional fields in upgraded actor * Make unit tests runnable again * Dummy null singleton in unit test * Add test cases * Support boxed 32-bit and 64-bit numbers * Support more object types * Support more object types * Handle `true` bool constant * Grow main memory on bulk copy * Update benchmark results * Support bigint * Clear deserialized data in stable memory * Update test results * Add documentation * Reformat * Add missing file * Update design/GraphCopyStabilization.md Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Graph Copy: Explicit Stable Data Layout (#4293) Refinement of Graph-Copy-Based Stabilization (#4286): Serialize/deserialize in an explicitly defined and fixed stable layout for a long-term perspective. * Supporting 64-bit pointer representations in stable format, even if main memory currently only uses 32-bit addresses. Open aspect: * Make `BigInt` stable format independent of Tom's math library. * Update rts/motoko-rts/src/stabilization.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Update rts/motoko-rts/src/stabilization/layout.rs Co-authored-by: Claudio Russo <claudio@dfinity.org> * Handle non-stable fields in stable records * Add object type `Some` * Add test case * Adjust stabilization to incremental GC * Update benchmark results * Distinguish assertions * Fix RTS unit test * Update benchmark results * Adjust test * Adjust test * Fix: Handle all non-stable types during serialization * Fix typos and complete comment * Experiment: Simplified Graph-Copy-Based Stabilization (#4313) # Experiment: Simplified Graph-Copy-Based Stabilization **Simplified version of #4286, without stable memory buffering and without memory flipping on deserialization.** Using graph copying instead of Candid-based serialization for stabilization, to save stable variables across upgrades. ## Goals * **Stop-gap solution until enhanced orthogonal persistence**: More scalable stabilization than the current Candid(ish) serialization. * **With enhanced orthogonal persistence**: Upgrades in the presence of memory layout changes introduced by future compiler versions. ## Design Graph copy of sub-graph of stable objects from main memory to stable memory and vice versa on upgrades. ## Properties * Preserve sharing for all objects like in the heap. * Allow the serialization format to be independent of the main memory layout. * Limit the additional main memory needed during serialization and deserialization. * Avoid deep call stack recursion (stack overflow). ## Memory Compatibility Check Apply a memory compatibility check analogous to the enhanced orthogonal persistence, since the upgrade compatibility of the graph copy is not identical to the Candid subtype relation. ## Algorithm Applying Cheney’s algorithm [1, 2] for both serialization and deserialization: ### Serialization * Cheney’s algorithm using main memory as from-space and stable memory as to-space: * Focusing on stable variables as root (sub-graph of stable objects). * The target pointers and Cheney’s forwarding pointers denote the (skewed) offsets in stable memory. * Using streaming reads for the `scan`-pointer and streaming writes for the `free`-pointer in stable memory. ### Deserialization * Cheney’s algorithm using stable memory as from-space and main memory as to-space: * Starting with the stable root created during the serialization process. * Objects are allocated in main memory using the default allocator. * Using random read/write access on the stable memory. ## Stable Format For a long-term perspective, the object layout of the serialized data in the stable memory is fixed and independent of the main memory layout. * Pointers support 64-bit representations, even if only 32-bit pointers are used in current main memory address space. * The Brooks forwarding pointer is omitted (used by the incremental GC). * The pointers encode skewed stable memory offsets to the corresponding target objects. * References to the null objects are encoded by a sentinel value. ## Specific Aspects * The null object is handled specifically to guarantee the singleton property. For this purpose, null references are encoded as sentinel values that are decoded back to the static singleton of the new program version. * Field hashes in objects are serialized in a blob. On deserialization, the hash blob is allocated in the dynamic heap. Same-typed objects that have been created by the same program version share the same hash blob. * Stable records can dynamically contain non-stable fields due to structural sub-typing. A dummy value can be serialized for such fields as a new program version can no longer access this field through the stable types. * For backwards compatibility, old Candid destabilzation is still supported when upgrading from a program that used older compiler version. * Incremental GC: Serialization needs to consider Brooks forwarding pointers (not to be confused with the Cheney's forwarding information), while deserialization can deal with partitioned heap that can have internal fragmentation (free space at partition ends). ## Complexity Specific aspects that entail complexity: * For each object type, not only serialization and deserialization needs to be implemeneted but also the pointer scanning logic of its serialized and deserialized format. Since the deserialization also targets stable memory the existing pointer visitor logic cannot be used for scanning pointers in its deserialized format. * The deserialization requires scanning the heap which is more complicated for the partitioned heap. The allocator must yield monotonously growing addresses during deserialization. Free space gaps are allowed to complete partitions. ## Open Aspects * Unused fields in stable records that are no longer declared in a new program versions should be removed. This could be done during garbage collection, when objects are moved/evacuated. * The binary serialization and deserialization of `BigInt` entails dynamic allocations (cf. `mp_to_sbin` and `mp_from_sbin` of Tom's math library). ## Related PRs * Motoko Enhanced Orthogonal Persistence: #4225 * Motoko Incremental Garbage Collector: #3837 ## References [1] C. J. Cheney. A Non-Recursive List Compacting Algorithm. Communications of the ACM, 13(11):677-8, November 1970. [2] R. Jones and R. Lins. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley 2003. Algorithm 6.1: Cheney's algorithm, page 123. * Bug fix: Allocations are not monotonically growing in partitioned heap for large objects * Update benchmark results * Update benchmark results * Drop content of destabilized `Any`-typed actor field * Refactor `is_primitive_type` in Candid parser and subtype check * Do not use the cache for the main actor type compatibility check * Update benchmark results * Increase chunk size for stable memory clearing * Custom bigint serialization * Update benchmark results * Update documentation * Update documentation * Optimize array deserialization * Update benchmark results * Code refactoring of upgrade version checks * Remove redundant math functions * Eliminate size redundancy in the `Object` header * Also adjust the `Object` header in the compiler * Revert "Also adjust the `Object` header in the compiler" This reverts commit f75bb76. * Revert "Eliminate size redundancy in the `Object` header" This reverts commit 0fe3926. * Record the upgrade instruction costs * Update tests for new `Prim.rts_upgrade_instructions()` function * Make test more ergonomic * Incremental Graph-Copy-Based Upgrades (#4361) # Incremental Graph-Copy-Based Upgrades Refinement of #4286 Supporting arbitrarily large graph-copy-based upgrades beyond the instruction limit: * Splitting the stabilization/destabilization in multiple asynchronous messages. * Limiting the stabilization work units to fit the update or upgrade messages. * Blocking other messages during the explicit incremental stabilization. * Restricting the upgrade functionality to the canister owner and controllers. * Stopping the GC during the explicit incremental upgrade process. ## Usage For large upgrades: 1. Initiate the explicit stabilization before the upgrade: ``` dfx canister call CANISTER_ID __motoko_stabilize_before_upgrade "()" ``` * An assertion first checks that the caller is the canister owner or a canister controller. * All other messages to the canister will be blocked until the upgrade has been successfully completed. * The GC is stopped. * If defined, the actor's pre-upgrade function is called before the explicit stabilization. * The stabilzation runs in possibly multiple asynchronous messages, each with a limited number of instructions. 2. Run the actual upgrade: ``` dfx deploy CANISTER_ID ``` * Run and complete the stabilization if not yet done in advance. * Perform the actual upgrade of the canister on the IC. * Start the destabilization with a limited number of steps to fit into the upgrade message. * If destabilization cannot be completed, the canister does not start the GC and does not accept messages except step 3. 3. Complete the explicit destabilization after the upgrade: ``` dfx canister call CANISTER_ID __motoko_destabilze_after_upgrade "()" ``` * An assertion checks that the caller is the canister owner or a canister controller. * All other messages remain blocked until the successful completion of the destabilization. * The destabilzation runs in possibly multiple asynchronous messages, each with a limited number of instructions. * If defined, the actor's post-upgrade function is called at the end of the explicit destabilization. * The GC is restarted. ## Remarks * Steps 1 (explicit stabilization) and/or 2 (explicit destabilization) may not be needed if the corresponding operation fits into the upgrade message. * Stabilization and destabilization steps are limited to the increment limits: Operation | Message Type | IC Instruction Limit | **Increment Limit** ----------|--------------|----------------------|-------------------- **Explicit (de)stabilization step** | Update | 20e9 | **16e9** **Actual upgrade** | Upgrade | 200e9 | **160e9** * The stabilization code in the RTS has been restructured to be less monolithic. * Manual merge conflict resolution (work in progress) * Adjust tests, resolve some merge bugs * Adjust RTS test case * Make RTS tests run again * Add missing function export * Adjust imports, manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Adjust persistence initialization * Adjust persistence version management * Adjust stable memory metadata for enhanced orthogonal persistence Distinguish enhanced orthogonal persistence from Candid legacy stabilization * Add comment * Adjust graph stabilization initialization * Adjust GC mode during destabilization * Adjust object visitor for graph destabilization * Adjust incremental graph destabilization * Adjust error message * Adjust tests * Adjust tests * Update benchmark results * Adjust test * Upgrade stable memory version after graph destabilization * Adjust memory sanity check * Clear memory on graph destabilization as first step * Adjust big int serialization for 64-bit * Fix: Clear memory on graph destabilization * Add test case for graph stabilization * Add test case for incremental graph stabilization * Add tests for graph stabilization * Add more tests for graph stabilization * Add more test cases for graph stabilization * Add more test cases for graph stabilization * More conservative persistence version check * Adjust expected test results * Adjust test * Adjust tests * Adjust tests * Adjust RTS test for stabilization * Adjust tests * Adjust test results * Remove unwanted binary files * Adjust comment * Code refactoring * Fix merge mistake * Manual merge conflict resolution * Add test cases * Manual merge conflict resolution * Fix typo in documentation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in documentation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Bug fix: Allow stabilization beyond compiler-specified stable memory limit * Adjustment to RTS unit tests * Add comments * Code refactoring * Fix difference between debug and release test execution * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Delete unused file * Code refactoring * Use correct trap for an unreachable case * Remove dead code * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in function identifier * Fix indendation Co-authored-by: Claudio Russo <claudio@dfinity.org> * Removing unused code * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Fix RTS compile error * Bug fix: Object size lookup during stabilization * experiment: refactoring of ir extensions in graph-copy PR (#4543) * refactoring of ir * fix arrange_ir.ml --------- Co-authored-by: luc-blaeser <luc.blaeser@dfinity.org> * Manual merge conflict resolution * Adjust test case, remove file check * Manual merge conflict resolution * Manual merge conflict resolution * test graph copy of text and blob iterators (#4562) * Optimize instruction limit checks * Bug fix graph copy limit on destabilization * Incremental stable memory clearing after graph copy * Parameter tuning for graph copy * Manual merge conflict resolution * Manual merge conflict resolution * Remove redundant code * Manual merge conflict resolution: Remove `ObjInd` from graph-copy stabilization * Manual merge conflict resolution * Merge Preparation: Latest IC with Graph Copy (#4630) * Adjust to new system API * Port to latest IC 64-bit system API * Update to new IC with Wasm64 * Updating nix hashes * Update IC dependency (Wasm64 enabled) * Update expected test results * Fix migration test * Use latest `drun` * Adjust expected test results * Updating nix hashes * Update expected test results * Fix `drun` nix build for Linux * Disable DTS in `drun`, refactor `drun` patches * Update expected test results for new `drun` * Limiting amount of stable memory accessed per graph copy increment * Reformat * Adjust expected test result --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> * Message-dependent stable memory access limit * Graph copy: Fix accessed memory limit during stabilization * Enhanced Orthogonal Persistence (Complete Integration) (#4488) * Prepare two compilation targets * Combined RTS Makefile * Port classical compiler backend to combined solution * Adjust nix config file * Start combined RTS * Reduce classical compiler backend changes * Continue combined RTS * Make RTS compilable for enhanced orthogonal persistence * Make RTS tests runnable again for enhanced orthogonal persistence * Adjust compiler backend of enhanced orthogonal persistence * Unify Tom's math library binding * Make classical non-incremental RTS compile again * Make classical incremental GC version compilable again * Make all RTS versions compile again * Adjust memory sanity check for combined RTS modes * Prepare RTS tests for combined modes * Continue RTS test merge * Continue RTS tests combined modes * Continue RTS tests support for combined modes * Adjust LEB128 encoding for combined mode * Adjust RTS test for classical incremental GC * Adjust RTS GC tests * Different heap layouts in RTS tests * Continue RTS GC test multi-mode support * Make all RTS run again * Adjust linker to support combined modes * Adjust libc import in RTS for combined mode * Adjust RTS test dependencies * Bugfix in Makefile * Adjust compiler backend import for combined mode * Adjust RTS import for combined mode * Adjust region management to combined modes * Adjust classical compiler backend to fit combined modes * Reorder object tags to match combined RTS * Adjust test * Adjust linker for multi memory during Wasi mode with regions * Adjust tests * Adjust bigint LEB encoding for combined modes * Adjust bigint LEB128 encoding for combined modes * Adjust test * Adjust tests * Adjust test * Code refactoring: SLEB128 for BigInt * Adjust tests * Adjust test * Reformat * Adjust tests * Adjust benchmark results * Adjust RTS for unit tests * Reintroduce compiler flags in classical mode * Support classical incremental GC * Add missing export for classical incremental GC * Adjust tests * Adjust test * Adjust test * Adjust test * Adjust test * Adjust test * Adjust test * Pass `keep_main_memory` upgrade option only for enhanced orthogonal persistence * Adjust test * Update nix hash * Adjust Motoko base dependency * Adjust tests * Extend documentation * Adjust test * Update documentation * Update documentation * Manual merge conflict resolution * Manual merge refinement * Manual merge conflict resolution * Manual merge conflict resolution * Refactor migration test from classical to new persistence * Adjust migration test * Manual merge conflict resolution * Manual merge conflict resolution * Adjust compiler reference documentation * Test CI build * Test CI build * Adjust performance comparison in CI build * Manual merge conflict resolution * Add test for migration paths * Adjust test for integrated PR * Adjust test case * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring * Fix typo in comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Manual merge conflict resolution * Add static assertions, code formatting * Manual merge conflict resolution * Add test case * Refine comment Co-authored-by: Claudio Russo <claudio@dfinity.org> * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring * Manual merge conflict resolution * Adjust test run script messages * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Manual merge conflict resolution * Merge Preparation: Dynamic Memory Capacity for Integrated EOP (#4586) * Tune for unknown memory capacity in 64-bit * Adjust benchmark results * Fix debug assertion, code refactoring * Manual merge conflict resolution * Manual merge conflict resolution * Code refactoring: Improve comments * Reformat * Fix debug assertion * Re-enable memory reserve for upgrade and queries See PR #4158 * Manual merge conflict resolution * Manual merge conflict resolution * Update benchmark results * Manual merge conflict resolution * Manual merge conflict resolution * Merge Preparation: Latest IC with Integrated EOP (#4638) * Adjust to new system API * Port to latest IC 64-bit system API * Update to new IC with Wasm64 * Updating nix hashes * Update IC dependency (Wasm64 enabled) * Update expected test results * Fix migration test * Use latest `drun` * Adjust expected test results * Updating nix hashes * Update expected test results * Fix `drun` nix build for Linux * Disable DTS in `drun`, refactor `drun` patches * Update expected test results for new `drun` * Limiting amount of stable memory accessed per graph copy increment * Reformat * Manual merge conflict resolution * Manual merge conflict resolution * Adjust expected test result --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> * Manual merge conflict resolution * Documentation Update for Enhanced Orthogonal Persistence (#4670) --------- Co-authored-by: Claudio Russo <claudio@dfinity.org> Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: Claudio Russo <claudio@dfinity.org> Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: Nix hash updater <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Claudio Russo <claudio@dfinity.org>

luc-blaeser added 30 commits February 22, 2023 11:49

Adjust comment

4c89aee

Code refactoring

a22018a

Barrier optimization

54d0334

Bug fix

075e579

Pause tuning

a1dfb7e

GC tuning

6dbaf33

Adjust comment

5f9b48a

Update benchmark

fef39be

Adjust test for different stack frame sizes

fb4fcc0

Due to performance-optimized barrier code

Merge branch 'master' into luc/forwarding_pointer

96b1ecb

Remove forwarding pointer test logic

a5eaaf6

Adjust merge

c7caef6

Adjust merge with generational GC

77b9277

Remove forwarding pointer sanity check logic

2d4af15

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

b5207df

Add more forwarding pointer changes

7e4c852

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

7e09745

More forwarding pointer changes

ca39a4b

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

fac4442

Adjust format and tests

ec820cc

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

37ceebd

Adjust test result

2e53539

Revert unnecessary changes

9c86583

Adjust format

09c81e4

Reformat

19fe816

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

f6e31da

Also use memory check mode for generational GC

a16f449

Revert change in run

a2bb7c4

Adjust comments

d2c18fe

Merge branch 'luc/forwarding_pointer' into luc/incremental-mark-bitmap

15102e1

Merge branch 'master' into luc/forwarding_pointer

647d517

crusso reviewed May 10, 2023

View reviewed changes

rts/motoko-rts/src/gc/incremental/sort.rs Show resolved Hide resolved

luc-blaeser added 2 commits May 10, 2023 13:22

Merge branch 'luc/forwarding_pointer' into luc/incremental-preparation

0bba417

Merge branch 'luc/incremental-preparation' into luc/incremental-gc

97afe73

ulan approved these changes May 10, 2023

View reviewed changes

crusso approved these changes May 10, 2023

View reviewed changes

luc-blaeser added 4 commits May 11, 2023 10:08

Use structural equality in OCaml

bb76ecf

Add GC random test

0457f54

Prepare Changelog.md

ce3d393

Document the incremental GC option as beta testing

cf2f0e0

Base automatically changed from luc/incremental-preparation to master May 12, 2023 07:57

Merge branch 'master' into luc/incremental-gc

b1b1e7f

luc-blaeser added the automerge-squash When ready, merge (using squash) label May 12, 2023

mergify bot merged commit 8cab778 into master May 12, 2023

mergify bot removed the automerge-squash When ready, merge (using squash) label May 12, 2023

luc-blaeser deleted the luc/incremental-gc branch May 12, 2023 09:55

luc-blaeser mentioned this pull request Nov 16, 2023

Incremental Graph-Copy-Based Stabilization #4286

Closed

luc-blaeser mentioned this pull request Dec 4, 2023

Experiment: Simplified Graph-Copy-Based Stabilization #4313

Merged

luc-blaeser mentioned this pull request Dec 21, 2023

Officializing the Incremental GC #4340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental GC #3837

Incremental GC #3837

luc-blaeser commented Feb 24, 2023 •

edited

Loading

ulan left a comment

luc-blaeser commented May 10, 2023

crusso left a comment

iclighthouse commented May 13, 2023

ByronBecker commented May 14, 2023

luc-blaeser commented May 15, 2023 •

edited

Loading

Incremental GC #3837

Incremental GC #3837

Conversation

luc-blaeser commented Feb 24, 2023 • edited Loading

Incremental GC PR Stack

Incremental GC

Design

Configuration

Measurement

Scalability

GC Pauses

Performance

Memory Size

Overheads

Testing

Extension to 64-Bit Heaps

Design Alternatives

References

ulan left a comment

Choose a reason for hiding this comment

luc-blaeser commented May 10, 2023

crusso left a comment

Choose a reason for hiding this comment

iclighthouse commented May 13, 2023

ByronBecker commented May 14, 2023

luc-blaeser commented May 15, 2023 • edited Loading

luc-blaeser commented Feb 24, 2023 •

edited

Loading

luc-blaeser commented May 15, 2023 •

edited

Loading