Dirty tracking performance improvements #210

Shillaker · 2021-12-29T15:51:08Z

This PR is the third (and final) in a series of overhauls of the threading, snapshotting and dirty tracking model.

This PR addresses the underlying issue of the performance of the dirty tracking, which is currently based on soft dirty PTEs. While it works, it has a couple of shortcomings:

Soft dirty PTEs can only be reset globally, creating a synchronisation point and preventing us from performing dirty tracking on more than one application at a time on each host.
Resetting and reading soft dirty PTEs requires writing to /proc/self/clear_refs and reading from /proc/self/pagemap respectively, which can be a drain on performance when done repeatedly in a tight loop.

Ideally we would do this tracking with userfaultfd write-protected pages, however, this is only available in kernels 5.7+, i.e. Ubuntu 22+ which isn't yet stable. In the interim we can use an mprotect + SIGSEGV approach, where we make pages read-only using mprotect, then catch the resulting segfault when they're written to, and mark them as dirty.

Although in our current use-cases the PTEs are less performant that the segfault approach, this may not be the case for all workloads, so I'd like to keep the soft PTEs approach around.

PTEs, segfaults and userfaultfd all result in a different approach to aggregating diffs for a batch of threads. PTEs are system-wide, while segfaults are handled by individual threads (so we have to introduce thread-local tracking), and userfaultfd would use a single background tracker thread per application. To abstract the boilerplate for doing this and support switching more easily, I've introduced a single DirtyTracker class that abstracts all the details, and a configuration parameter DIRTY_TRACKING_MODE which can be set to softpte or segfault for now, and uffd in future.

This results in quite a few changes to the code:

Abstract all dirty tracking logic to an interface encapsulated in the DirtyTracker class.
Support both global and thread-local dirty tracking.
Move all logic related to handling thread snapshots, invoking threads and dirty tracking into the Executor class (previously this was scattered across Faabric and Faasm).
Introduced the concept of a "main thread snapshot", used by threaded applications and controlled by Faabric. Previously this snapshot had to be created and updated outside of Faabric, which was confusing and error-prone.
When repeatedly executing batches of threads from the same application, cache each scheduling decision and pass it as a hint to the next batch of the same size.
Stop using dirty page tracking to track changes made to snapshots. We can do this manually with higher precision as all changes to each snapshot now go through methods on the SnapshotData class.
Improve the performance of inner loops related to dirty tracking and diffing.

csegarragonz

LGTM, very minor changes but will approve so that you don't have to re-request a review.

csegarragonz · 2022-01-07T08:19:14Z

include/faabric/util/bytes.h

+/*
+ * Returns a list of flags marking which bytes differ between the two arrays.
+ */
+std::vector<bool> diffArrays(std::span<const uint8_t> a,


Isn't this a bit-wise XOR?

It's currently byte-wise, not bit-wise, but this may change. I noticed that this function isn't actually used, so I should probably just delete it.

src/scheduler/Executor.cpp

src/util/dirty.cpp

csegarragonz · 2022-01-07T08:37:33Z

src/util/dirty.cpp

+
+        uint32_t diffPageStart = 0;
+        bool diffInProgress = false;
+        for (int i = 0; i < nPages; i++) {


Same, isn't this essentially the same as faabric::util::getDiffRegions()?

eigenraven

Just had a quick look at this, because it sounds very similar to what I'm doing to reduce the size of incremental snapshots for offloading in my project. I'd suggest not creating byte-wise diffs, as that's pretty costly performance-wise, you can play around with variants in the quickbench link I put in a comment. The code in faabric/util/delta uses a configurable "page" size (any difference in a page = emit a diff for the whole page), in my testing around 64 yielded the best performance without making the diffs larger than a few percent - you might want to test this for your applications. The code could be relatively easily modified to take in an array of modified OS pages to skip known-unmodified pages, I'm happy to do this if you're interested as it's on my todo list anyway, and then you could use the generate/applyDelta functions

eigenraven · 2022-01-07T11:36:24Z

src/util/bytes.cpp

+
+    std::vector<bool> diffs(a.size(), false);
+    for (int i = 0; i < a.size(); i++) {
+        diffs[i] = a.data()[i] != b.data()[i];


This might be very slow due to the vector<bool> specialization, just swapping the type to uint8_t (at the cost of 8x more memory) makes it 11x faster: https://quick-bench.com/q/d8INA76hIDM3m4gfC4xJKI4HcQM

src/util/dirty.cpp

Shillaker · 2022-01-07T12:02:20Z

I'm going to merge this and take discussions of improvements and tweaks offline.

Start on userfault experiment

c89c3b0

Shillaker self-assigned this Dec 29, 2021

Tidy up

f073e6e

Shillaker mentioned this pull request Dec 29, 2021

Merge regions overhaul #201

Merged

Shillaker added 4 commits December 29, 2021 16:28

Failing test for mapped regions

0d77f81

Merge branch 'master' into userfault

d4648a3

Destroy uffd tests

03e8782

Refactor to suit configurable diffing

5370a99

Shillaker changed the title ~~Dirty tracking with userfaultfd experiment~~ Dirty tracking performance improvements Dec 30, 2021

Shillaker added 21 commits December 30, 2021 11:50

Fixed tests for soft dirty PTEs

68ec1e5

Snapshot self-dirty tracking

e9e4f39

Working tests

7afffc1

Tidy up when executor does dirty tracking

2f4b710

Special-case reinitialisation

079c699

Remove restartTracking

ccfd8a0

Remove duplicated function names

f9f2a92

Handle dirty regions

8159ccc

Move gap filling back again

6ecd34a

Update to using OffsetMemoryRegions

e508c02

Remove debug logging

e39616c

Tidy up

f03fb0c

Renaming

2bf540f

Missing comments

647f687

Add test for multi-threaded segfault handling

7902639

Remove dirty checks from memory test

9269f2b

Fixing up tests

00d0729

Refactor overwrite diff logic

6e9c6a9

Fix test

482a7cc

Formatting

1d0a0e6

Add main thread snapshot handling to faabric

e51aa33

Shillaker added 2 commits January 4, 2022 08:59

Tidy up and docs

725300c

Formatting

dcc6f4f

Shillaker requested review from csegarragonz and removed request for csegarragonz January 4, 2022 09:15

Shillaker mentioned this pull request Jan 4, 2022

Update to use new Faabric dirty tracking faasm/faasm#561

Merged

Shillaker added 2 commits January 4, 2022 14:24

Guard against empty memory

1d22a73

More logging

6fc7190

Shillaker marked this pull request as draft January 4, 2022 14:26

Shillaker added 2 commits January 4, 2022 15:57

Specify merge regions when spawning threads

369540a

Small logging fix

9466e50

Shillaker marked this pull request as ready for review January 4, 2022 17:27

Attempt to speed up diffing

3ef664d

Shillaker marked this pull request as draft January 4, 2022 18:01

Shillaker added 6 commits January 5, 2022 08:44

Tighter diffing loops

6cb672a

Merge diffing and regions

e54c2ab

Remove OffsetMemoryRegion

5015f97

Avoid vector<bool>

e15c0e3

More dirty tracking

9daa474

Fix up SDPTE

f8831c6

Shillaker marked this pull request as ready for review January 5, 2022 15:36

Shillaker added 3 commits January 6, 2022 14:31

Switch from shared to full lock

9a49dd6

Add test for dirty tracking config field

0eb5917

Remove await thread results funciton

3b39032

Shillaker requested a review from csegarragonz January 6, 2022 17:58

csegarragonz approved these changes Jan 7, 2022

View reviewed changes

eigenraven reviewed Jan 7, 2022

View reviewed changes

Shillaker merged commit b3229aa into master Jan 7, 2022

Shillaker deleted the userfault branch January 7, 2022 12:02

csegarragonz mentioned this pull request Feb 23, 2022

Add task to generate release body #233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dirty tracking performance improvements #210

Dirty tracking performance improvements #210

Shillaker commented Dec 29, 2021 •

edited

Loading

csegarragonz left a comment

csegarragonz Jan 7, 2022

Shillaker Jan 7, 2022

csegarragonz Jan 7, 2022

eigenraven left a comment

eigenraven Jan 7, 2022

Shillaker commented Jan 7, 2022

Dirty tracking performance improvements #210

Dirty tracking performance improvements #210

Conversation

Shillaker commented Dec 29, 2021 • edited Loading

csegarragonz left a comment

Choose a reason for hiding this comment

csegarragonz Jan 7, 2022

Choose a reason for hiding this comment

Shillaker Jan 7, 2022

Choose a reason for hiding this comment

csegarragonz Jan 7, 2022

Choose a reason for hiding this comment

eigenraven left a comment

Choose a reason for hiding this comment

eigenraven Jan 7, 2022

Choose a reason for hiding this comment

Shillaker commented Jan 7, 2022

Shillaker commented Dec 29, 2021 •

edited

Loading