Database-centric architecture for communication, persistence, and autoscaling #3

Zac-HD · 2022-10-26T08:23:00Z

The Status Quo

This is going to be substantial architecture overhaul, so let's start with how things currently work: a HypoFuzz run has three basic parts:

The Hypothesis database, a key-value store of failing and covering examples we've seen in previous runs (or other workers in this run)
The worker processes, which spend their time executing test cases for a variety of test functions, plus some 'meta' scheduling
The dashboard process, which serves a webpage showing how things are going, based on information sent by the workers over http.

In the current design, this is fundamentally a run-it-on-one-box kind of system: the tests are divided up between workers at startup time (or maybe run on every worker concurrently; the workers are fine with this though the dashboard isn't), and while the workers can reload the previous examples everything else is as if it were the first run ever - with some hit to efficiency and the clarity of statistics.

Goal: support a system where workers can come and go, for example to soak up idle CPU time as a low-priority autoscaling group on a cluster, and the fuzzing system overall keeps humming along.

Solution: lean on the database

If our problem is that information is neither persisted nor well distributed, let's solve that with the Hypothesis database! This is a very simple key-value store where keys are bytestrings and values are sets of bytestrings, with create/read/delete operations. The most common implementation is on the user's local filesystem, but there's also a Redis backend and it's trivial to write more.

What problems does this solve, and create?

✨ Workers could write metadata to the database (in some disjoint keyspace), meaning that the dashboard could show information regardless of whether a worker is currently running - it'd just be a view over the database (generally good design principle!), and update by polling at whatever frequency we wanted.
- ✨ We no longer need any HTTP traffic between components of the system; subtracting parts is underrated.
- ✨ If we had two separate systems ("partition tolerance"), we can just merge the databases (via e.g. MultiplexedDatabase) and keep going
- 🚧 We have to handle stale data, including from runs that diverged or never even had a common prefix. This is basically fine; "keep the examples from everything and discard all the metadata" is totally valid and anything fancier is a bonus. We'll probably try to construct a 'best guess' metadata though, e.g. keeping the longest.
  - For recently-diverged workers, which is a common case when two are fuzzing the same target, we can just sum the effort spent fuzzing in the same most-recent state. More complicated schemes run up against the question "to what degree should we reset state estimation when we discover new behavior", which is to my knowledge an open research problem (see Estimating Residual Risk in Greybox Fuzzing).
- 🚧 Worse, we have to handle data from different code: database keys are derived from a hash of the test function, but are necessarily consistent across changes in the code under test. Supposing we restart the fuzzer with a new bug-containing commit: the coverage information we have saved is likely to be wrong, and we might even have deprioritized testing that area!
  - Again, "keep the examples and ditch the metadata" would be OK here, though we'd need to track the commit that we're fuzzing. I'll continue assuming the presence of git; other VCS systems can be supported as demand arises.
  - Upside, if we're using VCS metadata we could prioritize fuzzing recently-changed code...
  - What about library versions though? Or operating systems? Or Python versions? To what degree should we distinguish these at the worker level, and/or in the dashboard?
- 🤔 Tracking provenance information about how we found each covering or failing example (e.g. blackbox/greybox/whitebox; for fuzzer which mutations from what seed, test-case-number to discover, etc) can be really helpful in visualizing and understanding how the process is going. Lots of interesting experiments and some literature exploiting this.
🚧 The dashboard process does need a local worker, in order to replay failing examples etc. - in not-that-rare pathological cases, this can produce more data than we'd want to persist for every test. Replaying live in the dashboard-worker also ensures that every test failure is reproducible.
- What if a test only fails on Windows, but the dashboard is on Linux? We do not want to delete that "fixed" failing example! Idea: give each test function an 'environment suffix', plus the ability to read from all other suffixes of the same test. That way we can fail to replay without risking deleting the case before it's reproduced in the environment it fails in.

Action Items

MVP is to ditch http and communicate all state through the database.

Metadata is just what we need to get the dashboard working, see that code for details. It's saved per-test by each worker.
Display whichever history is the longest, we really are going for MVP here. Handle the simple case: each test has a single worker.
Support for starting a dashboard without associated workers, beyond the minimum to display failing examples etc.
?? does this actually work at all without the fancier stuff ??

Better dashboard means we can get a little fancier about what we're displaying (mostly to keep these ideas out of the MVP):

Metadata includes:
- metadata-version-number
- git commit hash, maybe other environment metadata (package versions? OS? etc.)
- I have a marvelous design for an append-only log from which we can usually recover a linearizable tree. Entries include (worker UUID, hypothesis phase, start state, number of test cases, optional new state [, provenance etc. tbd]); states are hashes of interesting-origin or reason-to-keep-seed.
Pretty sure that if we emit to the log every time we switch test, find something new, or notice someone else found something new, this is sufficient to recover a tree; and linearizing it is usually lossless.
We can probably synchronize a lot of worker state from this log, in addition to using it for the dashboard

The full version is going to be an ongoing project. Once we get here, I'll aim to close this and split out more specific issues.

The text was updated successfully, but these errors were encountered:

tybug · 2024-11-15T03:44:43Z

I'm taking a look at this. What are the details of the db interactions?

(1) Each worker stores metadata in the database specified by the test settings, with key function_digest(f) + b".hypofuzz". Different tests may use different dbs via @settings(database=...), so the dashboard multiplexes over all test dbs when polling.

(2) Each worker stores metadata in a completely separate hypofuzz db, for-now hardcoded at DirectoryBasedExampleDatabase(".hypothesis/hypofuzz"), with key function_digest(f). Dashboard polls against just this db.

Or is it (3) a secret third thing?

Either way, dashboard will have to be told all keys function_digest(f) (probably by _fuzz_several?).

Zac-HD · 2024-11-15T06:03:32Z

We want to be able to host the dashboard on a separate server to the fuzzing workers, so it'll need to be the database specified by the test settings. No multiplexing though; we can have a --profile argument to specify which profile if it would be ambiguous. This will work out of the box on a single server with the default directory-based DB, but really shines with Redis or similar.

As you say, we'll use function_digest(f) + b".hypofuzz", or maybe finer-grained keys to distinguish dashboard state from worker state like execution counts for each coverage fingerprint. There's a cute trick though; we can save each function-digest under a well-known key b"hypofuzz-test-digests", and the associated values are our keys for each test 🙂

Zac-HD mentioned this issue Feb 9, 2023

Docs: example GitHub Actions configuration with CI + a fuzzing cronjob with shared database #14

Open

Zac-HD mentioned this issue Mar 5, 2023

Incorrect data aggregation for dashboard plots? #17

Open

Zac-HD mentioned this issue May 23, 2023

Another issue with shrinking #23

Closed

Zac-HD mentioned this issue Apr 22, 2024

Split the HypoFuzz engine into a Hypothesis backend and an executor #36

Open

tybug mentioned this issue Nov 16, 2024

Migrate dashboard to polling the db #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database-centric architecture for communication, persistence, and autoscaling #3

Database-centric architecture for communication, persistence, and autoscaling #3

Zac-HD commented Oct 26, 2022 •

edited

Loading

tybug commented Nov 15, 2024 •

edited

Loading

Zac-HD commented Nov 15, 2024

Database-centric architecture for communication, persistence, and autoscaling #3

Database-centric architecture for communication, persistence, and autoscaling #3

Comments

Zac-HD commented Oct 26, 2022 • edited Loading

The Status Quo

Solution: lean on the database

Action Items

tybug commented Nov 15, 2024 • edited Loading

Zac-HD commented Nov 15, 2024

Zac-HD commented Oct 26, 2022 •

edited

Loading

tybug commented Nov 15, 2024 •

edited

Loading