Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop support for per-task HPKE configs #2147

Open
divergentdave opened this issue Oct 18, 2023 · 7 comments
Open

Drop support for per-task HPKE configs #2147

divergentdave opened this issue Oct 18, 2023 · 7 comments

Comments

@divergentdave
Copy link
Contributor

We are considering dropping support for per-task HPKE configs to simplify. Note that this will have an impact on a number of unit tests. Global HPKE configs currently can only be set up via the aggregator API, so we may need to make the aggregator API required instead of optional, provide another means to set up a global keypair, or do so automatically on first boot.

@inahga
Copy link
Contributor

inahga commented Oct 18, 2023

Dupe of #1641, I'll let this issue supersede that one.

@inahga
Copy link
Contributor

inahga commented May 8, 2024

We have decided to move forward with obsoleting per-task HPKE keys and fully adopting global HPKE keys.

@inahga inahga removed the question Further information is requested label May 8, 2024
@inahga
Copy link
Contributor

inahga commented May 9, 2024

Some notes: We already have support for global keys for the taskprov use case. Functionality should already be present for using these keys. Janus also refreshes its internal cache of these keys every 30 minutes.

We'll need some tweaks to remove if taskprov_enabled, and change the behavior of /hpke_configs to always present global keys. We will need to retain the ability for Janus to trial decrypt using a task-specific key, until the next breaking windows change. Task provisioning will need to be modified to stop creating task-specific keys. Several tests will need to be changed to stop using task-specific keys.

The main missing functionality is that keys currently need to be manually provisioned. This is a non-starter for normal deployments of Janus, of which we run several. We also need to consider that BYOH deployments probably don't want to have to worry about key management.

We need a subprocess key-rotator (to borrow terminology from ENPA) that ensures that an active HPKE key is present, and fully handles the lifecycle of keys. This can run independently for Janus leader use cases, and in the same binary for BYOH use cases. There is a first-start edge case that requires consideration, where an aggregator process starts running before the key-rotator does. A few options here, perhaps the aggregator's cache is always invalid until it has at least one active key.

@inahga
Copy link
Contributor

inahga commented May 9, 2024

There is a first-start edge case that requires consideration, where an aggregator process starts running before the key-rotator does. A few options here, perhaps the aggregator's cache is always invalid until it has at least one active key.

This can be accomplished by having GlobalHpkeKeypairCache::new() backoff retry fetching keypairs until it finds at least one active one. This will block the process from fully starting, until the key rotator service inserts an active keypair.

For BYOH use cases where we're supplying an all-in-one binary, the key rotator task should be started before the HPKE cache task. The key rotator should run a sweep immediately on creation.

I didn't really like the "cache always invalid unless one key is active" strategy because the cache logic seems more complicated.

@inahga
Copy link
Contributor

inahga commented May 9, 2024

I've sketched out this logic for the key rotator (may have errors, let me know if anything stands out as wrong).

Configuration:

struct Options {
    /// How often to run the key rotator, in seconds
    pub frequency_s: u64,
    pub hpke: HpkeOptions,
}

struct HpkeOptions {
    // (these can be named better)
    /// How long to wait before promoting a keypair from PENDING to ACTIVE. This should correlate to
    /// how often Janus updates its global HPKE cache, i.e. 30 minutes
    pub pending_to_active_s: u64,
    /// How long until an active key should be considered expired, i.e. the age of an HPKE key
    pub active_to_expired_s: u64,
    /// How long an expired key should be kept around. This should correlate to how long clients are
    /// expected to cache an HPKE key.
    pub expired_to_deleted_s: u64,

    /// A list of ciphersuites to ensure have an active HPKE keypair. Can be multiple, if we want
    /// to provide support for multiple ciphersuites, otherwise it can just be whatever we want.
    pub ciphers: Vec<HpkeCiphersuite>,
}

struct HpkeCiphersuite(KEM, KDF, AEAM);

Key rotator logic:

In database transaction:
    Acquire ExclusiveLock on global_hpke_keys (n.b. this does not block readers)
    
    Get current keys
    For each HPKE ciphersuite:
        If there is no active key fitting the cipher suite (bootstrapping logic):
            Insert an active key
        Else:
             If the active key is too old:
                 If there is a pending key fitting the ciphersuite:
                     If the pending key has been pending long enough:
                          Make the pending key active
                          Expire the active key
                      Else do nothing (we'll have to wait longer before taking action)
                 Else insert a pending key
             Else do nothing (the active key is fine)

     For each expired key:
          If the expired key is old enough, delete it.

The HPKE ID is chosen at random, non-conflicting with any IDs currently in the table. It's possible to paint ourselves into a corner e.g. given too many ciphersuites, too long of expiration relative to HPKE keypair age, so it's important to choose the right parameters to avoid running out of u8 space.

Age of the key and the time of its state changes are tracked by the extant updated_at and created_at columns on global_hpke_keys.

Note we take an ExclusiveLock on the table such that only one key rotator replica can read and write to the table at a time. This shouldn't block any cache updates from aggregator replicas. We need to do this because we have no uniqueness constraint on the HPKE ciphersuite--multiple concurrent key rotators would otherwise be able to insert multiple pending keys for instance.

@divergentdave
Copy link
Contributor Author

divergentdave commented May 9, 2024

We should either list the global keys again just before finishing the database transaction and check that only our new key was added, or run the database transaction at the SERIALIZABLE level, to prevent two processes creating keys at the same time. Nevermind, locking the table should prevent that.

@inahga inahga mentioned this issue May 15, 2024
inahga added a commit that referenced this issue Jun 20, 2024
Supports #2147.

Introduces the `key_rotator`, which is a utility for managing the lifecycle of global HPKE keys. It will bootstrap keys, and rotate them according to the rotation policy in the config.

Keys are unique by their HPKE ciphersuite. For each ciphersuite, a key is run through the pending->active->expired->deleted lifecycle. It is permissive of some manual operator changes, e.g. if a manual key rotation needs to be executed through janus_cli or the aggregator API.

In this iteration, it runs as a standalone binary for use in a cronjob. It is suitable for deployment in our environments, including taskprov ones. A future PR will add support for BYOH deployments by letting the `aggregator` process run the key rotator.
@inahga
Copy link
Contributor

inahga commented Jul 15, 2024

Janus changes are complete.

Remaining work in this issue is taking a breaking change to remove per-task HPKE keys, in the next breaking change windows.

I've unassigned this for now to keep my queue clear, we can pick it up again whenever we're ready to take some breaking changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants