Skip to content

feat: Direct lifecycle control of instances#51

Merged
joshlreese merged 1 commit intomainfrom
feature/instance-lifecycle-control
Jun 18, 2025
Merged

feat: Direct lifecycle control of instances#51
joshlreese merged 1 commit intomainfrom
feature/instance-lifecycle-control

Conversation

@joshlreese
Copy link
Copy Markdown
Contributor

@joshlreese joshlreese commented May 24, 2025

Instance control

Introduced an instance control library, with a stateful implementation inspired by Kubernetes StatefulSets. The implementation does not consider adoption of instances, which are always owned by a WorkloadDeployment. This library returns a set of actions for the caller to consider and execute. It is possible for actions to be returned that will not be executed, but can be useful in troubleshooting or bringing visibility into the progress of a rollout. This design also opens the door for allowing bursting or parallel application of instance changes.

Instance control is managed at the context of a WorkloadDeployment, with actions being built up by the instance control strategy chosen, and executed by the reconciler.

Once an instance has no scheduling gates, the provider responsible for the location which it is attached to will be responsible for for provisioning the instance and updating the Programmed and Running conditions with success or failure information. Once an Instance is seen as both Programmed and Running, it will be considered Ready.

Scheduling gates

An instance can have one or more scheduling gates added to its spec, which are used to indicate to other systems that the instance may not be at a state to be processed. The initial scheduling gate of Network results in instances not being processed until the network they are attached to is ready. Scheduling gates for quota claims, volumes, or other dependencies should be expected to come in future revisions.

CRD improvements

  • Refined default conditions and printer columns for Workloads, WorkloadDeployments, and Instances.
  • Added many more condition reasons to communicate the state of resources.
  • Added ReadyReplicas to Workloads and WorkloadDeployments.

Network readiness

The operator will now ensure that networks being attached to are ready for use, and at least one subnet has been issued to any network context with an attached instance. Subnet issuance occurs by creating a SubnetClaim, and waiting for it to be issued a prefix by controllers in network-services-operator.

General improvements

  • Improved garbage collection logic.
  • Propagation of errors from related resources to the instance, such as network creation failures.

Enhancement: datum-cloud/enhancements#28

Related work:

Instance control:

Introduced an instance control library, with a stateful implementation
inspired by Kubernetes StatefulSets. The implementation does not
consider adoption of instances, which are always owned by a
WorkloadDeployment. This library returns a set of actions for
the caller to consider and execute. It is possible for actions
to be returned that will not be executed, but can be useful
in troubleshooting or bringing visibility into the progress of
a rollout. This design also opens the door for allowing bursting
or parallel application of instance changes.

Instance control is managed at the context of a WorkloadDeployment,
with actions being built up by the instance control strategy
chosen, and executed by the reconciler.

Once an instance has no scheduling gates, the provider responsible for
the location which it is attached to will be responsible for for
provisioning the instance and updating the `Programmed` and `Running`
conditions with success or failure information. Once an Instance is seen as
both Programmed and Running, it will be considered Ready.

Scheduling gates:

An instance can have one or more scheduling gates added to its spec,
which are used to indicate to other systems that the instance may not
be at a state to be processed. The initial scheduling gate of `Network`
results in instances not being processed until the network they are
attached to is ready. Scheduling gates for quota claims, volumes,
or other dependencies should be expected to come in future revisions.

CRD improvements:

- Refined default conditions and printer columns for Workloads,
WorkloadDeployments, and Instances.
- Added many more condition reasons to communicate the state of resources.
- Added ReadyReplicas to Workloads and WorkloadDeployments.

Network readiness:

The operator will now ensure that networks being attached to
are ready for use, and at least one subnet has been issued to any
network context with an attached instance. Subnet issuance
occurs by creating a SubnetClaim, and waiting for it to be issued
a prefix by controllers in network-services-operator.

General improvements:

- Improved garbage collection logic.
- Propagation of errors from related resources to the instance,
  such as network creation failures.
@joshlreese joshlreese force-pushed the feature/instance-lifecycle-control branch from 83dde68 to 69c65e1 Compare June 10, 2025 20:42
@joshlreese joshlreese changed the title Feature/instance lifecycle control feat: Direct lifecycle control of instances Jun 10, 2025
@joshlreese joshlreese requested a review from scotwells June 10, 2025 20:44
@joshlreese joshlreese marked this pull request as ready for review June 10, 2025 20:45
Copy link
Copy Markdown
Contributor

@scotwells scotwells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshlreese joshlreese merged commit 680632c into main Jun 18, 2025
8 checks passed
@joshlreese joshlreese deleted the feature/instance-lifecycle-control branch June 18, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants