feat: AP 17 Fleet shard status #82

shawkins · 2022-10-17T14:34:01Z

Resolves #81: this covers the pattern for fleet shard status that was discussed in an architecture call and several follow-up working group meetings.

The only major pending change is that @lburgazzoli would like to expand on the current vs desired state section to include a recommendation on using a generation field to know more precisely when the condition was observed. That can be added after an initial merge if needed.

cc @tombentley

lburgazzoli · 2022-10-17T14:47:07Z

@shawkins will get back this week hopefully

tombentley · 2023-01-30T15:34:32Z

@shawkins sorry to have dropped the ball so badly on this. Does it still need reviewing?

shawkins · 2023-01-30T22:52:43Z

@shawkins sorry to have dropped the ball so badly on this. Does it still need reviewing?

Yes it still needs to be reviewed. There are some pending changes from @lburgazzoli that I'll make part of this pr as well.

shawkins · 2023-02-01T14:34:13Z

@lburgazzoli incorporated the changes from the google doc, but omitted the part about polling based upon generation - that seems out of scope for this doc and would probably need to mention a watching paradigm in addition to polling.

@tombentley should be ready for review now.

tombentley

I left a few suggestions, but overall this LGTM. Thanks!

_ap/17/index.adoc

tombentley · 2023-02-17T09:28:20Z

_ap/17/index.adoc

+For example the ManagedKafka status Ready condition refers to the operator’s view of the resource.  A ManagedKafka is only ready when all of the dependent resources are observed to be in their desired / ready state.  This will not always be the same notion of readiness from an end user’s observation of managed kafka service.  In particular, the restart of a broker pods, a temporary networking issue, etc. may not be reflected in ManagedKafka status.  It follows that a ManagedKafka status Ready condition alone is insufficient to show the user the state of their service.  For ManagedKafka canary and other kafka metrics provide a more exact representation of cluster functioning.
+Error States
+
+An operator in general may not consider any problem with a valid resource as terminal - one from which the resource can never recover.  The operator’s job is to implement the desired state whenever possible.  If at a given time additional resources are needed, other necessary system parts aren’t installed, etc.  - it doesn’t mean the conditions won’t be correct at a later time.  From the perspective of a managed service though we do have more rigid expectations.  For example a ManagedKafka should only be created with a Strimzi version that has an operator installed to manage.  For a fleet shard operator if such conditions are violated it is acceptable for the operator and the status handling to assume the resource is in a terminal state.


I understand what this is trying to say, but I think that way this is worded is a little vague. Do you think the following is better?

An operator in a fleetshard usually lacks sufficient context to determine whether a valid resource is in a terminal state. The operator’s job is simply to implement the desired state whenever possible. For example, if at a given time additional resources are needed, other necessary system parts aren’t installed, etc. - it doesn’t mean the conditions won’t be correct at a later time as a result of some action outside the context of the operator.

In cases where the operator does have sufficient context to determine terminality, for example violation of preconditions in installed versions of APIs, then it is acceptable for the operator and the status handling to assume the resource is in a terminal state.

Incorporated the above, please see if that matches what you are thinking.

tombentley · 2023-02-17T09:30:45Z

_ap/17/index.adoc

+
+### Current vs. Desired State
+
+Especially in instances where spec changes are rolled into dependent resources the desired state will not be fully realized until some point in the future.  It is appropriate for the status to provide additional information about this transition.  As per the Kubernetes guidelines that could be represented via a condition, or similar to the standard Deployment resource additional status fields such as ready or available replicas, allow for the control plane or a user to infer the progress of the transition.


Can we give a concrete example here?

Elaborated more based upon a kafka version upgrade.

Show concrete YAML example to anchor the reader

Added the yaml that matches the description.

_ap/17/index.adoc

Co-authored-by: Tom Bentley <tombentley@users.noreply.github.com>

lburgazzoli

LGTM (I'm among the authors so I should not be among the reviewer)

shawkins · 2023-03-02T15:33:22Z

@emmanuelbernard @danielezonca please review or delegate so that this can move forward

danielezonca

LGTM, just a minor comment

_ap/17/index.adoc

tombentley · 2023-03-08T17:20:47Z

@emmanuelbernard still waiting for your review on this one.

emmanuelbernard

This is generally useful thank you for the write up.
I felt the description was assuming a good knowledge of fleet shard and connecting the dot required to me some leap of my imagination.
That's why I've been heavily proposing to add lots of example.

Generally I think all recommendations should try and succintly explan

why it's useful (value)
what for (context)
how it is done (with examples)

The current structure of the document reflects the CR areas and I wonder whether a reorganisation around concrete problems could highlight its usefulness.

_ap/17/index.adoc

emmanuelbernard · 2023-03-10T14:46:47Z

_ap/17/index.adoc

+
+Define a pattern for fleet shard status representation and interpretation.
+
+## Motivation


a small drawing of fleet-manager fleet-shard operator operand could help set the context.

After starting on a drawing it made me wonder if that was making this seem too complicated. So I just updated with some language about custom resources in general - I really am trying to convey this is the kubernetes custom resource paradigm with an intermediary, so it has some additional considerations on top of the base recommendations.

emmanuelbernard · 2023-03-10T14:48:57Z

_ap/17/index.adoc

+
+More specific to fleet shards:
+Since the fleet shard will typically send a full status object back to the control plane periodically or with every change, it is best to include only a minimally necessary set of conditions.  However that does not mean you can simply omit conditions.  If a condition is not present, it should be interpreted as having status Unknown.
+It is fine to have a control plane reason over status conditions - in particular the type, reason, and status fields.   However the parsing of a condition message should be avoided, and rather additional status fields should be used to provide specialized information.


Show examples in the form of yaml with a context of the status if necessary.

Are there specific examples you had in mind?

Things like showing non omited considtions even though not use for that specific effective status report.
And that might not be here but you say that the shard in most case cannot know whether a case is transient or terminal. which makes me wonder how the control plan makes that decision from thge status it receives fromt he shard. HEre some examples of how you achieved it , or how a status is interpreted for feedback tot he end user would be interesting for context on how the global machinery is orchestrated

Things like showing non omited considtions even though not use for that specific effective status report.

The guidance about minimizing is fleetshard specific. The interpretation of missing as unknown comes straight from the kubernetes recommendations. From existing non-fleetshard operators, we have counter examples like:

strimzi uses both a Ready and NotReady conditions. The NotReady condition is only populated when there's a problem. By their contract you have to know that a missing NotReady means to look instead for the Ready condition, not that NotReady is currently unknown.

But I'm honestly not sure how instructive that is.

which makes me wonder how the control plan makes that decision from thge status it receives fromt he shard

That's the gist of the recommendations around the error handling - the control plane nor a user can assume an immediate action is needed given seeing something is currently in error - regardless of whether we're installing.

The fleetshard operator interprets the strimzi status as follows:
https://github.com/bf2fc6cc711aee1a0c2a/kas-fleetshard/blob/main/operator/src/main/java/org/bf2/operator/operands/AbstractKafkaCluster.java#L169 That is then aggregated with the other operands. The fleet manager is further able to aggregate reasoning over the cluster itself and from things like route53.

We don't expect the fleet manager to make specific decisions based upon errors that appear in our status - we are instead calling out very specific service level cases - the wrong profile or version of something is in the cr, the data plane lacks capacity for the given instance - that the control plane may use to do things like select a different strimzi version or use a different cluster. Otherwise about the best that they can infer is whether we think we're still Installing.

emmanuelbernard · 2023-03-10T14:52:26Z

_ap/17/index.adoc

+
+An operator in a fleetshard _usually_ lacks sufficient context to determine whether a valid resource is in a terminal state. The operator’s job is simply to implement the desired state whenever possible. For example, if at a given time additional resources are needed, other necessary system parts aren’t installed, etc. - it doesn’t mean the conditions won’t be correct at a later time as a result of some action outside the context of the operator.
+
+In cases where the operator does have sufficient context to determine terminality, for example violation of preconditions in installed versions of APIs, then it is acceptable for the operator and the status handling to assume the resource is in a terminal state.


why acceptable? Why not recommended? What is the intent of that guideline?

Show example of termlnal in YAML and maybe the context

Made some further refinements to the text and added an example.

emmanuelbernard · 2023-03-10T14:55:48Z

_ap/17/index.adoc

+
+In cases where the operator does have sufficient context to determine terminality, for example violation of preconditions in installed versions of APIs, then it is acceptable for the operator and the status handling to assume the resource is in a terminal state.
+
+When status is conveying an error that is not known to be terminal, it follows that the error may resolve on its own with additional time.  Control planes should assume such errors are ephemeral and not immediately react as if a hard failure has occurred.  If there are further actions that the control plane may take, the condition reason or other status fields should make that easy to determine.  This allows the data plane logic to remain straightforward and not have to fully understand every possible error condition.


Maybe a state diagram with some context could help setup the context of the conversation here.

transient vs terminal

data plane reconsiliation vs fleet manager expected state display
I'm not sure it will work but could help as I feel I have to think at the problem to figure out what is recommended and for what use case.

There really isn't a state diagram, only very special exceptions to what qualifies as a terminal error.

Maybe restating more succinctly might help:

many, if not most, errors will be ephemeral

it's impractical, or impossible, for the data plane to know a priori how to categorize whether an error it's seeing is terminal or not.

requiring a data plane fix every time a new ephemeral error is encountered is cumbersome, instead only errors that lead to SLO breach need high priority remediation.

_ap/17/index.adoc

emmanuelbernard · 2023-03-10T14:58:35Z

_ap/17/index.adoc

+
+### Current vs. Desired State
+
+Especially in instances where spec changes are rolled into dependent resources the desired state will not be fully realized until some point in the future.  It is appropriate for the status to provide additional information about this transition.  As per the Kubernetes guidelines that could be represented via a condition, or similar to the standard Deployment resource additional status fields such as ready or available replicas, allow for the control plane or a user to infer the progress of the transition.


Show concrete YAML example to anchor the reader

emmanuelbernard · 2023-03-10T14:59:22Z

_ap/17/index.adoc

+
+### generation and observedGeneration
+
+In kubernetes, each object should have a `metadata.generation` field which is defined as a sequence number - set by the system and monotonically increased per resource on a spec change. Some resources include a field named `status.observedGeneration`, which is the generation most recently observed by the component responsible for acting upon changes to the desired state of the resource. This can be used, for instance, to ensure that the reported status reflects the most recent desired state. The observedGeneration field is also part of metav1.Conditions and represents the spec generation that the condition was set based upon.


I am missing a link into how I would use that generation field, compare it to what? Maybe an example?

Added an example.

Co-authored-by: Emmanuel Bernard <github@mel.emmanuelbernard.com>

feat: AP 17 Fleet shard status

1096aa3

tombentley added the state: needs-reviewers label Jan 30, 2023

andreaTP mentioned this pull request Jan 30, 2023

OpenAPI clients code generation - Kiota #87

Closed

shawkins added 2 commits February 1, 2023 09:25

Merge branch 'main' of github.com:bf2fc6cc711aee1a0c2a/architecture

597c1cf

AP 17 adding Luca's updates on generation / observedGeneration

034da17

tombentley requested review from maleck13, lburgazzoli, tombentley, danielezonca, emmanuelbernard and EricWittmann February 17, 2023 09:08

tombentley approved these changes Feb 17, 2023

View reviewed changes

tombentley reviewed Feb 17, 2023

View reviewed changes

_ap/17/index.adoc Outdated Show resolved Hide resolved

maleck13 reviewed Feb 21, 2023

View reviewed changes

_ap/17/index.adoc Outdated Show resolved Hide resolved

maleck13 approved these changes Feb 21, 2023

View reviewed changes

shawkins and others added 3 commits February 22, 2023 11:01

Update _ap/17/index.adoc

e5cfb93

Co-authored-by: Tom Bentley <tombentley@users.noreply.github.com>

Update _ap/17/index.adoc

95f6570

Co-authored-by: Tom Bentley <tombentley@users.noreply.github.com>

addressing review comments

eb8e943

EricWittmann approved these changes Feb 22, 2023

View reviewed changes

tombentley added state: being-reviewed and removed state: needs-reviewers labels Mar 1, 2023

lburgazzoli reviewed Mar 2, 2023

View reviewed changes

danielezonca approved these changes Mar 3, 2023

View reviewed changes

_ap/17/index.adoc Outdated Show resolved Hide resolved

emmanuelbernard requested changes Mar 10, 2023

View reviewed changes

shawkins and others added 3 commits March 10, 2023 10:11

Update _ap/17/index.adoc

9869245

Co-authored-by: Emmanuel Bernard <github@mel.emmanuelbernard.com>

Update _ap/17/index.adoc

d8536f6

Co-authored-by: Emmanuel Bernard <github@mel.emmanuelbernard.com>

updating for review comments

85d0693

shawkins force-pushed the main branch from a29c68e to 85d0693 Compare March 14, 2023 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AP 17 Fleet shard status #82

feat: AP 17 Fleet shard status #82

shawkins commented Oct 17, 2022 •

edited

Loading

lburgazzoli commented Oct 17, 2022

tombentley commented Jan 30, 2023

shawkins commented Jan 30, 2023

shawkins commented Feb 1, 2023

tombentley left a comment

tombentley Feb 17, 2023

shawkins Feb 22, 2023

tombentley Feb 17, 2023

shawkins Feb 22, 2023

emmanuelbernard Mar 10, 2023

shawkins Mar 13, 2023

lburgazzoli left a comment

shawkins commented Mar 2, 2023

danielezonca left a comment

tombentley commented Mar 8, 2023

emmanuelbernard left a comment

emmanuelbernard Mar 10, 2023

shawkins Mar 14, 2023

emmanuelbernard Mar 10, 2023

shawkins Mar 14, 2023

emmanuelbernard Mar 14, 2023

shawkins Mar 14, 2023

emmanuelbernard Mar 10, 2023

emmanuelbernard Mar 10, 2023

shawkins Mar 14, 2023

emmanuelbernard Mar 10, 2023

shawkins Mar 14, 2023

emmanuelbernard Mar 10, 2023

emmanuelbernard Mar 10, 2023

shawkins Mar 13, 2023


		### Current vs. Desired State

		Especially in instances where spec changes are rolled into dependent resources the desired state will not be fully realized until some point in the future. It is appropriate for the status to provide additional information about this transition. As per the Kubernetes guidelines that could be represented via a condition, or similar to the standard Deployment resource additional status fields such as ready or available replicas, allow for the control plane or a user to infer the progress of the transition.


		Define a pattern for fleet shard status representation and interpretation.

		## Motivation


		An operator in a fleetshard _usually_ lacks sufficient context to determine whether a valid resource is in a terminal state. The operator’s job is simply to implement the desired state whenever possible. For example, if at a given time additional resources are needed, other necessary system parts aren’t installed, etc. - it doesn’t mean the conditions won’t be correct at a later time as a result of some action outside the context of the operator.

		In cases where the operator does have sufficient context to determine terminality, for example violation of preconditions in installed versions of APIs, then it is acceptable for the operator and the status handling to assume the resource is in a terminal state.


		In cases where the operator does have sufficient context to determine terminality, for example violation of preconditions in installed versions of APIs, then it is acceptable for the operator and the status handling to assume the resource is in a terminal state.

		When status is conveying an error that is not known to be terminal, it follows that the error may resolve on its own with additional time. Control planes should assume such errors are ephemeral and not immediately react as if a hard failure has occurred. If there are further actions that the control plane may take, the condition reason or other status fields should make that easy to determine. This allows the data plane logic to remain straightforward and not have to fully understand every possible error condition.


		### generation and observedGeneration

		In kubernetes, each object should have a `metadata.generation` field which is defined as a sequence number - set by the system and monotonically increased per resource on a spec change. Some resources include a field named `status.observedGeneration`, which is the generation most recently observed by the component responsible for acting upon changes to the desired state of the resource. This can be used, for instance, to ensure that the reported status reflects the most recent desired state. The observedGeneration field is also part of metav1.Conditions and represents the spec generation that the condition was set based upon.

feat: AP 17 Fleet shard status #82

Are you sure you want to change the base?

feat: AP 17 Fleet shard status #82

Conversation

shawkins commented Oct 17, 2022 • edited Loading

lburgazzoli commented Oct 17, 2022

tombentley commented Jan 30, 2023

shawkins commented Jan 30, 2023

shawkins commented Feb 1, 2023

tombentley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lburgazzoli left a comment

Choose a reason for hiding this comment

shawkins commented Mar 2, 2023

danielezonca left a comment

Choose a reason for hiding this comment

tombentley commented Mar 8, 2023

emmanuelbernard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shawkins commented Oct 17, 2022 •

edited

Loading