-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite GitRepositoryReconciler
to new standards
#411
Conversation
This commit introduces new Condition types to the v1beta1 API, facilitating easier observation of (potentially) problematic state for end-users. - `ArtifactUnavailableCondition`: indicates there is no artifact available for the resource. This Condition should be set by the reconciler as soon as it observes the absence of an artifact for a source. - `CheckoutFailedCondition`: indicates a transient or persistent checkout failure. This Condition should be set by the reconciler as soon as it observes a Git checkout failure, including any prerequisites like the unavailability of the referenced Secret used for authentication. It should be deleted as soon as a successful checkout has been observed again. - `SourceVerifiedCondition`: indicates the integrity of the source has been verified. The Condition should be set to True or False by the reconciler based on the result of the integrity check. If there is no verification mode and/or secret configured, the Condition should be removed. - `IncludeUnavailableCondition`: indicates one of the referenced includes is not available. This Condition should for example be set by the reconciler when the include does not exist, or does not have an artifact. If the includes become available, it should be deleted. - `ArtifactOutdatedCondition`: indicates the current artifact of the source is outdated. This Condition should for example be set by the reconciler when it notices there is a newer revision for an artifact, or the previously included artifacts differ from the current available ones. The Condition should be removed after writing a new artifact to the storage. Signed-off-by: Hidde Beydals <hello@hidde.co>
This commit ensures all API objects implement the interfaces used by the runtime package to work with conditions, etc., and prepares the test suite to work with the `pkg/runtime/testenv` wrapper. Changes are made in a backwards compatible way (that being: the existing code can still be build and works as expected), but without proper dependency boundaries. The result of this is that the API package temporary depends on the runtime package, which is resolved when all reconcilers have been refactored and the API package does no longer contain condition modifying functions. Signed-off-by: Hidde Beydals <hello@hidde.co>
At present it only implements a fake commit, which for example can be used to test commit verification logic. Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
49c01c7
to
3bdbc55
Compare
a78183f
to
a3ecd0a
Compare
This commit rewrites the `GitRepositoryReconciler` to new standards, while implementing the newly introduced Condition types, and trying to adhere better to Kubernetes API conventions. More specifically it introduces: - Implementation of more explicit Condition types to highlight abnormalities. - Extensive usage of the `conditions` subpackage from `runtime`. - Better and more conflict-resilient (status)patching of reconciled objects using the `patch` subpackage from runtime. - Proper implementation of kstatus' `Reconciling` and `Stalled` conditions. - First (integration) tests that solely rely on `testenv` and do not use Ginkgo. There are a couple of TODOs marked in-code, these are suggestions for the future and should be non-blocking. In addition to the TODOs, more complex and/or edge-case test scenarios may be added as well. Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
This ensures the checksum is predictable, and not influenced by e.g. different runtime configuration settings, or FS specific data. Signed-off-by: Hidde Beydals <hello@hidde.co>
a3ecd0a
to
e34f792
Compare
|
||
// Record that we do not have an artifact | ||
if obj.GetArtifact() == nil { | ||
conditions.MarkTrue(obj, sourcev1.ArtifactUnavailableCondition, "NoArtifact", "No artifact for resource in storage") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extract the common reasons to the API? e.g. NoArtifact
, NewRevision
, IllegalPath
, NotFound
, GarbageCollectionSucceeded
, GarbageCollectionFailed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, already had this as a follow up in my head. I want to do this once all the reconcilers have seen a PR like this, to see if there is any overlap between them.
Officially reasons are arbitrary by the way, and consumers should not rely on consistency and/or type checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made that list thinking of Bucket
, but maybe other reasons are common to Git/Helm repos. Good point on doing this after all the controllers are done.
repository.Status.SetLastHandledReconcileRequest(v) | ||
// Always record readiness and duration metrics | ||
r.Metrics.RecordReadiness(ctx, obj) | ||
r.Metrics.RecordDuration(ctx, obj, start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're missing:
log.Info(fmt.Sprintf("Reconciliation finished in %s, next run in %s",
time.Now().Sub(start).String(),
repository.GetInterval().Duration.String(),
))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering what value this adds, and if it wouldn't be better to log separate actions. Reason is that if an artifact is up-to-date, this is already logged. In addition, for the other things we record metrics or emit events, which means they are already observable.
I would feel more for adding a lot more debug level logs, which means the user has an option to switch from "important errors and little information to show it is doing things" to "show me all the steps".
objRef, err := reference.GetReference(r.Scheme, &repository) | ||
if err != nil { | ||
return ctrl.Result{}, err | ||
// Determine if the resource is still being reconciled, or if it has stalled, and record this observation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment confuses me, what does still being reconciled
, does it mean not marked for deletion
? Can you please also explain when result.IsZero()
returns true if retErr
is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do no longer always return a RequeueAfter
, as we know that in some cases there is a finite state (e.g. because there is a wrong path configured, or some other type of error that is not solved automatically).
The possible returns are:
Result.RequeueAfter && err == nil
: we did not run into an issue, and there is a delayed requeue. This means we are no longer activelyReconciling
, and if the resource is not in aReady
state, haveStalled
.Result.IsZero() && err == nil
: we did not return a requeue instruction and no error either. This can for example happen when we know an error is persistent, and can only be resolved by a new generation of the resource, the resource should thus be marked asStalled
.err != nil
: so we are expecting to try again in a relatively short time frame. In this case theReconciling
state can be left untouched.
// between init and delete | ||
if !controllerutil.ContainsFinalizer(obj, sourcev1.SourceFinalizer) { | ||
controllerutil.AddFinalizer(obj, sourcev1.SourceFinalizer) | ||
return ctrl.Result{Requeue: true}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create a dedicated patcher for adding the finalizers instead of reconciling twice all new objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our finalizers will need to be more domain specific so we can do kind-topological deletes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please get rid of %q
from logs/events, it bork the output of flux logs
and kubectl get events
e.g. unknown field "memory"
and I guess Slack too.
conditions.MarkTrue(obj, sourcev1.CheckoutFailedCondition, sourcev1.AuthenticationFailedReason, | ||
"Failed to get auth strategy for Git implementation %q: %s", obj.Spec.GitImplementation, err) | ||
// Do not return error as recovery without changes is impossible | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I’m seeing with this approach (not returning errors) is that flux logs
is no longer useful. You can’t determine what’s going on by tailing the logs anymore, as some critical errors only end up in conditions or/and events. While conditions make it easy for machines to determine state, the lack of logging hurts our end-users, as they have to look at logs, conditions and events to determine what’s going on, adding up to the cognitive load when debugging multiple controllers/resources at once. I guess from now on we should tell people to not really on logs, that means CloudWatch, Stackdriver, etc can’t be used for debugging, they can only determine state by connecting to Kubernetes API and run flux get all
in a loop. This also means that platform admins must give tenants vpn access to Kubernetes API, having access to a log aggregator UI is not sufficient to determine for example a miss configuration of Git auth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to discuss options that add minimal logging so flux logs
has value again, this may actually be combined with the debug suggestion I made in #411 (comment).
From an API conventions / Kubernetes observation perspective, logs generally should be your last resort, and you are normally expected to forward the Kubernetes Event to an aggregator for persistent storage.
Lets have a chat on Monday to see if we can agree on a solution we are both happy with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an API conventions / Kubernetes observation perspective, logs generally should be your last resort, and you are normally expected to forward the Kubernetes Event to an aggregator for persistent storage.
We can't impose this view of the world on our users, no managed Kubernetes offering does this, only logs are aggregated by default. Not to mention that not all Flux users can afford such an expensive feature, as most event aggregators makes you pay at ingress, and Kubernetes only keeps the events for a short period of time in etcd, making the events unreliable for incident analysis. Please consider: #411 (comment)
Fixes error returned from target path validation check and adds more test cases for TestGitRepositoryReconciler_reconcileArtifact. Signed-off-by: Sunny <darkowlzz@protonmail.com>
I propose we freeze the merge on Proposals:
|
Signed-off-by: Hidde Beydals <hello@hidde.co>
The problem with `GetInterval()` was that the returned type was of `metav1.Duration`, while almost anywhere it was used, a type of `time.Duration` was requested. The result of this was that we had to call `GetInterval().Duration` all the time, which would become a bit cumbersome after awhile. To prevent this, we introduce a new `GetRequeueAfter() time.Duration` method, which both results the right type, and bears a name that is easier to remember where the value is used most; while setting the `Result.RequeueAfter` during reconcile operations. The introduced of this method deprecates `GetInterval()`, which should be removed in a future MINOR release. Signed-off-by: Hidde Beydals <hello@hidde.co>
68c8fd5
to
a1ed1fc
Compare
- Mention the current revision in the up-to-date log message. - Ensure any error that is "swallowed" (not returned) is logged to ensure they are visible within the logs, and not just by inspecting the object. Signed-off-by: Hidde Beydals <hello@hidde.co>
controllers: Add more tests for reconcileArtifact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supersedes #361
This PR rewrites the
GitRepositoryReconciler
to new standards,while implementing the newly introduced Condition types, and trying
to adhere better to Kubernetes API conventions.
More specifically it introduces:
abnormalities.
conditions
subpackage fromruntime
.objects using the
patch
subpackage from runtime.Reconciling
andStalled
conditions.
testenv
and do notuse Ginkgo.
There are a couple of TODOs marked in-code, these are suggestions for
the future and should be non-blocking.
In addition to the TODOs, more complex and/or edge-case test scenarios
may be added as well.