Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite GitRepositoryReconciler to new standards #411

Merged
merged 13 commits into from
Aug 3, 2021

Conversation

hiddeco
Copy link
Member

@hiddeco hiddeco commented Jul 30, 2021

Supersedes #361

This PR rewrites the GitRepositoryReconciler to new standards,
while implementing the newly introduced Condition types, and trying
to adhere better to Kubernetes API conventions.

More specifically it introduces:

  • Implementation of more explicit Condition types to highlight
    abnormalities.
  • Extensive usage of the conditions subpackage from runtime.
  • Better and more conflict-resilient (status)patching of reconciled
    objects using the patch subpackage from runtime.
  • Proper implementation of kstatus' Reconciling and Stalled
    conditions.
  • First (integration) tests that solely rely on testenv and do not
    use Ginkgo.

There are a couple of TODOs marked in-code, these are suggestions for
the future and should be non-blocking.
In addition to the TODOs, more complex and/or edge-case test scenarios
may be added as well.

⚠️ target branch of the PR is a newly-introduced (protected) development branch.

This commit introduces new Condition types to the v1beta1 API,
facilitating easier observation of (potentially) problematic state for
end-users.

- `ArtifactUnavailableCondition`: indicates there is no artifact
  available for the resource. This Condition should be set by the
  reconciler as soon as it observes the absence of an artifact for a
  source.
- `CheckoutFailedCondition`: indicates a transient or persistent
  checkout failure. This Condition should be set by the reconciler as
  soon as it observes a Git checkout failure, including any
  prerequisites like the unavailability of the referenced Secret used
  for authentication. It should be deleted as soon as a successful
  checkout has been observed again.
- `SourceVerifiedCondition`: indicates the integrity of the source has
  been verified. The Condition should be set to True or False by the
  reconciler based on the result of the integrity check.
  If there is no verification mode and/or secret configured, the
  Condition should be removed.
- `IncludeUnavailableCondition`: indicates one of the referenced
  includes is not available. This Condition should for example be set
  by the reconciler when the include does not exist, or does not have
  an artifact. If the includes become available, it should be deleted.
- `ArtifactOutdatedCondition`: indicates the current artifact of the
  source is outdated. This Condition should for example be set by the
  reconciler when it notices there is a newer revision for an artifact,
  or the previously included artifacts differ from the current available
  ones. The Condition should be removed after writing a new artifact
  to the storage.

Signed-off-by: Hidde Beydals <hello@hidde.co>
This commit ensures all API objects implement the interfaces used by
the runtime package to work with conditions, etc., and prepares the
test suite to work with the `pkg/runtime/testenv` wrapper.

Changes are made in a backwards compatible way (that being: the
existing code can still be build and works as expected), but without
proper dependency boundaries. The result of this is that the API
package temporary depends on the runtime package, which is resolved
when all reconcilers have been refactored and the API package does
no longer contain condition modifying functions.

Signed-off-by: Hidde Beydals <hello@hidde.co>
At present it only implements a fake commit, which for example can be
used to test commit verification logic.

Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
@hiddeco hiddeco marked this pull request as ready for review July 30, 2021 13:03
@hiddeco hiddeco added area/git Git related issues and pull requests enhancement New feature or request labels Jul 30, 2021
@hiddeco hiddeco force-pushed the gitrepository-reconciler branch 3 times, most recently from a78183f to a3ecd0a Compare July 30, 2021 17:20
This commit rewrites the `GitRepositoryReconciler` to new standards,
while implementing the newly introduced Condition types, and trying
to adhere better to Kubernetes API conventions.

More specifically it introduces:

- Implementation of more explicit Condition types to highlight
  abnormalities.
- Extensive usage of the `conditions` subpackage from `runtime`.
- Better and more conflict-resilient (status)patching of reconciled
  objects using the `patch` subpackage from runtime.
- Proper implementation of kstatus' `Reconciling` and `Stalled`
  conditions.
- First (integration) tests that solely rely on `testenv` and do not
  use Ginkgo.

There are a couple of TODOs marked in-code, these are suggestions for
the future and should be non-blocking.
In addition to the TODOs, more complex and/or edge-case test scenarios
may be added as well.

Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
Signed-off-by: Hidde Beydals <hello@hidde.co>
This ensures the checksum is predictable, and not influenced by e.g.
different runtime configuration settings, or FS specific data.

Signed-off-by: Hidde Beydals <hello@hidde.co>
controllers/gitrepository_controller.go Outdated Show resolved Hide resolved

// Record that we do not have an artifact
if obj.GetArtifact() == nil {
conditions.MarkTrue(obj, sourcev1.ArtifactUnavailableCondition, "NoArtifact", "No artifact for resource in storage")
Copy link
Member

@stefanprodan stefanprodan Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract the common reasons to the API? e.g. NoArtifact, NewRevision, IllegalPath, NotFound, GarbageCollectionSucceeded, GarbageCollectionFailed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, already had this as a follow up in my head. I want to do this once all the reconcilers have seen a PR like this, to see if there is any overlap between them.

Officially reasons are arbitrary by the way, and consumers should not rely on consistency and/or type checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that list thinking of Bucket, but maybe other reasons are common to Git/Helm repos. Good point on doing this after all the controllers are done.

repository.Status.SetLastHandledReconcileRequest(v)
// Always record readiness and duration metrics
r.Metrics.RecordReadiness(ctx, obj)
r.Metrics.RecordDuration(ctx, obj, start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're missing:

	log.Info(fmt.Sprintf("Reconciliation finished in %s, next run in %s",
		time.Now().Sub(start).String(),
		repository.GetInterval().Duration.String(),
	))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what value this adds, and if it wouldn't be better to log separate actions. Reason is that if an artifact is up-to-date, this is already logged. In addition, for the other things we record metrics or emit events, which means they are already observable.

I would feel more for adding a lot more debug level logs, which means the user has an option to switch from "important errors and little information to show it is doing things" to "show me all the steps".

objRef, err := reference.GetReference(r.Scheme, &repository)
if err != nil {
return ctrl.Result{}, err
// Determine if the resource is still being reconciled, or if it has stalled, and record this observation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment confuses me, what does still being reconciled, does it mean not marked for deletion? Can you please also explain when result.IsZero() returns true if retErr is nil.

Copy link
Member Author

@hiddeco hiddeco Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do no longer always return a RequeueAfter, as we know that in some cases there is a finite state (e.g. because there is a wrong path configured, or some other type of error that is not solved automatically).

The possible returns are:

  1. Result.RequeueAfter && err == nil: we did not run into an issue, and there is a delayed requeue. This means we are no longer actively Reconciling, and if the resource is not in a Ready state, have Stalled.
  2. Result.IsZero() && err == nil: we did not return a requeue instruction and no error either. This can for example happen when we know an error is persistent, and can only be resolved by a new generation of the resource, the resource should thus be marked as Stalled.
  3. err != nil: so we are expecting to try again in a relatively short time frame. In this case the Reconciling state can be left untouched.

// between init and delete
if !controllerutil.ContainsFinalizer(obj, sourcev1.SourceFinalizer) {
controllerutil.AddFinalizer(obj, sourcev1.SourceFinalizer)
return ctrl.Result{Requeue: true}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a dedicated patcher for adding the finalizers instead of reconciling twice all new objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our finalizers will need to be more domain specific so we can do kind-topological deletes in the future.

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of %q from logs/events, it bork the output of flux logs and kubectl get events e.g. unknown field &#34;memory&#34; and I guess Slack too.

conditions.MarkTrue(obj, sourcev1.CheckoutFailedCondition, sourcev1.AuthenticationFailedReason,
"Failed to get auth strategy for Git implementation %q: %s", obj.Spec.GitImplementation, err)
// Do not return error as recovery without changes is impossible
return ctrl.Result{}, nil
Copy link
Member

@stefanprodan stefanprodan Jul 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I’m seeing with this approach (not returning errors) is that flux logs is no longer useful. You can’t determine what’s going on by tailing the logs anymore, as some critical errors only end up in conditions or/and events. While conditions make it easy for machines to determine state, the lack of logging hurts our end-users, as they have to look at logs, conditions and events to determine what’s going on, adding up to the cognitive load when debugging multiple controllers/resources at once. I guess from now on we should tell people to not really on logs, that means CloudWatch, Stackdriver, etc can’t be used for debugging, they can only determine state by connecting to Kubernetes API and run flux get all in a loop. This also means that platform admins must give tenants vpn access to Kubernetes API, having access to a log aggregator UI is not sufficient to determine for example a miss configuration of Git auth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to discuss options that add minimal logging so flux logs has value again, this may actually be combined with the debug suggestion I made in #411 (comment).

From an API conventions / Kubernetes observation perspective, logs generally should be your last resort, and you are normally expected to forward the Kubernetes Event to an aggregator for persistent storage.

Lets have a chat on Monday to see if we can agree on a solution we are both happy with.

Copy link
Member

@stefanprodan stefanprodan Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API conventions / Kubernetes observation perspective, logs generally should be your last resort, and you are normally expected to forward the Kubernetes Event to an aggregator for persistent storage.

We can't impose this view of the world on our users, no managed Kubernetes offering does this, only logs are aggregated by default. Not to mention that not all Flux users can afford such an expensive feature, as most event aggregators makes you pay at ingress, and Kubernetes only keeps the events for a short period of time in etcd, making the events unreliable for incident analysis. Please consider: #411 (comment)

@hiddeco
Copy link
Member Author

hiddeco commented Jul 31, 2021

Having worked more on #412 and #413, it looks like they both require a DownloadFailed condition. I am wondering if we can find a term that would cover both DownloadFailed and CheckoutFailed. As this would cut back on the number of conditions.

Fixes error returned from target path validation check and adds more
test cases for TestGitRepositoryReconciler_reconcileArtifact.

Signed-off-by: Sunny <darkowlzz@protonmail.com>
@stefanprodan
Copy link
Member

stefanprodan commented Aug 2, 2021

I propose we freeze the merge on main for reconcilers-dev until we agree and implement the observability improvements.

Proposals:

  • To improve observability through events, without spamming the alerting system, I propose we introduce a new eventSeverity type called trace. The trace events are emitted as Kubernetes events under Normal, these events are dropped when forwarding to notification controller. Every source reconciliation should issue a single info event that contains the source URL and revision (later on we should consider adding the commit message too for GitRepositories).
  • To improve observability through logs, every time we swallow errors we should log them. At the end of the reconciliation, if the revision is the same, we should clearly log the revision e.g. Already up to date, current revision '%s'.

Signed-off-by: Hidde Beydals <hello@hidde.co>
The problem with `GetInterval()` was that the returned type was of
`metav1.Duration`, while almost anywhere it was used, a type of
`time.Duration` was requested. The result of this was that we had to
call `GetInterval().Duration` all the time, which would become a bit
cumbersome after awhile.

To prevent this, we introduce a new `GetRequeueAfter() time.Duration`
method, which both results the right type, and bears a name that is
easier to remember where the value is used most; while setting the
`Result.RequeueAfter` during reconcile operations.

The introduced of this method deprecates `GetInterval()`, which should
be removed in a future MINOR release.

Signed-off-by: Hidde Beydals <hello@hidde.co>
- Mention the current revision in the up-to-date log message.
- Ensure any error that is "swallowed" (not returned) is logged to
  ensure they are visible within the logs, and not just by inspecting
  the object.

Signed-off-by: Hidde Beydals <hello@hidde.co>
controllers: Add more tests for reconcileArtifact
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @hiddeco 🏅

PS. I've created an issue for the trace events here fluxcd/pkg#139

@hiddeco hiddeco merged commit becd5f8 into reconcilers-dev Aug 3, 2021
@hiddeco hiddeco deleted the gitrepository-reconciler branch August 3, 2021 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants