-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(integration) Integration not marked as Failed when Camel is unabl… #2292
Conversation
/restest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR. Here are a couple of points:
- Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?
- If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;
- I'd suggest to call the status update logic from the existing
monitor
action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources; - Generally,
Error
state is unrecoverable, whileCrashLoopBackOff
may recover (which explain why the Pod is still in running phase): shouldCrashLoopBackOff
container state andPodFailed
phase be handled differently? - Ideally, it'd be great to have an e2e test.
From my testing cases (routes in java dsl that succeed in build but fails at runtime), see below:
I don't think it is. Is there an example of this delegating controller so I can take a look at it ?
I found that the running pod status is the way to inspect the reason why pod failed to start.
At least looking at how
While testing I noticed the pod can be either
Sure, the tests are coming. |
To turn Integration into Pods, Camel K either creates a Deployment, a KnativeService, or a CronJob, depending on the deployment strategy. These are the controllers that manage the Integration Pods. They already take care of aggregating the Pods statuses into the Deployment (resp. KnativeService, or CronJob) status. I would like to make sure we don't reinvent the wheel of reconciling Pod statuses, if the Deployment (or the other primitives) already provide an aggregated status.
The action is called by the controller. So the
The main difference with
I understand you made the decision to map both
Great! |
@astefanutti i added an e2e test, so you can have a look, while I work on the other issues you commented. Thanks for reviewing. |
@astefanutti I had to put this work on hold, but came back today. The deployment status doesn't show the pod status, that's why the monitor_pod.go inspects the pod status.
Definitely, yes. |
What about the current ReplicaSet? Also Conditions may provide some useful info.
There are already the |
94a6321
to
57f1a13
Compare
@astefanutti can you review ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The main difficulty that remains is to aggregate the Pods statuses into the Integration status / phase. I would suggest to research relying on the Pods controller status (Deployment, CronJob or KnativeService), as:
- They control the Pods and are responsible for aggregating their statuses
- It would avoid watching for Pods changes, which has scalability challenges, especially when deploying the operator globally, meaning it would watch for all Pods events, for the entire cluster.
pkg/controller/integration/error.go
Outdated
@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati | |||
return integration, nil | |||
} | |||
|
|||
// the integration in error state may have recovered and the running pod may be ok | |||
// at this point we need to check, if the pod is ok set the integration to running. | |||
podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, it's preferable to use the client from controller-runtime, as it relies on cached informers.
pkg/controller/integration/error.go
Outdated
if len(podList.Items) > 0 { | ||
// when pod respin, the podlist may contain two pods (the terminating pod and the new starting) | ||
// we want the last one as it is the newest | ||
pod := podList.Items[len(podList.Items)-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Integration can have multiple replicas, when it's scaled manually, or automatically with Knative or HPA. Relying on the last Pod from the list seems wrong in that case.
pkg/controller/integration/error.go
Outdated
|
||
// we look at the container status, because the pod may be in running phase | ||
// but the container may be in error or running state | ||
if running := pod.Status.ContainerStatuses[0].State.Running; running != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Integration Pod(s) can have multiple containers, e.g., with Knative sidecars, or even when using the Pod trait. Relying on the first container seems wrong in that case.
pkg/controller/integration/error.go
Outdated
@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati | |||
return integration, nil | |||
} | |||
|
|||
// the integration in error state may have recovered and the running pod may be ok | |||
// at this point we need to check, if the pod is ok set the integration to running. | |||
podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reconcile loop won't always be called, as Pods are not being watched by the Integration reconciler.
When pods are in error state, looking at Deployment object in the status.conditions, the type "Available" is False. When pod is running ok, the Available state is "True". I think this is a good indicator to reconcilie the integration status, this way there is no need to inspect pod status anymore, WDYT ? |
I agree, it may be needed also to rely on the Once we're good for Deployment, we also have to handle the cases where the Integration is deployed as a KnativeService, or a CronJob, using the specific conditions that these other resources expose. |
57f1a13
to
b45846a
Compare
@astefanutti I pushed the changes to check Deployments, can you review if the general idea is valid ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claudio4j thanks for the update. The approach is valid. To converge on the Deployment strategy, I'd suggest the following improvements:
- Implement the error state, and take the complementary to compute the running state, so that it's easier to reason about, and make sure the Integration leaf states maps a partition;
- As to implementing that error state, I'd suggest to mimic the output of the
kubectl rollout status deployment
, that is theProgressing
condition is falsy, and its reason isProgressDeadlineExceeded
(see https://github.com/kubernetes/kubectl/blob/652881798563c00c1895ded6ced819030bfaa4d7/pkg/polymorphichelpers/rollout_status.go#L59-L92); - To report the reason(s) of the error:
- A condition seems more appropriate than using the
failure
field, which I think should be removed, - The
ReplicaFailure
condition should be checked, - If not present, the reason of the error condition should be a summary of the Integration pod(s) availability / readiness.
- A condition seems more appropriate than using the
…e to start apache#2291 * Added unversioned maven directories to .gitignore
b45846a
to
dcd2150
Compare
@astefanutti can you have a look ? |
@claudio4j thanks. Let's merge this and iterate further on the remaining points in subsequent PRs. |
…e to start
#2291
Release Note