fix(integration) Integration not marked as Failed when Camel is unabl… #2292

claudio4j · 2021-05-13T19:43:17Z

…e to start

Regenerated resources from a previous unrelated change to traits.yaml
Added unversioned maven directories to .gitignore

Release Note

NONE

claudio4j · 2021-05-13T22:05:52Z

/restest

astefanutti

Thanks a lot for the PR. Here are a couple of points:

Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?
If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;
I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;
Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?
Ideally, it'd be great to have an e2e test.

claudio4j · 2021-05-18T01:22:13Z

From my testing cases (routes in java dsl that succeed in build but fails at runtime), see below:

Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?

I don't think it is. Is there an example of this delegating controller so I can take a look at it ?

If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;

I found that the running pod status is the way to inspect the reason why pod failed to start.
The integration_controller.go doesn't watch pods, but the monitor_pod.go checks if the pod is properly running to set the integration running status.

I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;

At least looking at how build/monitor_pod works, it was my instinct to have a similar dedicated and specific monitor_pod.go action. The CanHandle in monitor_pod.go runs when integration is in error state too, so I think it should be in its own action.

Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?

While testing I noticed the pod can be either CrashLoopBackOff or Error, that why the monitor_pod checks both.

Ideally, it'd be great to have an e2e test.

Sure, the tests are coming.

astefanutti · 2021-05-18T07:24:36Z

From my testing cases (routes in java dsl that succeed in build but fails at runtime), see below:

Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?

I don't think it is. Is there an example of this delegating controller so I can take a look at it ?

To turn Integration into Pods, Camel K either creates a Deployment, a KnativeService, or a CronJob, depending on the deployment strategy. These are the controllers that manage the Integration Pods. They already take care of aggregating the Pods statuses into the Deployment (resp. KnativeService, or CronJob) status. I would like to make sure we don't reinvent the wheel of reconciling Pod statuses, if the Deployment (or the other primitives) already provide an aggregated status.

If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;

I found that the running pod status is the way to inspect the reason why pod failed to start.
The integration_controller.go doesn't watch pods, but the monitor_pod.go checks if the pod is properly running to set the integration running status.

The action is called by the controller. So the monitor_pod.go is called for events that are related to other resources, being watched by the integration_controller.go.

I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;

At least looking at how build/monitor_pod works, it was my instinct to have a similar dedicated and specific monitor_pod.go action. The CanHandle in monitor_pod.go runs when integration is in error state too, so I think it should be in its own action.

The main difference with build/monitor_pod is that the Build controller creates the Pod, while the Integration controller creates Deployment, KnativeService, or CronJob, and the monitor.go action is responsible for reconciling their statuses.

Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?

While testing I noticed the pod can be either CrashLoopBackOff or Error, that why the monitor_pod checks both.

I understand you made the decision to map both CrashLoopBackOff or Error Pod statuses into the Integration Error phase. In the Pod state machine, CrashLoopBackOff is more a container condition, and the Pod is still in Running phase, because retries occur. Should we make a closer mapping between Pods and Integration phases?

Ideally, it'd be great to have an e2e test.

Sure, the tests are coming.

Great!

claudio4j · 2021-05-19T01:13:06Z

@astefanutti i added an e2e test, so you can have a look, while I work on the other issues you commented. Thanks for reviewing.

claudio4j · 2021-06-04T11:08:05Z

@astefanutti I had to put this work on hold, but came back today. The deployment status doesn't show the pod status, that's why the monitor_pod.go inspects the pod status.
I added monitor_pod.go specific action because the CanHandle function accepts to run when integration phase is in running or error.

Should we make a closer mapping between Pods and Integration phases?

Definitely, yes.

astefanutti · 2021-06-04T12:14:55Z

@astefanutti I had to put this work on hold, but came back today. The deployment status doesn't show the pod status, that's why the monitor_pod.go inspects the pod status.

What about the current ReplicaSet? Also Conditions may provide some useful info.

I added monitor_pod.go specific action because the CanHandle function accepts to run when integration phase is in running or error.

There are already the monitor and error action. I understand it may be simpler from this specific issue point of view to create a new action, but it seems a bit an anti-pattern to have multiple actions bound to the same phases. The general approach is to have a single action for a given resource phase.

claudio4j · 2021-06-14T17:54:14Z

@astefanutti can you review ?

astefanutti

Thanks. The main difficulty that remains is to aggregate the Pods statuses into the Integration status / phase. I would suggest to research relying on the Pods controller status (Deployment, CronJob or KnativeService), as:

They control the Pods and are responsible for aggregating their statuses
It would avoid watching for Pods changes, which has scalability challenges, especially when deploying the operator globally, meaning it would watch for all Pods events, for the entire cluster.

astefanutti · 2021-06-15T09:13:09Z

pkg/controller/integration/error.go

@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati
 		return integration, nil
 	}

+	// the integration in error state may have recovered and the running pod may be ok
+	// at this point we need to check, if the pod is ok set the integration to running.
+	podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{


Generally, it's preferable to use the client from controller-runtime, as it relies on cached informers.

astefanutti · 2021-06-15T09:16:05Z

pkg/controller/integration/error.go

+	if len(podList.Items) > 0 {
+		// when pod respin, the podlist may contain two pods (the terminating pod and the new starting)
+		// we want the last one as it is the newest
+		pod := podList.Items[len(podList.Items)-1]


The Integration can have multiple replicas, when it's scaled manually, or automatically with Knative or HPA. Relying on the last Pod from the list seems wrong in that case.

astefanutti · 2021-06-15T09:17:04Z

pkg/controller/integration/error.go

+
+		// we look at the container status, because the pod may be in running phase
+		// but the container may be in error or running  state
+		if running := pod.Status.ContainerStatuses[0].State.Running; running != nil {


The Integration Pod(s) can have multiple containers, e.g., with Knative sidecars, or even when using the Pod trait. Relying on the first container seems wrong in that case.

astefanutti · 2021-06-15T09:21:08Z

pkg/controller/integration/error.go

@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati
 		return integration, nil
 	}

+	// the integration in error state may have recovered and the running pod may be ok
+	// at this point we need to check, if the pod is ok set the integration to running.
+	podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{


The reconcile loop won't always be called, as Pods are not being watched by the Integration reconciler.

claudio4j · 2021-06-18T04:24:32Z

I would suggest to research relying on the Pods controller status

When pods are in error state, looking at Deployment object in the status.conditions, the type "Available" is False. When pod is running ok, the Available state is "True". I think this is a good indicator to reconcilie the integration status, this way there is no need to inspect pod status anymore, WDYT ?

astefanutti · 2021-06-21T09:19:35Z

I would suggest to research relying on the Pods controller status

When pods are in error state, looking at Deployment object in the status.conditions, the type "Available" is False. When pod is running ok, the Available state is "True". I think this is a good indicator to reconcilie the integration status, this way there is no need to inspect pod status anymore, WDYT ?

I agree, it may be needed also to rely on the appsv1.DeploymentProgressing condition, and check it equals to NewReplicaSetAvailable, in order to cover the use case where the latest ReplicaSet failed, the Deployment controller rolls back to the previous ReplicaSet that succeeds, but does not correspond to the latest version of the Integration.

Once we're good for Deployment, we also have to handle the cases where the Integration is deployed as a KnativeService, or a CronJob, using the specific conditions that these other resources expose.

claudio4j · 2021-06-23T05:01:25Z

@astefanutti I pushed the changes to check Deployments, can you review if the general idea is valid ?
I am going to work on the KnativeService and CronJob cases.

astefanutti

@claudio4j thanks for the update. The approach is valid. To converge on the Deployment strategy, I'd suggest the following improvements:

Implement the error state, and take the complementary to compute the running state, so that it's easier to reason about, and make sure the Integration leaf states maps a partition;
As to implementing that error state, I'd suggest to mimic the output of the kubectl rollout status deployment, that is the Progressing condition is falsy, and its reason is ProgressDeadlineExceeded(see https://github.com/kubernetes/kubectl/blob/652881798563c00c1895ded6ced819030bfaa4d7/pkg/polymorphichelpers/rollout_status.go#L59-L92);
To report the reason(s) of the error:
- A condition seems more appropriate than using the failure field, which I think should be removed,
- The ReplicaFailure condition should be checked,
- If not present, the reason of the error condition should be a summary of the Integration pod(s) availability / readiness.

…e to start apache#2291 * Added unversioned maven directories to .gitignore

claudio4j · 2021-06-24T03:02:16Z

@astefanutti can you have a look ?

astefanutti · 2021-06-24T07:58:10Z

@claudio4j thanks. Let's merge this and iterate further on the remaining points in subsequent PRs.

astefanutti reviewed May 17, 2021

View reviewed changes

astefanutti added the area/core Core features of the integration platform label May 18, 2021

claudio4j force-pushed the fix_int_status branch from 2edf8a0 to 94a6321 Compare May 19, 2021 00:53

claudio4j force-pushed the fix_int_status branch from 94a6321 to 57f1a13 Compare June 14, 2021 17:53

astefanutti requested changes Jun 15, 2021

View reviewed changes

claudio4j force-pushed the fix_int_status branch from 57f1a13 to b45846a Compare June 23, 2021 04:58

astefanutti reviewed Jun 23, 2021

View reviewed changes

claudio4j marked this pull request as draft June 23, 2021 13:07

fix(integration) Integration not marked as Failed when Camel is unabl…

dcd2150

…e to start apache#2291 * Added unversioned maven directories to .gitignore

claudio4j force-pushed the fix_int_status branch from b45846a to dcd2150 Compare June 24, 2021 03:00

astefanutti marked this pull request as ready for review June 24, 2021 07:56

astefanutti merged commit 5dd46e4 into apache:main Jun 24, 2021

claudio4j deleted the fix_int_status branch June 24, 2021 13:25

claudio4j mentioned this pull request Jun 25, 2021

If the pod is in CrashLoopBackOff, the phase is reported as Running #2445

Closed

astefanutti mentioned this pull request Jun 28, 2021

chore: Polish Integration Error phase reconciliation #2443

Closed

nicolaferraro mentioned this pull request Jul 2, 2021

Release 1.5.0 #2470

Closed

astefanutti mentioned this pull request Sep 16, 2021

Integration in error phase can't be scaled: why don't we just rebuild it? #2640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration) Integration not marked as Failed when Camel is unabl… #2292

fix(integration) Integration not marked as Failed when Camel is unabl… #2292

claudio4j commented May 13, 2021

claudio4j commented May 13, 2021

astefanutti left a comment

claudio4j commented May 18, 2021 •

edited by astefanutti

Loading

astefanutti commented May 18, 2021

claudio4j commented May 19, 2021

claudio4j commented Jun 4, 2021

astefanutti commented Jun 4, 2021

claudio4j commented Jun 14, 2021

astefanutti left a comment

astefanutti Jun 15, 2021

astefanutti Jun 15, 2021

astefanutti Jun 15, 2021

astefanutti Jun 15, 2021

claudio4j commented Jun 18, 2021

astefanutti commented Jun 21, 2021

claudio4j commented Jun 23, 2021

astefanutti left a comment

claudio4j commented Jun 24, 2021

astefanutti commented Jun 24, 2021

fix(integration) Integration not marked as Failed when Camel is unabl… #2292

fix(integration) Integration not marked as Failed when Camel is unabl… #2292

Conversation

claudio4j commented May 13, 2021

claudio4j commented May 13, 2021

astefanutti left a comment

Choose a reason for hiding this comment

claudio4j commented May 18, 2021 • edited by astefanutti Loading

astefanutti commented May 18, 2021

claudio4j commented May 19, 2021

claudio4j commented Jun 4, 2021

astefanutti commented Jun 4, 2021

claudio4j commented Jun 14, 2021

astefanutti left a comment

Choose a reason for hiding this comment

astefanutti Jun 15, 2021

Choose a reason for hiding this comment

astefanutti Jun 15, 2021

Choose a reason for hiding this comment

astefanutti Jun 15, 2021

Choose a reason for hiding this comment

astefanutti Jun 15, 2021

Choose a reason for hiding this comment

claudio4j commented Jun 18, 2021

astefanutti commented Jun 21, 2021

claudio4j commented Jun 23, 2021

astefanutti left a comment

Choose a reason for hiding this comment

claudio4j commented Jun 24, 2021

astefanutti commented Jun 24, 2021

claudio4j commented May 18, 2021 •

edited by astefanutti

Loading