Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(integration) Integration not marked as Failed when Camel is unabl… #2292

Merged
merged 1 commit into from
Jun 24, 2021

Conversation

claudio4j
Copy link
Contributor

…e to start

#2291

  • Regenerated resources from a previous unrelated change to traits.yaml
  • Added unversioned maven directories to .gitignore

Release Note

NONE

@claudio4j
Copy link
Contributor Author

/restest

Copy link
Member

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR. Here are a couple of points:

  • Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?
  • If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;
  • I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;
  • Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?
  • Ideally, it'd be great to have an e2e test.

@claudio4j
Copy link
Contributor Author

claudio4j commented May 18, 2021

From my testing cases (routes in java dsl that succeed in build but fails at runtime), see below:

  • Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?

I don't think it is. Is there an example of this delegating controller so I can take a look at it ?

  • If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;

I found that the running pod status is the way to inspect the reason why pod failed to start.
The integration_controller.go doesn't watch pods, but the monitor_pod.go checks if the pod is properly running to set the integration running status.

  • I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;

At least looking at how build/monitor_pod works, it was my instinct to have a similar dedicated and specific monitor_pod.go action. The CanHandle in monitor_pod.go runs when integration is in error state too, so I think it should be in its own action.

  • Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?

While testing I noticed the pod can be either CrashLoopBackOff or Error, that why the monitor_pod checks both.

  • Ideally, it'd be great to have an e2e test.

Sure, the tests are coming.

@astefanutti
Copy link
Member

From my testing cases (routes in java dsl that succeed in build but fails at runtime), see below:

  • Could the Integration status be reconciled from the delegating controller, i.e., Deployment, KnativeService or CronJob, rather than going down to the Pods level?

I don't think it is. Is there an example of this delegating controller so I can take a look at it ?

To turn Integration into Pods, Camel K either creates a Deployment, a KnativeService, or a CronJob, depending on the deployment strategy. These are the controllers that manage the Integration Pods. They already take care of aggregating the Pods statuses into the Deployment (resp. KnativeService, or CronJob) status. I would like to make sure we don't reinvent the wheel of reconciling Pod statuses, if the Deployment (or the other primitives) already provide an aggregated status.

  • If getting the Pods statuses is the only option, I don't think the Integration controller currently watches for Pods changes, so updates may be missed;

I found that the running pod status is the way to inspect the reason why pod failed to start.
The integration_controller.go doesn't watch pods, but the monitor_pod.go checks if the pod is properly running to set the integration running status.

The action is called by the controller. So the monitor_pod.go is called for events that are related to other resources, being watched by the integration_controller.go.

  • I'd suggest to call the status update logic from the existing monitor action, as it's already responsible for reconciling the Integration status when it's running, and generally action are bound to phases, not resources;

At least looking at how build/monitor_pod works, it was my instinct to have a similar dedicated and specific monitor_pod.go action. The CanHandle in monitor_pod.go runs when integration is in error state too, so I think it should be in its own action.

The main difference with build/monitor_pod is that the Build controller creates the Pod, while the Integration controller creates Deployment, KnativeService, or CronJob, and the monitor.go action is responsible for reconciling their statuses.

  • Generally, Error state is unrecoverable, while CrashLoopBackOff may recover (which explain why the Pod is still in running phase): should CrashLoopBackOff container state and PodFailed phase be handled differently?

While testing I noticed the pod can be either CrashLoopBackOff or Error, that why the monitor_pod checks both.

I understand you made the decision to map both CrashLoopBackOff or Error Pod statuses into the Integration Error phase. In the Pod state machine, CrashLoopBackOff is more a container condition, and the Pod is still in Running phase, because retries occur. Should we make a closer mapping between Pods and Integration phases?

  • Ideally, it'd be great to have an e2e test.

Sure, the tests are coming.

Great!

@astefanutti astefanutti added the area/core Core features of the integration platform label May 18, 2021
@claudio4j
Copy link
Contributor Author

@astefanutti i added an e2e test, so you can have a look, while I work on the other issues you commented. Thanks for reviewing.

@claudio4j
Copy link
Contributor Author

@astefanutti I had to put this work on hold, but came back today. The deployment status doesn't show the pod status, that's why the monitor_pod.go inspects the pod status.
I added monitor_pod.go specific action because the CanHandle function accepts to run when integration phase is in running or error.

Should we make a closer mapping between Pods and Integration phases?

Definitely, yes.

@astefanutti
Copy link
Member

@astefanutti I had to put this work on hold, but came back today. The deployment status doesn't show the pod status, that's why the monitor_pod.go inspects the pod status.

What about the current ReplicaSet? Also Conditions may provide some useful info.

I added monitor_pod.go specific action because the CanHandle function accepts to run when integration phase is in running or error.

There are already the monitor and error action. I understand it may be simpler from this specific issue point of view to create a new action, but it seems a bit an anti-pattern to have multiple actions bound to the same phases. The general approach is to have a single action for a given resource phase.

@claudio4j
Copy link
Contributor Author

@astefanutti can you review ?

Copy link
Member

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The main difficulty that remains is to aggregate the Pods statuses into the Integration status / phase. I would suggest to research relying on the Pods controller status (Deployment, CronJob or KnativeService), as:

  • They control the Pods and are responsible for aggregating their statuses
  • It would avoid watching for Pods changes, which has scalability challenges, especially when deploying the operator globally, meaning it would watch for all Pods events, for the entire cluster.

@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati
return integration, nil
}

// the integration in error state may have recovered and the running pod may be ok
// at this point we need to check, if the pod is ok set the integration to running.
podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, it's preferable to use the client from controller-runtime, as it relies on cached informers.

if len(podList.Items) > 0 {
// when pod respin, the podlist may contain two pods (the terminating pod and the new starting)
// we want the last one as it is the newest
pod := podList.Items[len(podList.Items)-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Integration can have multiple replicas, when it's scaled manually, or automatically with Knative or HPA. Relying on the last Pod from the list seems wrong in that case.


// we look at the container status, because the pod may be in running phase
// but the container may be in error or running state
if running := pod.Status.ContainerStatuses[0].State.Running; running != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Integration Pod(s) can have multiple containers, e.g., with Knative sidecars, or even when using the Pod trait. Relying on the first container seems wrong in that case.

@@ -56,6 +57,27 @@ func (action *errorAction) Handle(ctx context.Context, integration *v1.Integrati
return integration, nil
}

// the integration in error state may have recovered and the running pod may be ok
// at this point we need to check, if the pod is ok set the integration to running.
podList, _ := action.client.CoreV1().Pods(integration.Namespace).List(ctx, metav1.ListOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reconcile loop won't always be called, as Pods are not being watched by the Integration reconciler.

@claudio4j
Copy link
Contributor Author

I would suggest to research relying on the Pods controller status

When pods are in error state, looking at Deployment object in the status.conditions, the type "Available" is False. When pod is running ok, the Available state is "True". I think this is a good indicator to reconcilie the integration status, this way there is no need to inspect pod status anymore, WDYT ?

@astefanutti
Copy link
Member

I would suggest to research relying on the Pods controller status

When pods are in error state, looking at Deployment object in the status.conditions, the type "Available" is False. When pod is running ok, the Available state is "True". I think this is a good indicator to reconcilie the integration status, this way there is no need to inspect pod status anymore, WDYT ?

I agree, it may be needed also to rely on the appsv1.DeploymentProgressing condition, and check it equals to NewReplicaSetAvailable, in order to cover the use case where the latest ReplicaSet failed, the Deployment controller rolls back to the previous ReplicaSet that succeeds, but does not correspond to the latest version of the Integration.

Once we're good for Deployment, we also have to handle the cases where the Integration is deployed as a KnativeService, or a CronJob, using the specific conditions that these other resources expose.

@claudio4j
Copy link
Contributor Author

@astefanutti I pushed the changes to check Deployments, can you review if the general idea is valid ?
I am going to work on the KnativeService and CronJob cases.

Copy link
Member

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claudio4j thanks for the update. The approach is valid. To converge on the Deployment strategy, I'd suggest the following improvements:

  • Implement the error state, and take the complementary to compute the running state, so that it's easier to reason about, and make sure the Integration leaf states maps a partition;
  • As to implementing that error state, I'd suggest to mimic the output of the kubectl rollout status deployment, that is the Progressing condition is falsy, and its reason is ProgressDeadlineExceeded(see https://github.com/kubernetes/kubectl/blob/652881798563c00c1895ded6ced819030bfaa4d7/pkg/polymorphichelpers/rollout_status.go#L59-L92);
  • To report the reason(s) of the error:
    • A condition seems more appropriate than using the failure field, which I think should be removed,
    • The ReplicaFailure condition should be checked,
    • If not present, the reason of the error condition should be a summary of the Integration pod(s) availability / readiness.

@claudio4j claudio4j marked this pull request as draft June 23, 2021 13:07
…e to start

apache#2291

* Added unversioned maven directories to .gitignore
@claudio4j
Copy link
Contributor Author

@astefanutti can you have a look ?

@astefanutti astefanutti marked this pull request as ready for review June 24, 2021 07:56
@astefanutti
Copy link
Member

@claudio4j thanks. Let's merge this and iterate further on the remaining points in subsequent PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Core features of the integration platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants