Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Report runtime health checks into Integration readiness condition #2719

Merged
merged 5 commits into from
Nov 17, 2021
Merged

feat: Report runtime health checks into Integration readiness condition #2719

merged 5 commits into from
Nov 17, 2021

Conversation

astefanutti
Copy link
Member

@astefanutti astefanutti commented Oct 26, 2021

This PR probes the readiness checks exposed by the Camel runtime (Camel Quarkus / MicroProfile Health / SmallRye Health), to reconcile the Integration phase and readiness condition. It aims at surfacing the response from the Camel readiness checks, and exposing useful information, like error messages, so that the Integration status can serve as a single interface for higher level controllers.

As details from the readiness probe responses are not accessible from the Pod(s) status, the readiness probes are directly called by the operator (via the API server proxy), each time the Integration Pod(s) readiness condition change.

This follows up #2682, so that readiness / error status reconciliation now covers the entire lifecycle of an Integration.

TODO:

Release Note

feat: Report runtime health checks into Integration readiness condition

@astefanutti astefanutti added kind/feature New feature or request status/wip Work in progress labels Oct 26, 2021
@astefanutti
Copy link
Member Author

astefanutti commented Oct 26, 2021

While that PR proposes to use the Camel readiness checks as a channel / interface to peek into the runtime status, I've identified the following open points / gaps:

@lburgazzoli @nicolaferraro @jamesnetherton @davsclaus would you be kind to shed your expertise / opinion on this?

@jamesnetherton
Copy link
Contributor

The error details are not propagated during the conversion from the Camel health check result to the MP Health response

Yeah, it'd be nice to implement that. Should be a simple enough enhancement.

@lburgazzoli
Copy link
Contributor

* `camel-health`: It seems the default `RouteHealthCheck` generally associates the check status to the route status, that is `UP` (resp. `DOWN`) when the route is started (resp. when the route is stopped). So, by default, with the consumer like `from("telegram?authorizationToken=WRONG_TOKEN")`, the route starts, so health is `UP`, despite the connection keeps failing. 

This is a long standing issue and I think we need to start adding health check at component level

Also the error details do not seem to be propagated

This needs to be fixed too

Mind opening some issue on camel ?

@davsclaus
Copy link
Contributor

Yes whether a route can startup vs the consumer is connected is component specific.

Some components have built in their own recovery so they startup the route, and then will automatic self-heal / failover etc. Such as JMS, SQL etc.

And some components does not and fail starting the route if so.

As Luca says then very likely the best solution is to add component level health check, so we can add the logic needed per component (we can have a default readiness that is based on consumer is started).

@davsclaus
Copy link
Contributor

There are some tickets already for component level health checks
https://issues.apache.org/jira/browse/CAMEL-16975
https://issues.apache.org/jira/browse/CAMEL-16976
https://issues.apache.org/jira/browse/CAMEL-15133

And also some way of checking the caused error in the consumer if its connectivity error or a business error
https://issues.apache.org/jira/browse/CAMEL-16977

@davsclaus
Copy link
Contributor

davsclaus commented Oct 27, 2021

@astefanutti for the 1st bullet we need a JIRA ticket about this. Then maybe @jamesnetherton can take a look, seems like we can copy over the message/error to MP.

There are also some other details such as

        builder.detail("route.id", route.getId());
        builder.detail("route.status", status.name());
        builder.detail("route.context.name", context.getName());

And then some general information for counters, eg number of checks and failures in row etc. See the base class source code

@astefanutti
Copy link
Member Author

@davsclaus, thanks a lot, these tickets capture exactly what would be needed 👍🏼.

I've created CAMEL-17138 to track the propagation of the Camel health check result details into the MP Health responses.

@nicolaferraro
Copy link
Member

I'm thinking to a multi-tenant cluster with strict network policies that disallow cross-namespace connections, and a Camel K operator deployed globally. It seems from the code that this would result in a kind of "health unavailable" (if the connection error is catched).

Maybe tunneling the request e.g. via the apiserver proxy could make it work in any configuration. Wdyt @astefanutti ?

@astefanutti
Copy link
Member Author

astefanutti commented Oct 27, 2021

Maybe tunneling the request e.g. via the apiserver proxy could make it work in any configuration. Wdyt @astefanutti ?

Ah right, that's a very good point. I took the shortest path and mimicked what the kubelet does, but the operator Pod is indeed subjected to network policies. Let me rework it based on the API server proxy. Ultimately, it would be possible to have the reason reported into the Pod readiness condition directly, alike the termination message, but in the interim, relying on the API server proxy should do it.

@davsclaus
Copy link
Contributor

Okay have a prototype for camel health-check in camel-telegram that reports it as DOWN or UP

You can set the threshold in the standard way today, so either global with a * or by route id, (pattern)

camel.health.config[*].failure-threshold = 10

or via route id

camel.health.config[myRoute].failure-threshold = 10

There is no threshold by default, so we may consider something special for this, or let camel-k auto assign a default value or something.

@astefanutti
Copy link
Member Author

Okay have a prototype for camel health-check in camel-telegram that reports it as DOWN or UP

Wow that was fast! There is also @Croway that has a PoC for propagating the error details in camel-microprofile-health.

You can set the threshold in the standard way today, so either global with a * or by route id, (pattern)

camel.health.config[*].failure-threshold = 10

or via route id

camel.health.config[myRoute].failure-threshold = 10

There is no threshold by default, so we may consider something special for this, or let camel-k auto assign a default value or something.

Yes, I think we could have a health trait that would provide users the ability to configure theses, and auto assign sensible defaults there. I think encapsulating the health configuration into a dedicated trait would also help disentangling the container trait.

Also it could be useful to have a success-threshold parameter, akin to Kubernetes health probes.

@davsclaus
Copy link
Contributor

davsclaus commented Oct 28, 2021

Good idea about a trait for health checks. This can then make configuring this easier for end users.

The success threshold is a nice touch, as today when a route/consumer is successful again after 1 attempt its UP. We can add similar threshold as we have for failure.
https://issues.apache.org/jira/browse/CAMEL-17143

@astefanutti
Copy link
Member Author

I've updated the logic to call the health probe via the API server proxy.

@davsclaus
Copy link
Contributor

Another ticket to allow to control Camel to auto stop un-healthy routes
https://issues.apache.org/jira/browse/CAMEL-17148

@astefanutti
Copy link
Member Author

The new health trait and the fix for #1610 is addressed via #2740.

@astefanutti astefanutti changed the title feat: Integrate Camel readiness checks into Integration readiness condition feat: Report runtime health checks into Integration readiness condition Nov 15, 2021
@astefanutti astefanutti removed the status/wip Work in progress label Nov 15, 2021
@astefanutti
Copy link
Member Author

I think it's ready. We'll be able to iterate as soon as we upgrade to Camel 3.13+, to leverage the new features developed by @davsclaus, add more test cases, and fix bugs 😇.

@astefanutti astefanutti merged commit 87b3a94 into apache:main Nov 17, 2021
@astefanutti astefanutti deleted the pr-321 branch November 17, 2021 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants