Readiness API for k8s #184503

jloleysens · 2024-05-30T11:28:11Z

Today we rely on /api/status to drive overall status of Kibana for all consumers, including orchestration layers -- specifically k8s readiness probes.

The /api/status endpoint is intended as a "current view of the world". This view of the world includes our current connection/connect-ability to Elasticsearch. In k8s readiness this means Kibana will be removed from servicing requests and thus eagerly create downtime before any issues may have actually been observed by end users.

It would be better if the API we use for k8s readiness probes focuses on a more simple, narrow "readiness":

Node.js is running (a given, otherwise no response)
Have we reached the state where we can do work? E.g., are we listening for incoming requests? (maybe different for BT only role)
Once we have reached "ready", we accept that we remain ready. This gives other metrics like our HTTP responses/latency to reflect end user experience without prematurely removing Kibana from service.

In this way downtime will be a function of only a "ready" Node.js process being live.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-30T11:28:13Z

Pinging @elastic/kibana-core (Team:Core)

kobelb · 2024-05-30T12:56:17Z

While I fully support us doing something more standard with regard to our readiness probes, I think we're going to have to overcome a bit of "inertia" in the way that things have been architected thus far before we can have these simple readiness probes, similar to all other web applications.

If we did all "startup" logic prior to Kibana's HTTP server starting to listen for traffic, this change would be great. However, we've historically told teams that they should defer some work outside of the plugin's setup/start lifecycle methods to allow the Kibana server to startup faster because it doesn't make sense to block all HTTP requests while plugin specific initialization logic is happening. As a result, if we were to switch to relying on these simple readiness HTTP APIs, there's the risk that this out-of-band initialization logic could fail, and we'd end up with Kibana pods that wouldn't be restarted, but with major features being inoperative. Where-as today, these Kibana pods would be restarted, and the out-of-band initialization logic would be retried.

Prior to us implementing this, I think we're going to have to either move all initialization logic to the setup/start lifecycle methods to prevent the HTTP server from responding to HTTP request until it's complete, or figure out a way for plugins to trigger a process.exit if their initialization logic has failed for an adequately long duration.

jloleysens · 2024-06-05T09:16:39Z

If we did all "startup" logic prior to Kibana's HTTP server starting to listen for traffic, this change would be great...

I believe our current approach suffers from this same problem so switching to simpler API would not worsen the situation in that regard. However, I totally agree that it is worth addressing. Especially your point about "out of band failures". In those cases it'd be best, IMO, if Kibana were to just crash as that sort of thing seems pretty catastrophic (per your point about process.exit).

In our convo you brought up that usually the same endpoint is used for liveness, readiness and startup probes. Those are also things to think about when designing this API as we may want to have a way to "proactively" restart a stuck Kibana with a liveness probe. Similar for startup probes - they will also restart the Kibana container if we exceed "worst case" startup times.

Once we have reached "ready", we accept that we remain ready.

In that case we may need to reconsider this point. However I'd argue if we have a purpose built, internal endpoint for probes we can evolve and improve it over time so starting with this behaviour given our current knowledge would be a good improvement IMO.

Introducing liveness and startup probes should be done with some care as I'm not sure we know when Kibana is "stuck" today... Mostly the process will crash or fail to start. Therefore I'm hesitant to employ those mechanisms just yet. But I may be wrong.

afharo · 2024-06-06T23:13:54Z

TBH, I don't fully understand why GET /api/status is not a valid ready API.

If I understand correctly, we want to differentiate between "The Node.js instance is up" and "The connection to ES is down". Can you elaborate on what the benefit is of that when used in a readiness probe?

Let's say we've got a scenario with 1 load balancer with 2 instances of Kibana behind. One of them loses connection to ES.

If the load balancer uses today's GET /api/status as a health API, the disconnected Kibana will return 503, and the load balancer will not forward the traffic to that instance. All the traffic goes to the instance of Kibana that's still connected to ES.
If the load balancer uses a new GET /is-node.js-running API, the disconnected Kibana will reply 200 OK, and the load balancer will forward requests to that instance. And they will fail with 500 Internal Error because most actions in Kibana heavily depend on ES.

IMO, the first option is preferable because it doesn't have any impact on the end-user's UX.

jloleysens · 2024-06-10T14:56:06Z

The primary intention is to avoid eagerly reporting downtime for Kibana to k8s. In this way we avoid eagerly removing Kibana from servicing requests. There is also likely a timing penalty that will occur until ES is available and Kibana is servicing requests again.

Rather it seems highly likely there will be times when ES (or some other component) is down but there are no users currently using Kibana. IMO it is better to allow these issues to surface in other ways: failed requests or BT issues. So we'd likely hear about it anyway. I'm not sure I see the value in removing Kibana from service during that time (end UX is negatively affected either way).

Let's say we've got a scenario with 1 load balancer with 2 instances of Kibana behind. One of them loses connection to ES...

Losing connection to another backing service sounds like a serious issue and I'm not sure we should be optimising for it in our readiness checks. Rather these should focused on Kibana. But I think you're probing at something important: what is "ready" in Kibana post startup phase if we don't couple this to ES?

afharo · 2024-06-22T10:16:01Z

what is "ready" in Kibana post startup phase if we don't couple this to ES?

++ We should start with this question. IMO, the definition of "ready" is different for each node role:

Node Role	Readiness question	Key readiness drivers
`migrator`	I don't think we run any readiness checks here since it is run as a `job`. If we had to check its readiness, we'd question Is it capable of running the migrations?	ES connection is a big limitation here. However, there may be others, like plugins failing to register their migrations (or not having registered them yet).
`ui`	Is it capable of serving HTTP traffic?	The main driver is if Kibana is listening to its HTTP port. However, authenticating every request relies on license + ES auth. So, effectively, we are not capable of serving HTTP requests if the ES connection is not working because the auth middleware will fail with 503.
`background-tasks`	Is it capable of running background tasks?	The main indicator is the Task Manager's status. It is calculated by merging core (ES + SO) + TM metrics. On top of that TM halts any task polling when ES or SO are down.

To be clear: I agree that a network failure/ES restart shouldn't blame Kibana, and for that reason, our current GET /api/status may seem to track the wrong knobs. However, when you look closer, a stable ES connection and a healthy SO service are key to every Kibana feature (that's why every plugin's derived status takes into account Core's status). For this reason, I don't think that investing time in creating a new Is Ready? API would solve what we're trying to accomplish here (at least, not the way things work today).

All this said, at the moment, when ES is available, but Task Manager is struggling and reporting unavailable, our GET /api/status returns 200 OK (the API only returns 503 when any of the core services are down, but doesn't care about plugins). I think it's worth considering an improvement in our GET /api/status API to promote some plugins to the status-code evaluation so that BT nodes can return 503 when TM, Alerting, Connectors(?) are in bad shape. It could be a query parameter that would allow the requestor to specify the services they care about the most: a customer may only care about Fleet or Reporting being available for their use-case, they could provide those in a query parameter GET /api/status?q=fleet,reporting or as part of the path GET /api/status/fleet,reporting.

WDYT?

The primary intention is to avoid eagerly reporting downtime for Kibana to k8s. In this way we avoid eagerly removing Kibana from servicing requests. There is also likely a timing penalty that will occur until ES is available and Kibana is servicing requests again.

Our current readiness probe waits for multiple failures in a row before flagging a node as unavailable, while it recovers as soon as the first successful check happens (probe interval is 5s). This means that our "availability alerts" should only trigger if the status is 503 for an extended period of time (6 failed attempts at 5s intervals = 30s). IMO, we shouldn't consider it "noise" and, instead, use these to raise awareness to infra and/or ES folks.

As you point out, it only affects the HTTP load balancer. AFAIK, BT nodes are not affected if flagged as unavailable, since they don't serve requests (they'd only impact our alerts), right?

Rather it seems highly likely there will be times when ES (or some other component) is down but there are no users currently using Kibana. IMO it is better to allow these issues to surface in other ways: failed requests or BT issues.

While I agree that an on-call person focusing on harmless alerts due to those projects being idle may seem like time better spent on something else, I tend to differ on the conclusion of replacing this with other "actual impact" alerts: I'd rather proactively work towards having an issue solved before the customer notices vs. receiving a critical "the pain is now" alert. IMO, it all comes down to providing the on-call operator with the appropriate tools to prioritize the more critical alerts over the status availability.

👆 Or is it really a lower priority?

For ui nodes, if there are more than 1 nodes, actual HTTP traffic shouldn't be affected. If HTTP traffic is affected, we already have higher-priority alerts for that. => We can flag the status-based alerts as lower priority ✅ (and I'd even reduce our tolerance to flag a node as unavailable sooner to avoid forwarding any requests to that node).
For background-tasks, not having a connection to ES means the TM metrics can't be collected either. Do we have any way to alert us if no tasks are being executed (remember, task polling stops when ES is down), and we don't know how many are queued (since we can't connect to ES, we can't query how many pending actions there are)? ❓

pgayvallet · 2024-06-24T06:05:26Z

So initially, we opened this discussion to discuss about implementing different readiness and liveness checks for serverless Kibana.

I'm not sure this is really what this thread has been discussing until now, so just to make sure we're on the same page, I'll use my googling powers to reveal the truth.

What is the difference between liveness and readiness?

If the liveness probe fails, Kubernetes considers the container to be in a failed state and restarts it. This is useful when an application gets into an unrecoverable state or becomes unresponsive. Readiness Probe: It determines whether the application inside the container is ready to accept traffic or requests.

So, just to make sure, are we discussing here to introduce distinct liveness and readiness checks, or are we instead discussing about changing the behavior of our single mixed check?

jloleysens · 2024-06-26T12:13:36Z

So, just to make sure, are we discussing here to introduce distinct liveness and readiness checks

@pgayvallet The intention is to talk about how we would design a readiness probe API for Kibana on Kubernetes.

Liveness provides different behaviour and I'm not convinced we need it yet. It's possible we can use a readiness probe's endpoint as a liveness probe's endpoint but that's a different discussion imo.

As I understand it, @afharo is arguing the current /api/status endpoint works as the readiness probe precisely because it is tied to ES status. I'd argue we should more narrowly define Kibana's readiness as something like "I've started and am ready to handle requests" rather than including ES status.

jloleysens · 2024-06-26T12:15:52Z

By more narrow I mean: we can consider an initial connection to ES as needed for startup but it's not clear this should maintained through the lifetime of Kibana's own readiness status.

afharo · 2024-06-26T21:22:28Z

"I've started and am ready to handle requests"

Is there any additional validation we need? Or all we need is to check if Kibana is capable of serving HTTP traffic?

IIRC, ESS used to check Kibana's readiness via GET /. After noticing false positives (the, now cloud-disabled, interactive interface setup plugin served that during the preboot phase), they switched to GET /api/status.

Of course, we can register a simple API that returns a static 200 - OK. If the http service is up, it works, otherwise, it just fails.

However, that only means that Kibana is capable of serving basic HTTP traffic. I wouldn't use it in my infra to confirm that it's capable of handling actual requests. As mentioned in the table of my previous comment: no ES => no auth => no request handling (at least for 90%+ of the routes).

Maybe we need to register the http service's status to our status service? I don't think it'd make any difference when unavailable, but it'd be interesting when reporting degraded state (overloaded/applying circuit breakers).

rudolf · 2024-07-17T08:35:12Z

In a recent discussion we agreed that something we lack today is the ability to distinguish between "Kibana is the problem" or "Elasticsearch is the problem" when we're unavailable.

We ingest the /api/status endpoint into our metrics indices, but these documents don't have anything useful in 8.x :( I also need to check the behaviour when our status is critical and /api/status returns 503 will we still ingest the results?

We have other signals like proxy logs for failed requests to the nodes.info API or perhaps when the metrics shows 0 active and 0 idle ES sockets. But it feels like using the actual Elasticsearch service status would be more accurate and potentially less noisy.

sophiec20 · 2024-07-24T20:01:27Z

I like the conceptual differentiation between liveness and readiness.
And I'm not sure I see the benefit in re-spinning kibana if elasticsearch is unable to service client requests reliably. It feels to me that a better UX would be to have a "Loading..." or "Waiting for data..." message... similar to "Maintenance mode" (which we might in fact need anyway).

afharo · 2024-07-24T20:25:17Z

It feels to me that a better UX would be to have a "Loading..." or "Waiting for data..." message... similar to "Maintenance mode" (which we might in fact need anyway).

I like that. My question is: do we need to implement this in Kibana or can it be a Custom 404 page at the proxy level?

At the moment, requesting a non-available Kibana returns an ugly JSON with the message "Resource not available". This JSON is returned by the proxy. I wonder if it could return a better designed UI with a more readable error. It could keep the JSON response to API requests (when the header Accept is application/json). This way we provide a nice UX when Kibana is up but having issues and when Kibana is down.

WDYT?

jloleysens added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label May 30, 2024

jloleysens mentioned this issue May 30, 2024

Kibana should prioritize /api/status and /api/stats #170132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness API for k8s #184503

Readiness API for k8s #184503

jloleysens commented May 30, 2024

elasticmachine commented May 30, 2024

kobelb commented May 30, 2024

jloleysens commented Jun 5, 2024 •

edited

Loading

afharo commented Jun 6, 2024

jloleysens commented Jun 10, 2024 •

edited

Loading

afharo commented Jun 22, 2024

pgayvallet commented Jun 24, 2024 •

edited

Loading

jloleysens commented Jun 26, 2024

jloleysens commented Jun 26, 2024 •

edited

Loading

afharo commented Jun 26, 2024

rudolf commented Jul 17, 2024

sophiec20 commented Jul 24, 2024

afharo commented Jul 24, 2024

Readiness API for k8s #184503

Readiness API for k8s #184503

Comments

jloleysens commented May 30, 2024

elasticmachine commented May 30, 2024

kobelb commented May 30, 2024

jloleysens commented Jun 5, 2024 • edited Loading

afharo commented Jun 6, 2024

jloleysens commented Jun 10, 2024 • edited Loading

afharo commented Jun 22, 2024

pgayvallet commented Jun 24, 2024 • edited Loading

jloleysens commented Jun 26, 2024

jloleysens commented Jun 26, 2024 • edited Loading

afharo commented Jun 26, 2024

rudolf commented Jul 17, 2024

sophiec20 commented Jul 24, 2024

afharo commented Jul 24, 2024

jloleysens commented Jun 5, 2024 •

edited

Loading

jloleysens commented Jun 10, 2024 •

edited

Loading

pgayvallet commented Jun 24, 2024 •

edited

Loading

jloleysens commented Jun 26, 2024 •

edited

Loading