-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readiness API for k8s #184503
Comments
Pinging @elastic/kibana-core (Team:Core) |
While I fully support us doing something more standard with regard to our readiness probes, I think we're going to have to overcome a bit of "inertia" in the way that things have been architected thus far before we can have these simple readiness probes, similar to all other web applications. If we did all "startup" logic prior to Kibana's HTTP server starting to listen for traffic, this change would be great. However, we've historically told teams that they should defer some work outside of the plugin's setup/start lifecycle methods to allow the Kibana server to startup faster because it doesn't make sense to block all HTTP requests while plugin specific initialization logic is happening. As a result, if we were to switch to relying on these simple readiness HTTP APIs, there's the risk that this out-of-band initialization logic could fail, and we'd end up with Kibana pods that wouldn't be restarted, but with major features being inoperative. Where-as today, these Kibana pods would be restarted, and the out-of-band initialization logic would be retried. Prior to us implementing this, I think we're going to have to either move all initialization logic to the setup/start lifecycle methods to prevent the HTTP server from responding to HTTP request until it's complete, or figure out a way for plugins to trigger a |
I believe our current approach suffers from this same problem so switching to simpler API would not worsen the situation in that regard. However, I totally agree that it is worth addressing. Especially your point about "out of band failures". In those cases it'd be best, IMO, if Kibana were to just crash as that sort of thing seems pretty catastrophic (per your point about In our convo you brought up that usually the same endpoint is used for liveness, readiness and startup probes. Those are also things to think about when designing this API as we may want to have a way to "proactively" restart a stuck Kibana with a liveness probe. Similar for startup probes - they will also restart the Kibana container if we exceed "worst case" startup times.
In that case we may need to reconsider this point. However I'd argue if we have a purpose built, internal endpoint for probes we can evolve and improve it over time so starting with this behaviour given our current knowledge would be a good improvement IMO. Introducing liveness and startup probes should be done with some care as I'm not sure we know when Kibana is "stuck" today... Mostly the process will crash or fail to start. Therefore I'm hesitant to employ those mechanisms just yet. But I may be wrong. |
TBH, I don't fully understand why If I understand correctly, we want to differentiate between "The Node.js instance is up" and "The connection to ES is down". Can you elaborate on what the benefit is of that when used in a readiness probe? Let's say we've got a scenario with 1 load balancer with 2 instances of Kibana behind. One of them loses connection to ES.
IMO, the first option is preferable because it doesn't have any impact on the end-user's UX. |
The primary intention is to avoid eagerly reporting downtime for Kibana to k8s. In this way we avoid eagerly removing Kibana from servicing requests. There is also likely a timing penalty that will occur until ES is available and Kibana is servicing requests again. Rather it seems highly likely there will be times when ES (or some other component) is down but there are no users currently using Kibana. IMO it is better to allow these issues to surface in other ways: failed requests or BT issues. So we'd likely hear about it anyway. I'm not sure I see the value in removing Kibana from service during that time (end UX is negatively affected either way).
Losing connection to another backing service sounds like a serious issue and I'm not sure we should be optimising for it in our readiness checks. Rather these should focused on Kibana. But I think you're probing at something important: what is "ready" in Kibana post startup phase if we don't couple this to ES? |
++ We should start with this question. IMO, the definition of "ready" is different for each node role:
To be clear: I agree that a network failure/ES restart shouldn't blame Kibana, and for that reason, our current All this said, at the moment, when ES is available, but Task Manager is struggling and reporting WDYT?
Our current readiness probe waits for multiple failures in a row before flagging a node as unavailable, while it recovers as soon as the first successful check happens (probe interval is 5s). This means that our "availability alerts" should only trigger if the status is 503 for an extended period of time (6 failed attempts at 5s intervals = 30s). IMO, we shouldn't consider it "noise" and, instead, use these to raise awareness to infra and/or ES folks. As you point out, it only affects the HTTP load balancer. AFAIK, BT nodes are not affected if flagged as unavailable, since they don't serve requests (they'd only impact our alerts), right?
While I agree that an on-call person focusing on harmless alerts due to those projects being idle may seem like time better spent on something else, I tend to differ on the conclusion of replacing this with other "actual impact" alerts: I'd rather proactively work towards having an issue solved before the customer notices vs. receiving a critical "the pain is now" alert. IMO, it all comes down to providing the on-call operator with the appropriate tools to prioritize the more critical alerts over the status availability. 👆 Or is it really a lower priority?
|
So initially, we opened this discussion to discuss about implementing different readiness and liveness checks for serverless Kibana. I'm not sure this is really what this thread has been discussing until now, so just to make sure we're on the same page, I'll use my googling powers to reveal the truth.
So, just to make sure, are we discussing here to introduce distinct liveness and readiness checks, or are we instead discussing about changing the behavior of our single mixed check? |
@pgayvallet The intention is to talk about how we would design a readiness probe API for Kibana on Kubernetes. Liveness provides different behaviour and I'm not convinced we need it yet. It's possible we can use a readiness probe's endpoint as a liveness probe's endpoint but that's a different discussion imo. As I understand it, @afharo is arguing the current /api/status endpoint works as the readiness probe precisely because it is tied to ES status. I'd argue we should more narrowly define Kibana's readiness as something like "I've started and am ready to handle requests" rather than including ES status. |
By more narrow I mean: we can consider an initial connection to ES as needed for startup but it's not clear this should maintained through the lifetime of Kibana's own readiness status. |
Is there any additional validation we need? Or all we need is to check if Kibana is capable of serving HTTP traffic? IIRC, ESS used to check Kibana's readiness via Of course, we can register a simple API that returns a static However, that only means that Kibana is capable of serving basic HTTP traffic. I wouldn't use it in my infra to confirm that it's capable of handling actual requests. As mentioned in the table of my previous comment: no ES => no auth => no request handling (at least for 90%+ of the routes). Maybe we need to register the http service's status to our status service? I don't think it'd make any difference when unavailable, but it'd be interesting when reporting degraded state (overloaded/applying circuit breakers). |
In a recent discussion we agreed that something we lack today is the ability to distinguish between "Kibana is the problem" or "Elasticsearch is the problem" when we're unavailable. We ingest the We have other signals like proxy logs for failed requests to the |
I like the conceptual differentiation between liveness and readiness. |
I like that. My question is: do we need to implement this in Kibana or can it be a Custom 404 page at the proxy level? At the moment, requesting a non-available Kibana returns an ugly JSON with the message "Resource not available". This JSON is returned by the proxy. I wonder if it could return a better designed UI with a more readable error. It could keep the JSON response to API requests (when the header Accept is application/json). This way we provide a nice UX when Kibana is up but having issues and when Kibana is down. WDYT? |
Today we rely on
/api/status
to drive overall status of Kibana for all consumers, including orchestration layers -- specifically k8s readiness probes.The
/api/status
endpoint is intended as a "current view of the world". This view of the world includes our current connection/connect-ability to Elasticsearch. In k8s readiness this means Kibana will be removed from servicing requests and thus eagerly create downtime before any issues may have actually been observed by end users.It would be better if the API we use for k8s readiness probes focuses on a more simple, narrow "readiness":
In this way downtime will be a function of only a "ready" Node.js process being live.
The text was updated successfully, but these errors were encountered: