Conversation
pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java
Outdated
Show resolved
Hide resolved
|
There's a compilation error: |
There was a problem hiding this comment.
we should consider BC here. when health check is sending to an old broker, there is no heartbeat namespace.
There was a problem hiding this comment.
Healthcheck should only run against the local broker, so the version of the broker should be the same version as the command.
There was a problem hiding this comment.
but we need to handle the case if people rollback broker without tearing down the health check, no? from writing code perspective, we have to keep these things in mind, because you can assume people who running this code will follow exactly the same scenario you are thinking of.
There was a problem hiding this comment.
The command to do the healthcheck lives within the same binary code tree as the binary for the broker itself. When it gets called, it's using the same jars.
If someone rolls back, then they will have to disable their livenessProbe anyhow, as it will be giving them 127 exit code, as bin/pulsar healthcheck will not exist.
Now, if we're talking about someone running a healthcheck remote to the broker, that's a different story. For that we should have an http endpoint, which runs a similar liveness check.
There was a problem hiding this comment.
why not just use Reader? using subscribe and unsubscribe can potentially causing zombie subscriptions.
There was a problem hiding this comment.
Because I want to subscribe to ensure that that path is working (that managed ledger can create cursors).
There was a problem hiding this comment.
hmm. reader also creates cursor, it exercises the same code path of publishing and consuming. not sure if it is worth subscribing/unsubscribing.
but if you are insisting on doing subscribing and unsubscribing, we should have logic on handling these zombie subscriptions; otherwise these health checks will potentially cause operational pains.
There was a problem hiding this comment.
Readers create non-durable cursors, which don't write anything to zk. If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.
I can change this to do use the same sub name each time. That way, if unsubscribe fails, there'll only ever been one per broker, and the validation that we can get the message should still work.
There was a problem hiding this comment.
If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.
well. I have a different opinion here. if we report broker as unhealthy when it fails to talk to zookeeper, the health check system can potentially kill the whole cluster of brokers when zookeeper has problems.
ideally we should not put zookeeper in any critical checking path. the reason is pulsar is able to continue publish and consume even zookeeper is unavailable until pulsar decide to roll a new ledger. if you are adding this check into health check, you are basically making broker's healthy tied to zookeeper availability, which can have pretty bad side effects when running this check on production.
from this perspective, I would also suggest leaving out subscribing and unsubscribing from the health check.
There was a problem hiding this comment.
I was actually thinking of removing the subscribe for other reasons overnight. A subscribe is a couple of ZK writes, and unsubscribe likewise (ledger creation/deletion, metadata updates). If healthchecking is doing 4x writes per probe and you have a lot of brokers, that a significant bump in zk writes.
Will change to reader.
There was a problem hiding this comment.
are we sure we need this check here? if so, the health check is not able to be used concurrently.
There was a problem hiding this comment.
This currently can't be used concurrently. I want this check to ensure the publish is actually going through. I can change this to loop receive until we found the one we want.
There was a problem hiding this comment.
I would suggest changing this to loop receive to improve the health check logic.
There was a problem hiding this comment.
a bit confused here. if we exit straight after, why we need to start an executor? can't we just trigger the callable?
There was a problem hiding this comment.
To have a timeout on it.
There was a problem hiding this comment.
hmm, doesn't sound right to me. because both publish and receive have timeouts, we should just rely on that. if you run the callable directly here, you will have a clear stacktrace when timeout happens. putting this in another thread and having timeout on the runnable will hide a lot of useful information when timeout occurs.
There was a problem hiding this comment.
Admin does not. Admin hangs forever.
There was a problem hiding this comment.
Admin is essentially a jersey client. We should be able to configure a timeout via readTimeout in the jersey client builder. I feel the right approach is to:
- add timeout in pulsar admin. since other applications who use pulsar admin directly can also be benefited from this setting.
- run and call the command directly in the main thread, rather than submitting to another thread, which will make debug easier.
There was a problem hiding this comment.
I agree that admin should have read and connect timeout exposed, but I didn't go down that route because I thought we needed this sooner. I've created #2891 for this.
Regarding running in the main thread, I added the timeout so that the call to the healthcheck would return in a predictable time. I looked further into k8s docs, and they do provide a timeout option, which defaults to 1 second, which is a problem in itself, as the jvm takes multiple seconds to boot. As such, I think we should move the check into a http endpoint, and recommend that people use a http probe.
There was a problem hiding this comment.
One issue to solve around the HTTP probe is permissions.
We prob need to have super-user access to trigger the health-check, though then the credentials need to be passed by the checker (eg: Kubernetes livenessProbe config).
The only advantage of using a script in the local machine is that the credentials will be already there.
There was a problem hiding this comment.
A script on the local machine can also call the http probe. That said, it's not a given that admin credentials are available on the broker. In a securely configured cluster they should not be.
For the probe I'll also add a healthcheckRole configuration option, which will have access to run the probe. There will also need to be configuration options to configure the client for the probe within the endpoint.
Triggering the endpoint causes a client to try to produce and consume messages on the local broker.
f1031f1 to
5bd81ff
Compare
|
rerun integration tests |
| try { | ||
| PulsarClient client = pulsar().getClient(); | ||
|
|
||
| try (Producer<String> producer = client.newProducer(Schema.STRING).topic(topic).create(); |
There was a problem hiding this comment.
Can we rewrite this logic in async way? I know writing sync code is easy, but running blocking operations in pulsar executor is not a good practice.
|
rerun integration tests |
| } | ||
| }); | ||
| // timeout read after 10 seconds | ||
| ScheduledFuture<?> timeout = pulsar().getExecutor().schedule(() -> { |
There was a problem hiding this comment.
isn't better just timeout at the outer level, so the timeout runnable will not be scheduled for each read.
|
run integration tests |
|
run integration tests |
Triggering the endpoint causes a client to try to produce and consume
messages on the local broker.