Add healthcheck HTTP endpoint by ivankelly · Pull Request #2883 · apache/pulsar

ivankelly · 2018-10-29T15:16:55Z

Triggering the endpoint causes a client to try to produce and consume
messages on the local broker.

ivankelly · 2018-10-29T15:17:13Z

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java

merlimat · 2018-10-29T16:35:08Z

There's a compilation error:

2018-10-29\T\15:20:27.181 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:testCompile (default-testCompile) on project pulsar-broker: Compilation failure
2018-10-29\T\15:20:27.181 [ERROR] /home/jenkins/jenkins-slave/workspace/pulsar_precommit_cpp/pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminTest.java:[209,50] no suitable constructor found for InternalConfigurationData(java.lang.String,java.lang.String,java.lang.String)
2018-10-29\T\15:20:27.181 [ERROR]     constructor org.apache.pulsar.common.conf.InternalConfigurationData.InternalConfigurationData() is not applicable
2018-10-29\T\15:20:27.181 [ERROR]       (actual and formal argument lists differ in length)
2018-10-29\T\15:20:27.181 [ERROR]     constructor org.apache.pulsar.common.conf.InternalConfigurationData.InternalConfigurationData(java.lang.String,java.lang.String,java.lang.String,java.lang.String) is not applicable

sijie · 2018-10-29T17:51:54Z

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java

we should consider BC here. when health check is sending to an old broker, there is no heartbeat namespace.

Healthcheck should only run against the local broker, so the version of the broker should be the same version as the command.

but we need to handle the case if people rollback broker without tearing down the health check, no? from writing code perspective, we have to keep these things in mind, because you can assume people who running this code will follow exactly the same scenario you are thinking of.

The command to do the healthcheck lives within the same binary code tree as the binary for the broker itself. When it gets called, it's using the same jars.

If someone rolls back, then they will have to disable their livenessProbe anyhow, as it will be giving them 127 exit code, as bin/pulsar healthcheck will not exist.

Now, if we're talking about someone running a healthcheck remote to the broker, that's a different story. For that we should have an http endpoint, which runs a similar liveness check.

sijie · 2018-10-29T17:53:09Z

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java

why not just use Reader? using subscribe and unsubscribe can potentially causing zombie subscriptions.

Because I want to subscribe to ensure that that path is working (that managed ledger can create cursors).

hmm. reader also creates cursor, it exercises the same code path of publishing and consuming. not sure if it is worth subscribing/unsubscribing.

but if you are insisting on doing subscribing and unsubscribing, we should have logic on handling these zombie subscriptions; otherwise these health checks will potentially cause operational pains.

Readers create non-durable cursors, which don't write anything to zk. If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.

I can change this to do use the same sub name each time. That way, if unsubscribe fails, there'll only ever been one per broker, and the validation that we can get the message should still work.

If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.

well. I have a different opinion here. if we report broker as unhealthy when it fails to talk to zookeeper, the health check system can potentially kill the whole cluster of brokers when zookeeper has problems.

ideally we should not put zookeeper in any critical checking path. the reason is pulsar is able to continue publish and consume even zookeeper is unavailable until pulsar decide to roll a new ledger. if you are adding this check into health check, you are basically making broker's healthy tied to zookeeper availability, which can have pretty bad side effects when running this check on production.

from this perspective, I would also suggest leaving out subscribing and unsubscribing from the health check.

I was actually thinking of removing the subscribe for other reasons overnight. A subscribe is a couple of ZK writes, and unsubscribe likewise (ledger creation/deletion, metadata updates). If healthchecking is doing 4x writes per probe and you have a lot of brokers, that a significant bump in zk writes.

Will change to reader.

sijie · 2018-10-29T17:54:42Z

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java

are we sure we need this check here? if so, the health check is not able to be used concurrently.

This currently can't be used concurrently. I want this check to ensure the publish is actually going through. I can change this to loop receive until we found the one we want.

I would suggest changing this to loop receive to improve the health check logic.

sijie · 2018-10-29T17:56:53Z

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java

a bit confused here. if we exit straight after, why we need to start an executor? can't we just trigger the callable?

To have a timeout on it.

hmm, doesn't sound right to me. because both publish and receive have timeouts, we should just rely on that. if you run the callable directly here, you will have a clear stacktrace when timeout happens. putting this in another thread and having timeout on the runnable will hide a lot of useful information when timeout occurs.

Admin does not. Admin hangs forever.

Admin is essentially a jersey client. We should be able to configure a timeout via readTimeout in the jersey client builder. I feel the right approach is to:

add timeout in pulsar admin. since other applications who use pulsar admin directly can also be benefited from this setting.

run and call the command directly in the main thread, rather than submitting to another thread, which will make debug easier.

I agree that admin should have read and connect timeout exposed, but I didn't go down that route because I thought we needed this sooner. I've created #2891 for this.

Regarding running in the main thread, I added the timeout so that the call to the healthcheck would return in a predictable time. I looked further into k8s docs, and they do provide a timeout option, which defaults to 1 second, which is a problem in itself, as the jvm takes multiple seconds to boot. As such, I think we should move the check into a http endpoint, and recommend that people use a http probe.

+1 for http probe

One issue to solve around the HTTP probe is permissions.
We prob need to have super-user access to trigger the health-check, though then the credentials need to be passed by the checker (eg: Kubernetes livenessProbe config).

The only advantage of using a script in the local machine is that the credentials will be already there.

A script on the local machine can also call the http probe. That said, it's not a given that admin credentials are available on the broker. In a securely configured cluster they should not be.
For the probe I'll also add a healthcheckRole configuration option, which will have access to run the probe. There will also need to be configuration options to configure the client for the probe within the endpoint.

Triggering the endpoint causes a client to try to produce and consume messages on the local broker.

ivankelly · 2018-11-01T08:51:57Z

rerun integration tests
rerun cpp tests

sijie · 2018-11-01T18:00:21Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java

+                try {
+                    PulsarClient client = pulsar().getClient();
+
+                    try (Producer<String> producer = client.newProducer(Schema.STRING).topic(topic).create();


Can we rewrite this logic in async way? I know writing sync code is easy, but running blocking operations in pulsar executor is not a good practice.

…integration tests not being run) also, disable the one test that depends on admin client timeouts

ivankelly · 2018-11-05T13:02:45Z

rerun integration tests

sijie · 2018-11-05T19:08:39Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java

+                            }
+                        });
+                // timeout read after 10 seconds
+                ScheduledFuture<?> timeout = pulsar().getExecutor().schedule(() -> {


isn't better just timeout at the outer level, so the timeout runnable will not be scheduled for each read.

sijie · 2018-11-06T02:53:08Z

run integration tests

merlimat · 2018-11-06T05:35:47Z

run integration tests

ivankelly self-assigned this Oct 29, 2018

ivankelly requested review from jerrypeng, merlimat and sijie October 29, 2018 15:16

ivankelly added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker labels Oct 29, 2018

ivankelly added this to the 2.3.0 milestone Oct 29, 2018

merlimat reviewed Oct 29, 2018

View reviewed changes

pulsar-broker/src/main/java/org/apache/pulsar/PulsarHealthcheck.java Outdated Show resolved Hide resolved

merlimat approved these changes Oct 29, 2018

View reviewed changes

sijie requested changes Oct 29, 2018

View reviewed changes

Add healthcheck HTTP endpoint

5bd81ff

Triggering the endpoint causes a client to try to produce and consume messages on the local broker.

ivankelly force-pushed the healthcheck branch from f1031f1 to 5bd81ff Compare November 1, 2018 00:34

ivankelly changed the title ~~Add healthcheck for broker~~ Add healthcheck HTTP endpoint Nov 1, 2018

sijie requested changes Nov 1, 2018

View reviewed changes

ivankelly added 3 commits November 2, 2018 21:32

Async

9f45eac

Add Suite to suite list (this is super cumbersome and will result in …

665358b

…integration tests not being run) also, disable the one test that depends on admin client timeouts

typo

8727305

sijie reviewed Nov 5, 2018

View reviewed changes

Move timeout to outside read loop

28cc6b9

sijie approved these changes Nov 6, 2018

View reviewed changes

merlimat merged commit 4fb60ee into apache:master Nov 6, 2018

Conversation

ivankelly commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankelly commented Oct 29, 2018

Uh oh!

Uh oh!

merlimat commented Oct 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivankelly Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivankelly commented Nov 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivankelly commented Nov 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sijie commented Nov 6, 2018

Uh oh!

merlimat commented Nov 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivankelly commented Oct 29, 2018 •

edited

Loading

ivankelly Oct 29, 2018 •

edited

Loading