Skip to content

Add healthcheck HTTP endpoint#2883

Merged
merlimat merged 5 commits intoapache:masterfrom
ivankelly:healthcheck
Nov 6, 2018
Merged

Add healthcheck HTTP endpoint#2883
merlimat merged 5 commits intoapache:masterfrom
ivankelly:healthcheck

Conversation

@ivankelly
Copy link
Contributor

@ivankelly ivankelly commented Oct 29, 2018

Triggering the endpoint causes a client to try to produce and consume
messages on the local broker.

@ivankelly ivankelly self-assigned this Oct 29, 2018
@ivankelly
Copy link
Contributor Author

@cckellogg

@ivankelly ivankelly added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker labels Oct 29, 2018
@ivankelly ivankelly added this to the 2.3.0 milestone Oct 29, 2018
@merlimat
Copy link
Contributor

There's a compilation error:

2018-10-29\T\15:20:27.181 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.7.0:testCompile (default-testCompile) on project pulsar-broker: Compilation failure
2018-10-29\T\15:20:27.181 [ERROR] /home/jenkins/jenkins-slave/workspace/pulsar_precommit_cpp/pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminTest.java:[209,50] no suitable constructor found for InternalConfigurationData(java.lang.String,java.lang.String,java.lang.String)
2018-10-29\T\15:20:27.181 [ERROR]     constructor org.apache.pulsar.common.conf.InternalConfigurationData.InternalConfigurationData() is not applicable
2018-10-29\T\15:20:27.181 [ERROR]       (actual and formal argument lists differ in length)
2018-10-29\T\15:20:27.181 [ERROR]     constructor org.apache.pulsar.common.conf.InternalConfigurationData.InternalConfigurationData(java.lang.String,java.lang.String,java.lang.String,java.lang.String) is not applicable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider BC here. when health check is sending to an old broker, there is no heartbeat namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Healthcheck should only run against the local broker, so the version of the broker should be the same version as the command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we need to handle the case if people rollback broker without tearing down the health check, no? from writing code perspective, we have to keep these things in mind, because you can assume people who running this code will follow exactly the same scenario you are thinking of.

Copy link
Contributor Author

@ivankelly ivankelly Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command to do the healthcheck lives within the same binary code tree as the binary for the broker itself. When it gets called, it's using the same jars.

If someone rolls back, then they will have to disable their livenessProbe anyhow, as it will be giving them 127 exit code, as bin/pulsar healthcheck will not exist.

Now, if we're talking about someone running a healthcheck remote to the broker, that's a different story. For that we should have an http endpoint, which runs a similar liveness check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use Reader? using subscribe and unsubscribe can potentially causing zombie subscriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I want to subscribe to ensure that that path is working (that managed ledger can create cursors).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. reader also creates cursor, it exercises the same code path of publishing and consuming. not sure if it is worth subscribing/unsubscribing.

but if you are insisting on doing subscribing and unsubscribing, we should have logic on handling these zombie subscriptions; otherwise these health checks will potentially cause operational pains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readers create non-durable cursors, which don't write anything to zk. If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.

I can change this to do use the same sub name each time. That way, if unsubscribe fails, there'll only ever been one per broker, and the validation that we can get the message should still work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't check that we have a working connection to ZK, we maybe report a unhealthy broker as healthy.

well. I have a different opinion here. if we report broker as unhealthy when it fails to talk to zookeeper, the health check system can potentially kill the whole cluster of brokers when zookeeper has problems.

ideally we should not put zookeeper in any critical checking path. the reason is pulsar is able to continue publish and consume even zookeeper is unavailable until pulsar decide to roll a new ledger. if you are adding this check into health check, you are basically making broker's healthy tied to zookeeper availability, which can have pretty bad side effects when running this check on production.

from this perspective, I would also suggest leaving out subscribing and unsubscribing from the health check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking of removing the subscribe for other reasons overnight. A subscribe is a couple of ZK writes, and unsubscribe likewise (ledger creation/deletion, metadata updates). If healthchecking is doing 4x writes per probe and you have a lot of brokers, that a significant bump in zk writes.

Will change to reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we need this check here? if so, the health check is not able to be used concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently can't be used concurrently. I want this check to ensure the publish is actually going through. I can change this to loop receive until we found the one we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest changing this to loop receive to improve the health check logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confused here. if we exit straight after, why we need to start an executor? can't we just trigger the callable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have a timeout on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, doesn't sound right to me. because both publish and receive have timeouts, we should just rely on that. if you run the callable directly here, you will have a clear stacktrace when timeout happens. putting this in another thread and having timeout on the runnable will hide a lot of useful information when timeout occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin does not. Admin hangs forever.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admin is essentially a jersey client. We should be able to configure a timeout via readTimeout in the jersey client builder. I feel the right approach is to:

  • add timeout in pulsar admin. since other applications who use pulsar admin directly can also be benefited from this setting.
  • run and call the command directly in the main thread, rather than submitting to another thread, which will make debug easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that admin should have read and connect timeout exposed, but I didn't go down that route because I thought we needed this sooner. I've created #2891 for this.

Regarding running in the main thread, I added the timeout so that the call to the healthcheck would return in a predictable time. I looked further into k8s docs, and they do provide a timeout option, which defaults to 1 second, which is a problem in itself, as the jvm takes multiple seconds to boot. As such, I think we should move the check into a http endpoint, and recommend that people use a http probe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for http probe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue to solve around the HTTP probe is permissions.
We prob need to have super-user access to trigger the health-check, though then the credentials need to be passed by the checker (eg: Kubernetes livenessProbe config).

The only advantage of using a script in the local machine is that the credentials will be already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A script on the local machine can also call the http probe. That said, it's not a given that admin credentials are available on the broker. In a securely configured cluster they should not be.
For the probe I'll also add a healthcheckRole configuration option, which will have access to run the probe. There will also need to be configuration options to configure the client for the probe within the endpoint.

Triggering the endpoint causes a client to try to produce and consume
messages on the local broker.
@ivankelly ivankelly changed the title Add healthcheck for broker Add healthcheck HTTP endpoint Nov 1, 2018
@ivankelly
Copy link
Contributor Author

rerun integration tests
rerun cpp tests

try {
PulsarClient client = pulsar().getClient();

try (Producer<String> producer = client.newProducer(Schema.STRING).topic(topic).create();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rewrite this logic in async way? I know writing sync code is easy, but running blocking operations in pulsar executor is not a good practice.

…integration tests not being run)

also, disable the one test that depends on admin client timeouts
@ivankelly
Copy link
Contributor Author

rerun integration tests

}
});
// timeout read after 10 seconds
ScheduledFuture<?> timeout = pulsar().getExecutor().schedule(() -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't better just timeout at the outer level, so the timeout runnable will not be scheduled for each read.

@sijie
Copy link
Member

sijie commented Nov 6, 2018

run integration tests

@merlimat
Copy link
Contributor

merlimat commented Nov 6, 2018

run integration tests

@merlimat merlimat merged commit 4fb60ee into apache:master Nov 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/broker type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants