Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka: add support for active brokers in Producers #169

Closed
gm42 opened this issue Mar 4, 2020 · 16 comments · Fixed by #192
Closed

Kafka: add support for active brokers in Producers #169

gm42 opened this issue Mar 4, 2020 · 16 comments · Fixed by #192
Assignees
Labels
enhancement New feature or request

Comments

@gm42
Copy link

gm42 commented Mar 4, 2020

Is your feature request related to a problem? Please describe

It is necessary to check the healthiness of Kafka consumers/producers; in IBM/sarama#1341 it is mentioned that this could be done by checking that there are active brokers (they are checked periodically by the client anyways).

Describe the solution

The minimally invasive solution is to use https://godoc.org/github.com/Shopify/sarama#NewAsyncProducerFromClient in Patron and then expose the Brokers() method.

This solution should work for both consumers and producers if we expose functionality of the Client.
We will focus on the producers only.

Additional context

This is related to #148, in the sense that more functionality of the Kafka client library is necessary.

@mantzas
Copy link

mantzas commented Mar 4, 2020

@gm42 does sarama not handle this automatically? KeepAlice in sarama.Config could do the trick or not?

@gm42
Copy link
Author

gm42 commented Mar 5, 2020

@mantzas this is about checking that the sarama.Client is still healthy, not making it always healthy (it's not possible).

Example use-case: when implementing the health check of a service one should ping on each of the subsystems' client for connectivity/health status.

The mentioned issue (IBM/sarama#1341) suggests to check that the set of brokers is not empty, which seems like a good way to measure the health of the cluster since the sarama.Client is already maintaining the active brokers set internally (and exposes that method for it).

@mantzas
Copy link

mantzas commented Mar 8, 2020

@gm42 Do you propose to make this change in producers, consumers or both?

@gm42
Copy link
Author

gm42 commented Mar 10, 2020

@mantzas the change necessarily affects all of them: there is a sarama.Client for producers and consumers and the method to expose is on the sarama.Client struct.

There are in my opinion two approaches to this:

  • allow constructing producers and consumers with a sarama.Client, then caller can directly call Brokers() on the client used to construct a producer/consumer; this is more akin on how to http.Client works
  • expose a new method ActiveBrokers() on Patron's *Producer and *Consumer structs (one still needs to do the piping with a sarama.Client internally for this)

@mantzas
Copy link

mantzas commented Mar 11, 2020

@gm42 I would go for the second option. But let me ask a question do understand better what your are proposing:

  1. Let's assume you want to send a message. If you cannot because there is no broker available it will probably fail with an error either sync or async (in a channel)
  2. Let's assume that you consume a message. The consumer (simple or group) will fail because there in broker to connect to and the async component will catch this error and starts the error handling strategy.
    What exactly will this feature bring to the table?

@gm42
Copy link
Author

gm42 commented Mar 12, 2020

2. What exactly will this feature bring to the table?

The feature is useful/necessary to implement service health checks (think of them in the docker-compose or kubernetes way): it is necessary to know whether a service is "in business" or not all the time, not just as needed (e.g. when using the consumer or producer). This allows to implement SLOs/SLAs at large and to properly monitor unavailability issues as they arise.

Ref: https://docs.docker.com/compose/compose-file/#healthcheck and https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request

@mantzas
Copy link

mantzas commented Mar 12, 2020

@gm42 I am aware of such checks as patron supports Kubernetes liveness and readiness checks.
I still fail to understand why a publisher that cannot send a message to Kafka should render the services readiness or liveness check invalid. Usually, in these cases, you want to employ circuit breakers or retries, so it should be up to the caller of the publisher to decide what to do.
In the case of the consumer, as I mentioned in the previous comment, you can employ some strategies there like trying to reconnect for some time until you decide that Kafka is down. These retries are of course configurable.
You can monitor your SLA/SLO by monitoring Prometheus metrics and alert, so I do not see why this is only possible with the proposal here.
Given that this proposed change seems fairly extensive we should be cautious that the benefit outweighs the effort.

@gm42
Copy link
Author

gm42 commented Mar 13, 2020

I still fail to understand why a publisher that cannot send a message to Kafka should render the services readiness or liveness check invalid. Usually, in these cases, you want to employ circuit breakers or retries, so it should be up to the caller of the publisher to decide what to do.
In the case of the consumer, as I mentioned in the previous comment, you can employ some strategies there like trying to reconnect for some time until you decide that Kafka is down. These retries are of course configurable.
You can monitor your SLA/SLO by monitoring Prometheus metrics and alert, so I do not see why this is only possible with the proposal here.

@mantzas the health check should not correspond to "HTTP port is open and listening" but to the status of underlying components. This is a strategy discussion and I am going to involve more people to see how we want to proceed.

@oss92
Copy link

oss92 commented Mar 13, 2020

@mantzas @gm42

I still fail to understand why a publisher that cannot send a message to Kafka should render the services readiness or liveness check invalid.

That's absolutely valid and can be another way to set the state to un-healthy but checking for connectivity is also very important.

The reason for this is that some services can stay for a long time without consuming a message or producing one. Some services load data periodically so there will be spikes of messages and periods of nothing being produced there. It is more convenient for the teams maintaining those services to know that the service can not connect the the broker beforehand rather than once the errors start to hit.

@gm42
Copy link
Author

gm42 commented Mar 13, 2020

Yes, we actually might want to have both, eventually, as they cover different aspects. The health checks are part of the "fail early and consistently" approach which allows to be aware of problems also in situations of load testing and QA.

@gm42
Copy link
Author

gm42 commented Mar 13, 2020

@mantzas do we already have in Patron a concept of "dependable component"s? e.g. Patron computes the health check of the service based on all of its components, by asking each component if they are healthy.

We could frame the discussion into that; I am going to read the Patron documentation, code and issues/PRs about this.

@mantzas
Copy link

mantzas commented Mar 13, 2020

My issue is not about liveness and readiness but that with Kafka consumer and producer which i did not get how you can achieve this.
You have actually the freedom in Patron to define liveness and rediness yourself by implementing the interfaces.
So you have actually the freedom to do whatever you want with them. Check out the documentation/code it is really easy to do...

@gm42
Copy link
Author

gm42 commented Mar 13, 2020

@mantzas it is explained here: #169 (comment)

To use that freedom we would have to be able to do a "ping" on the consumer's client or producer's client, that's what is needed for the health check. "ping" is a correct analogy because when using Kafka there are network concerns to be covered and cluster/partitioning concerns (even if Kafka is network-accessible it might not be available to do message brokerage).

The sarama.Client already does the ping and internal health check and exposing the brokers set in Patron would tell us which ones are still pingable and in good health.

I can make a PR for both the Patron changes and including an example on how to use this for the health/readiness checks.

@mantzas
Copy link

mantzas commented Mar 13, 2020

@gm42 you mention the how, but I am interested in the why as per comment #169 (comment).
Are the two cases not covering what you want?

@gm42
Copy link
Author

gm42 commented Mar 13, 2020

  • What more will you do if the publisher cannot send a message? stop/kill your pod with readiness and liveness?

readiness != liveness. This check should only be part of the readiness

  • Do you believe that stopping/killing a pod is a good idea for temporal things that might be resolved just by retrying??

readiness probes do not affect the running service, let's start from that distinction.

See the discussion in mozmeao/infra#660 (comment) (as an example), I think they do a sane choice there.

They check dependencies only for the readiness probe.

@gm42 gm42 changed the title Kafka: add support for health check of the brokers Kafka: add support for active brokers in AsyncProducer Mar 16, 2020
@gm42
Copy link
Author

gm42 commented Mar 16, 2020

I will reduce the scope of this to simply the producers and make a small PR to implement this functionality; thanks to all parties for the interesting and open discussion 🎉 I have learnt a thing or two about best practices along the way!

@mantzas mantzas added the enhancement New feature or request label Mar 16, 2020
@mantzas mantzas added this to the v1.0.0 Release milestone Mar 16, 2020
@mantzas mantzas changed the title Kafka: add support for active brokers in AsyncProducer Kafka: add support for active brokers in Producer Mar 16, 2020
@mantzas mantzas changed the title Kafka: add support for active brokers in Producer Kafka: add support for active brokers in Producers Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants