Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe add linkedin/cruise-control? #100

Closed
DanielFerreiraJorge opened this issue Nov 24, 2017 · 9 comments
Closed

Maybe add linkedin/cruise-control? #100

DanielFerreiraJorge opened this issue Nov 24, 2017 · 9 comments

Comments

@DanielFerreiraJorge
Copy link

Just an idea... cruise control is great

@solsson
Copy link
Contributor

solsson commented Nov 25, 2017

Could you elaborate a bit on what's great with it? We're definitely on the lookout for tooling right now, in particular topic management.

@DanielFerreiraJorge
Copy link
Author

DanielFerreiraJorge commented Nov 25, 2017

It is particularly great on rebalancing the cluster, adding and removing brokers and anomaly detection. Take a look at the project readme at https://github.com/linkedin/cruise-control, and you will understand. Also, see the slides of the linkedin presentation at https://www.slideshare.net/JiangjieQin/introduction-to-kafka-cruise-control-68180931, and if you have a little bit of time to spare watch the presentation at https://www.youtube.com/watch?v=lf31udm9cYY&t

Also take a look at https://github.com/krallistic/kafka-operator which already have it integrated... maybe if you decide to add it, this project will jump start your work

Edit:
From the wiki:

Cruise Control provides the following features out of the box:

  • Continually balance a Kafka cluster with respect to disk, network and CPU utilization.
  • When a broker fails, automatically reassign replicas that were on that broker to other brokers in the -cluster and restore the original replication factor.
  • Identify the topic-partitions that consume the most resources in the cluster.
  • Support one-click cluster expansions and broker decommissions.
  • Support heterogeneous Kafka clusters and multiple brokers per machine.

Cheers!

@solsson
Copy link
Contributor

solsson commented Dec 19, 2017

I took a quick look at cruise-control specs. It could certainly help for situations like #108 and #105. However the installation is quite intrusive. It requires a custom kafka docker image, or some quite intricate mount tricks. This repo has taken a stand for using stock kafka. Because of that I will not prioritize this issue. A fork that maintains a cruise-control setup would be welcome.

@DanielFerreiraJorge
Copy link
Author

Hi, that is fine! Just something I wanna share now: For the past couple of months my team and I were evaluating a Kafka alternative (we use Kafka in production for 3 years and we are happy with it... mostly). It is an Apache Incubating project called Pulsar (https://pulsar.apache.org/). It was open sourced by Yahoo and used internally to the level of 100s of billions of messages per day (so it is proven in production). We already have a Pulsar cluster running at my company and taking some loading off Kafka and we plan to decommission our Kafka cluster in 6 months. It is really worth it to understand why Pulsar is actually better than Kafka. It solves almost all of the annoying problems that Kafka has like topic management, disruptive rebalances, etc... and offers the same amount of "raw" power (throughput, low latency etc). One of the most attractive "features" of it, IMHO, is that it uses Apache Bookkeeper as the storage layer, making the brokers stateless. It has disconnected write and read paths. It can function EXACTLY like Kafka but it can ALSO function as a queue. It has geo replication at the storage level (not a mirror cluster like kafka, but bookkeeper can be deployed in a geo replicated manner)... etc

Pulsar is being backed by a funded startup called streaml.io (I have absolutely nothing to do with them, they don't even know me). Their blog is very enlightening in explaining pulsar's architecture.

The only downside of the project is not in the technology itself but in the ecosystem. Since it is a "new" project, it has a MUCH, MUCH smaller community than Kafka. But I really hope that changes and the project takes off!

I posted this here because you mentioned you were looking for tooling. Well... this is not a Kafka tool, but a tool to replace Kafka. Really. I was really skeptical at first but it paid of once I consume all the documentation from pulsar and bookkeeper.

Take care!

@solsson
Copy link
Contributor

solsson commented Dec 31, 2017

@DanielFerreiraJorge I'm guessing, because this thread started with CruiseControl, that you aim to reduce toil on the ops side? Or are there client use cases that you can solve better with Pulsar? Or, since you refer to BookKeeper, is it the concept of a distributed append-only log that didn't fit your use cases?

I always felt that Kafka alternatives would pop up that are lighter (more Docker-friendly from the prespective of #112) and have official client libraries for the major languages, but now that I'm possibly looking at one I don't really know evaluate it.

To us, in a way Kafka is like Git - a great data model with a clunky interface. Analogous to Kafka I've imagined a Git rewrite will come along, with humane CLI syntax :). In a way Kafka delivers value the way Kubernetes does, as abstraction for a distributed system. I like when the abstraction lets us understand the system we're building, while hiding the low-level stuff.

We value the distributed append-only log model. We use it for all our "immutable facts", and we happily subscribe to "events" as well as read from offset 0 into table-like caches. Let's say all of this could scale well without the Partitions concept, then our developers wouln't have to understand that which would save time, but for now it's good that we have to to understand the relation between keys, partitioning and ordering guarantees.

It feels a bit outdated that every (dev)ops team has to build their own tooling on the level of #95. Kafka arguably still evolves in a way that assumes that there's consultants for that :) However, it's quite compatible with ideas in the Google SRE book, that you start with manual maintenance and the when the amount of toil gets too high you automate it. Doing so we get to understand the design. I feel that with Kubernetes jobs etc, log aggregation (#91), monitoring (#96 + #31 etc) and communities like the one around this repo, such automation could be quite low hanging fruit.

@DanielFerreiraJorge
Copy link
Author

@solsson we made the decision to switch to Pulsar both to reduce toil on the ops side and to better support client use cases. We are using Kafka in production for years with a fairly high load (~25K msg/sec), which is nowhere near the absurd load that Kafka can achieve, but it is a lot. We have a very stable Kafka infrastructure and the decision to change wasn't easy, but Pulsar offers too much to be ignored.

Let's first clarify what appears to be the misconception that Pulsar is not a distributed append-only log that can be used to store immutable facts and replay them at will/necessity. This is our primary use case for Kafka and Pulsar does it exactly like Kafka. Pulsar have the concept of cursor and individual message ack per subscription, which allows you to do way more that restart from a particular offset. Also, you can have partitioned topics, exactly like Kafka, meaning you can, lets say, partition a topic in 100 partitions and have 100 consumers reading order-guaranteed partitions, exactly like Kafka. From our evaluation, absolutely 100% of the Kafka use cases can be implemented with Pulsar in the exact same way that is implemented with Kafka. That is one of the things facilitated the decision to switch to Pulsar.

The thing is that there is a lot that Pulsar does that is either impossible to Kafka, or much easier done in Pulsar. Both in the ops side, and client features. Lets see a client example:

  • Pulsar can act as a queue if you do not need ordering for a particular subscription (notice that I said subscription, not topic). The beauty is that you do not have to choose to have a topic in "queue" mode. You will do at the subscription level. So lets say you have a topic partitioned in 10 partitions and you are consuming ordered messages with 10 consumers from one of your microservices (exactly like you would in Kafka). But now, you develop a new microservice, lets say some analytics stuff that does not need ordering, and you need to replay all the billions (maybe trillions?) of messages from the beginning of time and send them to Hadoop. With Kafka, you would either be limited to 10 consumers, sacrificing throughput and taking forever to load data into Hadoop, or you would have to re-partition your topic in, lets say, 100 partitions and sacrifice the ordering you need for your old, in-production, microservice (which is not really an option). With Pulsar, you just create a new shared subscription to the same topic and attach 1000 consumers to the subscription and Pulsar will fan-out the messages to Hadoop 100x faster, in an unordered manner, without messing with the topic partitions. To us, the fact that you do not have to think ahead possible usages for a particular topic is awesome.

  • Another use case: Our company have ~2000 customers (businesses) and each of those have around 50 "sub-customer" (also businesses). We want to have one exclusive topic per "sub-customer", which means that we need around 100K topics, without accounting for growth. This functionality would power an external facing API that will allow our customers to consume their messages directly from our messaging system, enabling them to build custom software for their enterprise needs. Try doing 100K topics in Kafka. Pulsar's architecture (bookkeeper) handles millions of topics easily. This use case was, in fact, what drove us to start looking around for alternatives but we never thought we would end up replacing Kafka at all...

  • Another use case is that Pulsar can have non-persistent topics, allowing a very high throughput for topics that does not need persistence at all. Like a memory queue, which is great for real-time apps.

  • On the ops side what stands out is the decoupling of broker (stateless) and storage (bookkeeper) and also the fact that, in bookkeeper, reads and writes have different paths. If we have, for instance, a spike in reads (like 1000 consumers loading data from one topic into hadoop), the writes will not be impacted at all, and vice-versa.

  • With Bookkeeper, you will never have to worry if a partition will outgrow the physical disk. In Kafka, a partition HAVE to fit on disk. Bookkeeper is something like a distributed append-only log composed of even MORE distributed append-only "mini-logs".

  • Also, Pulsar is already pretty well instrumented out-of-the-box with prometheus.

Well I sound like a sales person trying to sell my product. The truth is that I have absolutely nothing to do with Pulsar and couldn't care less if people are using it or not. As I said, I just mentioned it because Kafka does come with some toil. Kafka has evolved greatly since its initial development at LinkedIn, but it is a software that wasn't initially built for the cloud-native world we are living and, because of that, we need to spend a lot of time to make it work on top of things like kubernetes, to instrument it with new techs like prometheus, etc... Pulsar is a newer software that was built at Yahoo EXACTLY to overcome some of the shortcomings of Kafka and I'm really glad I stumbled upon it!

@solsson
Copy link
Contributor

solsson commented Jan 2, 2018

Thanks for sharing this. I hadn't read as far as https://bookkeeper.apache.org/docs/latest/getting-started/concepts/#ledgers. I've lost count of all the "scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads" we've evaluated, like RethinkDB.

@solsson solsson closed this as completed in c805f9b Jan 6, 2019
@maaziPractice
Copy link

disconnected write and read paths

what does this mean? thanks in advance @DanielFerreiraJorge

@DanielFerreiraJorge
Copy link
Author

@maaziPractice
"Bookies are designed to handle thousands of ledgers with concurrent reads and writes. By using multiple disk devices---one for journal and another for general storage--bookies are able to isolate the effects of read operations from the latency of ongoing write operations."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants