Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable compression in netty #8486

Closed
deepthidevaki opened this issue Dec 28, 2021 · 9 comments · Fixed by #8502
Closed

Enable compression in netty #8486

deepthidevaki opened this issue Dec 28, 2021 · 9 comments · Fixed by #8502
Assignees
Labels
kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog

Comments

@deepthidevaki
Copy link
Contributor

Is your feature request related to a problem? Please describe.

More and more usecases are coming up where Zeebe is deployed across multiple data centers. That means, the communication latency between two zeebe brokers can be high.

When there is high latency between nodes, we have previously observed

While we expect that higher latency will impact the commit latency and this overall process execution time and throughput, frequent leader change is not acceptable. One way to prevent leader change was to increase the election timeout. This means that failover time is also higher, as it take longer to start an election when the current leader dies.

One of the causes for the frequent leader change is that the bandwidth limitation of TCP when the RTT is higher.
https://accedian.com/blog/measuring-network-performance-latency-throughput-packet-loss/
image

Since we are sending a lot of data over network, the latency has a big impact on replication throughput. This cause a lot of requests to timeout and this causes a leader election.

Describe the solution you'd like

  • Compress messages that are sent over network. It is easy to plugin a compression algorithm to netty.
    We can enable compression by adding the following lines when creating a channel (BasicServer/ClientChannelInitializer#initChannel
        channel.pipeline().addLast(ZlibCodecFactory.newZlibEncoder(ZlibWrapper.GZIP));
        channel.pipeline().addLast(ZlibCodecFactory.newZlibDecoder(ZlibWrapper.GZIP));

There are also other compression algorithms available.

Describe alternatives you've considered

  • To prevent freq leader change, we can increase the election timeout. But this would mean failover time is high as it takes longer to detect a leader failure.
  • Compress records before writing to disk. Raft should replicate already compressed records. This may have an additional benefit that the disk i/o is reduced.

Additional context
See comment for the benchmark result from a prototype.

@deepthidevaki deepthidevaki added the kind/feature Categorizes an issue or PR as a feature, i.e. new behavior label Dec 28, 2021
@deepthidevaki
Copy link
Contributor Author

Here are the results from a benchmark from a prototype when we introduced high latency between brokers with the default benchmark setup. We introduced latency by running the following command on all brokers:

delay=30ms
kubectl --namespace $namespace exec $namespace-zeebe-0 -- tc qdisc replace dev eth0 root netem delay $delay

This introduces an RTT of 60ms.

  • First we run with a base image with high latency . No compression enabled.
    image
    image
    Although this configuration did not introduce a leader change, we observed heartbeat misses:
    image

  • Second we run with compression enabled in netty and the same RTT of 60ms.
    image
    image

With compression - it is almost 3 times the throughput.

  • Third we run with compression, with the default benchmark setup and no additional latency.
    image
    image

Compare this with the latest weekly benchmark:
image
image

There is no big impact on the performance.

Also see the effect of compression:
image
image

@deepthidevaki deepthidevaki added the scope/broker Marks an issue or PR to appear in the broker section of the changelog label Dec 28, 2021
@npepinpe npepinpe added this to Ready in Zeebe Dec 28, 2021
@npepinpe
Copy link
Member

npepinpe commented Dec 28, 2021

Go for it, it seems like an easy, low hanging fruit. We would need to pick the right compression algo. Common wisdom is gzip, as it has a very low memory impact, but it doesn't have the best compression/decompression times and/or ratio. Snappy is the one I usually see mentioned for streaming use cases, since it has low memory and cpu impact (at the cost of worse compression). But I guess GZip is usually pretty balanced.

I'm not sure how these compare to Zlib, Zstd, or Lzx tbh.

I would propose GZip if only because it's balanced and AFAIK the most used for TCP/HTTP compression, but I'm happy to allocate time for benchmarking it compared to other algos which have different concerns (e.g. lz4 or snappy which are optimized for fast compression/decompression at the loss of ratio)

@deepthidevaki
Copy link
Contributor Author

We can also make it configurable and make more than one algo available. I think, for us optimizing for time/cpu/memory is more important than the compression ratio.

@npepinpe
Copy link
Member

Then I would propose offering GZip and Snappy. Both don't require much configuration (as opposed to, say, Brotli), one is focused on compression ratio and the other low resource usage. Of course we can discuss other alternatives (e..g I'm not familiar with the difference between lz4 and Snappy, both have similar focuses).

@deepthidevaki deepthidevaki self-assigned this Jan 3, 2022
@lenaschoenburg
Copy link
Member

I think zstd would be a strong contender as well, maybe even instead of gzip. It should be faster than gzip with the same compression ratio.

@deepthidevaki
Copy link
Contributor Author

I have added Gzip and Snappy in #8502 . We can add zstd as a follow up. Right now, I don't see any performance impact when using Gzip. So I don't know if we will see any difference when using zstd. Besides I had already started testing with Gzip before your comment @oleschoenburg

@falko
Copy link
Member

falko commented Feb 12, 2022

Did you measure how the CPU usage changes with and without compression? Are we penalizing users with low network latency, i.e. running all brokers in the same data center, if we enable this by default?

@falko
Copy link
Member

falko commented Feb 12, 2022

Okay, I saw that it's optional. Do you have any feeling out of your benchmarks at what network latency it's worth the CPU investment?

@npepinpe
Copy link
Member

IIRC, Deepthi/Ole didn't notice any performance impact when it comes to either gzip or snappy. That said, Snappy is specifically available for the use case where you want a bit of compression for very little overhead, as that's what it was designed for. So users worried about it but who may face a bit of latency could use Snappy. I would recommend to start with GZIP in general, and if you see no different stick to it, as it gives the best compression. If there is some impact, go to Snappy, and if you're sure latency is great then just keep it disabled.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants