Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable TCP compression when shipping Lucene segments during peer recovery #33844

Closed
ywelsch opened this issue Sep 19, 2018 · 20 comments
Closed
Assignees
Labels
:Distributed/Network Http and internode communication implementations :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Sep 19, 2018

Shipping lucene segments during peer recovery is configured to set the transport compress options to false:

.withCompress(false) // lucene files are already compressed and therefore compressing this won't really help much so
// we are saving the cpu for other things

Unfortunately this is overridden when transport.tcp.compress is enabled, discarding the request-level option:

if (compress) {
options = TransportRequestOptions.builder(options).withCompress(true).build();
}

This means that when transport.tcp.compress is enabled, the (already compressed) Lucene files are unnecessarily compressed a second time, slowing peer recovery.

My preference would be that transport options that were explicitly set (withCompress(true) or withCompress(false)) take precedence over transport.tcp.compress.

Relates to https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/

@ywelsch ywelsch added >enhancement discuss :Distributed/Network Http and internode communication implementations :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Sep 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2018

+1 to make transport options have precedence over the setting.

It's not completely obvious to me that Lucene files don't need further compression: Lucene files are indeed compressed, but in a way that still provides quick (often random) access, which prevents the effectiveness of compression techniques that may be used. Re-compressiong on top of Lucene might bring further space savings. It might still be the wrong trade-off, especially on a local network, but I wanted to clarify this point.

@jasontedor
Copy link
Member

I agree that transport options set explicitly on the request should override the transport.tcp.compress setting. Like @jpountz, I am unsure that disabling compression on file-based recovery transfers is the right trade-off. I would like to see some experiments here. So we might end up with two changes:

  • request options override transport.tcp.compress
  • stop disabling compression on recovery

@jasontedor
Copy link
Member

cc @original-brownbear @tbrooks8

@original-brownbear
Copy link
Member

original-brownbear commented Sep 20, 2018

@ywelsch @jasontedor I experimented with compression performance today and there's hard data provided by @DaveCTurner provided in a linked issue (more details there) and both my experiments and the provided data suggest that compression severely limits the network throughput to around 20-30MB/s (~160-240Mbit/s) with only trivial amounts of network traffic saved (this was the result of some miscommunication, see next comment in a few min) => it seems reasonable to just fix the request options to override the transport.tcp.compress and keep disabling compression on recovery.
Also, there is the anecdotal evidence in https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/14 suggesting that compression is an issue here in practice.

@original-brownbear
Copy link
Member

@ywelsch thanks for pointing out the issue with the data I worked from :)
There's definitely a traffic saving here from compression. Measured this with the example log dataset we provide in the docs and its about a 50% reduction in traffic from compression for the whole recovery process end-to-end.

This makes the situation a little closer I guess :)
Still given the data we gathered on the runtime of recoveries and the reported issues, I think overall turning off compression wins (easier on the CPU and much faster on a 1Gbit link). It seems enabling the compression really only helps in situations available bandwidth for recovery is limited.

@ywelsch
Copy link
Contributor Author

ywelsch commented Sep 21, 2018

We discussed this during FixitFriday today. While re-compression on top of Lucene can bring further space savings (needs more benchmarks, but numbers between 10% to 50% reduction in size were mentioned), it comes at the cost of additional CPU usage, which 1) can have a negative impact on indexing / search performance, as recovery is supposed to be a background task and 2) due to the way we're sequentially sending files during recovery, can limit the throughput of single shard recoveries. While we're less enthusiastic about adding a new configuration option just for recoveries, we want to explore always disabling compression for phase 1 of recoveries, even if transport.tcp.compress is enabled. This also looks to be the thing that the code currently intends on doing, even though it's not actually working that way. The first step will therefore be to figure out whether this behavior was changed in a previous ES version.

@original-brownbear
Copy link
Member

original-brownbear commented Sep 21, 2018

@ywelsch I looked into 2.4 and it seems the same issues this PR points out is present there as well:

The global setting overrides the per request option exactly like it still does, see https://github.com/elastic/elasticsearch/blob/2.4/core/src/main/java/org/elasticsearch/transport/netty/NettyTransport.java#L832.

So while we even did have the option (removed in #15235) to set compression off for the recovery it had no effect in practice with compression enabled as far as I can tell.

=> seems like this issue isn't new at all.


That said in the real world, I'd still vote for turning off compression for shipping the segments even when compression is globally enabled:

  • Keeping the compression on here at best saves 50% traffic or speeds up recovery by 50% assuming the performance is gated by bandwidth.
  • While turning it off gives you a ~10x speedup if bandwidth is not an issue.
    • And there have been a number of complaints now about the performance of recovery with compression enabled.

=> IMO it's consistent to turn off compression for data that gets little benefit out of it in this special case.

@original-brownbear
Copy link
Member

Also:

request options override transport.tcp.compress

Shouldn't we regardless of what we decide on how we want to transfer the segment data fix this to actually have the options override the global setting (or remove the option?)?

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 29, 2018
* Individual setting should only be able to negate the global compress setting since `org.elasticsearch.transport.TcpTransport#canCompress` ensures that compression only ever happens if the global compression is enabled regardless of the `TransportRequestOptions`
* Disables compression of segment files during recovery to bring code and comment in line with each other
* Fixes elastic#33844
@original-brownbear
Copy link
Member

Created a PR with the above suggestion #34959

@dnhatn
Copy link
Member

dnhatn commented Jan 13, 2019

I ran a compression test with different datasets and chunk sizes. These indices use the default codec without force merge. Below is the result.

compress nyc_taxis chunk_size=512kb from 38.7gb to 27.8gb diff 28% took 32.4m
compress nyc_taxis chunk_size=1mb from 38.7gb to 27.7gb diff 28% took 32.5m
compress nyc_taxis chunk_size=2mb from 38.7gb to 27.7gb diff 28% took 32.7m
compress nyc_taxis chunk_size=4mb from 38.7gb to 27.7gb diff 28% took 32.7m

compress pmc chunk_size=512kb from 20.2gb to 16.9gb diff 16% took 19.2m
compress pmc chunk_size=1mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=2mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=4mb from 20.2gb to 16.9gb diff 16% took 19.1m

compress geonames chunk_size=512kb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=1mb from 3.1gb to 2.1gb diff 30% took 2.3m
compress geonames chunk_size=2mb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=4mb from 3.1gb to 2.1gb diff 30% took 2.2m

compress osmgeopoints chunk_size=512kb from 3.3gb to 2.6gb diff 22% took 2.6m
compress osmgeopoints chunk_size=1mb from 3.3gb to 2.6gb diff 22% took 2.5m
compress osmgeopoints chunk_size=2mb from 3.3gb to 2.6gb diff 22% took 2.7m
compress osmgeopoints chunk_size=4mb from 3.3gb to 2.6gb diff 22% took 2.7m

compress logs-191998 chunk_size=512kb from 677.4mb to 461mb diff 31% took 30.9s
compress logs-191998 chunk_size=1mb from 677.4mb to 460.8mb diff 31% took 29.6s
compress logs-191998 chunk_size=2mb from 677.4mb to 460.7mb diff 31% took 29.5s
compress logs-191998 chunk_size=4mb from 677.4mb to 460.6mb diff 32% took 29.9s

compress logs-201998 chunk_size=512kb from 909.6mb to 622mb diff 31% took 41.4s
compress logs-201998 chunk_size=1mb from 909.6mb to 621.7mb diff 31% took 39.7s
compress logs-201998 chunk_size=2mb from 909.6mb to 621.5mb diff 31% took 38.4s
compress logs-201998 chunk_size=4mb from 909.6mb to 621.4mb diff 31% took 40.5s

compress logs-231998 chunk_size=512kb from 890.1mb to 609.9mb diff 31% took 39.2s
compress logs-231998 chunk_size=1mb from 890.1mb to 609.6mb diff 31% took 39s
compress logs-231998 chunk_size=2mb from 890.1mb to 609.4mb diff 31% took 39.6s
compress logs-231998 chunk_size=4mb from 890.1mb to 609.3mb diff 31% took 39.3s

compress logs-211998 chunk_size=512kb from 1.1gb to 811.2mb diff 32% took 52.8s
compress logs-211998 chunk_size=1mb from 1.1gb to 810.8mb diff 32% took 52.2s
compress logs-211998 chunk_size=2mb from 1.1gb to 810.6mb diff 32% took 50.8s
compress logs-211998 chunk_size=4mb from 1.1gb to 810.5mb diff 32% took 52.4s

compress logs-181998 chunk_size=512kb from 162.2mb to 106.3mb diff 34% took 7s
compress logs-181998 chunk_size=1mb from 162.2mb to 106.3mb diff 34% took 6.8s
compress logs-181998 chunk_size=2mb from 162.2mb to 106.3mb diff 34% took 6.7s
compress logs-181998 chunk_size=4mb from 162.2mb to 106.3mb diff 34% took 6.9s

compress logs-241998 chunk_size=512kb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=1mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=2mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=4mb from 9.3gb to 5.8gb diff 37% took 6.9m

compress logs-221998 chunk_size=512kb from 740.6mb to 501.4mb diff 32% took 33.5s
compress logs-221998 chunk_size=1mb from 740.6mb to 501.2mb diff 32% took 31.2s
compress logs-221998 chunk_size=2mb from 740.6mb to 501mb diff 32% took 32.2s
compress logs-221998 chunk_size=4mb from 740.6mb to 501mb diff 32% took 32.6s

The detail can be found here.

/cc @ywelsch

@ywelsch
Copy link
Contributor Author

ywelsch commented Jan 14, 2019

@dnhatn thanks for running this test. A few observations:

  • increasing chunk_size does not look to help achieve better compression.
  • There is quite the CPU overhead with compression enabled vs network bandwidth saved (which is around 30%) when sending file chunk requests. On a fast network connection, this can significantly slow down peer recoveries when having compression enabled.

@jpountz
Copy link
Contributor

jpountz commented Jan 14, 2019

Another observation is that the compression ratio varies a lot depending on the extension.

Out of curiosity, would it be possible to figure out whether to enable compression dynamically, eg. by starting by compressing half the chunks and decreasing or increasing this ratio dynamically depending on whichever performs faster?

dnhatn added a commit that referenced this issue Jan 14, 2019
Today file-chunks are sent sequentially one by one in peer-recovery. This is a
correct choice since the implementation is straightforward and recovery is
network bound in most of the time. However, if the connection is encrypted, we
might not be able to saturate the network pipe because encrypting/decrypting
are cpu bound rather than network-bound.

With this commit, a source node can send multiple (default to 2) file-chunks
without waiting for the acknowledgments from the target.

Below are the benchmark results for PMC and NYC_taxis.

- PMC (20.2 GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| -------- | -------- | -------- | -------- |
| Plain     | 184s     | 137s     | 106s     | 105s     | 106s     |
| TLS       | 346s     | 294s     | 176s     | 153s     | 117s     |
| Compress  | 1556s    | 1407s    | 1193s    | 1183s    | 1211s    |

- NYC_Taxis (38.6GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| ---------| ---------| ---------| -------- |
| Plain     | 321s     | 249s     | 191s     |  *       | *        |
| TLS       | 618s     | 539s     | 323s     | 290s     | 213s     |
| Compress  | 2622s    | 2421s    | 2018s    | 2029s    | n/a      |

Relates #33844
@jasontedor
Copy link
Member

@dnhatn What was the network bandwidth between the nodes in your benchmark? Or was it localhost?

@dnhatn
Copy link
Member

dnhatn commented Jan 14, 2019

@jasontedor It's a 12GiB connection between two GCP instances.

@jasontedor
Copy link
Member

Have you done benchmarks on slower networks or networks with high-latency?

dnhatn added a commit that referenced this issue Jan 15, 2019
Today file-chunks are sent sequentially one by one in peer-recovery. This is a
correct choice since the implementation is straightforward and recovery is
network bound in most of the time. However, if the connection is encrypted, we
might not be able to saturate the network pipe because encrypting/decrypting
are cpu bound rather than network-bound.

With this commit, a source node can send multiple (default to 2) file-chunks
without waiting for the acknowledgments from the target.

Below are the benchmark results for PMC and NYC_taxis.

- PMC (20.2 GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| -------- | -------- | -------- | -------- |
| Plain     | 184s     | 137s     | 106s     | 105s     | 106s     |
| TLS       | 346s     | 294s     | 176s     | 153s     | 117s     |
| Compress  | 1556s    | 1407s    | 1193s    | 1183s    | 1211s    |

- NYC_Taxis (38.6GB)

| Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 |
| ----------| ---------| ---------| ---------| ---------| -------- |
| Plain     | 321s     | 249s     | 191s     |  *       | *        |
| TLS       | 618s     | 539s     | 323s     | 290s     | 213s     |
| Compress  | 2622s    | 2421s    | 2018s    | 2029s    | n/a      |

Relates #33844
@dnhatn
Copy link
Member

dnhatn commented Jan 15, 2019

@jasontedor Not yet. The above result is a pure compression test.

@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
@obogobo
Copy link

obogobo commented Oct 6, 2020

I think an option to disable transport.compress for peer recovery ONLY would be very useful, especially now that ILM uses peer recovery to handle rollover.

Some metrics, we index about 500K docs per second 24/7 and rely on transport.compress in order to not go bankrupt on EC2 network transfer cost during online replication. This saves us nearly 2/3rd on transfer!! But during rollover, transport.compress brings 1/2 the cluster to a standstill due to DEFLATE pegging CPU at ~90%, while not reducing the overall network transfer amount by any more than Lucene compression already does.

Our use case would benefit greatly from separate options for compression during online/offline replication and recovery.

@fcofdez
Copy link
Contributor

fcofdez commented Sep 8, 2021

I'm closing this issue since there's been quite a bit of related work to optimize recoveries since this issue was open.

@fcofdez fcofdez closed this as completed Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

9 participants