Disable TCP compression when shipping Lucene segments during peer recovery #33844

ywelsch · 2018-09-19T07:02:45Z

Shipping lucene segments during peer recovery is configured to set the transport compress options to false:

elasticsearch/server/src/main/java/org/elasticsearch/indices/recovery/RemoteRecoveryTargetHandler.java

Lines 70 to 71 in 7f473b6

    
           .withCompress(false)  // lucene files are already compressed and therefore compressing this won't really help much so 
        
           // we are saving the cpu for other things

Unfortunately this is overridden when transport.tcp.compress is enabled, discarding the request-level option:

elasticsearch/server/src/main/java/org/elasticsearch/transport/TcpTransport.java

Lines 873 to 875 in 7f473b6

    
           if (compress) { 
        
               options = TransportRequestOptions.builder(options).withCompress(true).build(); 
        
           }

This means that when transport.tcp.compress is enabled, the (already compressed) Lucene files are unnecessarily compressed a second time, slowing peer recovery.

My preference would be that transport options that were explicitly set (withCompress(true) or withCompress(false)) take precedence over transport.tcp.compress.

Relates to https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-09-19T07:02:46Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-09-19T07:02:47Z

Pinging @elastic/es-distributed

jpountz · 2018-09-19T07:37:02Z

+1 to make transport options have precedence over the setting.

It's not completely obvious to me that Lucene files don't need further compression: Lucene files are indeed compressed, but in a way that still provides quick (often random) access, which prevents the effectiveness of compression techniques that may be used. Re-compressiong on top of Lucene might bring further space savings. It might still be the wrong trade-off, especially on a local network, but I wanted to clarify this point.

jasontedor · 2018-09-19T21:14:02Z

I agree that transport options set explicitly on the request should override the transport.tcp.compress setting. Like @jpountz, I am unsure that disabling compression on file-based recovery transfers is the right trade-off. I would like to see some experiments here. So we might end up with two changes:

request options override transport.tcp.compress
stop disabling compression on recovery

jasontedor · 2018-09-19T21:14:33Z

cc @original-brownbear @tbrooks8

original-brownbear · 2018-09-20T15:10:47Z

@ywelsch @jasontedor I experimented with compression performance today and there's hard data provided by @DaveCTurner provided in a linked issue (more details there) and both my experiments and the provided data suggest that compression severely limits the network throughput to around 20-30MB/s (~160-240Mbit/s) ~~with only trivial amounts of network traffic saved~~ (this was the result of some miscommunication, see next comment in a few min) => it seems reasonable to just fix the request options to override the transport.tcp.compress and keep disabling compression on recovery.
Also, there is the anecdotal evidence in https://discuss.elastic.co/t/elasticsearch-6-3-0-shard-recovery-is-slow/140940/14 suggesting that compression is an issue here in practice.

original-brownbear · 2018-09-20T16:41:10Z

@ywelsch thanks for pointing out the issue with the data I worked from :)
There's definitely a traffic saving here from compression. Measured this with the example log dataset we provide in the docs and its about a 50% reduction in traffic from compression for the whole recovery process end-to-end.

This makes the situation a little closer I guess :)
Still given the data we gathered on the runtime of recoveries and the reported issues, I think overall turning off compression wins (easier on the CPU and much faster on a 1Gbit link). It seems enabling the compression really only helps in situations available bandwidth for recovery is limited.

ywelsch · 2018-09-21T14:12:36Z

We discussed this during FixitFriday today. While re-compression on top of Lucene can bring further space savings (needs more benchmarks, but numbers between 10% to 50% reduction in size were mentioned), it comes at the cost of additional CPU usage, which 1) can have a negative impact on indexing / search performance, as recovery is supposed to be a background task and 2) due to the way we're sequentially sending files during recovery, can limit the throughput of single shard recoveries. While we're less enthusiastic about adding a new configuration option just for recoveries, we want to explore always disabling compression for phase 1 of recoveries, even if transport.tcp.compress is enabled. This also looks to be the thing that the code currently intends on doing, even though it's not actually working that way. The first step will therefore be to figure out whether this behavior was changed in a previous ES version.

original-brownbear · 2018-09-21T16:32:12Z

@ywelsch I looked into 2.4 and it seems the same issues this PR points out is present there as well:

The global setting overrides the per request option exactly like it still does, see https://github.com/elastic/elasticsearch/blob/2.4/core/src/main/java/org/elasticsearch/transport/netty/NettyTransport.java#L832.

So while we even did have the option (removed in #15235) to set compression off for the recovery it had no effect in practice with compression enabled as far as I can tell.

=> seems like this issue isn't new at all.

That said in the real world, I'd still vote for turning off compression for shipping the segments even when compression is globally enabled:

Keeping the compression on here at best saves 50% traffic or speeds up recovery by 50% assuming the performance is gated by bandwidth.
While turning it off gives you a ~10x speedup if bandwidth is not an issue.
- And there have been a number of complaints now about the performance of recovery with compression enabled.

=> IMO it's consistent to turn off compression for data that gets little benefit out of it in this special case.

original-brownbear · 2018-09-25T16:56:43Z

Also:

request options override transport.tcp.compress

Shouldn't we regardless of what we decide on how we want to transfer the segment data fix this to actually have the options override the global setting (or remove the option?)?

* Individual setting should only be able to negate the global compress setting since `org.elasticsearch.transport.TcpTransport#canCompress` ensures that compression only ever happens if the global compression is enabled regardless of the `TransportRequestOptions` * Disables compression of segment files during recovery to bring code and comment in line with each other * Fixes elastic#33844

original-brownbear · 2018-10-29T11:42:34Z

Created a PR with the above suggestion #34959

dnhatn · 2019-01-13T22:33:07Z

I ran a compression test with different datasets and chunk sizes. These indices use the default codec without force merge. Below is the result.

compress nyc_taxis chunk_size=512kb from 38.7gb to 27.8gb diff 28% took 32.4m
compress nyc_taxis chunk_size=1mb from 38.7gb to 27.7gb diff 28% took 32.5m
compress nyc_taxis chunk_size=2mb from 38.7gb to 27.7gb diff 28% took 32.7m
compress nyc_taxis chunk_size=4mb from 38.7gb to 27.7gb diff 28% took 32.7m

compress pmc chunk_size=512kb from 20.2gb to 16.9gb diff 16% took 19.2m
compress pmc chunk_size=1mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=2mb from 20.2gb to 16.9gb diff 16% took 19.4m
compress pmc chunk_size=4mb from 20.2gb to 16.9gb diff 16% took 19.1m

compress geonames chunk_size=512kb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=1mb from 3.1gb to 2.1gb diff 30% took 2.3m
compress geonames chunk_size=2mb from 3.1gb to 2.1gb diff 30% took 2.2m
compress geonames chunk_size=4mb from 3.1gb to 2.1gb diff 30% took 2.2m

compress osmgeopoints chunk_size=512kb from 3.3gb to 2.6gb diff 22% took 2.6m
compress osmgeopoints chunk_size=1mb from 3.3gb to 2.6gb diff 22% took 2.5m
compress osmgeopoints chunk_size=2mb from 3.3gb to 2.6gb diff 22% took 2.7m
compress osmgeopoints chunk_size=4mb from 3.3gb to 2.6gb diff 22% took 2.7m

compress logs-191998 chunk_size=512kb from 677.4mb to 461mb diff 31% took 30.9s
compress logs-191998 chunk_size=1mb from 677.4mb to 460.8mb diff 31% took 29.6s
compress logs-191998 chunk_size=2mb from 677.4mb to 460.7mb diff 31% took 29.5s
compress logs-191998 chunk_size=4mb from 677.4mb to 460.6mb diff 32% took 29.9s

compress logs-201998 chunk_size=512kb from 909.6mb to 622mb diff 31% took 41.4s
compress logs-201998 chunk_size=1mb from 909.6mb to 621.7mb diff 31% took 39.7s
compress logs-201998 chunk_size=2mb from 909.6mb to 621.5mb diff 31% took 38.4s
compress logs-201998 chunk_size=4mb from 909.6mb to 621.4mb diff 31% took 40.5s

compress logs-231998 chunk_size=512kb from 890.1mb to 609.9mb diff 31% took 39.2s
compress logs-231998 chunk_size=1mb from 890.1mb to 609.6mb diff 31% took 39s
compress logs-231998 chunk_size=2mb from 890.1mb to 609.4mb diff 31% took 39.6s
compress logs-231998 chunk_size=4mb from 890.1mb to 609.3mb diff 31% took 39.3s

compress logs-211998 chunk_size=512kb from 1.1gb to 811.2mb diff 32% took 52.8s
compress logs-211998 chunk_size=1mb from 1.1gb to 810.8mb diff 32% took 52.2s
compress logs-211998 chunk_size=2mb from 1.1gb to 810.6mb diff 32% took 50.8s
compress logs-211998 chunk_size=4mb from 1.1gb to 810.5mb diff 32% took 52.4s

compress logs-181998 chunk_size=512kb from 162.2mb to 106.3mb diff 34% took 7s
compress logs-181998 chunk_size=1mb from 162.2mb to 106.3mb diff 34% took 6.8s
compress logs-181998 chunk_size=2mb from 162.2mb to 106.3mb diff 34% took 6.7s
compress logs-181998 chunk_size=4mb from 162.2mb to 106.3mb diff 34% took 6.9s

compress logs-241998 chunk_size=512kb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=1mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=2mb from 9.3gb to 5.8gb diff 37% took 6.9m
compress logs-241998 chunk_size=4mb from 9.3gb to 5.8gb diff 37% took 6.9m

compress logs-221998 chunk_size=512kb from 740.6mb to 501.4mb diff 32% took 33.5s
compress logs-221998 chunk_size=1mb from 740.6mb to 501.2mb diff 32% took 31.2s
compress logs-221998 chunk_size=2mb from 740.6mb to 501mb diff 32% took 32.2s
compress logs-221998 chunk_size=4mb from 740.6mb to 501mb diff 32% took 32.6s

The detail can be found here.

/cc @ywelsch

ywelsch · 2019-01-14T11:21:49Z

@dnhatn thanks for running this test. A few observations:

increasing chunk_size does not look to help achieve better compression.
There is quite the CPU overhead with compression enabled vs network bandwidth saved (which is around 30%) when sending file chunk requests. On a fast network connection, this can significantly slow down peer recoveries when having compression enabled.

jpountz · 2019-01-14T17:34:50Z

Another observation is that the compression ratio varies a lot depending on the extension.

Out of curiosity, would it be possible to figure out whether to enable compression dynamically, eg. by starting by compressing half the chunks and decreasing or increasing this ratio dynamically depending on whichever performs faster?

Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the connection is encrypted, we might not be able to saturate the network pipe because encrypting/decrypting are cpu bound rather than network-bound. With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target. Below are the benchmark results for PMC and NYC_taxis. - PMC (20.2 GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| -------- | -------- | -------- | -------- | | Plain | 184s | 137s | 106s | 105s | 106s | | TLS | 346s | 294s | 176s | 153s | 117s | | Compress | 1556s | 1407s | 1193s | 1183s | 1211s | - NYC_Taxis (38.6GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| ---------| ---------| ---------| -------- | | Plain | 321s | 249s | 191s | * | * | | TLS | 618s | 539s | 323s | 290s | 213s | | Compress | 2622s | 2421s | 2018s | 2029s | n/a | Relates #33844

jasontedor · 2019-01-14T20:22:06Z

@dnhatn What was the network bandwidth between the nodes in your benchmark? Or was it localhost?

dnhatn · 2019-01-14T20:35:33Z

@jasontedor It's a 12GiB connection between two GCP instances.

jasontedor · 2019-01-15T01:43:22Z

Have you done benchmarks on slower networks or networks with high-latency?

Today file-chunks are sent sequentially one by one in peer-recovery. This is a correct choice since the implementation is straightforward and recovery is network bound in most of the time. However, if the connection is encrypted, we might not be able to saturate the network pipe because encrypting/decrypting are cpu bound rather than network-bound. With this commit, a source node can send multiple (default to 2) file-chunks without waiting for the acknowledgments from the target. Below are the benchmark results for PMC and NYC_taxis. - PMC (20.2 GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| -------- | -------- | -------- | -------- | | Plain | 184s | 137s | 106s | 105s | 106s | | TLS | 346s | 294s | 176s | 153s | 117s | | Compress | 1556s | 1407s | 1193s | 1183s | 1211s | - NYC_Taxis (38.6GB) | Transport | Baseline | chunks=1 | chunks=2 | chunks=3 | chunks=4 | | ----------| ---------| ---------| ---------| ---------| -------- | | Plain | 321s | 249s | 191s | * | * | | TLS | 618s | 539s | 323s | 290s | 213s | | Compress | 2622s | 2421s | 2018s | 2029s | n/a | Relates #33844

dnhatn · 2019-01-15T08:46:31Z

@jasontedor Not yet. The above result is a pure compression test.

obogobo · 2020-10-06T16:59:59Z

I think an option to disable transport.compress for peer recovery ONLY would be very useful, especially now that ILM uses peer recovery to handle rollover.

Some metrics, we index about 500K docs per second 24/7 and rely on transport.compress in order to not go bankrupt on EC2 network transfer cost during online replication. This saves us nearly 2/3rd on transfer!! But during rollover, transport.compress brings 1/2 the cluster to a standstill due to DEFLATE pegging CPU at ~90%, while not reducing the overall network transfer amount by any more than Lucene compression already does.

Our use case would benefit greatly from separate options for compression during online/offline replication and recovery.

fcofdez · 2021-09-08T12:39:31Z

I'm closing this issue since there's been quite a bit of related work to optimize recoveries since this issue was open.

ywelsch added >enhancement discuss :Distributed/Network Http and internode communication implementations :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Sep 19, 2018

jasontedor assigned original-brownbear Sep 19, 2018

rlhh mentioned this issue Sep 20, 2018

Cluster goes into a yellow state when relocating active primary shards #33872

Closed

jasontedor removed the discuss label Sep 27, 2018

original-brownbear mentioned this issue Oct 29, 2018

NETWORKING: Simplify Transport Compression Setting #34959

Closed

Tim-Brooks mentioned this issue Dec 12, 2018

Always compress based on the settings #36522

Merged

dnhatn mentioned this issue Dec 24, 2018

Reduce recovery time with compress or secure transport #36981

Merged

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

fcofdez closed this as completed Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable TCP compression when shipping Lucene segments during peer recovery #33844

Disable TCP compression when shipping Lucene segments during peer recovery #33844

ywelsch commented Sep 19, 2018 •

edited

elasticmachine commented Sep 19, 2018

elasticmachine commented Sep 19, 2018

jpountz commented Sep 19, 2018

jasontedor commented Sep 19, 2018

jasontedor commented Sep 19, 2018

original-brownbear commented Sep 20, 2018 •

edited

original-brownbear commented Sep 20, 2018

ywelsch commented Sep 21, 2018

original-brownbear commented Sep 21, 2018 •

edited

original-brownbear commented Sep 25, 2018

original-brownbear commented Oct 29, 2018

dnhatn commented Jan 13, 2019 •

edited

ywelsch commented Jan 14, 2019

jpountz commented Jan 14, 2019

jasontedor commented Jan 14, 2019

dnhatn commented Jan 14, 2019

jasontedor commented Jan 15, 2019

dnhatn commented Jan 15, 2019

obogobo commented Oct 6, 2020 •

edited

fcofdez commented Sep 8, 2021

Disable TCP compression when shipping Lucene segments during peer recovery #33844

Disable TCP compression when shipping Lucene segments during peer recovery #33844

Comments

ywelsch commented Sep 19, 2018 • edited

elasticmachine commented Sep 19, 2018

elasticmachine commented Sep 19, 2018

jpountz commented Sep 19, 2018

jasontedor commented Sep 19, 2018

jasontedor commented Sep 19, 2018

original-brownbear commented Sep 20, 2018 • edited

original-brownbear commented Sep 20, 2018

ywelsch commented Sep 21, 2018

original-brownbear commented Sep 21, 2018 • edited

original-brownbear commented Sep 25, 2018

original-brownbear commented Oct 29, 2018

dnhatn commented Jan 13, 2019 • edited

ywelsch commented Jan 14, 2019

jpountz commented Jan 14, 2019

jasontedor commented Jan 14, 2019

dnhatn commented Jan 14, 2019

jasontedor commented Jan 15, 2019

dnhatn commented Jan 15, 2019

obogobo commented Oct 6, 2020 • edited

fcofdez commented Sep 8, 2021

ywelsch commented Sep 19, 2018 •

edited

original-brownbear commented Sep 20, 2018 •

edited

original-brownbear commented Sep 21, 2018 •

edited

dnhatn commented Jan 13, 2019 •

edited

obogobo commented Oct 6, 2020 •

edited