New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing performance degradation in 6.0.0-beta1 #26339

Closed
muralikpbhat opened this Issue Aug 23, 2017 · 19 comments

Comments

Projects
None yet
8 participants
@muralikpbhat

muralikpbhat commented Aug 23, 2017

Describe the feature: Indexing performance

Elasticsearch version (bin/elasticsearch --version): 6.0.0-beta1

Plugins installed: none

JVM version (java -version): openjdk version "1.8.0_131"

OS version (uname -a if on a Unix-like system): Linux 4.9.32-15.41.amzn1.x86_64 # 1 SMP Thu Jun 22 06:20:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Host: i3.16xl EC2 instance with 31GB heap configured, indexing buffer size of 50%, and data.path spread across 8 disks.

Description of the problem including expected versus actual behavior: Indexing throughput is degraded by up to 50% compared to 5.5.2. Seeing hotspots in LocalCheckpointTracker.margSeqNoAsCompleted(), generetaSeqNo() etc. Following shows the results for different tracks :

ES versions 5.5.2 6.0.0-beta1
genomes 178940 docs/s 88310 docs/s
nyc_taxis 208638 docs/s 150481 docs/s
logging 363691 docs/s 289247 docs/s

Steps to reproduce:

  1. Restrict the rally challenge of each track to the following:
    {
          "name": "append-no-conflicts-just-index",
          "description": "benchmark",
          "index-settings": {
            "index.number_of_replicas": 0,
            "index.number_of_shards": 8,
            "index.refresh_interval": "100s",
            "index.translog.flush_threshold_size": "4gb",
            "index.translog.durability": "async",
            "index.translog.sync_interval": "240s"
            },
          "schedule": [
            {
              "operation": "index",
              "warmup-time-period": 240,
              "clients": 48
            }
          ]
        }

  2. Run the challenge against 5.5.2 and 6.0.0-beta1 versions.
    esrally --pipeline benchmark-only --track=nyc_taxis --challenge append-no-conflicts-just-index --target-hosts=XX.XX.XX.XX:9200

@bleskes

This comment has been minimized.

Member

bleskes commented Aug 23, 2017

thx for trying 6.0.0 - I added you to the pioneer program.

Seeing hotspots in LocalCheckpointTracker.margSeqNoAsCompleted(), generetaSeqNo()

Those methods are indeed synchronization points as we assumed they would be much lighter than indexing. Can you share the output that make you conclude this?

    "index.translog.flush_threshold_size": "4gb",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "240s"

This indeed removes a common bottle neck, shifting it to another place. Can you try without these? I want to confirm things become closer then - as our own benchmarks suggest.

@coderplay

This comment has been minimized.

coderplay commented Aug 23, 2017

The degradation is caused by this line introduced by this commit whose original purpose was doing refactoring, but it added a lock in the critical path of indexing.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Aug 24, 2017

Thanks for reporting this! We've reproduced the issue on our nightly benchmarking hardware with a few modifications:

  • 16GB heap configured (our machines have "only" 32GB)
  • 32 clients

I have used the nyc_taxis track and I've seen the following median throughput:

  • 6.0.0-beta1: 92261 docs/s
  • 5.5.2: 98220 docs/s

While the effect is not as pronounced as in your case, it is still significant. For 6.0.0-beta1, we are seeing lock contention in LocalCheckpointTracker where locks were held for roughly 4s in total during the benchmark (for comparison: locks on TranslogWriter, which is the lock with the highest contention, were held for a total of 49 seconds).

In 80 out of 94 captured traces, the lock was held via LocalCheckpointTracker#generateSeqNo(). I'll implement this method in a lock-free manner.

LocalCheckpointTracker.markSeqNoAsCompleted(long) were held 14 times for a total of 580ms. As this method is (a) a lot trickier to implement lock-free and (b) contributing less to contention (in our tests), I'll only reduce the scope of the lock a bit.

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Aug 24, 2017

Implement local sequence number counting lock-free
With this commit we use CAS instead of locking to guard the local
sequence number counter. We do this to reduce lock contention when a
large number of clients indexes data in Elasticsearch.

This commit only targets `LocalCheckpointTracker#generateSeqNo()` and
the related methods but leaves
`LocalCheckpointTracker#markSeqNoAsCompleted()` more or less in its
current state because we have seen it to cause less lock contention than
the other method and it is also harder to implement lock-free.

Closes elastic#26339
@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Aug 25, 2017

Repeating my comment on the related PR #26362 here:

After running benchmarks in our nightly benchmarking environment, we see the following median throughput for the nyc_taxis track (24 clients):

  • 5.5.2: 91773 docs/s
  • With #26362: 82125 docs/s
  • Without #26362: 82258 docs/s

In summary: This did not improve the situation.

We do, however, see that with this PR the monitor lock of org.elasticsearch.index.seqno.LocalCheckpointTracker is now blocked for 10.3 seconds during the benchmark (before ~ 4 seconds). In the majority of cases, the affected method is #markSeqNoAsCompleted(long). Hence, we'll look into improving #markSeqNoAsCompleted(long) too.

@danielmitterdorfer danielmitterdorfer self-assigned this Aug 25, 2017

@coderplay

This comment has been minimized.

coderplay commented Aug 26, 2017

@danielmitterdorfer Try more powerful machine. More powerful machine will gain better parallelism if there is no lock contention. The throughput difference will be larger.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Aug 28, 2017

Try more powerful machine. More powerful machine will gain better parallelism if there is no lock contention. The throughput difference will be larger.

I totally agree to this statement but for the time being, it's just important that we see a measurable difference due to the same hotspots that you are describing. When we're confident that the problem is fixed, I think it makes sense to run another benchmark on a more powerful machine.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 1, 2017

To summarize the work so far:

I've reimplemented LocalCheckpointTracker so that it is completely lock-free. In the default case, it even avoids notification if it is certain that there are no waiting threads anyway (see LocalCheckpointTracker#waitForOpsToComplete(long)).

I've ran the nyc_taxis track with the following challenge:

{
  "name": "index-contention",
  "description": "Indexes the whole document corpus with more clients",
  "index-settings": {
    "index.number_of_replicas": 0,
    "index.number_of_shards": 8,
    "index.refresh_interval": "100s",
    "index.translog.flush_threshold_size": "4gb",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "240s"
  },
  "schedule": [
    {
      "operation": "index",
      "warmup-time-period": 240,
      "clients": 24
    }
  ]
}

Note: So far I could never finish any benchmark with the originally stated number of 48 clients without running into bulk rejections and I consider a benchmark with an error rate > 0 invalid. Hence, I've reduced the number of clients to 24.

On our nightly benchmarking hardware (4 cores, 32GB) with a heap size of 16GB and indices.memory.index_buffer_size: 50%, we see the following results (median indexing throughput):

  • 5.5.2: 90844 docs/s
  • 6.0.0-beta2-SNAPSHOT (commit hash 41675c1): 83611 docs/s
  • 6.0.0-beta2-SNAPSHOT + lock-free changes: 83208 docs/s

So for this configuration we do not see any noticeable difference between different implementations on 6.0.0-beta2 that cannot be attributed to run-to-run variance.

One significant contributor is contention in TranslogWriter due to frequent flushing of checkpoints. Therefore I also ran a benchmark with the index setting index.translog.generation_threshold_size: "2gb" (default is 64mb) and we see indeed an improvement. I only tested the configuration "6.0.0-beta2-SNAPSHOT + lock free changes" which resulted in a median indexing throughput of 86937 docs/s. Still not there but much closer. We also see less contention in TranslogWriter as checkpoints are flushed less often.

Finally, I also ran a benchmark on the same hardware as stated originally, i.e. i3.16xl with 8 data paths (formatted with ext4), a heap size of 31GB and indices.memory.index_buffer_size: 50%. The load generator (Rally) was on a c3.8xlarge in the same availability zone. I've removed all data and trimmed the disks before each run.

  • 5.5.2: 240115 docs/s
  • 6.0.0-beta2-SNAPSHOT (commit hash 41675c1): 144277 docs/s
  • 6.0.0-beta2-SNAPSHOT + lock-free changes: 156933 docs/s

In this case, we can see a noticeable improvement due the lock-free implementation of LocalCheckpointTracker. An additional run with index.translog.generation_threshold_size: "2gb" did not show any improvement though (median indexing throughput 153709 docs/s).

As far as I can tell at the moment, after resolving problems in LocalCheckpointTracker the main bottleneck seems to be caused by heavy contention in TranslogWriter but this is subject to further investigation.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 4, 2017

Further experiments have shown that a major contributor to higher contention in 6.0.0 is caused by a higher number threads in the system due to #20874. A i3.16xl has 64 cores which result in 64 bulk indexing threads on 6.0.0 whereas it was bounded to 32 bulk indexing threads on 5.x. Thus setting processors: 32 in config/elasticsearch.yml we see now a much saner behavior w.r.t. to lock contention.

With the same setup as before (same instance types but different instances) we get the following median indexing throughput:

  • 5.5.2: 251121 docs/s
  • 6.0.0-beta2-SNAPSHOT + lock-free changes with default processor settings: 152846 docs/s
  • 6.0.0-beta2-SNAPSHOT + lock-free changes with processors: 32: 136893 docs/s

The top-contended lock in 6.0.0 is the monitor lock on TranslogWriter. Across the three benchmark candidates, this lock was contended the following number of times during the benchmark:

  • 5.5.2: 1,050
  • 6.0.0-beta2-SNAPSHOT + lock-free changes with default processor settings: 143,030
  • 6.0.0-beta2-SNAPSHOT + lock-free changes with processors: 32: 13,708

While this measure has drastically reduced the number of times this lock is contended, we still see a significant difference in indexing throughput for these settings and hardware.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 6, 2017

After further benchmarks have hinted that the cause may be Lucene, I've started bisecting. Indeed, we see a major drop in median indexing throughput after upgrading to Lucene 7 (measured on an i3.16xlarge):

  • last revision with Lucene 6.x (revision f217eb8): 241561 docs/s
  • first revision with Lucene 7.0 (revision 4632661): 128667 docs/s

Here is the achieved median indexing throughput on a 4 core machine for comparison:

  • last revision with Lucene 6.x: 83898 docs/s
  • first revision with Lucene 7.0 : 82558 docs/s

So the drop we see is significantly lower on machines with lower number of cores.

The cause is not yet clear; two possibilities are JIT compiler behavior or lock contention.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 8, 2017

The performance drop is caused by a write to a(n unused) variable in Lucene. We've seen that a significant portion of time is spent in related code. Here is a snippet of a JMH microbenchmark with the perfasm profiler:

....[Hottest Region 1]..............................................................................
C2, level 4, org.apache.lucene.index.DefaultIndexingChain$PerField::invert, version 1231 (792 bytes)

                          0x00007f52d56b026a: xor    %r10d,%r10d        ;*invokevirtual isAssignableFrom
                                                                        ; - org.apache.lucene.util.AttributeSource::addAttribute@28 (line 207)
                                                                        ; - org.apache.lucene.document.Field$BinaryTokenStream::<init>@8 (line 512)
                                                                        ; - org.apache.lucene.document.Field::tokenStream@82 (line 491)
                                                                        ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)
                          0x00007f52d56b026d: mov    $0x8,%r11d         ;*invokeinterface iterator
                                                                        ; - org.apache.lucene.util.AttributeSource::getCurrentState@46 (line 254)
                                                                        ; - org.apache.lucene.util.AttributeSource::clearAttributes@1 (line 269)
                                                                        ; - org.apache.lucene.document.Field$BinaryTokenStream::incrementToken@10 (line 532)
                                                                        ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@153 (line 736)
  0.00%    0.02%       ↗  0x00007f52d56b0273: test   %r10,%r10
                  ╭    │  0x00007f52d56b0276: je     0x00007f52d56b0292  ;*getfield fieldsData
                  │    │                                                ; - org.apache.lucene.document.Field::binaryValue@1 (line 441)
                  │    │                                                ; - org.apache.lucene.document.Field::tokenStream@65 (line 487)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)
  0.00%    0.00%  │    │  0x00007f52d56b0278: mov    (%r11),%rsi        ;*getfield next
                  │    │                                                ; - java.util.HashMap::getNode@137 (line 580)
                  │    │                                                ; - java.util.LinkedHashMap::get@6 (line 440)
                  │    │                                                ; - org.apache.lucene.util.AttributeSource::getAttribute@6 (line 245)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@143 (line 734)
  0.09%    0.51%  │    │  0x00007f52d56b027b: mov    0x18(%rsi),%r8
 23.70%    3.54%  │    │  0x00007f52d56b027f: mov    $0x7f4926a81d88,%rcx  ;   {metadata('org/apache/lucene/analysis/tokenattributes/CharTermAttribute')}

I've created a custom version of Lucene (and Elasticsearch) where the only difference is that I've eliminated this variable. Here are the results of the nyc_taxis benchmark (again same benchmark setup as always):

  • Baseline: last revision with Lucene 6.x (revision f217eb8): 218361 docs/s
  • first revision with Lucene 7.0 (revision 4632661): 125418 docs/s
  • first revision with Lucene 7.0 (revision 4632661) + unused variable removed: 221237 docs/s

I've raised LUCENE-7963 together with a patch. As soon as we have a new Lucene snapshot build which contains my patch, we can iterate here. The next step is then to see whether we still need a lock-free implementation of LocalCheckpointTracker (i.e. #26362).

@nitsanw

This comment has been minimized.

nitsanw commented Sep 8, 2017

@danielmitterdorfer If I read this correctly (and there's always a chance to misread assembly) the line blame here is falsely accused. At least the assembly line (not arguing about the unused local var). The line:
0x00007f52d56b027f: mov $0x7f4926a81d88,%rcx
I setting the RCX register to a constant, so costs nothing and cannot be the issue. It's likely that the issue is in fact one of the lines preceding it:

  0.00%    0.00%  │    │  0x00007f52d56b0278: mov    (%r11),%rsi        ;*getfield next
                  │    │                                                ; - java.util.HashMap::getNode@137 (line 580)
                  │    │                                                ; - java.util.LinkedHashMap::get@6 (line 440)
                  │    │                                                ; - org.apache.lucene.util.AttributeSource::getAttribute@6 (line 245)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@143 (line 734)
  0.09%    0.51%  │    │  0x00007f52d56b027b: mov    0x18(%rsi),%r8
 23.70%    3.54%  │    │  0x00007f52d56b027f: mov    $0x7f4926a81d88,%rcx  ;   {metadata('org/apache/lucene/analysis/tokenattributes/CharTermAttribute')}
  0.00%    0.00%  │    │  0x00007f52d56b0289: cmp    %rcx,%r8
                  │    │  0x00007f52d56b028c: jne    0x00007f52d56b0949  ;*instanceof
                  │    │                                                ; - org.apache.lucene.document.Field::binaryValue@4 (line 441)
                  │    │                                                ; - org.apache.lucene.document.Field::tokenStream@65 (line 487)
                  │    │                                                ; - org.apache.lucene.index.DefaultIndexingChain$PerField::invert@99 (line 729)

The cost would be the code in here: https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/util/AttributeSource.java#L244
Which is a HashMap lookup, which is probably a costly cache miss. The cache miss theory is supported by the poor CPI (blamed on the wrong instruction) of roughly 6.5 cycles per instruction. I would hazard a guess the hash lookup is a loop (see here) , and this is not one cache miss but many that get blamed on the wrong instruction.

The line of code, while initialising an unused local variable, cannot be easily eliminated by the JVM as the lookup code may have a side effect (e.g. return the wrong type and trigger a ClassCastException), and it is hard to prove that a complex piece of code such as the one triggered to init the variable has no side effects.

In any case, removing the unused variable is (as backed by your results) the right solution to the issue :-)

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 11, 2017

The cost would be the code in here: https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/core/src/java/org/apache/lucene/util/AttributeSource.java#L244 [...] which is a HashMap lookup, which is probably a costly cache miss.

I have to admit, I misread this instruction (reading assembly is not something I do every day ;) ). Your explanation is much more sound to me as to why we have such a huge cost.

The line of code, while initialising an unused local variable, cannot be easily eliminated by the JVM as the lookup code may have a side effect (e.g. return the wrong type and trigger a ClassCastException), and it is hard to prove that a complex piece of code such as the one triggered to init the variable has no side effects.

That makes perfect sense to me and also solves the second part of the puzzle. Thanks a lot for your explanations @nitsanw. I really appreciate it.

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Sep 11, 2017

Advance local checkpoint lock-free
With this commit we advance the local checkpoint lock-free. This
improves throughput in the contented case. We also add a test case that
calls the affected code paths in a lot of threads to expose concurrency
bugs more likely.

Relates elastic#26339
@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 11, 2017

I've run final benchmarks now. It's the same benchmarking scenario as always but I'll summarize it here.

Benchmark Setup and Methodology

  • Load driver machine: c3.8xlarge, Ubuntu 16.04
  • Target machine: i3.16xlarge, Ubuntu 16.04

Elasticsearch is configured to use 8 data paths (ext4-formatted) and 31 GB heap size. Before each trial run, we wipe the disks and run TRIM.

config/elasticsearch.yml:

network.host: 0.0.0.0
path.data: [/mnt/data0, /mnt/data1, /mnt/data2, /mnt/data3, /mnt/data4, /mnt/data5, /mnt/data6, /mnt/data7]
indices.memory.index_buffer_size: 50%

config/jvm.options: out of the box configuration, except that we set:

-Xms31G
-Xmx31G

We use Rally and the nyc_taxis track with the following challenge:

{
  "name": "index-contention",
  "description": "Indexes the whole document corpus with more clients",
  "index-settings": {
    "index.number_of_replicas": 0,
    "index.number_of_shards": 8,
    "index.refresh_interval": "100s",
    "index.translog.flush_threshold_size": "4gb",
    "index.translog.durability": "async",
    "index.translog.sync_interval": "240s"
  },
  "schedule": [
    {
      "operation": "index",
      "warmup-time-period": 240,
      "clients": 24
    }
  ]
}

Results

Version processors setting Min Indexing Throughput Median Indexing Throughput Max Indexing Throughput
5.5.2 "default" (was bounded to 32) 229286 docs/s 239590 docs/s 242889 docs/s
6.0.0-beta2* 32 230524 docs/s 236368 docs/s 242602 docs/s
6.0.0-beta2* "default" (unbounded, i.e. 64) 227845 docs/s 238880 docs/s 246613 docs/s
6.0.0-beta2* + #26362 32 228910 docs/s 234151 docs/s 240606 docs/s
6.0.0-beta2* + #26362 "default" (unbounded, i.e. 64) 221387 docs/s 237433 docs/s 242800 docs/s

*) includes the patch from LUCENE-7963

Analysis

We see now comparable performance between 5.5.2 and 6.0.0-beta2. In JFR (Java flight recorder) we can see that the new unbounded default for processors causes significantly more contention than with 32 processors. This is also the reason we are not seeing much improvement between setting processors to 32 and keeping the default value. While this is an unfortunate situation I think it is out of the scope of this ticket to fix these contention issues.

Comparing the lock-free implementation of LocalCheckpointTracker (see #26362) with the current implementation does not yield any benefits and that's why I will also close the PR without merging it. Flight recordings show that we just put more pressure on the monitor lock on TranslogWriter. So while it may be beneficial in the future to implement LocalCheckpointTracker lock-free for machines with a high core count, I think we first need to address other issues before we can see a benefit.

I'll leave this ticket open until we've actually upgraded to a Lucene version that includes the patch from LUCENE-7963.

Thanks @muralikpbhat for bringing this up.

@muralikpbhat

This comment has been minimized.

muralikpbhat commented Sep 11, 2017

Thanks @danielmitterdorfer for taking this to a nice conclusion. Very interesting find of getAttribute() issue in LUCENE-7963

In JFR (Java flight recorder) we can see that the new unbounded default for processors causes significantly more contention than with 32 processors. This is also the reason we are not seeing much improvement between setting processors to 32 and keeping the default value.

I think we first need to address other issues before we can see a benefit.

Surprising that your lock-free implementation is not giving better performance with 64cores. Are you saying Translogwriter monitor lock needs to be fixed before we make this lock free or are there more issues than the Translogwriter?

@coderplay

This comment has been minimized.

coderplay commented Sep 11, 2017

My recent test on nyc_taxis with 5.5.2 is about 288K, but with different configuration. Will make another comparison later.

@danielmitterdorfer

This comment has been minimized.

Member

danielmitterdorfer commented Sep 12, 2017

Surprising that your lock-free implementation is not giving better performance with 64cores. Are you saying Translogwriter monitor lock needs to be fixed before we make this lock free or are there more issues than the Translogwriter?

As far as I can see from the most recent flight recordings, the reason is that LocalCheckpointTracker is not the bottleneck. Although there is measurable contention in the current implementation, making it lock-free just increases contention in TranslogWriter.

IMHO, Elasticsearch's approach to scalability is to scale horizontally rather than vertically (i.e. many small nodes instead of a few large nodes). As a consequence of that design decision, we see contention on machines with a high number of cores. Or more specifically: The root cause is that multiple writers attempt to write concurrently to the translog and guarding it with a lock does not scale to machines with high core counts. A more scalable solution could be to apply the single-writer principle.

@s1monw

This comment has been minimized.

Contributor

s1monw commented Sep 12, 2017

IMHO, Elasticsearch's approach to scalability is to scale horizontally rather than vertically (i.e. many small nodes instead of a few large nodes). As a consequence of that design decision, we see contention on machines with a high number of cores. Or more specifically: The root cause is that multiple writers attempt to write concurrently to the translog and guarding it with a lock does not scale to machines with high core counts. A more scalable solution could be to apply the single-writer principle.

I had multiple attempts to this in the past that didn't yield massive speed improvements compared to the complexity it added. Lets say you'd use a ringbuffer to do this you'd need to copy the payload for every document which in-turn adds a reasonable complexity in terms of memory consumption and GC. Today the impl is basically writing to a buffered stream which is reusing it's buffer so we only pay the price of copying the bytes. It's all hidden behind the stream interface which is nice. IMO there needs to be a significant speed improvement to justify a Single Writer refactoring.

@coderplay

This comment has been minimized.

coderplay commented Sep 20, 2017

Configurations

  • Load driver machine: i3.16xlarge, Amazon Linux
  • Target machine: i3.16xlarge, Amazon Linux, 128GB heap configured for ES, G1 GC, MaxGCPauseMillis=1000, indexing buffer size of 50%, data.path spread across 8 disks, 32 netty workers, 64 bulk workers.

Challenge settings:

      "index-settings": {
        "index.number_of_shards": 24,
        "index.codec": "best_compression",
        "index.number_of_replicas": 0,
        "index.refresh_interval": "100s",
        "index.translog.flush_threshold_size": "4g",
        "index.translog.durability": "async",
        "index.translog.sync_interval": "240s"
      },
      "schedule": [
        {
          "operation": "index",
          "warmup-time-period": 240,
          "clients": 48
        }
      ]

Results

Version Min Indexing Throughput Median Indexing Throughput Max Indexing Throughput
5.5.2 268327 docs/s 281017 docs/s 288709 docs/s
6.0.0-beta2 166966 docs/s 168839 docs/s 171388 docs/s
6.0.0-beta2* 271438 docs/s 277353 docs/s 285558 docs/s

5.5.2 loosed max threads number limit.

@danielmitterdorfer Still a nice work!

The TranslogWriter lock contention problem I believe has been around for a long time. 5.5.2 also has that problem. Writings on the same index shard share the same WAL writer. Single writer principle is the ideal solution. In reality, we usually use group commit to solve this problem.

Hbase uses disruptor ringbuffer achieved group commit. We probably can learn from it.

@s1monw
I don't think we need to copy the payload, since TransLog already created a new byte array in ReleasableBytesStreamOutput.

  public Location add(final Operation operation) throws IOException {
        final ReleasableBytesStreamOutput out = new ReleasableBytesStreamOutput(bigArrays);
...
            final ReleasablePagedBytesReference bytes = out.bytes();
...

just keep the reference object in the buffer and dump them to the WAL file in a batch should work.

Regarding to the GC, elasticsearch already had serious YGC issue. The JVM allocates ~9.11GB objects per second in nyc_taxis benchmark, most of them are allocated by org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(Mapping, ParseContext$InternalParseContext, XContentParser).
The nyc_taxis indexing takes ~630 secs in 5.5.2, times 9.11 GB/s , that's 5.74TB memory allocated just for a size of 74.3 GB dataset. The GC couldn't be worse even we copy payloads.

@jpountz

This comment has been minimized.

Contributor

jpountz commented Sep 25, 2017

Fixed via #26744

@jpountz jpountz closed this Sep 25, 2017

coderplay pushed a commit to coderplay/elasticsearch that referenced this issue Sep 30, 2017

Implement local sequence number counting lock-free
With this commit we use CAS instead of locking to guard the local
sequence number counter. We do this to reduce lock contention when a
large number of clients indexes data in Elasticsearch.

This commit only targets `LocalCheckpointTracker#generateSeqNo()` and
the related methods but leaves
`LocalCheckpointTracker#markSeqNoAsCompleted()` more or less in its
current state because we have seen it to cause less lock contention than
the other method and it is also harder to implement lock-free.

Closes elastic#26339

coderplay pushed a commit to coderplay/elasticsearch that referenced this issue Sep 30, 2017

Advance local checkpoint lock-free
With this commit we advance the local checkpoint lock-free. This
improves throughput in the contented case. We also add a test case that
calls the affected code paths in a lot of threads to expose concurrency
bugs more likely.

Relates elastic#26339
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment