Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize indexing (translog.durability=async ) by using LinkedBlockingQueue. Closes #45371 #45450

Closed
wants to merge 0 commits into from

Conversation

@dengweisysu
Copy link
Contributor

commented Aug 12, 2019

using LinkedBlockingQueue to decoupling Translog-Writing from Translog-Flushing.
this request contains two commits:

  1. Extract Interface ITranslog from Translog
  2. add AsyncTranslog to replace Translog when in async translog durability mode
@elasticcla

This comment has been minimized.

Copy link

commented Aug 12, 2019

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Hi @dengweisysu, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

i has add the e-mails into my github , is ok now ?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 12, 2019

@dnhatn
Copy link
Contributor

left a comment

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

//produce using ring buffer
dispatcher.produce(new OperationEvent(operation, this.translog, this.unConsumeCounter));
//async return no location
return null;

This comment has been minimized.

Copy link
@dnhatn

dnhatn Aug 14, 2019

Contributor

This breaks wait_for_refresh and other refresh features.

This comment has been minimized.

Copy link
@dengweisysu

dengweisysu Aug 14, 2019

Author Contributor

This breaks wait_for_refresh and other refresh features.

As I know, wait_for_refresh is not suitable for the scene that require high indexing performance, because it will result in writing thread waiting and smaller segement (which will in trun increase time to query doc version when indexing).

So. Is it really necessary to force return the low level info( offset of Translog file) in async durability?

@@ -69,6 +70,9 @@
public static final Setting<Translog.Durability> INDEX_TRANSLOG_DURABILITY_SETTING =
new Setting<>("index.translog.durability", Translog.Durability.REQUEST.name(),
(value) -> Translog.Durability.valueOf(value.toUpperCase(Locale.ROOT)), Property.Dynamic, Property.IndexScope);
public static final Setting<Integer> TRANSLOG_ASYNC_EVENT_BUFFER_SIZE =
Setting.intSetting("indices.translog.async_event_buffer_size", 256 * 1026, Property.NodeScope);

This comment has been minimized.

Copy link
@dnhatn

dnhatn Aug 14, 2019

Contributor

This would consume a significant amount of memory.

This comment has been minimized.

Copy link
@dengweisysu

dengweisysu Aug 14, 2019

Author Contributor

this can be set a smaller size.
But using memory to buffer translog when writing is unavailable is the key point of this proposal, it's a trade off between index speed and memory.

IndexSettings indexSettings = engineConfig.getIndexSettings();
if (translogConfig.isDurabilityAsync()) {
//using ayncTranslog to improve write performance
ExecutorService executor = engineConfig.getThreadPool().executor(Names.ASYNC_WRITE_TRANSLOG);

This comment has been minimized.

Copy link
@dnhatn

dnhatn Aug 14, 2019

Contributor

Changing the translog durability setting would no longer work.

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

Thank you for working on the proposal. I don't think we should change translog this way (see my comments). However, we can work together to improve the performance of the translog. Can you share with us more information on your benchmark? How does translog take 6 minutes of out 18 minutes in your benchmark? Thank you!

As I mention in issue #45371, Writing is easily be blocking to waiting the lock just like this:
20190814141129

In my benchmark, One translog(for one Index operation ) takes about 3k, and the TranslogConfig.DEFAULT_BUFFER_SIZE is 8k,that means that every about 3 times operation will result in flush of BufferedChannelOutputStream (see BufferedOutputStream#write) and then will block other threads to add translog.

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2019

Is there only contention on a rollover?
What if you set index.translog.flush_threshold_size: 50G, index.translog.sync_interval: 1h, and index.translog.generation_threshold_size: 50G (i.e. no rollover)?

this config is really effective ! indexing performance improve significantly!
translog open with async durability model : 18 minutes
translog close (change source code) : 12 minutes
translog writing async using disruptor(change source code) : 12 minutes
translog writing async using LinkedBlockQueue(change source code) : about 13 minutes
translog writing async using using config above : about 13 minutes

but is there any negative impact using this config?

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2019

I wouldn't recommend running with this config (as you will lose 50G worth of data in case of a crash), but the above experiment allows us to look at much simpler solutions (e.g. parallelizing the writing to a new generation while fsyncing a previous generation).

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 15, 2019

I print the elapsed time when rollGeneration, and found that:

rollGeneration :StopWatch '': running time = 760.9ms

ms % Task name

00644 085% closeIntoReader
00062 008% copyCheckpointTo
00054 007% createWriter

this test is done with default config:
index.translog.generation_threshold_size=64mb

that means that rollGeneration happen every 2 second, and block all other write thread .
this test is done on v6.6.2

@dnhatn

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

Thanks @dengweisysu.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

7 similar comments
@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine

This comment has been minimized.

Copy link
Collaborator

commented Aug 15, 2019

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

i would do the benchmark with this change today , But I don't think this will have effect, because the current.sync() will hold the lock of TranslogWriter, and then will block all other write thread too.

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

@dengweisysu this is quite interesting. I do wonder what happens if you run your benchmarks with this change:

diff --git a/server/src/main/java/org/elasticsearch/index/translog/Translog.java b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
index b98f0f2b643..8335503001f 100644
--- a/server/src/main/java/org/elasticsearch/index/translog/Translog.java
+++ b/server/src/main/java/org/elasticsearch/index/translog/Translog.java
@@ -1641,6 +1641,7 @@ public class Translog extends AbstractIndexShardComponent implements IndexShardC
      * @throws IOException if an I/O exception occurred during any file operations
      */
     public void rollGeneration() throws IOException {
+        current.sync(); // make sure we move most of the data to disk outside of the lock
         try (Releasable ignored = writeLock.acquire()) {
             try {
                 final TranslogReader reader = current.closeIntoReader();

This will make sure that we sync most of the data outside of the lock. if this saves 85% of the time that should reduce the time the lock is held significantly.
Would be great to see the impact.

this change effect!

  • translog open with async durability model : 18 minutes
  • translog close (change source code) : 12 minutes
  • translog writing async using disruptor(change source code) : 12 minutes
  • translog writing async using LinkedBlockQueue(change source code) : about 13 minutes
  • translog using bigGeneration : about 13 minutes
  • translog sync before rollGeneration : about 14 minutes 40 seconds, much better than 18 minutes with default config

I print more log in sync, and I found that most of time is wasted by channel.force:
running time = 592ms
ms % Task name
00000 000% get TranslogWriter lock
00000 000% outputStream.flush
00578 098% channel.force
00013 002% writeCheckpoint in sync

sync() method only hold TranslogWriter when flush, and release before channel.force
20190816202329

sync() before rollGeneration, reduce the writeLock time holding when rollGeneration

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 16, 2019

@s1monw use sync() maybe better than current.sync() in rollGeneration

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

For rolling generation, the order now is :

  1. close previous one
  2. open new one.

If this order can be change to:

  1. open new one.
  2. replace current reference to new one
  3. close previous one

close older generation will not block other writing thread.
But it is impossible for the current design of the checkpoint file, because checkpoint file should be rename to target generation before create new one.

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

Why the generation_threshold_size config not open to change on guide document ("https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html").
The more writing thread concurrently writing the same shard, the more seriously performacne was impact, because rolling generation will become more quickly.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

For rolling generation, the order now is :

close previous one
open new one.
If this order can be change to:

open new one.
replace current reference to new one
close previous one
close older generation will not block other writing thread.
But it is impossible for the current design of the checkpoint file, because checkpoint file should be rename to target generation before create new one.

I am afraid this would violate many of our assumptions we make in the transaction log. The problem here ist mostly that certain files must not exist if we fail during a sync or rolling a generation. In oder to change that I think we need a fundamental change how the translog writes it's files which I'd personally like to prevent. I think using the sync method outside of the lock would give us a good speedup and is low risk. Would you mind opening a PR for this change?

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

In your case moving to something like index.translog.generation_threshold_size: 512M would likely help too. Also given that we don't use the translog for recovery anymore in the future most of the downsides would go away here. @dnhatn @ywelsch we should consider changing the defaults here for the generation size? 64M might be quite small? But on the other hand we don't want to keep too much data.

@dnhatn

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

If we increase index.translog.generation_threshold_size, then we can keep more translog if a Lucene commit contains gaps in sequence numbers; then recovering from the store would take a bit longer. I don't think these are big issues. I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x? /cc @jasontedor.

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x?

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@dengweisysu

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

I am wondering if we should increase the default threshold from 64MB to 256MB in 7.x and remove it entirely in 8.x?

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@s1monw
Actually, with default config, Size of translog file generated is usually much bigger than 64M (in my benchmark is about 200M~500M)

Because flush operation (trigger by "flush_threshold_size") and rollGeneration use the same lock flushOrRollRunning (in IndexShard).
with default flush_threshold_size config:
[flush frequently] ---> [too many segments] ---> [slow down resolveDocVersion]---> [slow down writing]
to solve this problem, I tune up flush_threshold_size, and result in :
[flush infrequently]--->[roll generation frequently] ---> [slow down writing]

so in my benchmark:
with default config: 130 translog files, took about 70s
with larger flush_threshold_size config: 424 translog files, took about 242s

So maybe is unnecessary to simply change the default threshold from 64MB to 256MB.

@dnhatn

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

I am not sure I am following @dnhatn are you suggesting we make the size fixed and remove the setting?

@s1monw Sorry I wasn't clear. I meant to increase the default setting from 64MB to 256MB in 7.x. If there's no significant issue with the new default setting, we can remove the setting and the rolling logic entirely in 8.x.

@dengweisysu Can you please run a benchmark with the default flush_threshold_size and generation_threshold_size= 256MB or 1024MB (to disable the auto rolling)? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.