[FLINK-8790][State] Improve performance for recovery from incremental checkpoint #5582

sihuazhou · 2018-02-26T16:23:12Z

What is the purpose of the change

This PR fixes FLINK-8790. When there are multi state handle to be restored, we can improve the performance as follow:

1. Choose the best state handle to init the target db
1. Use the other state handles to create tmp db, and clip the tmp db according to the target key group range (via rocksdb.deleteRange()), this can help use get rid of the key group check in
  data insertion loop and also help us get rid of traversing the useless records.

Brief change log

Improve the performance when restoring from multi state handles

Verifying this change

The changes can be verified by the exists tests and below unit test can also help to verify it.

RocksDBIncrementalCheckpointUtilsTest.java

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)

Documentation

Does this pull request introduce a new feature? (no)

sihuazhou · 2018-02-26T16:24:25Z

@StefanRRichter Could you please have a look at this?

StefanRRichter · 2018-02-27T16:06:32Z

Thanks for the contribution! We are currently busy with the 1.5 release. I will have a closer look at this PR and your other pending JIRAs after the release is out.

sihuazhou · 2018-03-02T02:59:28Z

Thanks, looking forward.

sihuazhou · 2018-05-24T03:17:24Z

Unfortunately, after confirming with RocksDB, the deleteRange() is still an experimental feature, it may have impact on read performance currently(event thought we could use the ReadOption to reduce the impaction).

In practice, I tested the impact of read performance of deleteRange() in our case (only delete 2 ranges at most), I didn't find any impact in fact. And the TiKV has already used it to delete entire shards. But, to be on the safe side, I think the current PR should be frozen, but I think the implementation base on deleteRange() in this PR should be a better implementation(especially when user scaling up the job, in that case we only need to clip the RocksDB without iterating any records, a super fast way) if the deleteRange() is no longer a feature of experimental.

Anyways, even although we can't use the deleteRange() currently, but we can still improve the performance of the incremental checkpoint in somehow. We can improve it the by the follow way: if one of the state handle's key-group is a sub-range of the target key-group range. we can open it directly to prevent the overhead of iterating it. @StefanRRichter What do you think? If you don't object this, I will update the PR follow the above approach.

sihuazhou · 2018-05-28T14:17:46Z

Hi @StefanRRichter could you please have a look at this?

StefanRRichter · 2018-05-29T14:24:45Z

...sdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeySerializationUtils.java

@@ -138,4 +138,12 @@ private static void writeVariableIntBytes(
 			value >>>= 8;
 		} while (value != 0);
 	}
+
+	public static byte[] serializeKeyGroup(int keyGroup, int keyGroupPrefixBytes) {
+		byte[] startKeyGroupPrefixBytes = new byte[keyGroupPrefixBytes];


Maybe we can rather pass the startKeyGroupPrefixBytes array directly instead of creating it in every invocation from keyGroupPrefixBytes. Like that, the caller can reuse the same array.

Good point! There is one problem with this PR, After confirming with RocksDB the deleteRange() is still a experimental feature of RocksDB...even thought I did some experiments on this deleteRange() in our case, and I did find any downside, I'm not sure whether we should still use the deleteRange(), but the deleteRange() should be used for the recovery of the incremental checkpoint definitely when it's stable. What do you think?

Some info about the experiment I did:

I set ReadOptions::ignore_range_deletions = true to speed up the read performance, because we won't read any records that belong to the key-group we have deleted.

I only call the deleteRange() twice, because we will at most call it twice in the recovery of the incremental checkpoint.

This is a very good question. But I think for as long as we consider rescaling incremental checkpoints itself as experimental, we can try to use the deleteRange and change it in case we experience any problems. Would that be ok?

@StefanRRichter I think that makes a lot of sense! I will rebase the PR and ping you again, it's a bit outdated now...

Cool, no problem, I am also completing this review soon. One general thing that I was wondering about: did you ever see the sstable ingestion feature? It would be super nice for the rescaling of incremental checkpoints if we could simply ingest the sstables from multiple checkpoints into one database and the just clip the range boundaries. Unfortunately, from what I have seen this only works for external sstables written bei the sstable writer (see here: https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files). I wonder if there is any way to modify the sstables if incremental checkpoints to make them usable for ingestion, but maybe it is just completely impossible. I also found this interesting discussion that outlines another potential approach: facebook/rocksdb#499. Any thoughts?

Yes, I did notice the "sstable ingestion feature", and also did some experiment on it. You are right that currently the ingestion feature only works for the sstables written by the sstable writer. I tried to use the sstable writer to generate external sstables in parallel and ingest the sstables into the target db, but unfortunately the performance of the sstable writer is quite poor in RocksJava...I left the experiment conclusion in FLINK-8845(that is the reason why I took a step back to use the WriteBatch to speed up the recovery for full checkpoint), I pasted the comments below:

Unfortunately, even though according to RocksDB wiki, the best way to load data into RocksDB is "Generate SST files (using SstFileWriter) with non-overlapping ranges in parallel and bulk load the SST files.". But after implementing this and test with a simple bench mark, I found that the performance is not that good as expected, it's almost the same or worst that as using Rocks.put(). After a bit analysis I found that when building SST it consumed a lot of time to create DirectSlice and currently we can't reuse the DirectSlice in java api. Even though in C++ this could help to get a outperformance result, but in java I think we can't use this to improve the performance currently (maybe somedays RocksDB might improve this to enable us get a approximate performance in java as using C++) ...

And regarding to facebook/rocksdb#499, If I'm not misunderstand, I think we might also can't use the repairDB() because we have many column families, and the other opinions in that thread is quite similar with the approach that I've tried to build the sstables in parallel and it turned out that it didn't work properly with Java API.

Ok, that was also how I understood the discussions and docs. In that ase, let's proceed with this approach and I will finalize the review now.

StefanRRichter · 2018-05-29T14:29:23Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

-						", but found " + rawStateHandle.getClass());
-				}
+			if (!hasExtraKeys) {
+				restoreFromSingleHandle(restoreStateHandles.iterator().next());


This new (and also the old code before) look like there could be a potential bug: if restoreStateHandles.size() > 1 is false, how can we be sure that restoreStateHandles.iterator().next() exists? Even if it works from some hidden assumption, it does not look so clean.

You are right, I think this should be improved.

StefanRRichter · 2018-05-31T11:45:18Z

...rc/main/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtils.java

+		}
+	}
+
+	public static int evaluateGroupRange(KeyGroupRange range1, KeyGroupRange range2) {


This method name is not so helpful. It does not tell us anything about what is evaluated. Maybe a javadoc on the public methods would be helpful.

I think this code could also be based on something like KeyGroupRange.getIntersection(KeyGroupRange).getNumberOfKeyGroups().

Yes, will change it. 👍

One more thought: do you think it can make sense to also include the state size of the handle in the evaluation score? Only problem here is, is a higher or a lower size better? A higher size could also just mean that the initial database was just not in a well compacted state.

I think it make sense to also take the state size into the count. If we take it into count, then the score may look like: "handle's state size" * "numberOfKeyGroups" / "handle's total key group". But you are right, I don't know if a higher or the a lower size is better either, which make me not sure whether we should take it into count now...

I think I'm a bit torn here.

Then let's just keep it simple for now, and we can still improve it if we later find that the size can also be an indicator of the better initial db state.

StefanRRichter · 2018-05-31T11:51:21Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+		}
+
+		private class RestoredDBInfo implements AutoCloseable {
+			private RocksDB db;


All fields could be final and NonNull annotated.

StefanRRichter · 2018-05-31T11:58:14Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+			}
+		}
+
+		private class RestoredDBInfo implements AutoCloseable {


This name is not optimal because the class is more than pure info. It holds the temporary DB and is used to manage it's lifecycle. I would suggest RestoreDBInstance or RestoreDBHandle.

StefanRRichter · 2018-05-31T12:01:53Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+
+						try (RocksIterator iterator = tmpRestoreDBInfo.db.newIterator(tmpColumnFamilyHandle)) {
+
+							iterator.seek(targetStartKeyGroupPrefixBytes);


If the DB is clipped, do we even need to seek or will the iterator already begin at a useful key-group anyways?

I think this is a not so nice API of RocksIterator, a newly created Iterator doesn't point to the header element by default, users need to performance the seek() to make sure it is valid.

StefanRRichter · 2018-05-31T12:03:59Z

...rc/main/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtils.java

+			if (currentGroupRange.getStartKeyGroup() < targetGroupRange.getStartKeyGroup()) {
+				byte[] beginKey = RocksDBKeySerializationUtils.serializeKeyGroup(
+					currentGroupRange.getStartKeyGroup(), keyGroupPrefixBytes);
+				byte[] endKye = RocksDBKeySerializationUtils.serializeKeyGroup(


typo: endKye

StefanRRichter · 2018-05-31T12:06:54Z

...rc/main/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtils.java

+ */
+public class RocksDBIncrementalCheckpointUtils {
+
+	public static void clipDBWithKeyGroupRange(


I wonder if clipping the database to avoid prefix check is an optimization or not? If we don't clip, the must seek the iterator and apply a single if to every key. This if is very predictable for the CPU because it always passed except for when we terminate the loop. This sounds rather cheap. What are your thoughts about why deleting ranges is the better approach?

Ah, I don't see a clear benefit either, but I think it makes the loop code look cleaner to me. But If you think that we don't need to clip the database to avoid prefix check, that's also good to me and I will change it.

I think the code will still look ok, it is just one more if (we even only need if in the cases where we would clip something). If this allows us to eliminate some amount of codes, test, move away from experimental features, and may be faster then I think it is a good idea.

That makes sense, will change it.

StefanRRichter · 2018-05-31T12:14:03Z

...est/java/org/apache/flink/contrib/streaming/state/RocksDBIncrementalCheckpointUtilsTest.java

+/**
+ * Tests to guard {@link RocksDBIncrementalCheckpointUtils}.
+ */
+public class RocksDBIncrementalCheckpointUtilsTest {


Please add extends TestLogger

…ey = true.

sihuazhou · 2018-06-01T01:57:52Z

@StefanRRichter Thanks for your nice review, addressed your comments, could you please have a look again?

StefanRRichter · 2018-06-01T08:30:21Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+			chooseTheBestStateHandleToInit(restoreStateHandles, targetKeyGroupRange);
+
+			int targetStartKeyGroup = stateBackend.getKeyGroupRange().getStartKeyGroup();
+			byte[] targetStartKeyGroupPrefixBytes = new byte[stateBackend.keyGroupPrefixBytes];


I think this array is not used anymore after we write to it.

Oh yes, I wanted to pull this array creation out of the for (KeyedStateHandle rawStateHandle : restoreStateHandles) { but forgot to remove the array creation in the loop and replace the startKeyGroupPrefixBytes to targetStartKeyGroupPrefixBytes , nice catch! 👍

StefanRRichter · 2018-06-01T08:34:24Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+			return registeredStateMetaInfoEntry.f0;
+		}
+
+		private void chooseTheBestStateHandleToInit(


I think the name of this method is no longer accurate: it does not only chose the best handle, it already restores as db instance. Maybe we can we still break this up into two methods, so that each method only does one thing. I think it is not so nice if that creating the db is a side effect of a method that claims to only find something.

Yes, I will split it into two methods.

StefanRRichter · 2018-06-01T08:38:40Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

-				stateMetaInfoSnapshots);
+				columnFamilyHandles);
+
+			if (needClip) {


Instead of using this clipping as a flag, would it not be better to just have a method that clips a RestoredDBInstance which is simply called after restoreDBFromStateHandle in the on case that needs it? This would allow to not mix up those two tings in one method.

In that case, maybe clip can also become a method of the RestoredDBInstance?

Yes, you are right, will follow your suggestion.

StefanRRichter · 2018-06-01T08:44:01Z

...-rocksdb/src/test/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackendTest.java

+			AbstractKeyedStateBackend<K> keyedStateBackend = super.createKeyedStateBackend(
+				env, jobID, operatorIdentifier, keySerializer, numberOfKeyGroups, keyGroupRange, kvStateRegistry);
+
+			// We ignore the range deletions on production, but when we are running the tests we shouldn't ignore it.


Can you briefly explain why this whole ReadOptions change is required and what this comment about ignoring range deleted is related to? This seems to introduce some implicit complexities, so I just want to double check if this is really required.

I introduce this because we may called the deleteRange() when rescaling from incremental checkpoint. According to the RocksDB's comment on its CPP file, in order to get rid of the downside of the read performance, we should set readOptions.setIgnoreRangeDeletions(true);.

On production, that is fine because we won't query any record that belong to the key-group we have delete.

But when running tests, we may need to verify that after restoring from checkpoint, we didn't take any external key-group that isn't belong to the target key group into the backend(e.g. StateBackendTestBase#testKeyGroupSnapshotRestore().). The reason that we need to readOptions.setIgnoreRangeDeletions(false); in this case might be explain as below:

db.deleteRange(range1); readOptions.setIgnoreRangeDeletions(true); db.get(readOptions, key in range1); // this may not be null, because we have ignore the range deletions. readOptions.setIgnoreRangeDeletions(false); db.get(readOptions, key in range1); // this will be null

As I see, this is only happening in the case where there is only one handle and we are only interested in a subset of the key-groups. Unfortunately, that should be the common case of scaling out. I am wondering if we should not prefer to apply normal deletes over range delete, because what will happen if we take again a snapshot from a database that was using range deletes? Will the keys all be gone in cases of full and incremental snapshots? If the performance of normal deletes is not terrible, that might be cleaner for as long as range deletes are not working properly or have potential negative side-effects. What is your opinion about this?

Sorry...I think I may didn't understand "I am wondering if we should not prefer to apply normal deletes over range delete" properly, is that mean "I am wondering if we should prefer to apply normal deletes over range delete". As far as I know the keys all be gone only when compaction occur, for deleteRange() it only write a special record in db, looks like Deleted Range(beginKey, endKey], it won't remove any records from the db indeed.

And yes, concerning to the negative side-effects of the deleteRange() I also still have the same concerns, even thought rescaling from checkpoint is still an experimental feature. I think a more safer way to improve the performance of recover from incremental checkpoint is we don't clip it, and only choose the instance to be the initial db when its key-group range is a subset of the target key-group range. What do you think?

Does it make sense to let chooseTheBestStateHandleToInit to return a non-null instance only when there's one instance's key-group is fully covered by the target key-group. This way we won't clip anything, and the clip related code can be removed away(or still retain it in case we may use it in the future?).

Oh so poor of my english...I think maybe base on fraction is a better choose, e.g. if "number of the invalid key-group of handle" / "number of all key-group of handle" <= 1/4, we prefer to the "restore+single deletes".

That sounds good 👍 This also means we do not need any tricks with the read options :)

Yes, I will change it according to the approach we discussion above.

On a different note, this became more complex now, I wonder if we should also add a test for incremental rescale. I think that could be done at a level of using the KeyedOneInputStreamOperatorTestHarness for different ranges, choosing incremental RocksDB, trigger two checkpoints, scale in and out into a new harness with different key group range. Ideally, this test could cover all the different corner cases (about the fraction)

I also think it needs a test now, I will add it.

StefanRRichter · 2018-06-01T08:46:39Z

...rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java

+			}
+
+			@Override
+			public void close() throws Exception {


can remove throws Exception

sihuazhou · 2018-06-01T10:23:16Z

@StefanRRichter Thanks for your nice review and preventing this PR to fall into a sick way, I will change the code according to your comments and ping you again when I finish this.

…tion, and use delete() to clip the db.

sihuazhou · 2018-06-05T06:52:03Z

Hi @StefanRRichter I updated the PR according to the previous discussions, could you please have a look when you have time? The travis failed is unrelated, it's a checkstyle error introduced by the previous PRs.

StefanRRichter · 2018-06-05T10:28:23Z

LGTM 👍 Very nice work. I will merge it with some very minor touchups.

…heckpoint This closes apache#5582.

sihuazhou force-pushed the improve_recovery_from_increment_checkpoint branch 8 times, most recently from f18eb80 to 112bd74 Compare March 20, 2018 03:04

StefanRRichter reviewed May 29, 2018

View reviewed changes

StefanRRichter reviewed May 31, 2018

View reviewed changes

Improve the recovery performance for incremental checkpoint when hasK…

b4645d8

…ey = true.

sihuazhou force-pushed the improve_recovery_from_increment_checkpoint branch 6 times, most recently from 1f750b8 to 60f5b5f Compare June 1, 2018 01:04

StefanRRichter reviewed Jun 1, 2018

View reviewed changes

select the handle to init the inital db according to the overlap frac…

4cb3d92

…tion, and use delete() to clip the db.

sihuazhou force-pushed the improve_recovery_from_increment_checkpoint branch from 60f5b5f to 4cb3d92 Compare June 5, 2018 06:04

asfgit closed this in bbf7ff2 Jun 5, 2018

sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018

[FLINK-8790][State] Improve performance of rescaling an incremental c…

bc6e8d8

…heckpoint This closes apache#5582.

rmetzger added the component=Runtime/StateBackends label Mar 18, 2019

lgo mentioned this pull request Feb 8, 2021

[FLINK-21321][Runtime/StateBackends] improve RocksDB incremental rescale performance by using deleteRange operator #14893

Open


		try (RocksIterator iterator = tmpRestoreDBInfo.db.newIterator(tmpColumnFamilyHandle)) {

		iterator.seek(targetStartKeyGroupPrefixBytes);

[FLINK-8790][State] Improve performance for recovery from incremental checkpoint #5582

[FLINK-8790][State] Improve performance for recovery from incremental checkpoint #5582

Conversation

sihuazhou commented Feb 26, 2018

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

sihuazhou commented Feb 26, 2018

StefanRRichter commented Feb 27, 2018

sihuazhou commented Mar 2, 2018

sihuazhou commented May 24, 2018

sihuazhou commented May 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sihuazhou commented Jun 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sihuazhou commented Jun 1, 2018

sihuazhou commented Jun 5, 2018

StefanRRichter commented Jun 5, 2018