Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Add the functionality of cleaning up files of old versions for the RocksDB instance and RocksDBFileManager.

Why are the changes needed?

Part of the implementation of RocksDB state store.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New UT added.

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Test build #139876 has finished for PR 32933 at commit f8f9d20.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44406/

@SparkQA
Copy link

SparkQA commented Jun 16, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44406/

@viirya
Copy link
Member

viirya commented Jun 23, 2021

retest this please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the change for review in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, if SST file F was last used in version V, then it won't be used in version V+2 or later?

In other words, a SST file F can be only used in continuous versions. It won't be used in V, not used in V+1, and then used again in V+2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: won't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. The file can not be shared with skipping versions. If a file used in V and not used in V+1, the checkpoint of V+1 should already create new files for all the KVs included in the original file.

Comment on lines +222 to +223
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the second case for? When it will happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, it would occur with reattempt of same micro-batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxVersionPresent -> maxUsedVersion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, done in the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version -> versionFile.

Actually, maybe s"Error deleting version file $versionFile for version $version".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, done in the next commit.

Copy link
Member

@viirya viirya Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to reduce the files failed to delete? It is possible some files are failed to delete, but seems we ignore such failure and continue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah may need to count successful ones and failed ones separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, done in the next commit.

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44698/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deleteOldVersions is the only change needed for review, then it looks okay. Just a few minor comments.

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44698/

@SparkQA
Copy link

SparkQA commented Jun 23, 2021

Test build #140171 has finished for PR 32933 at commit f8f9d20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail to pass scalastyle now.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round. Need to look into test cases a bit more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in the next commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: won't

Comment on lines +222 to +223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, it would occur with reattempt of same micro-batch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in the next commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah may need to count successful ones and failed ones separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is localTempDir used only here? Just to make sure deleting the directory won't break anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the localTempDir is only used for unzip files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting - we don't remove any keys but expect some SST files to be invalid. Would compaction chime in and compact several SSTs into bigger one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We'll do the RocksDB checkpoint for each commit operation, each checkpoint is a full snapshot and includes all data. In this UT we have 50 versions but only retain 10 versions, so the SST files for deleted versions(1 to 40) will be deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: double empty lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in the next commit.

@SparkQA
Copy link

SparkQA commented Jun 29, 2021

Test build #140373 has finished for PR 32933 at commit 321c8b0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Test build #140438 has finished for PR 32933 at commit 56439b9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44952/

@xuanyuanking xuanyuanking changed the title [WIP][SPARK-35785][SS] Cleanup support for RocksDB instance [SPARK-35785][SS] Cleanup support for RocksDB instance Jun 30, 2021
@xuanyuanking
Copy link
Member Author

Rebased and addressed all the comments. Thanks for your heads-up! cc @viirya @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44958/

@SparkQA
Copy link

SparkQA commented Jun 30, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44958/

@xuanyuanking
Copy link
Member Author

Thanks again for the help of @HeartSaVioR and @viirya.
The test failure is not related to the changes. Let me retrigger.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45009/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45009/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45024/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140511 has finished for PR 32933 at commit ea94983.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140498 has finished for PR 32933 at commit 9c99cdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45024/

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45034/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45034/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140521 has finished for PR 32933 at commit ea94983.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

logInfo(s"Rolled back to $loadedVersion")
}

def cleanup(): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will we call this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be called in the RocksDBStateStoreProvider.doMaintenace. I'll submit the state store provider PR (the last one) today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay.

}
}

test("disallow concurrent updates to the same RocksDB instance") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems not related to clean up change here? Looks like more related to RocksDB instance PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea, this is the test for rollback.
Actually the original plan is expose rollback and cleanup in this PR. It should be a mistake for the last PR, I introduced the rollback without tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

override def run(): Unit = {
try {
for (version <- 0 to numUpdatesInEachThread) {
withDB(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what this test is used for? Each RocksDB in each thread uses the same remote root dir, won't they conflict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to simulate the multi-thread scenario of updating and cleaning old versions. It will not conflict since we call commit for each update thread and the version get updated for each commits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, will it happens? I think RocksDB is not thread-safe, and each state task only has one RocksDB instance. They should update and clean old versions individually as they are for different state store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be more likely simulating the case multiple streaming queries with same checkpoint run concurrently.

SST files shouldn't conflict as we make the file name be unique, and for metadata files we use overwriteIfPossible = true, so won't throw error if the file already exists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think the purpose of this test is to make sure no error thrown and the result is correct in the end.
After taking a further look, there's a small issue is that exception never used. I'll confirm it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuanyuanking and me discussed this test offline. Seems there is something wrong with exception usage. It doesn't look completely correct. @xuanyuanking will address it by fixing it or deleting the test later in a follow-up.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@viirya
Copy link
Member

viirya commented Jul 2, 2021

Thanks @xuanyuanking for working on this and @HeartSaVioR for the review! Merging to master.

@xuanyuanking
Copy link
Member Author

Great thanks for the help! @viirya @HeartSaVioR
I'll update the rest PR and submit the last one of RocksDBStateStoreProvider today.

@viirya viirya closed this in ca6acf0 Jul 2, 2021
@xuanyuanking xuanyuanking deleted the SPARK-35785 branch July 2, 2021 07:49
@viirya
Copy link
Member

viirya commented Jul 2, 2021

Ah, sorry, I forgot branch-3.2 was cut and this should be in branch-3.2 too. @xuanyuanking Can you submit a PR for 3.2?

@xuanyuanking
Copy link
Member Author

Sure. Let me do it now.

xuanyuanking added a commit to xuanyuanking/spark that referenced this pull request Jul 2, 2021
### What changes were proposed in this pull request?
Add the functionality of cleaning up files of old versions for the RocksDB instance and RocksDBFileManager.

### Why are the changes needed?
Part of the implementation of RocksDB state store.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT added.

Closes apache#32933 from xuanyuanking/SPARK-35785.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
@dongjoon-hyun
Copy link
Member

@dongjoon-hyun
Copy link
Member

It seems that https://issues.apache.org/jira/browse/SPARK-35993 is filed already by @attilapiros

@viirya
Copy link
Member

viirya commented Jul 2, 2021

Thanks @dongjoon-hyun! Let me ignore the test first to unblock others. @xuanyuanking will address (fix or delete) the test later.

dongjoon-hyun pushed a commit that referenced this pull request Jul 2, 2021
### What changes were proposed in this pull request?

This patch ignores the test "ensure that concurrent update and cleanup consistent versions" in #32933. The test is currently flaky and we will address it later.

### Why are the changes needed?

Unblock other developments.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33195 from viirya/ignore-rocksdb-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jul 2, 2021
### What changes were proposed in this pull request?

This patch ignores the test "ensure that concurrent update and cleanup consistent versions" in #32933. The test is currently flaky and we will address it later.

### Why are the changes needed?

Unblock other developments.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33195 from viirya/ignore-rocksdb-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a6e00ee)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants