Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct migration status #30116

Closed
wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 21, 2020

What changes were proposed in this pull request?

This PR changes < into > in the following to fix data loss during storage migrations.

// If we found any new shuffles to migrate or otherwise have not migrated everything.
- newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get()
+ newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()

Why are the changes needed?

refreshOffloadingShuffleBlocks should return true when the migration is still on-going.

Since migratingShuffles is defined like the following, migratingShuffles.size > numMigratedShuffles.get() means the migration is not finished.

// Shuffles which are either in queue for migrations or migrated
protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]()

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CI with the updated test cases.

@dongjoon-hyun
Copy link
Member Author

cc @holdenk

@@ -268,7 +268,7 @@ private[storage] class BlockManagerDecommissioner(
stoppedShuffle = true
}
// If we found any new shuffles to migrate or otherwise have not migrated everything.
newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get()
newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix.

@dongjoon-hyun
Copy link
Member Author

cc @HyukjinKwon since you are interested in this area, too.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34680/

@HyukjinKwon
Copy link
Member

Thanks for cc'ing me.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34680/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130071 has finished for PR 30116 at commit 421c10c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34686/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34686/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130077 has finished for PR 30116 at commit 421c10c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34707/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34707/

@agrawaldevesh
Copy link
Contributor

cc: @Ngone51

@holdenk
Copy link
Contributor

holdenk commented Oct 21, 2020

LGTM pending integration testing.

@@ -183,7 +183,7 @@ class BlockManagerDecommissionUnitSuite extends SparkFunSuite with Matchers {
val bmDecomManager = new BlockManagerDecommissioner(sparkConf, bm)
bmDecomManager.migratingShuffles += ShuffleBlockInfo(10, 10)

validateDecommissionTimestampsOnManager(bmDecomManager)
validateDecommissionTimestampsOnManager(bmDecomManager, fail = false, assertDone = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also increment migratedShuffles and then we would expect it to finish.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130098 has finished for PR 30116 at commit fadcf55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34711/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34711/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130102 has finished for PR 30116 at commit fa10f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Oct 21, 2020

Since this is a bug fix, if no one has any concerns I'll merge this tonight.

@dongjoon-hyun
Copy link
Member Author

Thank you for review and approval, @holdenk . I'll merge this with your LGTM. :)
This is an obvious fix > and <.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-33202 branch October 21, 2020 22:24
@HyukjinKwon
Copy link
Member

Nice, +1. Seems like properly merged to master.

@Ngone51
Copy link
Member

Ngone51 commented Oct 22, 2020

Good catch! late LGTM.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
…ect migration status

This PR changes `<` into `>` in the following to fix data loss during storage migrations.

```scala
// If we found any new shuffles to migrate or otherwise have not migrated everything.
- newShufflesToMigrate.nonEmpty || migratingShuffles.size < numMigratedShuffles.get()
+ newShufflesToMigrate.nonEmpty || migratingShuffles.size > numMigratedShuffles.get()
```

`refreshOffloadingShuffleBlocks` should return `true` when the migration is still on-going.

Since `migratingShuffles` is defined like the following, `migratingShuffles.size > numMigratedShuffles.get()` means the migration is not finished.
```scala
// Shuffles which are either in queue for migrations or migrated
protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]()
```

No.

Pass the CI with the updated test cases.

Closes apache#30116 from dongjoon-hyun/SPARK-33202.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants