Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21596][SS]Ensure places calling HDFSMetadataLog.get check the return value #18799

Closed
wants to merge 5 commits into from

Conversation

@zsxwing
Copy link
Member

commented Aug 1, 2017

What changes were proposed in this pull request?

When I was investigating a flaky test, I realized that many places don't check the return value of HDFSMetadataLog.get(batchId: Long): Option[T]. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.

This PR ensures that places calling HDFSMetadataLog.get always check the return value.

How was this patch tested?

Jenkins

@SparkQA

This comment has been minimized.

Copy link

commented Aug 1, 2017

Test build #80129 has finished for PR 18799 at commit 7090fc0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Aug 2, 2017

Test build #80138 has finished for PR 18799 at commit 91efeb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -123,7 +123,7 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path:
serialize(metadata, output)
return Some(tempPath)
} finally {
IOUtils.closeQuietly(output)
output.close()

This comment has been minimized.

Copy link
@zsxwing

zsxwing Aug 7, 2017

Author Member

The output stream may fail to close (e.g., fail to flush the internal buffer), if it happens, we should fail the query rather than ignoring it.

val allLogs = validBatches.flatMap(batchId => super.get(batchId)).flatten ++ logs
val allLogs = validBatches.map { batchId =>
super.get(batchId).getOrElse {
throw new IllegalStateException(s"batch $batchId doesn't exist")

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

what does it mean when this get return None? somehow the batch file is missing? is there anyway to improve the message any more?

getAllValidBatches(latestId, compactInterval).flatMap(id => super.get(id)).flatten
getAllValidBatches(latestId, compactInterval).map { id =>
super.get(id).getOrElse {
throw new IllegalStateException(s"batch $id doesn't exist")

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

would be good give more context information. like latestId, etc.

@@ -396,6 +400,7 @@ object HDFSMetadataLog {

/**
* Rename a path. Note that this implementation is not atomic.
*

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

is this needed?

@SparkQA

This comment has been minimized.

Copy link

commented Aug 8, 2017

Test build #80370 has finished for PR 18799 at commit 5170950.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
// Verify that we can get all batches between `startId` and `endId`.
if (startId.isDefined || endId.isDefined) {
if (batchIds.isEmpty) {
throw new IllegalStateException(s"batch ${startId.orElse(endId).get} doesn't exist")

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

would be good to print the range that was asked for. otherwise its hard to see what was expected while debugging.

test("verifyBatchIds") {
import HDFSMetadataLog.verifyBatchIds
verifyBatchIds(Seq(1L, 2L, 3L), Some(1L), Some(3L))
verifyBatchIds(Seq(1L), Some(1L), Some(1L))

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

you didnt test the valid cases when one of the start or end is None.

@@ -1314,6 +1314,7 @@ class FileStreamSourceSuite extends FileStreamSourceTest {
val metadataLog =
new FileStreamSourceLog(FileStreamSourceLog.VERSION, spark, dir.getAbsolutePath)
assert(metadataLog.add(0, Array(FileEntry(s"$scheme:///file1", 100L, 0))))
assert(metadataLog.add(1, Array(FileEntry(s"$scheme:///file2", 200L, 0))))

This comment has been minimized.

Copy link
@tdas

tdas Aug 8, 2017

Contributor

what was this change for?

This comment has been minimized.

Copy link
@zsxwing

zsxwing Aug 9, 2017

Author Member

newSource.getBatch(None, FileStreamSourceOffset(1)) will fail without this because batch 1 doesn't exist.

@SparkQA

This comment has been minimized.

Copy link

commented Aug 8, 2017

Test build #80418 has finished for PR 18799 at commit 7f85b63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@zsxwing

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2017

@tdas addressed your comments

@tdas

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

LGTM.

@SparkQA

This comment has been minimized.

Copy link

commented Aug 9, 2017

Test build #80429 has finished for PR 18799 at commit 16b02da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 6edfff0 Aug 9, 2017

@tdas

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

Merged to master. But there were conflicts with 2.2. Can you make another PR for 2.2.

tdas added a commit to tdas/spark that referenced this pull request Aug 9, 2017
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…
… return value

When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.

This PR ensures that places calling HDFSMetadataLog.get always check the return value.

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#18799 from zsxwing/SPARK-21596.
@tdas

This comment has been minimized.

Copy link
Contributor

commented Aug 9, 2017

Never mind @zsxwing I already opened a PR #18890

@zsxwing zsxwing deleted the zsxwing:SPARK-21596 branch Aug 9, 2017

asfgit pushed a commit that referenced this pull request Aug 9, 2017
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…
… return value

Same PR as #18799 but for branch 2.2. Main discussion the other PR.
--------

When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.

This PR ensures that places calling HDFSMetadataLog.get always check the return value.

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18890 from tdas/SPARK-21596-2.2.
MatthewRBruce added a commit to Shopify/spark that referenced this pull request Jul 31, 2018
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the…
… return value

Same PR as apache#18799 but for branch 2.2. Main discussion the other PR.
--------

When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug.

This PR ensures that places calling HDFSMetadataLog.get always check the return value.

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#18890 from tdas/SPARK-21596-2.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.