Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19779][SS]Delete needless tmp file after restart structured streaming job #17124

Closed
wants to merge 1 commit into from

Conversation

gf53520
Copy link
Contributor

@gf53520 gf53520 commented Mar 1, 2017

What changes were proposed in this pull request?

SPARK-19779

The PR (#17012) fixed restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

How was this patch tested?

unit tests

@gf53520 gf53520 changed the title Delete needless tmp file after restart structured streaming job [SPARK-19779][SS]Delete needless tmp file after restart structured streaming job Mar 1, 2017
if (!fs.exists(finalDeltaFile) && !fs.rename(tempDeltaFile, finalDeltaFile)) {
if (fs.exists(finalDeltaFile)) {
fs.delete(tempDeltaFile, true)
} else if (!fs.rename(tempDeltaFile, finalDeltaFile)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file exists, it is deleted, but no new file is renamed to it -- is that right?

Copy link
Contributor Author

@gf53520 gf53520 Mar 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when restart streaming job , thefinalDeltaFile generated by the first batch is same to a finalDeltaFile generated by the last batch of streaming job before restart. So here don't need rename to create an same file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is, after this change, the file may not exist after this executes. Before, it always existed after this block. I wasn't sure that was the intended behavior change because the purpose seems to be to delete the temp file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This pr just want to delete the needless temp file, and the delta file need exist.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #3589 has finished for PR 17124 at commit 5600776.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Just some minor issues.

@@ -282,8 +282,12 @@ private[state] class HDFSBackedStateStoreProvider(
// target file will break speculation, skipping the rename step is the only choice. It's still
// semantically correct because Structured Streaming requires rerunning a batch should
// generate the same output. (SPARK-19677)
// Also, a tmp file of delta file that generated by the first batch after restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not 100% correct, this may also happen in a speculation task.

This PR is just a follow up to delete the temp file that #17012 forgot to do it. IMO, not need to add a comment for it.

@@ -295,6 +295,28 @@ class StateStoreSuite extends SparkFunSuite with BeforeAndAfter with PrivateMeth
provider.getStore(0).commit()
}

test("SPARK-19779: A tmp file of delta file should not be reserved on HDFS " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a new test, I prefer to just add several lines to the above SPARK-19677: Committing a delta file atop an existing one should not fail on HDFS. E.g.

  test("SPARK-19677: Committing a delta file atop an existing one should not fail on HDFS") {
    val conf = new Configuration()
    conf.set("fs.fake.impl", classOf[RenameLikeHDFSFileSystem].getName)
    conf.set("fs.default.name", "fake:///")

    val provider = newStoreProvider(hadoopConf = conf)
    provider.getStore(0).commit()
    provider.getStore(0).commit()

    // Verify we don't leak temp files
    val tempFiles = FileUtils.listFiles(new File(provider.id.checkpointLocation), null, true)
      .asScala.filter(_.getName.contains("temp-"))
    assert(tempFiles.isEmpty)
  }

@gf53520
Copy link
Contributor Author

gf53520 commented Mar 2, 2017

@zsxwing I have rewritten the test case.

@gf53520 gf53520 force-pushed the SPARK-19779 branch 2 times, most recently from 1a0b232 to db3f4db Compare March 2, 2017 10:38
@zsxwing
Copy link
Member

zsxwing commented Mar 2, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73786 has finished for PR 17124 at commit db3f4db.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gf53520
Copy link
Contributor Author

gf53520 commented Mar 3, 2017

retest this please

@zsxwing
Copy link
Member

zsxwing commented Mar 3, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73803 has finished for PR 17124 at commit c5895e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Mar 3, 2017

LGTM. Merging to master, 2.1 and 2.0. Thanks!

asfgit pushed a commit that referenced this pull request Mar 3, 2017
…treaming job

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <guifengleaf@gmail.com>

Closes #17124 from gf53520/SPARK-19779.

(cherry picked from commit e24f21b)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
asfgit pushed a commit that referenced this pull request Mar 3, 2017
…treaming job

## What changes were proposed in this pull request?

[SPARK-19779](https://issues.apache.org/jira/browse/SPARK-19779)

The PR (#17012) can to fix restart a Structured Streaming application using hdfs as fileSystem, but also exist a problem that a tmp file of delta file is still reserved in hdfs. And Structured Streaming don't delete the tmp file generated when restart streaming job in future.

## How was this patch tested?
 unit tests

Author: guifeng <guifengleaf@gmail.com>

Closes #17124 from gf53520/SPARK-19779.

(cherry picked from commit e24f21b)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
@asfgit asfgit closed this in e24f21b Mar 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants