Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a write #9285

Closed
wants to merge 6 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Oct 26, 2015

Currently the Write Ahead Log in Spark Streaming flushes data as writes need to be made. S3 does not support flushing of data, data is written once the stream is actually closed.
In case of failure, the data for the last minute (default rolling interval) will not be properly written. Therefore we need a flag to close the stream after the write, so that we achieve read after write consistency.

cc @tdas @zsxwing

@harishreedharan
Copy link
Contributor

Wouldn't this be really expensive?

@brkyvz
Copy link
Contributor Author

brkyvz commented Oct 26, 2015

@harishreedharan There is a related patch that allows batching here.
Unfortunately it is expensive, but it is necessary. It is also turned off by default

@brkyvz brkyvz changed the title [SPARK-11324] Flag for closing Write Ahead Logs after a write [SPARK-11324][STREAMING] Flag for closing Write Ahead Logs after a write Oct 26, 2015
@harishreedharan
Copy link
Contributor

If there is no other option, this LGTM

@SparkQA
Copy link

SparkQA commented Oct 26, 2015

Test build #44375 has finished for PR 9285 at commit a1636c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -39,6 +39,7 @@ private[streaming] object WriteAheadLogUtils extends Logging {

val DEFAULT_ROLLING_INTERVAL_SECS = 60
val DEFAULT_MAX_FAILURES = 3
val WAL_CLOSE_AFTER_WRITE = "spark.streaming.writeAheadLog.closeAfterWrite"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stick to existing naming conventions for this configuration. So for lets have spark.streaming.driver.wal.closeFileAfterWrite and same for ...receiver...

So this should be a configuration like the rollingIntervalSecs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do receivers by default use local disk for WAL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the config in that case be something like spark.streaming.driver.writeAheadLog.closeFileAfterWrite ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah yeah ... i shortened to make the point .. .sorry if that wasnt clear.
@brkyvz receivers by default do not use WAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas Noticed that this is going to be super hard. How are we going to differentiate between a receiver WAL and a driver WAL in the write method of FileBasedWAL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, worked around it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at how rollingIntervalSecs is implemented. That also has
separate configurations for driver and receiver

On Mon, Oct 26, 2015 at 7:47 PM, Burak Yavuz notifications@github.com
wrote:

In
streaming/src/main/scala/org/apache/spark/streaming/util/WriteAheadLogUtils.scala
#9285 (comment):

@@ -39,6 +39,7 @@ private[streaming] object WriteAheadLogUtils extends Logging {

val DEFAULT_ROLLING_INTERVAL_SECS = 60
val DEFAULT_MAX_FAILURES = 3

  • val WAL_CLOSE_AFTER_WRITE = "spark.streaming.writeAheadLog.closeAfterWrite"

@tdas https://github.com/tdas Noticed that this is going to be super
hard. How are we going to differentiate between a receiver WAL and a driver
WAL in the write method of FileBasedWAL?


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/9285/files#r43079268.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -47,7 +47,8 @@ private[streaming] class FileBasedWriteAheadLog(
logDirectory: String,
hadoopConf: Configuration,
rollingIntervalSecs: Int,
maxFailures: Int
maxFailures: Int,
closeAfterWrite: Boolean = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to break existing code. Apparently FileBasedWriteAheadLog is instantiated in only 4 spots, therefore I can manually add the closeAfterWrite = false parameter to all of them. It could be okay to keep it though as the method is private[streaming] and we don't need to keep Java compatibility as it is not exposed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not exposed. So its better to actually keep it as non-default parameter, and add the parameter judiciously in the all the 4 spots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44394 has finished for PR 9285 at commit 8a1efea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44439 has finished for PR 9285 at commit 5dd0ac8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2015

Test build #44444 has finished for PR 9285 at commit a8dcd87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Oct 27, 2015

LGTM. Merging this to master. Thanks @brkyvz

@asfgit asfgit closed this in 4f030b9 Oct 27, 2015
@brkyvz brkyvz deleted the caw-wal branch February 3, 2019 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants