[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

zsxwing · 2016-03-10T01:42:42Z

What changes were proposed in this pull request?

Add a MetadataLog interface for metadata reliably storage.
Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.

How was this patch tested?

unit tests

zsxwing · 2016-03-10T01:43:24Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

@@ -360,59 +357,6 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
    Utils.deleteRecursively(tmp)
  }

-  test("fault tolerance with corrupted metadata file") {


Removed these tests as they were testing the old metadata file which has been removed in this PR

zsxwing · 2016-03-10T01:50:11Z

cc @marmbrus @tdas

SparkQA · 2016-03-10T03:15:57Z

Test build #52801 has finished for PR 11625 at commit c32cd96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class HDFSMetadataLog[T: ClassTag](sqlContext: SQLContext, path: String) extends MetadataLog[T]
- trait MetadataLog[T]

marmbrus · 2016-03-10T19:48:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+
+  private val serializer = new JavaSerializer(sqlContext.sparkContext.conf).newInstance()
+
+  private def tryAcquireLock(): Unit = {


What happens if you die while you are holding the lock? It seems like your streaming job will be unable to restart without human intervention. Is there a reason that we can't detect problems when writing a new log entry?

As the semantics of add is If batchId's metatdata has already been stored, this method does nothing., if the writer A sees a file created by the other writer B, it won't write the file. So A won't fail and will use the metadata written by B.

Seems like this abstraction would be more powerful if we threw a ConcurrentUpdateException and then the user could decide if that is okay. If all you are trying to get is idempotence (the file sink) then you can ignore it. If you are trying to do mutual exclusion (stream execution trying to define the offsets in each batch id) you can terminate the stream.

Basically, I don't believe that stale locks are only going to occur in extreme circumstances. If the JVM locks due to an OOM or if the container spark is running in is killed or if you kill -9. All of these become unrecoverable.

You could also return a boolean instead of throwing an exception.

If we don't use a global .lock file, there are two cases when writing a log entry fails because of FileAlreadyExistsException:

There is another HDFSMetadataLog using the same path

The file is corrupted. We just restarted from a failure and tried to rerun a batch.

For case 1, we want to throw ConcurrentUpdateException; for case 2, we need to overwrite the file.

As we need to figure out which situation we are in, we will try to read the file to see if it's completed. However, we also have two possibilities if we find the file is corrupted:

Another HDFSMetadataLog is writing the file.

Nobody is using the path and the file is corrupted.

So how can we know which case is right?

SparkQA · 2016-03-10T19:50:03Z

Test build #52835 has finished for PR 11625 at commit d77bf39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-10T20:00:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+      }
+      try {
+        output.writeInt(buffer.remaining())
+        Utils.writeByteBuffer(buffer, output: DataOutput)


Why the type ascription?

There are two methods in Utils:

def writeByteBuffer(bb: ByteBuffer, out: DataOutput): Unit def writeByteBuffer(bb: ByteBuffer, out: OutputStream): Unit

FSDataOutputStream is both DataOutput and OutputStream. The compiler doesn't know to call which one. So I need to add the type here.

SparkQA · 2016-03-14T19:17:32Z

Test build #53083 has finished for PR 11625 at commit 7a52adc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-14T20:56:31Z

Test build #53103 has finished for PR 11625 at commit 4c27c6e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-03-14T21:15:04Z

retest this please

SparkQA · 2016-03-14T21:31:41Z

Test build #53112 has finished for PR 11625 at commit 4c27c6e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-14T23:28:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

+    None
+  }
+
+  override def stop(): Unit = {


Get rid of stop? Seems like we would like to avoid relying on this method for correctness (since we need to handle abnormal termination). So I would just leave it out of the interface entirely.

SparkQA · 2016-03-15T01:33:45Z

Test build #53137 has finished for PR 11625 at commit 1c82e56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-03-15T02:27:31Z

Thanks! Merging to master.

## What changes were proposed in this pull request? - Add a MetadataLog interface for metadata reliably storage. - Add HDFSMetadataLog as a MetadataLog implementation based on HDFS. - Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself. ## How was this patch tested? unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#11625 from zsxwing/metadata-log.

Add MetadataLog and HDFSMetadataLog

c32cd96

zsxwing reviewed Mar 10, 2016
View reviewed changes

Sort the return values

d77bf39

marmbrus reviewed Mar 10, 2016
View reviewed changes

zsxwing added 3 commits March 14, 2016 10:38

Update the strategy to detect directory collision

ed9bab6

Merge remote-tracking branch 'origin/master' into metadata-log

7a52adc

Merge remote-tracking branch 'origin/master' into metadata-log

f2007ba

Handle a corner case

4c27c6e

marmbrus reviewed Mar 14, 2016
View reviewed changes

Remove stop and fix docs

1c82e56

asfgit closed this in b5e3bd8 Mar 15, 2016

zsxwing deleted the metadata-log branch March 15, 2016 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

zsxwing commented Mar 10, 2016

zsxwing Mar 10, 2016

zsxwing commented Mar 10, 2016

SparkQA commented Mar 10, 2016

marmbrus Mar 10, 2016

zsxwing Mar 10, 2016

marmbrus Mar 10, 2016

marmbrus Mar 10, 2016

marmbrus Mar 10, 2016

zsxwing Mar 11, 2016

SparkQA commented Mar 10, 2016

marmbrus Mar 10, 2016

zsxwing Mar 10, 2016

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

zsxwing commented Mar 14, 2016

SparkQA commented Mar 14, 2016

marmbrus Mar 14, 2016

SparkQA commented Mar 15, 2016

marmbrus commented Mar 15, 2016


		private val serializer = new JavaSerializer(sqlContext.sparkContext.conf).newInstance()

		private def tryAcquireLock(): Unit = {

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

Conversation

zsxwing commented Mar 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

zsxwing commented Mar 10, 2016

SparkQA commented Mar 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 14, 2016

SparkQA commented Mar 14, 2016

zsxwing commented Mar 14, 2016

SparkQA commented Mar 14, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2016

marmbrus commented Mar 15, 2016