Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

Closed
wants to merge 7 commits into from
Closed

[SPARK-13791][SQL]Add MetadataLog and HDFSMetadataLog #11625

wants to merge 7 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Mar 10, 2016

What changes were proposed in this pull request?

  • Add a MetadataLog interface for metadata reliably storage.
  • Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
  • Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.

How was this patch tested?

unit tests

@@ -360,59 +357,6 @@ class FileStreamSourceSuite extends FileStreamSourceTest with SharedSQLContext {
Utils.deleteRecursively(tmp)
}

test("fault tolerance with corrupted metadata file") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these tests as they were testing the old metadata file which has been removed in this PR

@zsxwing
Copy link
Member Author

zsxwing commented Mar 10, 2016

cc @marmbrus @tdas

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52801 has finished for PR 11625 at commit c32cd96.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HDFSMetadataLog[T: ClassTag](sqlContext: SQLContext, path: String) extends MetadataLog[T]
    • trait MetadataLog[T]


private val serializer = new JavaSerializer(sqlContext.sparkContext.conf).newInstance()

private def tryAcquireLock(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you die while you are holding the lock? It seems like your streaming job will be unable to restart without human intervention. Is there a reason that we can't detect problems when writing a new log entry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the semantics of add is If batchId's metatdata has already been stored, this method does nothing., if the writer A sees a file created by the other writer B, it won't write the file. So A won't fail and will use the metadata written by B.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this abstraction would be more powerful if we threw a ConcurrentUpdateException and then the user could decide if that is okay. If all you are trying to get is idempotence (the file sink) then you can ignore it. If you are trying to do mutual exclusion (stream execution trying to define the offsets in each batch id) you can terminate the stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I don't believe that stale locks are only going to occur in extreme circumstances. If the JVM locks due to an OOM or if the container spark is running in is killed or if you kill -9. All of these become unrecoverable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also return a boolean instead of throwing an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use a global .lock file, there are two cases when writing a log entry fails because of FileAlreadyExistsException:

  1. There is another HDFSMetadataLog using the same path
  2. The file is corrupted. We just restarted from a failure and tried to rerun a batch.

For case 1, we want to throw ConcurrentUpdateException; for case 2, we need to overwrite the file.

As we need to figure out which situation we are in, we will try to read the file to see if it's completed. However, we also have two possibilities if we find the file is corrupted:

  1. Another HDFSMetadataLog is writing the file.
  2. Nobody is using the path and the file is corrupted.

So how can we know which case is right?

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52835 has finished for PR 11625 at commit d77bf39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
try {
output.writeInt(buffer.remaining())
Utils.writeByteBuffer(buffer, output: DataOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the type ascription?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two methods in Utils:

def writeByteBuffer(bb: ByteBuffer, out: DataOutput): Unit
def writeByteBuffer(bb: ByteBuffer, out: OutputStream): Unit

FSDataOutputStream is both DataOutput and OutputStream. The compiler doesn't know to call which one. So I need to add the type here.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53083 has finished for PR 11625 at commit 7a52adc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53103 has finished for PR 11625 at commit 4c27c6e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Mar 14, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53112 has finished for PR 11625 at commit 4c27c6e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

None
}

override def stop(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of stop? Seems like we would like to avoid relying on this method for correctness (since we need to handle abnormal termination). So I would just leave it out of the interface entirely.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53137 has finished for PR 11625 at commit 1c82e56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks! Merging to master.

@asfgit asfgit closed this in b5e3bd8 Mar 15, 2016
@zsxwing zsxwing deleted the metadata-log branch March 15, 2016 04:00
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

- Add a MetadataLog interface for  metadata reliably storage.
- Add HDFSMetadataLog as a MetadataLog implementation based on HDFS.
- Update FileStreamSource to use HDFSMetadataLog instead of managing metadata by itself.

## How was this patch tested?

unit tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#11625 from zsxwing/metadata-log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants