[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

zsxwing · 2016-04-07T23:31:38Z

What changes were proposed in this pull request?

Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own.

How was this patch tested?

test("metadataPath should be in checkpointLocation")

SparkQA · 2016-04-08T00:52:42Z

Test build #55270 has finished for PR 12247 at commit d161f3a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-04-08T03:33:48Z

retest this please

SparkQA · 2016-04-08T05:04:49Z

Test build #55308 has finished for PR 12247 at commit d161f3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-04-08T17:01:26Z

cc @marmbrus

marmbrus · 2016-04-11T18:15:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+   */
+  def createSource(
+      sourceId: Option[Long] = None,
+      checkpointLocation: Option[String] = None): Source = {


Why are these optional?

sourceId and checkpointLocation are set via DataFrameWriter. When this one is called in DataFrameReader, we don't know them.

Yeah, and we also don't really need to create a source there (we only need to know the schema). Perhaps getting the schema should be separated from getting the source (like we do in FileFormat).

zsxwing · 2016-04-11T21:02:27Z

Updated

SparkQA · 2016-04-11T22:22:00Z

Test build #55536 has finished for PR 12247 at commit 61fe406.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-11T22:28:31Z

Test build #55537 has finished for PR 12247 at commit 7a818a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-04-11T22:57:04Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

  def createSource(
      sqlContext: SQLContext,
+      sourceId: Long,


Why are we passing the sourceId instead of the location?

I think some Source may not need a location. Instead, it just needs an id to distinguish.

I thought the goal was to have all the data in the same location. With this API everyone needs to duplicate the checkpoint location resolution logic.

Note that if you want a unique identifier the path also qualifies.

Make sense. I will update it.

SparkQA · 2016-04-11T23:10:55Z

Test build #55539 has finished for PR 12247 at commit a761692.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-12T01:00:23Z

Test build #55548 has finished for PR 12247 at commit 4cb1608.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-04-12T17:39:03Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

  def createSource(
      sqlContext: SQLContext,
+      metadataPath: String,


This is called metadataPath to avoid confusing with checkpointLocation since they are not the same path.

marmbrus · 2016-04-12T17:45:44Z

Thanks, merging to master.

Move FileSource offset log into checkpointLocation

d161f3a

marmbrus reviewed Apr 11, 2016
View reviewed changes

zsxwing added 2 commits April 11, 2016 13:52

Add DataSource.sourceSchema

61fe406

Remove duplicated codes and add comments

7a818a9

Make FileStreamSource.metadataPath public

a761692

marmbrus reviewed Apr 11, 2016
View reviewed changes

Add metadataPath to StreamSourceProvider

4cb1608

zsxwing reviewed Apr 12, 2016
View reviewed changes

asfgit closed this in 6bf6921 Apr 12, 2016

zsxwing deleted the file-source-log-location branch April 12, 2016 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

zsxwing commented Apr 7, 2016

SparkQA commented Apr 8, 2016

zsxwing commented Apr 8, 2016

SparkQA commented Apr 8, 2016

zsxwing commented Apr 8, 2016

marmbrus Apr 11, 2016

zsxwing Apr 11, 2016

marmbrus Apr 11, 2016

zsxwing commented Apr 11, 2016

SparkQA commented Apr 11, 2016

SparkQA commented Apr 11, 2016

marmbrus Apr 11, 2016

zsxwing Apr 11, 2016

marmbrus Apr 11, 2016

zsxwing Apr 11, 2016

SparkQA commented Apr 11, 2016

SparkQA commented Apr 12, 2016

zsxwing Apr 12, 2016

marmbrus commented Apr 12, 2016

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

Conversation

zsxwing commented Apr 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 8, 2016

zsxwing commented Apr 8, 2016

SparkQA commented Apr 8, 2016

zsxwing commented Apr 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zsxwing commented Apr 11, 2016

SparkQA commented Apr 11, 2016

SparkQA commented Apr 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 11, 2016

SparkQA commented Apr 12, 2016

Choose a reason for hiding this comment

marmbrus commented Apr 12, 2016