Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

Closed
wants to merge 5 commits into from
Closed

[SPARK-14474][SQL]Move FileSource offset log into checkpointLocation #12247

wants to merge 5 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Apr 7, 2016

What changes were proposed in this pull request?

Now that we have a single location for storing checkpointed state. This PR just propagates the checkpoint location into FileStreamSource so that we don't have one random log off on its own.

How was this patch tested?

test("metadataPath should be in checkpointLocation")

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55270 has finished for PR 12247 at commit d161f3a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 8, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55308 has finished for PR 12247 at commit d161f3a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented Apr 8, 2016

cc @marmbrus

*/
def createSource(
sourceId: Option[Long] = None,
checkpointLocation: Option[String] = None): Source = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these optional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceId and checkpointLocation are set via DataFrameWriter. When this one is called in DataFrameReader, we don't know them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and we also don't really need to create a source there (we only need to know the schema). Perhaps getting the schema should be separated from getting the source (like we do in FileFormat).

@zsxwing
Copy link
Member Author

zsxwing commented Apr 11, 2016

Updated

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55536 has finished for PR 12247 at commit 61fe406.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55537 has finished for PR 12247 at commit 7a818a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def createSource(
sqlContext: SQLContext,
sourceId: Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we passing the sourceId instead of the location?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some Source may not need a location. Instead, it just needs an id to distinguish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the goal was to have all the data in the same location. With this API everyone needs to duplicate the checkpoint location resolution logic.

Note that if you want a unique identifier the path also qualifies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I will update it.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55539 has finished for PR 12247 at commit a761692.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55548 has finished for PR 12247 at commit 4cb1608.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def createSource(
sqlContext: SQLContext,
metadataPath: String,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is called metadataPath to avoid confusing with checkpointLocation since they are not the same path.

@marmbrus
Copy link
Contributor

Thanks, merging to master.

@asfgit asfgit closed this in 6bf6921 Apr 12, 2016
@zsxwing zsxwing deleted the file-source-log-location branch April 12, 2016 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants