[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120

lw-lin · 2017-03-01T13:48:56Z

What changes were proposed in this pull request?

Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing s3n to s3a).

This patch adds an option fileNameOnly that causes the new file check to be based only on the filename (but still store the whole path in the log).

Usage

spark
  .readStream
  .option("fileNameOnly", true)
  .text("s3n://bucket/dir1/dir2")
  .writeStream
  ...

How was this patch tested?

Added a test case

steveloughran · 2017-03-01T16:15:10Z

-1, non binding

I understand the rationale for this, to aid migration from s3/s3n to s3a, but given the need is schema independence, you should be using the full path name from Path.getUri().getPath() instead of 'Path.getName()`, which means only the filename, the last entry in the path element list, is checked.

match only on name and the two files

s3a://bucket/incoming/dataset.avro
s3a://bucket/2015/12/dataset.avro

will be mistaken for the same file, even when they aren't. If this scenario arises then someone will end up fielding support calls about missing data, or worse, incorrect query results.

If you use the full path, that problem goes away and the filtering is only on schema and filesystem/bucket name.

SparkQA · 2017-03-01T18:11:43Z

Test build #73691 has finished for PR 17120 at commit aeb10d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SeenFilesMap(maxAgeMs: Long, fileNameOnly: Boolean)

lw-lin · 2017-03-01T23:42:18Z

@steveloughran thanks for the comments and your concern also looks reasonable to me. I'm open to both approaches.

@marmbrus @zsxwing it'd be great if you could share some thoughts!

marmbrus · 2017-03-03T19:05:33Z

The use case here is when you have truly unique filenames (i.e. they contain a guid). This is actually pretty common in my experience. We definitely shouldn't turn this on by default, but as implemented I think the semantics are pretty clear and this option is useful.

steveloughran · 2017-03-03T21:52:27Z

I know that's the current use case, but I'm thinking about future confusion, especially as the use case you espoused, "move from s3n to s3a within the same window" isn't likely to be that common in a running app, is it?
At the very least, the documentation needs to be explicit about what works and what doesn't here.

marmbrus · 2017-03-03T22:00:57Z

Note streams can be very long running, so this isn't about some short window. It could even be that I'm moving to a different bucket (but don't want to loose my exactly once guarantees of a very long running stream).

I agree the documentation should be explicit about the expectations of the filename for this parameter.

lw-lin · 2017-03-04T02:38:59Z

docs/structured-streaming-programming-guide.md

+        · "file:///dataset.txt"<br/>
+        · "s3://a/dataset.txt"<br/>
+        · "s3n://a/b/dataset.txt"<br/>
+        · "s3a://a/b/c/dataset.txt"<br/>


the incidents of a <li> does not look pretty, so I'm using a dot here

lw-lin · 2017-03-04T02:41:19Z

Thank you @marmbrus @steveloughran for the feedback. Added some explicit docs. Here's a screenshot of the affected section from the programming guide:

Please take a look again.

SparkQA · 2017-03-04T04:25:07Z

Test build #73884 has finished for PR 17120 at commit c59f35f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-04T04:30:09Z

Test build #73885 has finished for PR 17120 at commit 2354ae6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2017-03-04T11:48:53Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+
+    map.add("file:///a/b/c/d", 5)
+    map.add("file:///a/b/c/e", 5)
+    assert(map.size == 2)


recommend === for better error reporting

sure, thanks!

SparkQA · 2017-03-04T15:03:30Z

Test build #73900 has finished for PR 17120 at commit f9e525e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2017-03-07T02:44:23Z

Jenkins retest this please

SparkQA · 2017-03-07T04:41:21Z

Test build #74062 has finished for PR 17120 at commit f9e525e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Looks good overall. Left some minor comments.

zsxwing · 2017-03-08T23:03:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

@@ -75,7 +77,7 @@ class FileStreamSource(

  /** A mapping from a file that we have processed to some timestamp it was last modified. */
  // Visible for testing and debugging in production.
-  val seenFiles = new SeenFilesMap(sourceOptions.maxFileAgeMs)
+  val seenFiles = new SeenFilesMap(sourceOptions.maxFileAgeMs, sourceOptions.fileNameOnly)


It's better to add a warning when fileNameOnly is true. How about

logWarning("fileNameOnly is enabled. Make user your file names are unique (e.g., using UUID), otherwise, files using the same name will be considered as the same file and causes data lost")

added. thanks!

zsxwing · 2017-03-08T23:09:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+    /**
+     * Note when `fileNameOnly` is true, each entry would be (file name, timestamp) rather than
+     * (full path, timestamp).
+     */
    def allEntries: Seq[(String, Timestamp)] = {


This method is not used. Could you just delete it?

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStr eamSource.scala

SparkQA · 2017-03-09T05:34:14Z

Test build #74236 has finished for PR 17120 at commit 7da2a9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-03-09T06:32:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

+  if (fileNameOnly) {
+    logWarning("'fileNameOnly' is enabled. Make sure your file names are unique (e.g. using " +
+      "UUID), otherwise, files using the same name will be considered as the same file and causes" +
+      " data lost")


nit: if I may, this message sounds a bit odd.

files using the same name will be considered as the same file and causes data lost

could we say
files with the same name but under different paths will be considered the same and causes data lost

udpated -- thank you!

SparkQA · 2017-03-09T07:02:33Z

Test build #74245 has started for PR 17120 at commit aab7554.

lw-lin · 2017-03-09T09:19:14Z

Jenkins retest this please

SparkQA · 2017-03-09T11:23:31Z

Test build #74260 has finished for PR 17120 at commit aab7554.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2017-03-09T19:02:17Z

LGTM. Merging to master.

Add support for fileNameOnly

aeb10d1

Explicit docs about the expectations of the filename

2354ae6

lw-lin force-pushed the filename-only branch from c59f35f to 2354ae6 Compare March 4, 2017 02:38

lw-lin commented Mar 4, 2017

View reviewed changes

steveloughran reviewed Mar 4, 2017

View reviewed changes

Address comments from Steve

f9e525e

zsxwing requested changes Mar 8, 2017

View reviewed changes

lw-lin added 2 commits March 9, 2017 11:16

Merge remote-tracking branch 'apache/master' into filename-only

fd131f5

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStr eamSource.scala

Address @zsxwing's comments

7da2a9c

felixcheung reviewed Mar 9, 2017

View reviewed changes

Adjust wording as per Felix's comments

aab7554

asfgit closed this in 40da4d1 Mar 9, 2017

lw-lin deleted the filename-only branch March 10, 2017 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120

[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120

lw-lin commented Mar 1, 2017 •

edited

Loading

steveloughran commented Mar 1, 2017 •

edited

Loading

SparkQA commented Mar 1, 2017

lw-lin commented Mar 1, 2017 •

edited

Loading

marmbrus commented Mar 3, 2017

steveloughran commented Mar 3, 2017

marmbrus commented Mar 3, 2017

lw-lin Mar 4, 2017

lw-lin commented Mar 4, 2017

SparkQA commented Mar 4, 2017

SparkQA commented Mar 4, 2017

steveloughran Mar 4, 2017

lw-lin Mar 4, 2017 •

edited

Loading

SparkQA commented Mar 4, 2017

lw-lin commented Mar 7, 2017

SparkQA commented Mar 7, 2017

zsxwing left a comment

zsxwing Mar 8, 2017

lw-lin Mar 9, 2017

zsxwing Mar 8, 2017

lw-lin Mar 9, 2017

SparkQA commented Mar 9, 2017

felixcheung Mar 9, 2017

lw-lin Mar 9, 2017 •

edited

Loading

SparkQA commented Mar 9, 2017

lw-lin commented Mar 9, 2017

SparkQA commented Mar 9, 2017

zsxwing commented Mar 9, 2017

[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120

[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120

Conversation

lw-lin commented Mar 1, 2017 • edited Loading

What changes were proposed in this pull request?

Usage

How was this patch tested?

steveloughran commented Mar 1, 2017 • edited Loading

SparkQA commented Mar 1, 2017

lw-lin commented Mar 1, 2017 • edited Loading

marmbrus commented Mar 3, 2017

steveloughran commented Mar 3, 2017

marmbrus commented Mar 3, 2017

lw-lin Mar 4, 2017

Choose a reason for hiding this comment

lw-lin commented Mar 4, 2017

SparkQA commented Mar 4, 2017

SparkQA commented Mar 4, 2017

steveloughran Mar 4, 2017

Choose a reason for hiding this comment

lw-lin Mar 4, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 4, 2017

lw-lin commented Mar 7, 2017

SparkQA commented Mar 7, 2017

zsxwing left a comment

Choose a reason for hiding this comment

zsxwing Mar 8, 2017

Choose a reason for hiding this comment

lw-lin Mar 9, 2017

Choose a reason for hiding this comment

zsxwing Mar 8, 2017

Choose a reason for hiding this comment

lw-lin Mar 9, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2017

felixcheung Mar 9, 2017

Choose a reason for hiding this comment

lw-lin Mar 9, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2017

lw-lin commented Mar 9, 2017

SparkQA commented Mar 9, 2017

zsxwing commented Mar 9, 2017

lw-lin commented Mar 1, 2017 •

edited

Loading

steveloughran commented Mar 1, 2017 •

edited

Loading

lw-lin commented Mar 1, 2017 •

edited

Loading

lw-lin Mar 4, 2017 •

edited

Loading

lw-lin Mar 9, 2017 •

edited

Loading