-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19715][Structured Streaming] Option to Strip Paths in FileSource #17120
Conversation
-1, non binding I understand the rationale for this, to aid migration from s3/s3n to s3a, but given the need is schema independence, you should be using the full path name from match only on name and the two files
will be mistaken for the same file, even when they aren't. If this scenario arises then someone will end up fielding support calls about missing data, or worse, incorrect query results. If you use the full path, that problem goes away and the filtering is only on schema and filesystem/bucket name. |
Test build #73691 has finished for PR 17120 at commit
|
@steveloughran thanks for the comments and your concern also looks reasonable to me. I'm open to both approaches. @marmbrus @zsxwing it'd be great if you could share some thoughts! |
The use case here is when you have truly unique filenames (i.e. they contain a guid). This is actually pretty common in my experience. We definitely shouldn't turn this on by default, but as implemented I think the semantics are pretty clear and this option is useful. |
I know that's the current use case, but I'm thinking about future confusion, especially as the use case you espoused, "move from s3n to s3a within the same window" isn't likely to be that common in a running app, is it? |
Note streams can be very long running, so this isn't about some short window. It could even be that I'm moving to a different bucket (but don't want to loose my exactly once guarantees of a very long running stream). I agree the documentation should be explicit about the expectations of the filename for this parameter. |
· "file:///dataset.txt"<br/> | ||
· "s3://a/dataset.txt"<br/> | ||
· "s3n://a/b/dataset.txt"<br/> | ||
· "s3a://a/b/c/dataset.txt"<br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the incidents of a <li>
does not look pretty, so I'm using a dot here
Thank you @marmbrus @steveloughran for the feedback. Added some explicit docs. Here's a screenshot of the affected section from the programming guide: Please take a look again. |
Test build #73884 has finished for PR 17120 at commit
|
Test build #73885 has finished for PR 17120 at commit
|
|
||
map.add("file:///a/b/c/d", 5) | ||
map.add("file:///a/b/c/e", 5) | ||
assert(map.size == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend ===
for better error reporting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks!
Test build #73900 has finished for PR 17120 at commit
|
Jenkins retest this please |
Test build #74062 has finished for PR 17120 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Left some minor comments.
@@ -75,7 +77,7 @@ class FileStreamSource( | |||
|
|||
/** A mapping from a file that we have processed to some timestamp it was last modified. */ | |||
// Visible for testing and debugging in production. | |||
val seenFiles = new SeenFilesMap(sourceOptions.maxFileAgeMs) | |||
val seenFiles = new SeenFilesMap(sourceOptions.maxFileAgeMs, sourceOptions.fileNameOnly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add a warning when fileNameOnly
is true. How about
logWarning("fileNameOnly is enabled. Make user your file names are unique (e.g., using UUID), otherwise, files using the same name will be considered as the same file and causes data lost")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added. thanks!
/** | ||
* Note when `fileNameOnly` is true, each entry would be (file name, timestamp) rather than | ||
* (full path, timestamp). | ||
*/ | ||
def allEntries: Seq[(String, Timestamp)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not used. Could you just delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted :)
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStr eamSource.scala
Test build #74236 has finished for PR 17120 at commit
|
if (fileNameOnly) { | ||
logWarning("'fileNameOnly' is enabled. Make sure your file names are unique (e.g. using " + | ||
"UUID), otherwise, files using the same name will be considered as the same file and causes" + | ||
" data lost") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if I may, this message sounds a bit odd.
files using the same name will be considered as the same file and causes data lost
could we say
files with the same name but under different paths will be considered the same and causes data lost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udpated -- thank you!
Test build #74245 has started for PR 17120 at commit |
Jenkins retest this please |
Test build #74260 has finished for PR 17120 at commit
|
LGTM. Merging to master. |
What changes were proposed in this pull request?
Today, we compare the whole path when deciding if a file is new in the FileSource for structured streaming. However, this would cause false negatives in the case where the path has changed in a cosmetic way (i.e. changing
s3n
tos3a
).This patch adds an option
fileNameOnly
that causes the new file check to be based only on the filename (but still store the whole path in the log).Usage
How was this patch tested?
Added a test case