New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731
[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731
Commits on Mar 21, 2017
-
SPARK-17159: move filtering of directories and files out of glob/list…
… filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for -This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned. -For every glob match, the code saves 2 RPC calls to the HDFS NN -The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more) -for the modtime check of every file, it saves an HTTP GET The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed.
-
[SPARK-17159] Remove the fileModTime cache. Now that the modification…
… time costs 0 to evaluate, caching it actually consumes memory and the time for a lookup.
-
[SPARK-17159] inline FileStatus.getModificationTime; address style is…
…sues. Also note that 1s granularity is the resolution from HDFS; other filesystems may have a different resolution. The only one I know that is worse is FAT16/FAT32, which is accurate to 2s, but nobody should be using that except on SSD cards and USB sticks
-
[SPARK-17159] updates as discussed on PR: skip wildcards for non wild…
…carded listing; handle FNFE specially, add the docs
-
[SPARK-17159] move glob operation into SparkHadoopUtils, alongside an…
… existing/similar method. Add tests for the behaviour. Update docs with suggested fixes, and review/edit.
-
-
[SPARK-17159] add directory rename test (taken from SPARK-7481 exampl…
…es and made more robust)
-
[SPARK-17159] method nested inside a sparktest test closure being mis…
…taken for a public method and so needing to declare a return type.
-
-
[SPARK-17159] File input dstream: revert to directory list operation …
…which doesn't shortcut on a non-wildcard operation
-
[SPARK-17159] round out the file streaming text with the dirty detail…
…s of how HDFS doesn't update file length or modtime until close or a block boundary is reached.
-
SPARK-17159 Chris Nauroth of HDFS team clarified which operations upd…
…ate the mtime field; this is covered in the streaming section to emphasise why write + rename is the strategy for streaming in files in HDFS. That strategy does also work in object stores, though the rename operation is O(data)
-
[SPARK-17159] rebase to master; verify new test still works; review &…
… tighten documentation
-
[SPARK-17159] ; address comments, move to withTempDir for tests with …
…a temp dur. Docs now refer reader to the Hadoop FS spec for any details about what object stores do
-
-
SPARK-17159 review of docs: quote paths to clearly show what is code …
…and what is just punctuation
-
SPARK-17159 address sean's review comments, and read over the object …
…store text and update slightly to make things a bit clearer. The more I learn about object stores, the less they resemble file systems.