[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731

… filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for -This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned. -For every glob match, the code saves 2 RPC calls to the HDFS NN -The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more) -for the modtime check of every file, it saves an HTTP GET The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed.

… time costs 0 to evaluate, caching it actually consumes memory and the time for a lookup.

…sues. Also note that 1s granularity is the resolution from HDFS; other filesystems may have a different resolution. The only one I know that is worse is FAT16/FAT32, which is accurate to 2s, but nobody should be using that except on SSD cards and USB sticks

…carded listing; handle FNFE specially, add the docs

… existing/similar method. Add tests for the behaviour. Update docs with suggested fixes, and review/edit.

…es and made more robust)

…taken for a public method and so needing to declare a return type.

…which doesn't shortcut on a non-wildcard operation

…s of how HDFS doesn't update file length or modtime until close or a block boundary is reached.

…ate the mtime field; this is covered in the streaming section to emphasise why write + rename is the strategy for streaming in files in HDFS. That strategy does also work in object stores, though the rename operation is O(data)

… tighten documentation

…a temp dur. Docs now refer reader to the Hadoop FS spec for any details about what object stores do

…and what is just punctuation

…store text and update slightly to make things a bit clearer. The more I learn about object stores, the less they resemble file systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731

[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731

Commits on Mar 21, 2017