Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731

Closed

Commits on Mar 21, 2017

  1. SPARK-17159: move filtering of directories and files out of glob/list…

    … filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for
    
    -This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned.
    -For every glob match, the code saves 2 RPC calls to the HDFS NN
    -The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more)
    -for the modtime check of every file, it saves an HTTP GET
    
    The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    19d47cd View commit details
    Browse the repository at this point in the history
  2. [SPARK-17159] Remove the fileModTime cache. Now that the modification…

    … time costs 0 to evaluate, caching it actually consumes memory and the time for a lookup.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    6f1ea36 View commit details
    Browse the repository at this point in the history
  3. [SPARK-17159] inline FileStatus.getModificationTime; address style is…

    …sues. Also note that 1s granularity is the resolution from HDFS; other filesystems may have a different resolution. The only one I know that is worse is FAT16/FAT32, which is accurate to 2s, but nobody should be using that except on SSD cards and USB sticks
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    c2a4382 View commit details
    Browse the repository at this point in the history
  4. [SPARK-17159] updates as discussed on PR: skip wildcards for non wild…

    …carded listing; handle FNFE specially, add the docs
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    0c88093 View commit details
    Browse the repository at this point in the history
  5. [SPARK-17159] move glob operation into SparkHadoopUtils, alongside an…

    … existing/similar method. Add tests for the behaviour. Update docs with suggested fixes, and review/edit.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    36c5881 View commit details
    Browse the repository at this point in the history
  6. Copy the full SHA
    a69d1b6 View commit details
    Browse the repository at this point in the history
  7. Copy the full SHA
    f8c9521 View commit details
    Browse the repository at this point in the history
  8. [SPARK-17159] method nested inside a sparktest test closure being mis…

    …taken for a public method and so needing to declare a return type.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    4f01721 View commit details
    Browse the repository at this point in the history
  9. Copy the full SHA
    1b2027c View commit details
    Browse the repository at this point in the history
  10. [SPARK-17159] File input dstream: revert to directory list operation …

    …which doesn't shortcut on a non-wildcard operation
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    921c5c2 View commit details
    Browse the repository at this point in the history
  11. [SPARK-17159] round out the file streaming text with the dirty detail…

    …s of how HDFS doesn't update file length or modtime until close or a block boundary is reached.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    bff0d13 View commit details
    Browse the repository at this point in the history
  12. SPARK-17159 Chris Nauroth of HDFS team clarified which operations upd…

    …ate the mtime field; this is covered in the streaming section to emphasise why write + rename is the strategy for streaming in files in HDFS. That strategy does also work in object stores, though the rename operation is O(data)
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    a67902b View commit details
    Browse the repository at this point in the history
  13. Copy the full SHA
    9df7ff4 View commit details
    Browse the repository at this point in the history
  14. Copy the full SHA
    ac47d42 View commit details
    Browse the repository at this point in the history
  15. [SPARK-17159] ; address comments, move to withTempDir for tests with …

    …a temp dur. Docs now refer reader to the Hadoop FS spec for any details about what object stores do
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    f38a985 View commit details
    Browse the repository at this point in the history
  16. Copy the full SHA
    49519cc View commit details
    Browse the repository at this point in the history
  17. SPARK-17159 review of docs: quote paths to clearly show what is code …

    …and what is just punctuation
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    e03d189 View commit details
    Browse the repository at this point in the history
  18. SPARK-17159 address sean's review comments, and read over the object …

    …store text and update slightly to make things a bit clearer. The more I learn about object stores, the less they resemble file systems.
    steveloughran committed Mar 21, 2017
    Copy the full SHA
    a3aaf26 View commit details
    Browse the repository at this point in the history