New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17159] [streaming]: optimise check for new files in FileInputDStream #14731

Closed
wants to merge 18 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@steveloughran
Contributor

steveloughran commented Aug 20, 2016

What changes were proposed in this pull request?

This PR optimises the filesystem metadata reads in FileInputDStream, by moving the filters used in FileSystem.globStatus and FileSystem.listStatus into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus instances within the FileSystem operation.

  • This doesn't add overhead to the filtering process; that's done as post-processing in theFileSystem glob/list operations anyway.
  • At worst it may result in larger lists being built up and returned.
  • For every glob match of a file, the code saves 1 RPC calls to the HDFS NN; 1 GET against S3
  • For every glob match of a directory, the code the code saves 1 RPC call and 2-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more)
  • for the modtime check of every file, it saves a Hadoop RPC call, against all object stores which don't implement any client-side cache, an HTTP GET.
  • By entirely eliminating all getFileStatus() calls in the listed files, it should reduce the risk of AWS S3 throttling the HTTP request, as it does when too many requests are made to parts of a single S3 bucket.

How was this patch tested?

Running the spark streaming tests as a regression suite. In the SPARK-7481 cloud code, I could add a test against S3 which prints to stdout the exact number of HTTP requests made to S3 before and after the patch, so as to validate speedup. (the S3A metrics in Hadoop 2.8+ are accessible at the API level, but as they are only accessible in a new API added in 2.8; it'd stop that proposed module building against Hadoop 2.7. Logging and manual assessment is the only cross-version strategy.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 20, 2016

Test build #64140 has finished for PR 14731 at commit 738c51b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 20, 2016

Test build #64140 has finished for PR 14731 at commit 738c51b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 20, 2016

Member

LGTM. Does this sort of change make sense elsewhere where PathFilter is used? I glanced at the others and it looked like a wash in other cases.

Member

srowen commented Aug 20, 2016

LGTM. Does this sort of change make sense elsewhere where PathFilter is used? I glanced at the others and it looked like a wash in other cases.

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 20, 2016

Contributor

I'm going to scan through and tune them elsewhere; really I'm going by uses of the listFiles calls

There's actually no significant use elsewhere that I can see; just a couple of uses which filter on filename —so there is no cost penalty.

  • SparkHadoopUtil.listLeafStatuses() does implement its own directory recursion to find files; FileSystem.listFiles(path, true) does that, and on S3A will do flat scan that is O(files/5000); no directory overhead at all.
  • Otherwise, globStatus() can be pretty slow against object stores, but the fix there isn't in the client code; it means someone needs to implement HADOOP-13371, S3A globber to use bulk listObject call over recursive directory scan —more specifically, an implementation scalable to production datasets.

Returning to this patch, should I cut out the caching? I think it is superfluous.

  // Read-through cache of file mod times, used to speed up mod time lookups
  @transient private var fileToModTime = new TimeStampedHashMap[String, Long](true)
Contributor

steveloughran commented Aug 20, 2016

I'm going to scan through and tune them elsewhere; really I'm going by uses of the listFiles calls

There's actually no significant use elsewhere that I can see; just a couple of uses which filter on filename —so there is no cost penalty.

  • SparkHadoopUtil.listLeafStatuses() does implement its own directory recursion to find files; FileSystem.listFiles(path, true) does that, and on S3A will do flat scan that is O(files/5000); no directory overhead at all.
  • Otherwise, globStatus() can be pretty slow against object stores, but the fix there isn't in the client code; it means someone needs to implement HADOOP-13371, S3A globber to use bulk listObject call over recursive directory scan —more specifically, an implementation scalable to production datasets.

Returning to this patch, should I cut out the caching? I think it is superfluous.

  // Read-through cache of file mod times, used to speed up mod time lookups
  @transient private var fileToModTime = new TimeStampedHashMap[String, Long](true)
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 20, 2016

Member

Why is the caching superfluous -- because no file is evaluated more than once here?

Member

srowen commented Aug 20, 2016

Why is the caching superfluous -- because no file is evaluated more than once here?

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 20, 2016

Contributor

to be precise: the caching of file modification times is superfluous. It's there to avoid the cost of executing getFileStatus() on previously scanned files. Once you use the FileStatus returned in a listing, you aren't calling getFileStatus(), hence: no need to cache

Contributor

steveloughran commented Aug 20, 2016

to be precise: the caching of file modification times is superfluous. It's there to avoid the cost of executing getFileStatus() on previously scanned files. Once you use the FileStatus returned in a listing, you aren't calling getFileStatus(), hence: no need to cache

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 20, 2016

Member

Ah right, you already have the modification time for free. Sounds good, remove the caching.

Member

srowen commented Aug 20, 2016

Ah right, you already have the modification time for free. Sounds good, remove the caching.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 20, 2016

Test build #64142 has finished for PR 14731 at commit 6e8ace0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 20, 2016

Test build #64142 has finished for PR 14731 at commit 6e8ace0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 20, 2016

Test build #64156 has finished for PR 14731 at commit b08e3c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 20, 2016

Test build #64156 has finished for PR 14731 at commit b08e3c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 23, 2016

Member

This is ready to go right @steveloughran ? LGTM

Member

srowen commented Aug 23, 2016

This is ready to go right @steveloughran ? LGTM

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 23, 2016

Contributor

LGTM. I was trying to see if there was a way to create a good test here by triggering the takes-too-long codepath and having a counter, but there's no obvious way to do that deterministically. I am doing a test for this against s3 in the spark-cloud module I'm writing; I can look at the printed counts of getFileStatus before/after the patch to see the difference, but the actual (testable) metrics are only accessible with forthcoming Hadoop 2.8 release.

TL;DR: no easy test, so there's nothing left to do

Contributor

steveloughran commented Aug 23, 2016

LGTM. I was trying to see if there was a way to create a good test here by triggering the takes-too-long codepath and having a counter, but there's no obvious way to do that deterministically. I am doing a test for this against s3 in the spark-cloud module I'm writing; I can look at the printed counts of getFileStatus before/after the patch to see the difference, but the actual (testable) metrics are only accessible with forthcoming Hadoop 2.8 release.

TL;DR: no easy test, so there's nothing left to do

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 23, 2016

Contributor

Actually, I've just noticed that DStream behaviour isn't in sync with the streaming programming guide, which says "files written in nested directories not supported)". That is: SPARK-14796 didn't patch the docs.

it may as well be fixed in this patch. How about, in the bullet points underneath

  • Wildcards may be used to specify a set of directories to scan for new files, for example hdfs://nn1:8050/users/alice/logs/2016-*/*.gz
  • New directories and their contents will be discovered as they arrive

Special points for object stores

  • Wildcard lookup may be very slow with some object stores.
  • Directory rename is not atomic; if a directory is renamed into the streaming source, then the files within may only be discovered and process across a multiple streaming windows.

There's another optimisation; use the SparkHadoopUtils.isGlobPath() predicate to recognise when the dir path isn't a wildcard, in which case just do a simple listFiles(). Until that shortcutting is done automatically in the Hadoop FS implementation, spark can do it on its side. As the listFiles() call was what was used before SPARK-14796, it has to be compatible, else SPARK-14796 has broken things

Finally, any exception in the scan is caught and triggers a log @ warning and reset... It looks to me that this would include the FNFE raised by directory not existing. I think a better message can be displayed there and the reset() operation skipped...that's not going going to solve the problem in the filesystem

Contributor

steveloughran commented Aug 23, 2016

Actually, I've just noticed that DStream behaviour isn't in sync with the streaming programming guide, which says "files written in nested directories not supported)". That is: SPARK-14796 didn't patch the docs.

it may as well be fixed in this patch. How about, in the bullet points underneath

  • Wildcards may be used to specify a set of directories to scan for new files, for example hdfs://nn1:8050/users/alice/logs/2016-*/*.gz
  • New directories and their contents will be discovered as they arrive

Special points for object stores

  • Wildcard lookup may be very slow with some object stores.
  • Directory rename is not atomic; if a directory is renamed into the streaming source, then the files within may only be discovered and process across a multiple streaming windows.

There's another optimisation; use the SparkHadoopUtils.isGlobPath() predicate to recognise when the dir path isn't a wildcard, in which case just do a simple listFiles(). Until that shortcutting is done automatically in the Hadoop FS implementation, spark can do it on its side. As the listFiles() call was what was used before SPARK-14796, it has to be compatible, else SPARK-14796 has broken things

Finally, any exception in the scan is caught and triggers a log @ warning and reset... It looks to me that this would include the FNFE raised by directory not existing. I think a better message can be displayed there and the reset() operation skipped...that's not going going to solve the problem in the filesystem

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 23, 2016

Contributor

I've now done the s3a streaming test/example

this uses a pattern of s3a/path/sub* as the directory path; then creates a file in a directory and renames the dir to match the path; verifies that the file was found in the time period allocated

https://gist.github.com/steveloughran/c8b39a7b87a9bd63d7a383bda8687e7e

Notable that the scan of the empty dir took 150ms; once there's data in the tree the time jumps up to 500ms once there are two entries under the tree, one dir and one file

summary stats show 72 getFileStatus calls at the FS API, mapping to 140 HEAD calls and 88 LIST operations, on Hadoop branch-2

 S3AFileSystem{uri=s3a://stevel-ireland-new, workingDir=s3a://steve-ireland-new/user/stevel, inputPolicy=sequential, partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, blockSize=1048576, multiPartThreshold=2147483647, statistics {292 bytes read, 292 bytes written, 101 read ops, 0 large read ops, 11 write ops}, 
 metrics {{Context=S3AFileSystem}
{FileSystemId=343b706a-c238-4d71-9ed8-8083601ac28a-hwdev-steve-ireland-new}
{fsURI=s3a://hwdev-steve-ireland-new}
{files_created=1}
{files_copied=1}
{files_copied_bytes=292}
{files_deleted=1}
{directories_created=3}
{directories_deleted=0}
{ignored_errors=2}
{op_copy_from_local_file=0}
{op_exists=1}
{op_get_file_status=72}
{op_glob_status=16}
{op_is_directory=0}
{op_is_file=0}
{op_list_files=0}
{op_list_located_status=0}
{op_list_status=27}
{op_mkdirs=2}
{op_rename=1}
{object_copy_requests=0}
{object_delete_requests=3}
{object_list_requests=88}
{object_continue_list_requests=0}
{object_metadata_requests=140}
{object_multipart_aborted=0}
{object_put_bytes=292}
{object_put_requests=4}
{stream_read_fully_operations=0}
{stream_bytes_skipped_on_seek=0}
{stream_bytes_backwards_on_seek=0}
{stream_bytes_read=292}
{streamOpened=1}
{stream_backward_seek_pperations=0}
{stream_read_operations_incomplete=0}
{stream_bytes_discarded_in_abort=0}
{stream_close_operations=1}
{stream_read_operations=1}
{stream_aborted=0}
{stream_forward_seek_operations=0}
{streamClosed=1}
{stream_seek_operations=0}
{stream_bytes_read_in_close=0}
{stream_read_exceptions=0} }}

I'm going to do a test run with the modification here and see what it does to listing and status

Contributor

steveloughran commented Aug 23, 2016

I've now done the s3a streaming test/example

this uses a pattern of s3a/path/sub* as the directory path; then creates a file in a directory and renames the dir to match the path; verifies that the file was found in the time period allocated

https://gist.github.com/steveloughran/c8b39a7b87a9bd63d7a383bda8687e7e

Notable that the scan of the empty dir took 150ms; once there's data in the tree the time jumps up to 500ms once there are two entries under the tree, one dir and one file

summary stats show 72 getFileStatus calls at the FS API, mapping to 140 HEAD calls and 88 LIST operations, on Hadoop branch-2

 S3AFileSystem{uri=s3a://stevel-ireland-new, workingDir=s3a://steve-ireland-new/user/stevel, inputPolicy=sequential, partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, blockSize=1048576, multiPartThreshold=2147483647, statistics {292 bytes read, 292 bytes written, 101 read ops, 0 large read ops, 11 write ops}, 
 metrics {{Context=S3AFileSystem}
{FileSystemId=343b706a-c238-4d71-9ed8-8083601ac28a-hwdev-steve-ireland-new}
{fsURI=s3a://hwdev-steve-ireland-new}
{files_created=1}
{files_copied=1}
{files_copied_bytes=292}
{files_deleted=1}
{directories_created=3}
{directories_deleted=0}
{ignored_errors=2}
{op_copy_from_local_file=0}
{op_exists=1}
{op_get_file_status=72}
{op_glob_status=16}
{op_is_directory=0}
{op_is_file=0}
{op_list_files=0}
{op_list_located_status=0}
{op_list_status=27}
{op_mkdirs=2}
{op_rename=1}
{object_copy_requests=0}
{object_delete_requests=3}
{object_list_requests=88}
{object_continue_list_requests=0}
{object_metadata_requests=140}
{object_multipart_aborted=0}
{object_put_bytes=292}
{object_put_requests=4}
{stream_read_fully_operations=0}
{stream_bytes_skipped_on_seek=0}
{stream_bytes_backwards_on_seek=0}
{stream_bytes_read=292}
{streamOpened=1}
{stream_backward_seek_pperations=0}
{stream_read_operations_incomplete=0}
{stream_bytes_discarded_in_abort=0}
{stream_close_operations=1}
{stream_read_operations=1}
{stream_aborted=0}
{stream_forward_seek_operations=0}
{streamClosed=1}
{stream_seek_operations=0}
{stream_bytes_read_in_close=0}
{stream_read_exceptions=0} }}

I'm going to do a test run with the modification here and see what it does to listing and status

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 23, 2016

Contributor
  1. updated the code to bypass the glob routine when there is no wildcard; this bypasses something fairly inefficient.
  2. reporting FNFE on that base dir differently; skip the stack trace (maybe: log at a lower level?).
  3. Updated the docs with a special list of blobstore best practises.

It's a bit hard to get some of that phrasing of what the wildcard does right; needs careful review.

Tested using my s3 streaming test, which did use a * in the wildcard. All works, but no improvements in speed on what is a fairly unrealistic structure. The time to recursively list object stores remotely is tangibly slow. Maybe that should go in the text too: "it can be take seconds to scan object stores for new data, with the time being proportional to directory depth and the number of files in a directory. Shallow and wide directory trees are faster"

Contributor

steveloughran commented Aug 23, 2016

  1. updated the code to bypass the glob routine when there is no wildcard; this bypasses something fairly inefficient.
  2. reporting FNFE on that base dir differently; skip the stack trace (maybe: log at a lower level?).
  3. Updated the docs with a special list of blobstore best practises.

It's a bit hard to get some of that phrasing of what the wildcard does right; needs careful review.

Tested using my s3 streaming test, which did use a * in the wildcard. All works, but no improvements in speed on what is a fairly unrealistic structure. The time to recursively list object stores remotely is tangibly slow. Maybe that should go in the text too: "it can be take seconds to scan object stores for new data, with the time being proportional to directory depth and the number of files in a directory. Shallow and wide directory trees are faster"

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 23, 2016

Test build #64296 has finished for PR 14731 at commit 79b57a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 23, 2016

Test build #64296 has finished for PR 14731 at commit 79b57a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Show outdated Hide outdated docs/streaming-programming-guide.md Outdated
@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 23, 2016

Contributor

The logic has got complex enough it merits unit tests. Pulling into SparkHadoopUtils itself and writing some for the possible: simple, glob matches one , glob matches 1+, glob doesn't match, file not found

Contributor

steveloughran commented Aug 23, 2016

The logic has got complex enough it merits unit tests. Pulling into SparkHadoopUtils itself and writing some for the possible: simple, glob matches one , glob matches 1+, glob doesn't match, file not found

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Aug 24, 2016

Contributor

Having looked at the source code, FileSystem.globStatus() uses the glob patterns, which are not the same as the posix regexp ones. org.apache.hadoop.fs.GlobPattern does the conversion.

For the docs, I'll just use a wildcard * in the example, rather than try anything more sophisticated.

Contributor

steveloughran commented Aug 24, 2016

Having looked at the source code, FileSystem.globStatus() uses the glob patterns, which are not the same as the posix regexp ones. org.apache.hadoop.fs.GlobPattern does the conversion.

For the docs, I'll just use a wildcard * in the example, rather than try anything more sophisticated.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 24, 2016

Test build #64368 has finished for PR 14731 at commit b63abfe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 24, 2016

Test build #64368 has finished for PR 14731 at commit b63abfe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 26, 2016

Test build #64486 has finished for PR 14731 at commit 9bc0ea9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 26, 2016

Test build #64486 has finished for PR 14731 at commit 9bc0ea9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 26, 2016

Test build #64488 has finished for PR 14731 at commit fe40bd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 26, 2016

Test build #64488 has finished for PR 14731 at commit fe40bd2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 27, 2016

Test build #64534 has finished for PR 14731 at commit 4134620.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 27, 2016

Test build #64534 has finished for PR 14731 at commit 4134620.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 30, 2016

Test build #64662 has finished for PR 14731 at commit b60f175.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 30, 2016

Test build #64662 has finished for PR 14731 at commit b60f175.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Sep 1, 2016

Contributor

The latest patch pulls out the shortcutting of the globStatus call if there's no wildcard chars in the path; closer to the original patch

Contributor

steveloughran commented Sep 1, 2016

The latest patch pulls out the shortcutting of the globStatus call if there's no wildcard chars in the path; closer to the original patch

Show outdated Hide outdated docs/streaming-programming-guide.md Outdated
Show outdated Hide outdated docs/streaming-programming-guide.md Outdated
@uncleGen

This comment has been minimized.

Show comment
Hide comment
@uncleGen

uncleGen Mar 3, 2017

Contributor

@srowen Waiting for your final OK

Contributor

uncleGen commented Mar 3, 2017

@srowen Waiting for your final OK

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Mar 10, 2017

Contributor

The Hadoop FS Spec has now been updated to declare exactly what HDFS does w.r.t timestamps, and warn that what other filesystems and object stores do are implementation and installation specific features: filesystem.md

That is the associated documentation update with this one; some of the content there was originally here, but moved over to the hadoop docs for the HDFS team to take the blame for when it changes.

Contributor

steveloughran commented Mar 10, 2017

The Hadoop FS Spec has now been updated to declare exactly what HDFS does w.r.t timestamps, and warn that what other filesystems and object stores do are implementation and installation specific features: filesystem.md

That is the associated documentation update with this one; some of the content there was originally here, but moved over to the hadoop docs for the HDFS team to take the blame for when it changes.

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Mar 20, 2017

Contributor

Any more comments?

Contributor

steveloughran commented Mar 20, 2017

Any more comments?

@srowen

I have to start from scratch every time I review this ... so this looks like it does more than just optimize a check for new files. It adds docs too. I don't know if the examples are essential. The extra info about how streaming works could be useful but isn't that separate? It's easier to get in small directed changes. This one has been going on for months and I know that's not in anyone's interest.

* (originally from `SqlTestUtils`.)
* @todo Probably this method should be moved to a more general place
*/
protected def withTempDir(f: File => Unit): Unit = {

This comment has been minimized.

@srowen

srowen Mar 21, 2017

Member

We don't already have this defined and available elsewhere?

@srowen

srowen Mar 21, 2017

Member

We don't already have this defined and available elsewhere?

This comment has been minimized.

@steveloughran

steveloughran Mar 21, 2017

Contributor

yes, but in a module that isn't the one where these tests were, so it'd need more dependency logic or pulling it up into a common module, which, if done properly, makes for a big diff

@steveloughran

steveloughran Mar 21, 2017

Contributor

yes, but in a module that isn't the one where these tests were, so it'd need more dependency logic or pulling it up into a common module, which, if done properly, makes for a big diff

This comment has been minimized.

@srowen

srowen Mar 21, 2017

Member

I see, it's only otherwise defined in the SQL test utils class. Well something we could unify one day, maybe not such a big deal now here.

@srowen

srowen Mar 21, 2017

Member

I see, it's only otherwise defined in the SQL test utils class. Well something we could unify one day, maybe not such a big deal now here.

@@ -27,7 +27,8 @@ import scala.collection.JavaConverters._
import scala.collection.mutable
import com.google.common.io.Files
import org.apache.hadoop.fs.Path
import org.apache.commons.io.IOUtils

This comment has been minimized.

@srowen

srowen Mar 21, 2017

Member

Just use Files.write?

@srowen

srowen Mar 21, 2017

Member

Just use Files.write?

This comment has been minimized.

@steveloughran

steveloughran Mar 21, 2017

Contributor

I'll look @ that; I think I went with IOU as it was on the CP and I'd never had bad experiences of it. Guava, well...

@steveloughran

steveloughran Mar 21, 2017

Contributor

I'll look @ that; I think I went with IOU as it was on the CP and I'd never had bad experiences of it. Guava, well...

This comment has been minimized.

@steveloughran

steveloughran Mar 21, 2017

Contributor

Actually, there's a straightforward reason: the test is using the hadoop FS APIs, opening an input stream from a Path and writing to it; Files.write is working with a local file. It doesn't work with hadoop FileSystem and Path classes, so could only be used by abusing knowledge of path URLs. Going through FileSystem/Path uses the same API as you'd use in production, so is the more rigorous test.

@steveloughran

steveloughran Mar 21, 2017

Contributor

Actually, there's a straightforward reason: the test is using the hadoop FS APIs, opening an input stream from a Path and writing to it; Files.write is working with a local file. It doesn't work with hadoop FileSystem and Path classes, so could only be used by abusing knowledge of path URLs. Going through FileSystem/Path uses the same API as you'd use in production, so is the more rigorous test.

steveloughran added some commits Aug 20, 2016

SPARK-17159: move filtering of directories and files out of glob/list…
… filters and into filtering of the FileStatus instances returned in the results, so avoiding the need to create FileStatus intances for

-This doesn't add overhead to the filtering process; that's done as post-processing in FileSystem anyway. At worst it may result in larger lists being built up and returned.
-For every glob match, the code saves 2 RPC calls to the HDFS NN
-The code saves 1-3 HTTP calls to S3 for the directory check (including a slow List call whenever the directory has children as it doesn't exist as a blob any more)
-for the modtime check of every file, it saves an HTTP GET

The whole modtime cache can be eliminated; it's a performance optimisation to avoid the overhead of the file checks, one that is no longer needed.
[SPARK-17159] Remove the fileModTime cache. Now that the modification…
… time costs 0 to evaluate, caching it actually consumes memory and the time for a lookup.
[SPARK-17159] inline FileStatus.getModificationTime; address style is…
…sues. Also note that 1s granularity is the resolution from HDFS; other filesystems may have a different resolution. The only one I know that is worse is FAT16/FAT32, which is accurate to 2s, but nobody should be using that except on SSD cards and USB sticks
[SPARK-17159] updates as discussed on PR: skip wildcards for non wild…
…carded listing; handle FNFE specially, add the docs
[SPARK-17159] move glob operation into SparkHadoopUtils, alongside an…
… existing/similar method. Add tests for the behaviour. Update docs with suggested fixes, and review/edit.
[SPARK-17159] method nested inside a sparktest test closure being mis…
…taken for a public method and so needing to declare a return type.
[SPARK-17159] File input dstream: revert to directory list operation …
…which doesn't shortcut on a non-wildcard operation
[SPARK-17159] round out the file streaming text with the dirty detail…
…s of how HDFS doesn't update file length or modtime until close or a block boundary is reached.
SPARK-17159 Chris Nauroth of HDFS team clarified which operations upd…
…ate the mtime field; this is covered in the streaming section to emphasise why write + rename is the strategy for streaming in files in HDFS. That strategy does also work in object stores, though the rename operation is O(data)
[SPARK-17159] ; address comments, move to withTempDir for tests with …
…a temp dur. Docs now refer reader to the Hadoop FS spec for any details about what object stores do
SPARK-17159 address sean's review comments, and read over the object …
…store text and update slightly to make things a bit clearer. The more I learn about object stores, the less they resemble file systems.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 21, 2017

Test build #74990 has finished for PR 14731 at commit a3aaf26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 21, 2017

Test build #74990 has finished for PR 14731 at commit a3aaf26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Mar 29, 2017

Contributor

Is there anything else I need to do here?

Contributor

steveloughran commented Mar 29, 2017

Is there anything else I need to do here?

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Apr 8, 2017

Contributor

@srowen anything else I need to do here?

Contributor

steveloughran commented Apr 8, 2017

@srowen anything else I need to do here?

@srowen

I've timed out on this change. It changes every time and still doesn't match the title. I don't think this is a great way to pursue changes like this.

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Apr 24, 2017

Contributor

Ok, I shall start again with a whole new PR of the current state

Contributor

steveloughran commented Apr 24, 2017

Ok, I shall start again with a whole new PR of the current state

@rxin

This comment has been minimized.

Show comment
Hide comment
@rxin

rxin Apr 24, 2017

Contributor

Steve I think the main point is you should also respect the time of reviewers. The way most of your pull requests manifest have been suboptimal: they often start with a very early WIP (which is not necessarily a problem), and once in a while (e.g. a month or two) you update it to almost completely change it. The time itself is a problem. It requires a lot of context switching to review your pull requests. In addition, every time you update it it looks like a complete new giant pull request.

Contributor

rxin commented Apr 24, 2017

Steve I think the main point is you should also respect the time of reviewers. The way most of your pull requests manifest have been suboptimal: they often start with a very early WIP (which is not necessarily a problem), and once in a while (e.g. a month or two) you update it to almost completely change it. The time itself is a problem. It requires a lot of context switching to review your pull requests. In addition, every time you update it it looks like a complete new giant pull request.

@steveloughran

This comment has been minimized.

Show comment
Hide comment
@steveloughran

steveloughran Apr 24, 2017

Contributor

Reynold, I know very much about the time of reviewers, I put 1+h a day on the hadoop codebase reviewing stuff, generally trying to review the work of non-colleagues, so as to pull in the broad set of contributions which are needed..

I have been trying to get some object store related patches into spark alongside the foundational work in fundamentally transforming how we work with object storage, especially S3, in Hadoop. Without the spark side changes, a lot gets lost: here the performance is approx 100-300mS/file when scanning an object store.

here I've split things in two, docs and diff. Both are independent, both are reasonably tractable. If they can be reviewed fast and added, there's no problems of patches ageing, everyone having to resync.

We can get this out the way, and you've have fewer reasons to be unhappy with me.

Contributor

steveloughran commented Apr 24, 2017

Reynold, I know very much about the time of reviewers, I put 1+h a day on the hadoop codebase reviewing stuff, generally trying to review the work of non-colleagues, so as to pull in the broad set of contributions which are needed..

I have been trying to get some object store related patches into spark alongside the foundational work in fundamentally transforming how we work with object storage, especially S3, in Hadoop. Without the spark side changes, a lot gets lost: here the performance is approx 100-300mS/file when scanning an object store.

here I've split things in two, docs and diff. Both are independent, both are reasonably tractable. If they can be reviewed fast and added, there's no problems of patches ageing, everyone having to resync.

We can get this out the way, and you've have fewer reasons to be unhappy with me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment