Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible #18979

Conversation

steveloughran
Copy link
Contributor

What changes were proposed in this pull request?

BasicWriteTaskStatsTracker.getFileSize() to catch FileNotFoundException, log @ info and then return 0 as a file size.

This ensures that if a newly created file isn't visible due to the store not always having create consistency, the metric collection doesn't cause the failure.

How was this patch tested?

New test suite included, BasicWriteTaskStatsTrackerSuite. This not only checks the resilience to missing files, but verifies the existing logic as to how file statistics are gathered.

Note that in the current implementation

  1. if you call Tracker..getFinalStats() more than once, the file size count will increase by size of the last file. This could be fixed by clearing the filename field inside getFinalStats() itself.

  2. If you pass in an empty or null string to Tracker.newFile(path) then IllegalArgumentException is raised, but only in getFinalStats(), rather than in newFile. There's a test for this behaviour in the new suite, as it verifies that only FNFEs get swallowed.


// file 2 is noted, but not visible
tracker.newFile(file2.toString)
touch(file3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you are right. Spurious. Harmless but wrong. I was playing around with different sequences to see if I could confuse things. Will cut

@hvanhovell
Copy link
Contributor

cc @adrian-ionescu

@gatorsmile
Copy link
Member

To mimic S3-like behavior, you can overwrite the file system spark.hadoop.fs.$scheme.impl

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80803 has finished for PR 18979 at commit 2a113fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-ionescu
Copy link
Contributor

Thanks for the fix and tests, @steveloughran!
Re 1. -- is there a need for calling getFinalStats() more than once? The function doc clearly states that it's not supported and may lead to undefined behaviour. Could be fixed, of course, but depending on the implementation of the stats tracker, that can be at the expense of additional memory or code complexity..

case e: FileNotFoundException =>
// may arise against eventually consistent object stores
logInfo(s"File $path is not yet visible", e)
0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm i feel this would be dangerous; nowhere did we document that this would return an incorrect size ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: what can be picked up if the file isn't yet reported as present by the endpoint? Adding a bool to say "results are unreliable" could be used as a warning.

One thing to consider long term is: if hadoop FS output streams added a simple <String, Long> map of statistics, could they be picked up by committers & then aggregated in job reports. Hadoop filesystems have statistics (simple ones in Hadoop <= 2.7, an arbitrary map of String -> Long in 2.8 (with standard key names across filesystems), and certainly today S3ABlockOutputStream collects stats on individual streams. If that was made visible and collected, you could get a lot more detail on what is going on. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+I could a comment in the docs somewhere to state that metrics in the cloud may not be consistent.

val tracker = new BasicWriteTaskStatsTracker(conf)
tracker.newFile(file.toString)
touch(file)
assertStats(tracker, 1, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not be able to differentiate between 0 byte file and Missing File in final metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that the file will eventually come into existence; that its absence straight after collection is simply a transient create inconsistency of the endpoint, like a brief caching of negative HEAD/GET requests (which AWS S3 does do as part of its DoS defences). The files will be there later.

One option: count the #of missing files and include that in the report. It shouldn't be a metric most of the time though: never on a "real" FS or consistent object store, rarely on an inconsistent one

@viirya
Copy link
Member

viirya commented Aug 18, 2017

Btw, as the file path passed to state tracker should be task temp file, is it common to directly use S3 as temp file output destination?

@steveloughran
Copy link
Contributor Author

Currently nobody should be using s3a:// at the the temp file destination, which is the same as saying "nobody should be using s3a:// as the direct destination of work", not without a special committer (Netflix, IBM's stocator, ...) or without something to give S3 list consistency. Because today, task commit relies on a list & rename of all files in the task attempt dir, and if you don't get list consistency, you can miss out on files. If you ever hear anyone complaining "it takes too long to commit to s3" then they are using it this way. Tell them to use a consistency layer or to stop it :)

@steveloughran
Copy link
Contributor Author

To mimic S3-like behavior, you can overwrite the file system spark.hadoop.fs.$scheme.impl"

@gatorsmile: you will be able to do something better soon, as S3A is adding an inconsistent AWS client into hadoop-aws JAR, which you can then enable to guarantee consistency delays and inject intermittent faults into the system (throttling, transient network events). All it will take is a config option to switch to this client, plus the chaos-monkey-esque probabilities and delays. This is what I'm already using —you will be able to as well. That is, no need to switch clients, just gospark.hadoop.fs.s3a.s3.client.factory.impl=org.apache.hadoop.fs.s3a.InconsistentS3ClientFactory and wait for the stack traces.

The S3A FS itself needs to do more to handle throttling & failures (retry, add failure metrics so throttling & error rates can be measured). Knowing throttling rates is important as it will help identify perf problems due to bad distribution of work across a bucket, excess use of KMS key lookup..., things that in surface in support calls.

This patch restores Spark 2.3 to the behaviour it has in Spark 2.2: a brief delay between object creation and visibility does not cause the task to fail

@steveloughran
Copy link
Contributor Author

@adrian-ionescu wrote

is there a need for calling getFinalStats() more than once?

No. As long as everyone is aware of it, it won't be an issue.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80841 has finished for PR 18979 at commit f778213.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Related to this, updated spec on Hadoop output stream, Syncable and StreamCapabilities.

As the doc notes, object stores != filesystems, and while a lot can be done to preserve the metaphor on input, its on output where CRUD inconsistencies surface. along with the logic as "does a 0-byte file get created in create()", "when is data written?", etc.

@steveloughran
Copy link
Contributor Author

Has anyone had a look at this recently?

The problem still exists, and while downstream filesystems can address if they recognise the use case & lie about values, they will be returning invalid values to the caller: spark will be reporting the wrong values. At least with this PR Spark will get to make the decisions about how to react itself.

@adrian-ionescu
Copy link
Contributor

To me, this looks good.

@gatorsmile
Copy link
Member

Will review it tomorrow

@viirya
Copy link
Member

viirya commented Oct 11, 2017

I don't have strong opinion against this. Incorrect size is an issue but I can't think a better solution for now...

@steveloughran
Copy link
Contributor Author

@viirya : the new data writer API will allow for a broader set of stats to be propagated back from workers. When you are working with the object stores, an useful stat to get back is throttle count & retry count as they can be the cause of why things are slow ... and if it is due to throttling, throwing more workers at the job will actually slow things down. They'd be the ones to look at first

} catch {
case e: FileNotFoundException =>
// may arise against eventually consistent object stores
logInfo(s"File $path is not yet visible", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the log message and indicate the size zero might be wrong? For example negative caching in S3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say "Reported file size in job summary may be invalid"?

@gatorsmile
Copy link
Member

LGTM except a minor comment.

@dongjoon-hyun
Copy link
Member

+1. This solves the regression on writing emtpy dataset with ORC format, too!

@gatorsmile
Copy link
Member

Could you also include the test cases to InsertSuite.scala ?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 13, 2017

Gentle ping, @steveloughran ! :)

@steveloughran
Copy link
Contributor Author

Noted :)
@dongjoon-hyun : is the issue with ORC that if there's nothing to write, it doesn't generate a file (so avoiding that issue with sometimes you get 0-byte ORC files & things downstream fail)?

If so, the warning message which @gatorsmile has proposed is potentially going to mislead people into worrying about a problem which isn't there. and the numFiles metric is going to mislead.

I'm starting to worry about how noisy the log would be, both there and when working with s3 when it's playing delayed visibility (rarer).

  1. What if this patch just logged at debug: less noise, but still something there if people are trying to debug a mismatch?
  2. if there's no file found, numFiles doesn't get incremented.
  3. I count the number of files actually submitted
  4. And in getFinalStats() log @ info if there is a mismatch

This would line things up in future for actually returning the list of expected vs actual files up as a metric where it could be reported.

@steveloughran steveloughran force-pushed the cloud/SPARK-21762-missing-files-in-metrics branch from 649f8da to d3f96f6 Compare October 13, 2017 12:09
@steveloughran
Copy link
Contributor Author

The latest PR update pulls in @dongjoon-hyun's new test; to avoid merge conflict in the Insert suite I've rebased against master.

  1. Everything handles missing files on output
  2. There's only one logInfo at the end of the execute call, so if many empty files are created, the logs aren't too noisy.
  3. There is now some implicit counting of how many files were missing = submittedFiles - numFiles; this isn't aggregated and reported though.

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82730 has finished for PR 18979 at commit adab985.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @steveloughran .

is the issue with ORC that if there's nothing to write, it doesn't generate a file (so avoiding that issue with sometimes you get 0-byte ORC files & things downstream fail)?

Yes, So far, Spark leave an empty directory in case of ORC.

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82732 has finished for PR 18979 at commit d3f96f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82731 has finished for PR 18979 at commit 649f8da.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

} catch {
case e: FileNotFoundException =>
// may arise against eventually consistent object stores
logDebug(s"File $path is not yet visible", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the error messages, it looks okay for me. First, it's a debug message. Second, ORC writer bug will be fixed in Spark 2.3 in any way.

@dongjoon-hyun
Copy link
Member

Could you resolve the conflicts again?

… of tests for various file states.

Change-Id: I3269cb901a38b33e399ebef10b2dbcd51ccf9b75
Change-Id: I38ac11c808849e2fd91f4931f4cb5cdfad43e2af
Change-Id: I6d101ece0cccbd8403dff10004575a24109e6f1b
* Use Option to track whether or not current file is set; guarantees once-only invocation, amongst other things
* separate counting of submitted files from number of files actually seen
* Log at debug if an FNFE when caught
* Log at info only at the end of a sequance of writes

Change-Id: Id242c11338be1f41a3f9a5b8b30c796ac5b002a2
This is going to create merge conflict with this branch until I rebase it, which I'm about to

Change-Id: Ie2309066ad7892cb20155d9de8248c1682bba526
@steveloughran steveloughran force-pushed the cloud/SPARK-21762-missing-files-in-metrics branch from d3f96f6 to c0e81a1 Compare October 13, 2017 18:56
@steveloughran
Copy link
Contributor Author

done. Not writing 0-byte files will offer significant speedup against object stores, where the cost of a call to getFileStatus() can take hundreds of millis. I look forward to it

@SparkQA
Copy link

SparkQA commented Oct 13, 2017

Test build #82745 has finished for PR 18979 at commit c0e81a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in e353640 Oct 14, 2017
@steveloughran
Copy link
Contributor Author

thanks for the review everyone!

@steveloughran steveloughran deleted the cloud/SPARK-21762-missing-files-in-metrics branch March 18, 2019 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants