[SPARK-33739] [SQL] Jobs committed through the S3A Magic committer don't track bytes #30714

steveloughran · 2020-12-10T20:32:22Z

BasicWriteStatsTracker to probe for a custom Xattr if the size of
the generated file is 0 bytes; if found and parseable use that as
the declared length of the output.

The matching Hadoop patch in HADOOP-17414:

Returns all S3 object headers as XAttr attributes prefixed "header."
Sets the custom header x-hadoop-s3a-magic-data-length to the length of
the data in the marker file.

As a result, spark job tracking will correctly report the amount of data uploaded
and yet to materialize.

Why are the changes needed?

Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer
which redirects a file written to dest/__magic/job_0011/task_1245/__base/year=2020/output.avro
to its final destination dest/year=2020/output.avro , adding a zero byte marker file at
the end and a json file dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending
containing all the information for the job committer to complete the upload.

But: the write tracker statictics don't show progress as they measure the length of the
created file, find the marker file and report 0 bytes.
By probing for a specific HTTP header in the marker file and parsing that if
retrieved, the real progress can be reported.

There's a matching change in Hadoop apache/hadoop#2530
which adds getXAttr API support to the S3A connector and returns the headers; the magic
committer adds the relevant attributes.

If the FS being probed doesn't support the XAttr API, the header is missing
or the value not a positive long then the size of 0 is returned.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to
implement getXAttr on top of LocalFS; this is used to explore the set of
options:

no XAttr API implementation (existing tests; what callers would see with
most filesystems)
no attribute found (HDFS, ABFS without the attribute)
invalid data of different forms

All of these return Some(0) as file length.

The Hadoop PR verifies XAttr implementation in S3A and that
the commit protocol attaches the header to the files.

External downstream testing has done the full hadoop+spark end
to end operation, with manual review of logs to verify that the
data was successfully collected from the attribute.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

SparkQA · 2020-12-10T21:35:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37193/

SparkQA · 2020-12-10T22:08:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37193/

SparkQA · 2020-12-11T01:00:23Z

Test build #132588 has finished for PR 30714 at commit 132a2e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…n't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch HADOOP-17414. * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. Testing New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was succesfully collected from the attribute. Change-Id: I1fd0b9ac2eba1c8c27cbd776739e693a57d38fc3

SparkQA · 2021-01-22T14:57:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38958/

SparkQA · 2021-01-22T15:26:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38958/

SparkQA · 2021-01-22T18:06:52Z

Test build #134371 has finished for PR 30714 at commit c680138.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-01-24T17:53:18Z

Thank you for updating. I'm looking forward to using Hadoop 3.3.1!

steveloughran · 2021-01-26T19:33:54Z

ok, Hadoop-side PR is in trunk; verifying backport compiles and new tests work then it'll be in branch-3.3.

This patch is ready for final review/merge

HyukjinKwon

Looks making sense to me

HyukjinKwon · 2021-01-27T11:55:18Z

cc @tgravescs @mridulm FYI

tgravescs · 2021-01-28T14:51:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

+      val attr = fs.getXAttr(path, BasicWriteJobStatsTracker.FILE_LENGTH_XATTR)
+      if (attr != null && attr.nonEmpty) {
+        val str = new String(attr, StandardCharsets.UTF_8)
+        logInfo(s"File Length statistics for $path retrieved from XAttr: $str")


do we want this info or just debug? seems like I would only care if stats didn't come out but maybe its more useful...

can put to debug.

tgravescs · 2021-01-28T14:52:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

+    } catch {
+      case e: NumberFormatException =>
+        // warn but don't dump the whole stack
+        logInfo(s"Failed to parse" +


do we want this warn instead or perhaps just extend saying file stats may be wrong?

OK. I'd say at info as warn seems overkill for progress report issues.

tgravescs

overall looks fine to me, do we need to document certain version of Hadoop this works with and that they will be wrong with?

steveloughran · 2021-01-29T14:49:55Z

I'll update the docs. Also going to cut back on the warnings now that AWS S3 is consistent (And all third party ones were consistent out the box too). This is a good thing for apps like spark (much easier to do workflows across applications), but you still can't safely use the rename committer there as dir rename is non-atomic.
Fortunately, commit-by-rename is so slow people have a reason to switch to a better one, either the S3A ones or that from EMR :)

Change-Id: Id02711b83f3159d6c68bbad2cb74b303b277bb65

docs/cloud-integration.md

Change-Id: Iddcc8fd9af8f89d4a0429e077c09c03ada2fed72

dongjoon-hyun · 2021-01-30T05:48:27Z

docs/cloud-integration.md

@@ -49,7 +49,6 @@ They cannot be used as a direct replacement for a cluster filesystem such as HDF

 Key differences are:

-* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.


Shall we remove line 60 together?

The output of work may not be immediately visible to a follow-on query.

I worry about openstack swift. The original version was woefully inconsistent. Not been near it long enough to know what's change. with swift I could consistently replicate an inconsistency in a few lines

write a file of length, say, 1KB

overwrite with a shorter dataset, e.g. 512 bytes

open file

read from 0-128: get the new data

read from 768-1024: get the old data

No S3 implementation that I know of (i.e. the open source or commercial ones) are inconsistent, nor have they ever had the 404 caching issue. But people should still be aware of the risk.

oh, and I haven' t played with any of the chinese cloud stores for which connectors now exist. So I can't make statements there. All I can say is the "big three outside china" are consistent.

The line deletion is based on the fact S3 is now strong consistency, but that also means we only consider these three vendors. (Probably you're considering more like S3 compatible implementations, but you've also mentioned you don't consider the chinese cloud stores so it can't be exhaustive.) Why not explicitly saying it and update the description tied to these vendors? We would never be able to consider all possible implementations and for some minority the description may be wrong. Let's just make it clear.

Or, what about leaving this line as it is (so that the description enumerates all "possible" issues on object store considering beyond the big three), and elaborate which key differences are no longer in effect with strong consistency in consistency section?

(Probably you're considering more like S3 compatible implementations, but you've also mentioned you don't consider the chinese cloud stores so it can't be exhaustive.)

I don't test them; I don't know their consistency

reviewed the docs, huawei cos and tencent obs both seem consistent. swift is now the outlier. Moving the issue into a paragraph in the "consistency" section

@steveloughran Thanks for the update!
@dongjoon-hyun Could you please check whether the comments and new change make sense to you? Thanks in advance!

Change-Id: Ib2b379f6c456b4bec75db43d0f2290f1ea19b975

steveloughran · 2021-02-15T15:55:11Z

I've looked at the stores for which hadoop-trunk has connectors, looks like only openstack swift is inconsistent. Moved the details on consistency down, called out swift and said "consult the docs". After all, alternative implementations of the swift API (I'm thinking IBM's work) probably is consistent.

SparkQA · 2021-02-15T16:56:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39737/

SparkQA · 2021-02-15T18:32:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39737/

SparkQA · 2021-02-15T20:30:54Z

Test build #135156 has finished for PR 30714 at commit 4b08f49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-02-15T20:36:41Z

The doc change looks good to me, but I don't feel qualified to review the code. As @tgravescs already reviewed and approved, it sounds more natural if he can make "sign-off".

@tgravescs Kindly reminder. Thanks!

dongjoon-hyun · 2021-02-15T22:48:54Z

docs/cloud-integration.md

+
+```
+spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
+```


We can remove this repetition (line 194 ~ 196) because we already have this at the beginning of this section (line 149)

dongjoon-hyun · 2021-02-15T22:52:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

+        logDebug(s"XAttr not supported on path $path", e);
+      case e: Exception =>
+        // Something else. Log at debug and continue.
+        logDebug(s"Xattr processing failure on $path", e);


Xattr -> XAttr?

dongjoon-hyun · 2021-02-15T22:53:32Z

.../test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala

+  /**
+   * Does a length specified in the XAttr header get picked up?
+   */
+  test("Xattr sourced length") {


Xattr -> XAttr?

dongjoon-hyun

+1, LGTM, too.

Change-Id: Ied49268d0b68cd7603a3a52d174d640469338760

SparkQA · 2021-02-17T12:54:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39777/

SparkQA · 2021-02-17T13:29:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39777/

SparkQA · 2021-02-17T16:32:54Z

Test build #135196 has finished for PR 30714 at commit 81c0a52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2021-02-18T12:05:53Z

@tgravescs can you hit the merge button before some other change breaks this PR? thanks

tgravescs · 2021-02-18T14:44:38Z

merged to master, thanks @steveloughran @dongjoon-hyun @HeartSaVioR

steveloughran · 2021-02-18T14:49:20Z

thanks!

HyukjinKwon · 2021-02-19T01:51:53Z

I was here too! 😃

github-actions bot added the SQL label Dec 10, 2020

steveloughran changed the title ~~[SPARK-33739] [SQL] Jobs committed through the S3A Magic committer do…~~ [SPARK-33739] [SQL] Jobs committed through the S3A Magic committer don't track bytes Dec 10, 2020

steveloughran mentioned this pull request Dec 10, 2020

HADOOP-17414. Magic committer files don't have the count of bytes written collected by spark apache/hadoop#2530

Merged

dongjoon-hyun reviewed Dec 10, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala Show resolved Hide resolved

steveloughran force-pushed the cdpd/SPARK-33739-magic-commit-tracking-master branch from 132a2e9 to c680138 Compare January 22, 2021 13:55

HyukjinKwon reviewed Jan 27, 2021

View reviewed changes

tgravescs reviewed Jan 28, 2021

View reviewed changes

SPARK-33739: tgraves's feedback, including docs

54f2807

Change-Id: Id02711b83f3159d6c68bbad2cb74b303b277bb65

github-actions bot added the DOCS label Jan 29, 2021

tgravescs reviewed Jan 29, 2021

View reviewed changes

docs/cloud-integration.md Outdated Show resolved Hide resolved

tgravescs reviewed Jan 29, 2021

View reviewed changes

docs/cloud-integration.md Outdated Show resolved Hide resolved

SPARL-33739. correcting text

00c19a5

Change-Id: Iddcc8fd9af8f89d4a0429e077c09c03ada2fed72

tgravescs approved these changes Jan 29, 2021

View reviewed changes

dongjoon-hyun reviewed Jan 30, 2021

View reviewed changes

SPARK-33739. Pull out eventual consistency section, call out Swift

4b08f49

Change-Id: Ib2b379f6c456b4bec75db43d0f2290f1ea19b975

dongjoon-hyun reviewed Feb 15, 2021

View reviewed changes

dongjoon-hyun approved these changes Feb 15, 2021

View reviewed changes

SPARK-33739. Review comments

81c0a52

Change-Id: Ied49268d0b68cd7603a3a52d174d640469338760

asfgit closed this in ff5115c Feb 18, 2021

		@@ -49,7 +49,6 @@ They cannot be used as a direct replacement for a cluster filesystem such as HDF

		Key differences are:

		* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.

[SPARK-33739] [SQL] Jobs committed through the S3A Magic committer don't track bytes #30714

[SPARK-33739] [SQL] Jobs committed through the S3A Magic committer don't track bytes #30714

Conversation

steveloughran commented Dec 10, 2020 • edited

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 10, 2020

SparkQA commented Dec 10, 2020

SparkQA commented Dec 11, 2020

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

SparkQA commented Jan 22, 2021

dongjoon-hyun commented Jan 24, 2021

steveloughran commented Jan 26, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs left a comment

Choose a reason for hiding this comment

steveloughran commented Jan 29, 2021

dongjoon-hyun Jan 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Feb 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Feb 15, 2021

SparkQA commented Feb 15, 2021

SparkQA commented Feb 15, 2021

SparkQA commented Feb 15, 2021

HeartSaVioR commented Feb 15, 2021

dongjoon-hyun Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 17, 2021

SparkQA commented Feb 17, 2021

SparkQA commented Feb 17, 2021

steveloughran commented Feb 18, 2021

tgravescs commented Feb 18, 2021

steveloughran commented Feb 18, 2021

HyukjinKwon commented Feb 19, 2021

steveloughran commented Dec 10, 2020 •

edited

dongjoon-hyun Jan 30, 2021 •

edited

HeartSaVioR Feb 14, 2021 •

edited

dongjoon-hyun Feb 15, 2021 •

edited