Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPREDUCE-7403. manifest-committer dynamic partitioning support. #4728

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Aug 10, 2022

Description of PR

Declares its compatibility with the stream capability
"mapreduce.job.committer.dynamic.partitioning"

spark will need to cast to StreamCapabilities and then probe.

How was this patch tested?

SPARK-40034 has a PR with patch matching changes in the spark code; plus unit tests to verify that it's not an error to ask for dynamic partition if the committer's hasCapability holds.
apache/spark#37468

Testing

those new integration tests include

Tested through a spark build with the matching patch against s3 london, azure cardiff.
GCS test setup failing with oauth problems the way they were not on friday. assuming unrelated.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

Declares its compatibility with the stream capability
"mapreduce.job.committer.dynamic.partitioning"

spark will need to cast to StreamCapabilities and then probe.

Change-Id: Iafcacc6d2491bb1e7fc2fc033c6d17d5b63b5b4f
…through

Change-Id: Icc30bf6251977cfb76211bffcfc5796b1a44989b
* spark-side requirements
* why there is risk if you use it at scale.

That risk is low because currently spark seems to rename
sequentially. if/when it does parallel file rename then
throttling may be triggered, with the consequential
failure events.

Change-Id: I6e442bbdcaa007a3cd2e04ddf8b41d14c51057ff
@steveloughran steveloughran force-pushed the mr/MAPREDUCE-7403-manifest-committer-partitioning branch from f62db61 to 82372d0 Compare August 15, 2022 19:20
@apache apache deleted a comment from hadoop-yetus Aug 15, 2022
Change-Id: I423f052ca48915502f182cb4f1c67cdf04838a99
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 55s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 40m 58s trunk passed
+1 💚 compile 0m 56s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 0m 49s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 50s trunk passed
+1 💚 mvnsite 0m 56s trunk passed
+1 💚 javadoc 0m 44s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 0m 35s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 45s trunk passed
+1 💚 shadedclient 24m 16s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 39s the patch passed
+1 💚 compile 0m 44s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 0m 44s the patch passed
+1 💚 compile 0m 37s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 37s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 31s the patch passed
+1 💚 mvnsite 0m 42s the patch passed
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 0m 23s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 34s the patch passed
+1 💚 shadedclient 23m 44s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 12s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
110m 8s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/artifact/out/Dockerfile
GITHUB PR #4728
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 776b57f21b6a 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 25db5da
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/testReport/
Max. process+thread count 1110 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@apache apache deleted a comment from hadoop-yetus Aug 18, 2022
@steveloughran
Copy link
Contributor Author

would be good for some reviews here from @mukund-thakur , @mehakmeet and ideally @sunchao and @dongjoon-hyun -both of whom will be able to review the matching spark-side change, which is simply one of "don't reject attempts to use a PathOutputCommitter for dynamic partition overwrite if the instance created says it is OK"

Copy link

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new assertions TestManifestCommitProtocol.java are just defined but not executed. Otherwise LGTM.

Assertions.assertThat(committer.hasCapability(
ManifestCommitterConstants.CAPABILITY_DYNAMIC_PARTITIONING))
.describedAs("dynamic partitioning capability in committer %s",
committer);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
committer);
committer).isTrue();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did i just get my asserts wrong. that was bad. thanks!

Assertions.assertThat(bindingCommitter.hasCapability(
ManifestCommitterConstants.CAPABILITY_DYNAMIC_PARTITIONING))
.describedAs("dynamic partitioning capability in committer %s",
bindingCommitter);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bindingCommitter);
bindingCommitter).isTrue();

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1,
pending Attila's comments.

Change-Id: I29e98cf4ac607913d59e15babe6180434f665714
@steveloughran
Copy link
Contributor Author

thanks. fixed tests, ran locally, and ran the abfs ITest subclass. all good

@@ -29,7 +29,7 @@
* <li>Nothing else got through either.</li>
* </ol>
*/
public class AWSStatus500Exception extends AWSServiceIOException {
public class jAWSStatus500Exception extends AWSServiceIOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a typo in Intellij.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think Yetus failed beacuse of this only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 48s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+0 🆗 markdownlint 0m 1s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 8s Maven dependency ordering for branch
+1 💚 mvninstall 28m 39s trunk passed
+1 💚 compile 25m 5s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 22m 1s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 30s trunk passed
+1 💚 mvnsite 2m 28s trunk passed
+1 💚 javadoc 1m 56s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 53s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 38s trunk passed
+1 💚 shadedclient 24m 30s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 27s Maven dependency ordering for patch
-1 ❌ mvninstall 0m 20s /patch-mvninstall-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
-1 ❌ compile 22m 55s /patch-compile-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt root in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
-1 ❌ javac 22m 55s /patch-compile-root-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt root in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.
-1 ❌ compile 20m 48s /patch-compile-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt root in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
-1 ❌ javac 20m 48s /patch-compile-root-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt root in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 19s /results-checkstyle-root.txt root: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
-1 ❌ mvnsite 0m 52s /patch-mvnsite-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 💚 javadoc 1m 46s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
-1 ❌ javadoc 0m 53s /patch-javadoc-hadoop-tools_hadoop-aws-jdkPrivateBuild-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.txt hadoop-aws in the patch failed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07.
-1 ❌ spotbugs 0m 50s /patch-spotbugs-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 💚 shadedclient 25m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 26s hadoop-mapreduce-client-core in the patch passed.
-1 ❌ unit 0m 50s /patch-unit-hadoop-tools_hadoop-aws.txt hadoop-aws in the patch failed.
+1 💚 asflicense 1m 14s The patch does not generate ASF License warnings.
227m 14s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/artifact/out/Dockerfile
GITHUB PR #4728
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux a84745bf1c6e 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 649b902
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/testReport/
Max. process+thread count 1069 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Change-Id: Ifbe2d1012cbdf2e7467ce84a7d8d93a78e91dcf6
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 47s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 45s Maven dependency ordering for branch
+1 💚 mvninstall 28m 31s trunk passed
+1 💚 compile 25m 15s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 21m 52s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 30s trunk passed
+1 💚 mvnsite 2m 29s trunk passed
+1 💚 javadoc 1m 56s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 57s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 42s trunk passed
+1 💚 shadedclient 24m 28s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 27s Maven dependency ordering for patch
+1 💚 mvninstall 1m 15s the patch passed
+1 💚 compile 24m 35s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 24m 35s the patch passed
+1 💚 compile 22m 4s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 4s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 25s the patch passed
+1 💚 mvnsite 2m 27s the patch passed
+1 💚 javadoc 1m 49s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 1m 57s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 3m 50s the patch passed
+1 💚 shadedclient 24m 47s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 7m 30s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 3m 5s hadoop-aws in the patch passed.
+1 💚 asflicense 1m 17s The patch does not generate ASF License warnings.
233m 58s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/artifact/out/Dockerfile
GITHUB PR #4728
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets markdownlint
uname Linux 2b93cafc2124 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / bc9dfc9
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/testReport/
Max. process+thread count 1082 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-tools/hadoop-aws U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4728/6/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@steveloughran steveloughran merged commit de37fd3 into apache:trunk Aug 24, 2022
asfgit pushed a commit that referenced this pull request Aug 24, 2022
Declares its compatibility with Spark's dynamic
output partitioning by having the stream capability
"mapreduce.job.committer.dynamic.partitioning"

Requires a Spark release with SPARK-40034, which
does the probing before deciding whether to
accept/rejecting instantiation with
dynamic partition overwrite set

This feature can be declared as supported by
any other PathOutputCommitter implementations
whose algorithm and destination filesystem
are compatible.

None of the S3A committers are compatible.

The classic FileOutputCommitter is, but it
does not declare itself as such out of our fear
of changing that code. The Spark-side code
will automatically infer compatibility if
the created committer is of that class or
a subclass.

Contributed by Steve Loughran.
dongjoon-hyun pushed a commit to apache/spark that referenced this pull request Sep 9, 2022
### What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

### Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

### Does this PR introduce _any_ user-facing change?

No.

There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs.

### How was this patch tested?

1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

* spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212)
* Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151)

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Sep 9, 2022
### What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

### Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

### Does this PR introduce _any_ user-facing change?

No.

There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs.

### How was this patch tested?

1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

* spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212)
* Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151)

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
…che#4728)


Declares its compatibility with Spark's dynamic
output partitioning by having the stream capability
"mapreduce.job.committer.dynamic.partitioning"

Requires a Spark release with SPARK-40034, which
does the probing before deciding whether to 
accept/rejecting instantiation with
dynamic partition overwrite set

This feature can be declared as supported by
any other PathOutputCommitter implementations
whose algorithm and destination filesystem
are compatible.

None of the S3A committers are compatible.

The classic FileOutputCommitter is, but it
does not declare itself as such out of our fear
of changing that code. The Spark-side code
will automatically infer compatibility if
the created committer is of that class or
a subclass.

Contributed by Steve Loughran.
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
### What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

### Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

### Does this PR introduce _any_ user-facing change?

No.

There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs.

### How was this patch tested?

1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

* spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212)
* Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151)

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
a0x8o added a commit to a0x8o/spark that referenced this pull request Dec 30, 2022
### What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

### Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

### Does this PR introduce _any_ user-facing change?

No.

There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs.

### How was this patch tested?

1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

* spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212)
* Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151)

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

Closes #37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants