Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33605][BUILD] Add gcs-connector to hadoop-cloud module #37745

Closed
wants to merge 1 commit into from
Closed

[SPARK-33605][BUILD] Add gcs-connector to hadoop-cloud module #37745

wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Aug 31, 2022

What changes were proposed in this pull request?

This PR aims to add gcs-connector shaded jar to hadoop-cloud module.

Why are the changes needed?

To support Google Cloud Storage more easily.

Does this PR introduce any user-facing change?

Only one shaded jar file is added when the distribution is built with -Phadoop-cloud.

$ ls -alh gcs*
-rw-r--r--@ 1 dongjoon  staff    32M Aug 31 11:14 gcs-connector-hadoop3-2.2.7-shaded.jar

How was this patch tested?

BUILD

$ dev/make-distribution.sh -Phadoop-cloud

RUN

$ export KEYFILE=YOUR-credentials.json
$ export EMAIL=$(jq -r '.client_email' < $KEYFILE)
$ export PRIVATE_KEY_ID=$(jq -r '.private_key_id' < $KEYFILE)
$ export PRIVATE_KEY="$(jq -r '.private_key' < $KEYFILE)"
$ bin/spark-shell \
-c spark.hadoop.fs.gs.auth.service.account.email=$EMAIL \
-c spark.hadoop.fs.gs.auth.service.account.private.key.id=$PRIVATE_KEY_ID \
-c spark.hadoop.fs.gs.auth.service.account.private.key="$PRIVATE_KEY"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/31 11:56:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1661972165062).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0-SNAPSHOT
      /_/

Using Scala version 2.12.16 (OpenJDK 64-Bit Server VM, Java 17.0.4)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.text("gs://apache-spark-bucket/README.md").count()
res0: Long = 124

scala> spark.read.orc("examples/src/main/resources/users.orc").write.orc("gs://apache-spark-bucket/users.orc")

scala> spark.read.orc("gs://apache-spark-bucket/users.orc").show()
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

@dongjoon-hyun dongjoon-hyun marked this pull request as draft August 31, 2022 17:49
@github-actions github-actions bot added the BUILD label Aug 31, 2022
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33605][BUILD] Add gcs-connector to hadoop-cloud module [SPARK-33605][BUILD] Add gcs-connector to hadoop-cloud module Aug 31, 2022
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review August 31, 2022 18:58
@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 31, 2022

cc @sunchao , @steveloughran , @srowen

@srowen
Copy link
Member

srowen commented Aug 31, 2022

Seems OK, but what does it buy us? GCS storage support? Only downside is increasing the sea of JARs in the project, I guess.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Aug 31, 2022

  • Yes, only for better GCS support for the users who use -Phadoop-cloud.
  • Apache Spark distribution tar files don't use -Phadoop-cloud during our release process. So, the publish distribution tar files are not affected. The cost of 32M additional size increase is only on the user-side.

@dongjoon-hyun
Copy link
Member Author

The failure is irrelevant to this PR. It seems that the base image publishing is broken again. cc @Yikun

#33 ERROR: failed commit on ref "manifest-sha256:d7fdbdf2cdb51876ca0c22c2e3b1865b11e2058fd9a07e59f281f7596cb00956": unexpected status: 403 Forbidden
 > exporting to image:
ERROR: failed to solve: failed commit on ref "manifest-sha256:d7fdbdf2cdb51876ca0c22c2e3b1865b11e2058fd9a07e59f281f7596cb00956": unexpected status: 403 Forbidden
Error: buildx failed with: ERROR: failed to solve: failed commit on ref "manifest-sha256:d7fdbdf2cdb51876ca0c22c2e3b1865b11e2058fd9a07e59f281f7596cb00956": unexpected status: 403 Forbidden

@Yikun
Copy link
Member

Yikun commented Sep 1, 2022

@dongjoon-hyun Thanks to ping me, this due to github action ghcr unstable, you could retry to make it work. permisson.

By default, write permission already included to your GITHUB_TOKEN, but if you set it manually (see also [1], [2]), it will failed, current CI will first build infra and push to your ghcr so write permission is required (see also [3]).

[1] https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/
[2] https://github.com/users/dongjoon-hyun/packages/container/apache-spark-ci-image/settings
[3] https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 1, 2022

Thank you, but Base Image Build phase failed three times already . I'm double-checking the configurations.
Screen Shot 2022-09-01 at 3 05 48 AM

@Yikun
Copy link
Member

Yikun commented Sep 1, 2022

@dongjoon-hyun I just saw your recreate the spark repo, so might default permisson has some changes on Github Action?

You could first set permission for your dongjoon-hyun/spark repo: https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/#setting-the-default-permissions-for-the-organization-or-repository

and we might need a separate pr to set spark permission for new created repo: https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/#setting-permissions-in-the-workflow

@Yikun
Copy link
Member

Yikun commented Sep 1, 2022

@dongjoon-hyun
Copy link
Member Author

I'm already allowing all of them.
Screen Shot 2022-09-01 at 3 19 30 AM

@dongjoon-hyun
Copy link
Member Author

It's weird. IIRC, I didn't change anything from my previous repo either when your PR applied this change.

@Yikun
Copy link
Member

Yikun commented Sep 1, 2022

Could you check this link about the permisson and the belong repo?

https://github.com/users/dongjoon-hyun/packages/container/package/apache-spark-ci-image/settings

or ghcr.io/dongjoon-hyun/apache-spark-ci-image

image

The potential issue might be the old repo is removed, but the related images is not be deleted, then when create the new repo, the write permisson of this old image are not configured to new repo.

If it's still not work, you might need to delete the apache-spark-ci-image under your ghcr first:

curl -X DELETE -H "Accept: application/vnd.github+json" -H "Authorization: token $REPLACE_ME" https://api.github.com/user/packages/container/apache-spark-ci-image

@dongjoon-hyun
Copy link
Member Author

  1. I checked that mine is the same with you.

Screen Shot 2022-09-01 at 7 47 34 AM

  1. Let me try to clean up
curl -X DELETE -H "Accept: application/vnd.github+json" -H "Authorization: token $REPLACE_ME" https://api.github.com/user/packages/container/apache-spark-ci-image

@Yikun
Copy link
Member

Yikun commented Sep 1, 2022

https://github.com/users/dongjoon-hyun/packages/container/apache-spark-ci-image/settings

You can also remove it in page ^, pls let me know if still not working...

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me, especially since we are already bundling jars from other cloud vendors.

<classifier>shaded</classifier>
<exclusions>
<exclusion>
<groupId>*</groupId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why do we exclude everything from the shaded jar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @sunchao . According to the shaded pattern,

https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/gcs/pom.xml#L208-L363

We have all we needed for Hadoop3 and Hadoop2.

I intentionally exclude everything. We will add Spark's version if there is additional missing transitive dependency (if exists).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue is that there's a history of the shaded connector still declaring a dependence on things which are now shaded, so breaking convergence.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @Yikun . Now, it seems to work on my three PRs.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine.
note that the gcs connector (at leasts the builds off their master) are java 11 only; not sure where that stands w.r.t older releases

actually, it might be nice to have the option of excluding the gcs connector. why so? well, i will end up commenting this code out on our internal builds as it will come from hadoop instead

<classifier>shaded</classifier>
<exclusions>
<exclusion>
<groupId>*</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue is that there's a history of the shaded connector still declaring a dependence on things which are now shaded, so breaking convergence.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 1, 2022

Thank you for review, @steveloughran .

note that the gcs connector (at leasts the builds off their master) are java 11 only; not sure where that stands w.r.t older releases

I didn't realize this because I've been using Java 11+. If then, I had better close this PR and the JIRA officially.

Thank you, @srowen , @sunchao and @steveloughran !

@dongjoon-hyun
Copy link
Member Author

Ur, wait. @steveloughran . It's Java 8, isn't it?

https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/pom.xml#L76-L77

    <build.java.source.version>8</build.java.source.version>
    <build.java.target.version>8</build.java.target.version>

@dongjoon-hyun dongjoon-hyun reopened this Sep 1, 2022
@dongjoon-hyun
Copy link
Member Author

@steveloughran
Copy link
Contributor

3.0.0 is java 11

    <build.java.source.version>11</build.java.source.version>
    <build.java.target.version>11</build.java.target.version>

that is hadoop-3.3.1, which simplifies my life a lot too...the streams even support IOStatistics. I've been testing my manifest committer through it and have to bump the test shell up to java11 for all to work

@steveloughran
Copy link
Contributor

anyway, the version you are looking at is probably safe; it switched in feb 2022 (pr 726)

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 2, 2022

Ya, and, there is no 3.0.0 yet. :)
Screen Shot 2022-09-02 at 10 59 41 AM

@dongjoon-hyun
Copy link
Member Author

Thank you for your review, comments, and help, @srowen , @Yikun , @sunchao , @steveloughran . It seems that there is no other concerns, I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants