Skip to content

[SPARK-49243][INFRA][K8S] Add OpenContainers Annotations in docker images#47766

Closed
dongjoon-hyun wants to merge 2 commits intoapache:masterfrom
dongjoon-hyun:SPARK-49243
Closed

[SPARK-49243][INFRA][K8S] Add OpenContainers Annotations in docker images#47766
dongjoon-hyun wants to merge 2 commits intoapache:masterfrom
dongjoon-hyun:SPARK-49243

Conversation

@dongjoon-hyun
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun commented Aug 15, 2024

What changes were proposed in this pull request?

This PR aims to add OpenContainers Annotations to docker image to add metadata like Apache License.

Why are the changes needed?

BEFORE

$ docker inspect apache/spark:3.5.2 | jq '.[0].Config.Labels'
{
  "org.opencontainers.image.ref.name": "ubuntu",
  "org.opencontainers.image.version": "20.04"
}

AFTER

$ NO_MANUAL=1 ./dev/make-distribution.sh -Pkubernetes
$ bin/docker-image-tool.sh -t SPARK-49243 -n build
$ docker inspect spark:SPARK-49243 | jq '.[0].Config.Labels'
{
  "org.opencontainers.image.authors": "Apache Spark project <dev@spark.apache.org>",
  "org.opencontainers.image.licenses": "Apache-2.0",
  "org.opencontainers.image.ref.name": "Apache Spark Scala/Java Image",
  "org.opencontainers.image.version": ""
}

Does this PR introduce any user-facing change?

No, it's only a metadata change.

How was this patch tested?

Manual review.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft August 15, 2024 04:13
@dongjoon-hyun dongjoon-hyun force-pushed the SPARK-49243 branch 2 times, most recently from 35f93f2 to e34203c Compare August 15, 2024 04:19
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review August 15, 2024 04:26
@dongjoon-hyun
Copy link
Copy Markdown
Member Author

Could you review this PR, @viirya ?

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @dongjoon-hyun

@dongjoon-hyun
Copy link
Copy Markdown
Member Author

Thank you, @viirya !


# Image for building Spark releases. Based on Ubuntu 22.04.
FROM ubuntu:jammy-20240227
LABEL org.opencontainers.image.authors="Apache Spark project <dev@spark.apache.org>"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Spark Community?

Copy link
Copy Markdown
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Spark Community?

Isn't it too broad because it includes a read-only users?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's acceptable to include read-only users because non-code contributions, such as testing and sharing use cases, should also be acknowledged.
The User also plays a role in contributing according to 'The Apache Way'.

Copy link
Copy Markdown
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's acceptable to include read-only users because non-code contributions, such as testing and sharing use cases, should also be acknowledged. The User also plays a role in contributing according to 'The Apache Way'.

I got your point. It seems that there was a little misunderstanding between you and me.

  • When I say read-only users, I embraced read-only users (who didn't contribute at all in all ways) as Apache Spark Community.

The point is that I have a broader concept on the Community than yours.

The next question is that

  • Contributors are Authors?

Copy link
Copy Markdown
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing. I used the term Project as the author in this way.

An open source project includes all aspects of creating, maintaining, and distributing open source software including community building and mentoring, communication, the release process, and everything in between.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we capitalize 'project' to make it consistent with others

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, did you mean Apache Spark by with others? It's possible.

However, technically, Apache Spark™ is a trademark and the unique name of our project while project is a normal noun.


Screenshot 2024-08-15 at 08 11 04

Here is a sample example sentence according to ASF guideline. Unlike Apache Spark, the following normal noun like software is used in a lower case. You can see that the phrase, the Apache Foo project, in the same document as a example.
https://www.apache.org/foundation/marks/#guidelines

"Free copies of Apache ProjectName software under the Apache License and support services for Apache ProjectName are available at my own company website."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean 'Apache Spark pProject' like 'S'cala, 'I'mage etc, but it's just a nit

LABEL org.opencontainers.image.authors="Apache Spark project <dev@spark.apache.org>"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.ref.name="Apache Spark Release Manager Image"
LABEL org.opencontainers.image.version=""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be removed if it's blank?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I tried, but we should have a blank here to remove the previous OS layer version as described in the PR description.

$ docker inspect apache/spark:3.5.2 | jq '.[0].Config.Labels'
{
  "org.opencontainers.image.ref.name": "ubuntu",
  "org.opencontainers.image.version": "20.04"
}

Copy link
Copy Markdown
Member Author

@dongjoon-hyun dongjoon-hyun Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this blank, the final information could be interpreted as Apache Spark 20.04.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the explanation.
Maybe we can add some comment here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will add comments, @yaooqinn .

Ya, I agree that the AS-IS is obscure.

So, I'm currently trying to improve this area including bin/docker-image-tool.sh via version detection from RELEASE. For the K8s images, I guess I can deliver a better way to fill this field. But, for this spark-rm/Dockerfile, I guess it's not worth to do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

via version detection from RELEASE

+1

FROM ${base_img:-spark}
LABEL org.opencontainers.image.authors="Apache Spark project <dev@spark.apache.org>"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.ref.name="Apache Spark Python Image"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I also considered Apache Spark PySpark Image and Apache Spark SparkR Image.

But, I believe Python would be better because we have Apache Spark Scala/Java Image as a base image.

In addition, I'm not sure, but Apache Spark Go Image comes later. So, for consistency, language name could work extensively.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please see our docker hub. The Docker convention in the name is languages like scala, r, python3 instead of SparkR or PySpark.

Screenshot 2024-08-14 at 23 33 02

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering whether it will conflict with the https://pypi.org/project/pyspark-connect/ or not.

But it looks fine to me now

@dongjoon-hyun
Copy link
Copy Markdown
Member Author

Thank you for thorough review, @yaooqinn .

I addressed your comments by adding a comment line and replied for the discussion part here.

Please let me know your opinion.

Copy link
Copy Markdown
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Copy Markdown
Member Author

Thank you! Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-49243 branch August 15, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants