Skip to content

Conversation

@mengCareers
Copy link
Contributor

@mengCareers mengCareers commented Jan 12, 2020

What changes were proposed in this pull request?

Enable dependency audit files to tell the value of artifact id, version, and classifier of a dependency.

For example, avro-mapred-1.8.2-hadoop2.jar should be expanded to avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar where avro-mapred is the artifact id, 1.8.2 is the version, and haddop2 is the classifier.

Why are the changes needed?

Dependency audit files are expected to be consumed by automated tests or downstream tools.

However, current dependency audit files under dev/deps only show jar names. And there isn't a simple rule on how to parse the jar name to get the values of different fields. For example, hadoop2 is the classifier of avro-mapred-1.8.2-hadoop2.jar, in contrast, incubating is the version of htrace-core-3.1.0-incubating.jar.

Reference: There is a good example of the downstream tool that would be enabled as @yhuai suggested,

Say we have a Spark application that depends on a third-party dependency foo, which pulls in jackson as a transient dependency. Unfortunately, foo depends on a different version of jackson than Spark. So, in the pom of this Spark application, we use the dependency management section to pin the version of jackson. By doing this, we are lifting jackson to the top-level dependency of my application and I want to have a way to keep tracking what Spark uses. What we can do is to cross-check my Spark application's classpath with what Spark uses. Then, with a test written in my code base, whenever my application bumps Spark version, this test will check what we define in the application and what Spark has, and then remind us to change our application's pom if needed. In my case, I am fine to directly access git to get these audit files.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Code changes are verified by generated dependency audit files naturally. Thus, there are no tests added.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30491] Enable dependency audit files to tell dependency classifier [SPARK-30491][2.4] Enable dependency audit files to tell dependency classifier Jan 12, 2020
@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @mengCareers .

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-30491][2.4] Enable dependency audit files to tell dependency classifier [SPARK-30491][INFRA][2.4] Enable dependency audit files to tell dependency classifier Jan 12, 2020
@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116571 has finished for PR 27178 at commit 2f7b573.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengCareers
Copy link
Contributor Author

@dongjoon-hyun @gengliangwang May I adjust the comments for this PR first? If this PR is finalized, I would apply the same changes for #27177, and then ask you for a final review.
Thank you for the review!

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116735 has finished for PR 27178 at commit 4738b80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116743 has finished for PR 27178 at commit 8fbb565.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116794 has finished for PR 27178 at commit fe08fd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @mengCareers . It looks good to me. Could you update your PR on master branch now?
We need to finalize and merge on master branch first. Also, we are in the middle of vote on Apache Spark 2.4.5. So, branch-2.4 is currently locked for this type of patch backporting.

@yhuai
Copy link
Contributor

yhuai commented Jan 15, 2020

@dongjoon-hyun how does voting for 2.4.5 affect this since it will not be part of the artifact?

@srowen
Copy link
Member

srowen commented Jan 15, 2020

It could in theory break the build, so matters w.r.t. managing an RC process. But indeed this won't affect any functionality of Spark, so would be easy enough to merge for a next RC if necessary, or after 2.4.5 is finalized.

@mengCareers
Copy link
Contributor Author

My PR against the master branch has been updated.

@dongjoon-hyun
Copy link
Member

Since RC1 vote is cancelled with several feedbacks, I'll merge this for next RC. Thank you all.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Jan 16, 2020
…dency classifier

### What changes were proposed in this pull request?
Enable dependency audit files to tell the value of artifact id, version, and classifier of a dependency.

For example, `avro-mapred-1.8.2-hadoop2.jar` should be expanded to `avro-mapred/1.8.2/hadoop2/avro-mapred-1.8.2-hadoop2.jar` where `avro-mapred` is the artifact id, `1.8.2` is the version, and `haddop2` is the classifier.

### Why are the changes needed?
Dependency audit files are expected to be consumed by automated tests or downstream tools.

However, current dependency audit files under `dev/deps` only show jar names. And there isn't a simple rule on how to parse the jar name to get the values of different fields. For example, `hadoop2` is the classifier of `avro-mapred-1.8.2-hadoop2.jar`, in contrast, `incubating` is the version of `htrace-core-3.1.0-incubating.jar`.

Reference: There is a good example of the downstream tool that would be enabled as yhuai suggested,

> Say we have a Spark application that depends on a third-party dependency `foo`, which pulls in `jackson` as a transient dependency. Unfortunately, `foo` depends on a different version of `jackson` than Spark. So, in the pom of this Spark application, we use the dependency management section to pin the version of `jackson`. By doing this, we are lifting `jackson` to the top-level dependency of my application and I want to have a way to keep tracking what Spark uses. What we can do is to cross-check my Spark application's classpath with what Spark uses. Then, with a test written in my code base, whenever my application bumps Spark version, this test will check what we define in the application and what Spark has, and then remind us to change our application's pom if needed. In my case, I am fine to directly access git to get these audit files.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Code changes are verified by generated dependency audit files naturally. Thus, there are no tests added.

Closes #27178 from mengCareers/depsOptimize2.4.

Lead-authored-by: Xinrong Meng <meng.careers@gmail.com>
Co-authored-by: mengCareers <meng.careers@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants