Skip to content

[HUDI-2440] Add dependency change diff script for dependency governace#3674

Merged
yanghua merged 4 commits intoapache:masterfrom
yanghua:HUDI-2440
Sep 30, 2021
Merged

[HUDI-2440] Add dependency change diff script for dependency governace#3674
yanghua merged 4 commits intoapache:masterfrom
yanghua:HUDI-2440

Conversation

@yanghua
Copy link
Contributor

@yanghua yanghua commented Sep 16, 2021

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@yanghua yanghua marked this pull request as draft September 16, 2021 08:14
@yanghua yanghua changed the title [HUDI-2440] Add dependency change diff script for dependency governace [WIP][HUDI-2440] Add dependency change diff script for dependency governace Sep 16, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Sep 16, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run travis re-run the last Travis build
  • @hudi-bot run azure re-run the last Azure build

@yanghua yanghua changed the title [WIP][HUDI-2440] Add dependency change diff script for dependency governace [HUDI-2440] Add dependency change diff script for dependency governace Sep 23, 2021
@yanghua yanghua marked this pull request as ready for review September 23, 2021 08:27
@yanghua
Copy link
Contributor Author

yanghua commented Sep 23, 2021

I received many reports about dependency conflict around Hudi, e.g. here and Chinese WeChat group. Considering Hudi depends on so many Hadoop ecosystem components. IMO, it's time to do the dependency governance.

In this PR, I provided a dependency utility script that can be used to search/diff dependencies of bundles when the contributors change the dependencies. In addition, I have pre-generated a dependency list for some bundles.

Usage:

# use -r option to replace the old file when we introduce new dependencies
./scripts/dependency.sh -p hudi-utilities-bundle_2.11 -r

I have two suggestions:

  • build a Github workflow to check the dependency diff, and reviewers should consider any new dependency seriously before merging.
  • based on the pre-generated dependency file to do the dependency governance, you can see there are some dependencies that exist in multiple versions for one bundle.

WDYT? @vinothchandar @xushiyan

@xushiyan xushiyan self-assigned this Sep 26, 2021
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua i see the point here is to allow PR reviewer easily identify dep changes. One feedback i have for the script is: it turns out to have quite some logic and not friendly to maintainers who are not familiar with bash. Can we leverage existing commands from maven? e.g.

mvn dependency:tree -pl packaging/hudi-spark-bundle -DoutputFile=/tmp/hudi-spark-bundle.deptree.txt

classifier_start_index=length(artifact_id"-"version"-") + 1;
classifier_end_index=index(jar_name, ".jar") - 1;
classifier=substr(jar_name, classifier_start_index, classifier_end_index - classifier_start_index + 1);
print artifact_id"/"version"/"classifier"/"jar_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing double / before jar_name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we follow some existing convention to show the dependences? for example

https://search.maven.org/artifact/org.apache.hudi/hudi-flink-bundle_2.12/0.9.0/jar

the identifier pattern is group:artifact:version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing double / before jar_name.

It is used to extract the classifier value of the maven dependency. If not configured, it's an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we follow some existing convention to show the dependences? for example

https://search.maven.org/artifact/org.apache.hudi/hudi-flink-bundle_2.12/0.9.0/jar

the identifier pattern is group:artifact:version

You can find the dependency is ordered by the artifact, right? IMO, it's easier to distinguish the different versions of one artifact. And view based on order artifacts is more suitable for human sense. Generally, People pay more attention to artifacts than group, right?

But, yes, I agree that it's better to add group information.

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i get the intention right, we can also make the task (diff the dependencies) part of GitHub actions job, which is about building projects anyway. If GA job detects new dependencies in PR, we can make it report back on the PR, in some way. This job deserves automation in essence.

edits:

ok just read your suggestions so we're aligned on running this in GA. that sounds good. need to explore a bit on the implementation though.

To the 2nd point, yea agree we should clean up dependency tree for once and screen the changes with a CI process.

@yanghua
Copy link
Contributor Author

yanghua commented Sep 27, 2021

@xushiyan Thanks for sharing your thoughts. Let's discuss some points.

i see the point here is to allow PR reviewer easily identify dep changes.

Yes, that's one of the purposes. Another one is to let the contributors or developers/users have a way to know and view the dependencies of those bundles if that way(meet conflict problems). So I output them into the codebase. Just like the Kyuubi has done.

it turns out to have quite some logic and not friendly to maintainers who are not familiar with bash

IMO, It's a tool just like other tools in the codebase. We do not need to spend much time to change or maintain it. And we can add more comments and a better usage guide. It's the first version in order to receive other inputs from you guys.

why dependency:build-classpath not dependency:tree?

Because it's easier and more readable.

build-classpath looks like ls behavior and flatted to a dependencies list. However, tree contains nested | \ + - right? It's harder to deal with.

@yanghua
Copy link
Contributor Author

yanghua commented Sep 30, 2021

@xushiyan I have addresses some suggestions. Any new inputs?

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua Looks good. As you suggested, we need a follow-up JIRA for adding GA task, and another for cleaning up the current dependency lists?

@yanghua yanghua merged commit 47ed917 into apache:master Sep 30, 2021
prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants