[HUDI-2440] Add dependency change diff script for dependency governace#3674
[HUDI-2440] Add dependency change diff script for dependency governace#3674yanghua merged 4 commits intoapache:masterfrom
Conversation
|
I received many reports about dependency conflict around Hudi, e.g. here and Chinese WeChat group. Considering Hudi depends on so many Hadoop ecosystem components. IMO, it's time to do the dependency governance. In this PR, I provided a dependency utility script that can be used to search/diff dependencies of bundles when the contributors change the dependencies. In addition, I have pre-generated a dependency list for some bundles. Usage: # use -r option to replace the old file when we introduce new dependencies
./scripts/dependency.sh -p hudi-utilities-bundle_2.11 -rI have two suggestions:
WDYT? @vinothchandar @xushiyan |
xushiyan
left a comment
There was a problem hiding this comment.
@yanghua i see the point here is to allow PR reviewer easily identify dep changes. One feedback i have for the script is: it turns out to have quite some logic and not friendly to maintainers who are not familiar with bash. Can we leverage existing commands from maven? e.g.
mvn dependency:tree -pl packaging/hudi-spark-bundle -DoutputFile=/tmp/hudi-spark-bundle.deptree.txt
scripts/dependency.sh
Outdated
| classifier_start_index=length(artifact_id"-"version"-") + 1; | ||
| classifier_end_index=index(jar_name, ".jar") - 1; | ||
| classifier=substr(jar_name, classifier_start_index, classifier_end_index - classifier_start_index + 1); | ||
| print artifact_id"/"version"/"classifier"/"jar_name |
There was a problem hiding this comment.
seeing double / before jar_name.
There was a problem hiding this comment.
shall we follow some existing convention to show the dependences? for example
https://search.maven.org/artifact/org.apache.hudi/hudi-flink-bundle_2.12/0.9.0/jar
the identifier pattern is group:artifact:version
There was a problem hiding this comment.
seeing double
/before jar_name.
It is used to extract the classifier value of the maven dependency. If not configured, it's an empty string.
There was a problem hiding this comment.
shall we follow some existing convention to show the dependences? for example
https://search.maven.org/artifact/org.apache.hudi/hudi-flink-bundle_2.12/0.9.0/jar
the identifier pattern is
group:artifact:version
You can find the dependency is ordered by the artifact, right? IMO, it's easier to distinguish the different versions of one artifact. And view based on order artifacts is more suitable for human sense. Generally, People pay more attention to artifacts than group, right?
But, yes, I agree that it's better to add group information.
There was a problem hiding this comment.
if i get the intention right, we can also make the task (diff the dependencies) part of GitHub actions job, which is about building projects anyway. If GA job detects new dependencies in PR, we can make it report back on the PR, in some way. This job deserves automation in essence.
edits:
ok just read your suggestions so we're aligned on running this in GA. that sounds good. need to explore a bit on the implementation though.
To the 2nd point, yea agree we should clean up dependency tree for once and screen the changes with a CI process.
|
@xushiyan Thanks for sharing your thoughts. Let's discuss some points.
Yes, that's one of the purposes. Another one is to let the contributors or developers/users have a way to know and view the dependencies of those bundles if that way(meet conflict problems). So I output them into the codebase. Just like the Kyuubi has done.
IMO, It's a tool just like other tools in the codebase. We do not need to spend much time to change or maintain it. And we can add more comments and a better usage guide. It's the first version in order to receive other inputs from you guys.
Because it's easier and more readable.
|
|
@xushiyan I have addresses some suggestions. Any new inputs? |
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.