[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186
[SPARK-57130][BUILD] make-distribution.sh copies only git-tracked files for python#56186pan3793 wants to merge 2 commits into
make-distribution.sh copies only git-tracked files for python#56186Conversation
| cp "$SPARK_HOME/README.md" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/bin" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/python" "$DISTDIR" | ||
| if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then |
There was a problem hiding this comment.
To keep the console clean, how about redirecting the stdout/stderr of command -v to /dev/null?
There was a problem hiding this comment.
it follows the existing command -v pattern, e.g., in line 128 if [ $(command -v git) ]; then
There was a problem hiding this comment.
Got it. Let's fix all the occurrences of such pattern in the separate PR if we should do it.
| cp -r "$SPARK_HOME/bin" "$DISTDIR" | ||
| cp -r "$SPARK_HOME/python" "$DISTDIR" | ||
| if command -v git && command -v cpio && git rev-parse --git-dir 2>/dev/null; then | ||
| git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR" |
There was a problem hiding this comment.
I'm concerned about whether cpio behaves the same way in GNU and BSD. Did you confirm this script works on macOS and Linux?
There was a problem hiding this comment.
my testing (also confirmed by asking LLM) shows the BSD cpio has different behavior when involving symlinks, but I think it does not affect the make-distribution.sh case
There was a problem hiding this comment.
I tried your change and I noticed:
- On Linux environment,
cpioonly copies.coveragercto$DISTDIR/python.-0option seems required, which should work withcpioon macOS. - If we run
make-distribution.shon a directory other than$SPARK_HOME, files underpythonare't copied to$DISTDIR.
| git ls-files -z "$SPARK_HOME/python" | cpio -pdm "$DISTDIR" | |
| (cd "$SPARK_HOME" && git ls-files -z python | cpio -0pdm "$DISTDIR") |
There was a problem hiding this comment.
@sarutak thanks for checking, yes, -0 should be used
Operation modifiers valid in copy-out and copy-pass modes:
-0, --null Filenames in the list are delimited by null
but I think cd is not required as it runs cd "$SPARK_HOME" in lien 169, I tested it by running the make-distribution.sh under $SPARK_HOME/common, it works fine, am I missing something?
There was a problem hiding this comment.
Sorry. It was a misunderstanding on my part about cd $SPARK_HOME. Adding -0 option to cpio is enough.
…iles for python ### What changes were proposed in this pull request? `make-distribution.sh` copies only git-tracked files for `python` folder, when `git` and `cpio` commands are available and under a git repo, instead of raw `cp`. ### Why are the changes needed? I find that sometimes `make-distribution.sh` produces an unreasonably large tarball because it copies the entire `python` folder to the `dist` directory, which may contain generated files, e.g., compiled PySpark docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run `dev/make-distribution.sh` manually. Also tested the performance of the new command, on macOS, `cpio` is slightly slower than raw `cp`, but good enough. ``` $ time git ls-files -z "$PWD/python" | cpio -0pdm "target" 42452 blocks git ls-files -z "$PWD/python" 0.01s user 0.01s system 76% cpu 0.027 total cpio -0pdm "target" 0.05s user 1.10s system 77% cpu 1.480 total $ rm -rf target/python $ time cp -r "$PWD/python" "target" cp -r "$PWD/python" "target" 0.02s user 0.56s system 78% cpu 0.731 total ``` on Linux, `cpio` is faster ``` $ time git ls-files -z "$PWD/python" | cpio -0pdm "target" 46385 blocks git ls-files -z "$PWD/python" 0.01s user 0.01s system 81% cpu 0.022 total cpio -0pdm "target" 0.05s user 1.02s system 84% cpu 1.260 total $ rm -rf target/python $ time cp -r "$PWD/python" "target" cp -r "$PWD/python" "target" 0.02s user 0.57s system 73% cpu 0.807 total ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: DeepSeek V4 Pro. Closes #56186 from pan3793/SPARK-57130. Authored-by: Cheng Pan <pan3793@gmail.com> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 319dc6e) Signed-off-by: Cheng Pan <chengpan@apache.org>
|
thanks for review, merged to master/4.x |
What changes were proposed in this pull request?
make-distribution.shcopies only git-tracked files forpythonfolder, whengitandcpiocommands are available and under a git repo, instead of rawcp.Why are the changes needed?
I find that sometimes
make-distribution.shproduces an unreasonably large tarball because it copies the entirepythonfolder to thedistdirectory, which may contain generated files, e.g., compiled PySpark docs.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Run
dev/make-distribution.shmanually.Also tested the performance of the new command, on macOS,
cpiois slightly slower than rawcp, but good enough.on Linux,
cpiois fasterWas this patch authored or co-authored using generative AI tooling?
Generated-by: DeepSeek V4 Pro.