[improvement](iceberg) remove extra spark-shell bootstrap#61649
[improvement](iceberg) remove extra spark-shell bootstrap#61649morningman merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR improves the Iceberg docker bootstrap by removing the extra spark-shell startup path and consolidating deletion-vector seed data into the existing aggregated spark-sql bootstrap flow to reduce environment startup cost.
Changes:
- Removed the
spark-shellbootstrap execution of Scala scripts from the Iceberg container entrypoint. - Moved deletion-vector seed data generation into a new aggregated Spark SQL script (
run28.sql). - Deleted the unused Scala bootstrap script (
run01.scala).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl | Removes spark-shell bootstrap block; continues using aggregated spark-sql execution. |
| docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/iceberg/run28.sql | Adds deletion-vector seed data creation/deletes into the aggregated SQL bootstrap. |
| docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/iceberg_scala/run01.scala | Removes the now-unused Scala bootstrap script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| START_TIME3=$(date +%s) | ||
| find /mnt/scripts/create_preinstalled_scripts/iceberg_load -name '*.sql' | sed 's|^|source |' | sed 's|$|;|'> iceberg_load_total.sql | ||
| spark-sql --master spark://doris--spark-iceberg:7077 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions -f iceberg_load_total.sql |
There was a problem hiding this comment.
find ... -name '*.sql' does not guarantee a stable ordering. Several existing runXX.sql scripts appear to rely on earlier scripts having already run (e.g. run27.sql only does use demo.test_db; and doesn’t create the DB), so a non-deterministic execution order can cause intermittent bootstrap failures or objects being created in the wrong database. Please sort the file list before generating the aggregated *_total.sql (and apply the same fix to the other find ... > *_total.sql pipelines in this script).
|
run buildall |
1 similar comment
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
## What - Remove the extra spark-shell bootstrap from the spark-iceberg container startup - Move deletion-vector seed data into a new aggregated spark-sql script - Delete the unused scala bootstrap script ## Why - External regression pays this spark-iceberg startup cost on every environment bootstrap - Keeping the deletion-vector seed data inside the existing spark-sql batch avoids one extra heavyweight Spark session ## Testing - `bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl` - Not run docker-based regression tests in this environment (cherry picked from commit 400cc6b)
…61649 (#61748) ### What problem does this PR solve? Issue Number: None Related PR: #61649 Problem Summary: Cherry-pick #61649 to branch-4.1. This removes the extra spark-shell bootstrap from the Iceberg docker setup, switches initialization to a preinstalled SQL script, and drops the redundant Scala bootstrap script. ### Release note None ### Check List (For Author) - Test: No need to test (docker bootstrap script backport); verified `bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl` and `git diff --check HEAD^ HEAD` - Behavior changed: Yes (Iceberg docker bootstrap no longer starts an extra spark-shell session) - Does this need documentation: No
What
Why
Testing
bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl