Skip to content

[improvement](iceberg) remove extra spark-shell bootstrap#61649

Merged
morningman merged 1 commit intoapache:masterfrom
xylaaaaa:codex_iceberg_external_startup_pr
Mar 25, 2026
Merged

[improvement](iceberg) remove extra spark-shell bootstrap#61649
morningman merged 1 commit intoapache:masterfrom
xylaaaaa:codex_iceberg_external_startup_pr

Conversation

@xylaaaaa
Copy link
Contributor

@xylaaaaa xylaaaaa commented Mar 24, 2026

What

  • Remove the extra spark-shell bootstrap from the spark-iceberg container startup
  • Move deletion-vector seed data into a new aggregated spark-sql script
  • Delete the unused scala bootstrap script

Why

  • External regression pays this spark-iceberg startup cost on every environment bootstrap
  • Keeping the deletion-vector seed data inside the existing spark-sql batch avoids one extra heavyweight Spark session

Testing

  • bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl
  • Not run docker-based regression tests in this environment

Copilot AI review requested due to automatic review settings March 24, 2026 03:47
@Thearas
Copy link
Contributor

Thearas commented Mar 24, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Iceberg docker bootstrap by removing the extra spark-shell startup path and consolidating deletion-vector seed data into the existing aggregated spark-sql bootstrap flow to reduce environment startup cost.

Changes:

  • Removed the spark-shell bootstrap execution of Scala scripts from the Iceberg container entrypoint.
  • Moved deletion-vector seed data generation into a new aggregated Spark SQL script (run28.sql).
  • Deleted the unused Scala bootstrap script (run01.scala).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl Removes spark-shell bootstrap block; continues using aggregated spark-sql execution.
docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/iceberg/run28.sql Adds deletion-vector seed data creation/deletes into the aggregated SQL bootstrap.
docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/iceberg_scala/run01.scala Removes the now-unused Scala bootstrap script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 58 to 60
START_TIME3=$(date +%s)
find /mnt/scripts/create_preinstalled_scripts/iceberg_load -name '*.sql' | sed 's|^|source |' | sed 's|$|;|'> iceberg_load_total.sql
spark-sql --master spark://doris--spark-iceberg:7077 --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions -f iceberg_load_total.sql
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find ... -name '*.sql' does not guarantee a stable ordering. Several existing runXX.sql scripts appear to rely on earlier scripts having already run (e.g. run27.sql only does use demo.test_db; and doesn’t create the DB), so a non-deterministic execution order can cause intermittent bootstrap failures or objects being created in the wrong database. Please sort the file list before generating the aggregated *_total.sql (and apply the same fix to the other find ... > *_total.sql pipelines in this script).

Copilot uses AI. Check for mistakes.
@xylaaaaa
Copy link
Contributor Author

run buildall

1 similar comment
@xylaaaaa
Copy link
Contributor Author

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 25, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit 400cc6b into apache:master Mar 25, 2026
36 of 39 checks passed
xylaaaaa added a commit to xylaaaaa/doris that referenced this pull request Mar 26, 2026
## What
- Remove the extra spark-shell bootstrap from the spark-iceberg
container startup
- Move deletion-vector seed data into a new aggregated spark-sql script
- Delete the unused scala bootstrap script

## Why
- External regression pays this spark-iceberg startup cost on every
environment bootstrap
- Keeping the deletion-vector seed data inside the existing spark-sql
batch avoids one extra heavyweight Spark session

## Testing
- `bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl`
- Not run docker-based regression tests in this environment

(cherry picked from commit 400cc6b)
yiguolei pushed a commit that referenced this pull request Mar 26, 2026
…61649 (#61748)

### What problem does this PR solve?

Issue Number: None

Related PR: #61649

Problem Summary: Cherry-pick #61649 to branch-4.1. This removes the
extra spark-shell bootstrap from the Iceberg docker setup, switches
initialization to a preinstalled SQL script, and drops the redundant
Scala bootstrap script.

### Release note

None

### Check List (For Author)

- Test: No need to test (docker bootstrap script backport); verified
`bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl`
and `git diff --check HEAD^ HEAD`
- Behavior changed: Yes (Iceberg docker bootstrap no longer starts an
extra spark-shell session)
- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants