[improvement](build) Reduce Hive startup bootstrap overhead#62102
Closed
suxiaogang223 wants to merge 2 commits intoapache:masterfrom
Closed
[improvement](build) Reduce Hive startup bootstrap overhead#62102suxiaogang223 wants to merge 2 commits intoapache:masterfrom
suxiaogang223 wants to merge 2 commits intoapache:masterfrom
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Reduce Hive bootstrap overhead by merging preinstalled HQL execution into a single hive invocation, skipping redundant MSCK REPAIR TABLE statements on non-partitioned tables, and reusing cached Hive auxiliary jars instead of downloading them on every startup.
### Release note
None
### Check List (For Author)
- Test: Script validation only
- Manual test / No need to test (with reason)
- Behavior changed: No
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run external |
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 3, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#62102 Problem Summary: Introduce structured Hive startup control for third-party Docker scripts by decoupling metastore readiness from data loading, adding hive startup modes, and persisting baseline/module state so repeated startup can reuse existing Hive state while keeping the legacy run-thirdparties-docker.sh -c hive3 flow compatible. ### Release note None ### Check List (For Author) - Test: Script validation only - No need to test (with reason): validated modified shell scripts with bash -n, but end-to-end Docker runtime validation is still pending and the PR will stay draft - Behavior changed: Yes (Hive startup flow now supports fast/refresh/rebuild modes while preserving the default refresh semantics for existing pipeline entrypoints) - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#62102 Problem Summary: Fix Hive startup slowdown in external pipelines by executing merged preinstalled HQL files in parallel shards instead of a single serialized Hive invocation. This preserves the reduced HQL file count while keeping the original startup parallelism characteristics. ### Release note None ### Check List (For Author) - Test: Script validation only - No need to test (with reason): validated the updated shell script with bash -n; waiting for pipeline rerun to confirm runtime improvement - Behavior changed: No - Does this need documentation: No
Contributor
Author
|
run external |
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 8, 2026
…kflow ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: This PR consolidates Hive thirdparty startup improvements to make refresh/rebuild behavior more predictable, reduce startup overhead, and improve operational observability. Key updates include: - introduce structured hive startup modes () and module-selective refresh () - persist and reuse hive state, with SHA-based incremental refresh for modules and preinstalled HQL files - reduce refresh/startup log noise (xtrace gating, obsolete compose version cleanup, cleaner refresh stage logs) - make Hive bootstrap scripts/HQL idempotent with drop-then-create style and repeatable reruns - optimize healthy refresh path by skipping unnecessary compose-up steps - switch JuiceFS default metadata backend for Hive to metastore PostgreSQL and remove auto-MySQL dependency - add Hive README documenting component segmentation, startup modes, module refresh, and troubleshooting ### Release note None ### Check List (For Author) - Test: Manual test - Ran hive3 refresh and module-scoped refresh via run-thirdparties-docker.sh - Behavior changed: Yes (Hive startup and refresh behavior is now mode/module driven and defaults to PostgreSQL-backed JuiceFS metadata) - Does this need documentation: No
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 8, 2026
…, and metadata backend ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows. Main changes: - add structured Hive startup modes: --hive-mode fast|refresh|rebuild - add module-scoped refresh: --hive-modules - persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh - optimize healthy refresh path to skip unnecessary compose rebuild/up steps - reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal) - refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns) - remove redundant startup-heavy operations in refresh path - switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup - add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting ### Release note None ### Check List (For Author) - Test: Manual test - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend - Behavior changed: Yes - Hive startup now follows mode/module-based refresh semantics - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META) - Does this need documentation: No
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 8, 2026
…-side JuiceFS init ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: Revert Hive JuiceFS metadata default back to mysql_57 and stop running JuiceFS initialization/format flow during hive2/hive3 startup to avoid unexpected dependency and startup failures in pipeline environments. ### Release note None ### Check List (For Author) - Test: Manual test - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh - Behavior changed: Yes (hive startup no longer runs JuiceFS init flow; Hive JuiceFS metadata default points to mysql_57 again) - Does this need documentation: No
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 8, 2026
### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: Align JuiceFS startup behavior with the pre-refactor flow by restoring Hive-triggered JuiceFS jar sync and metadata initialization after hive2/hive3 startup while keeping MySQL as the default metadata backend. ### Release note None ### Check List (For Author) - Test: Manual test - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh - Behavior changed: Yes (hive startup again runs JuiceFS compatibility init path as before refactor) - Does this need documentation: No
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Related Issue: #62101
Problem Summary:
This PR reduces Hive third-party bootstrap overhead in the current startup flow without changing the broader startup model yet.
The main optimizations are:
create_preinstalled_scripts/*.hqlinto a single Hive invocation to avoid repeated JVM startup costMSCK REPAIR TABLEstatements on non-partitioned tablesThese changes are intended as an incremental optimization PR before the larger structured startup-script redesign tracked in #62101.
Release note
None
Check List (For Author)