[improvement](build) Improve Hive docker startup refresh and idempotency#62103
Open
suxiaogang223 wants to merge 10 commits intoapache:masterfrom
Open
[improvement](build) Improve Hive docker startup refresh and idempotency#62103suxiaogang223 wants to merge 10 commits intoapache:masterfrom
suxiaogang223 wants to merge 10 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run external |
suxiaogang223
added a commit
to suxiaogang223/doris
that referenced
this pull request
Apr 7, 2026
### What problem does this PR solve? Issue Number: None Related PR: apache#62103 Problem Summary: Fix Hive startup timeout in external pipelines by replacing netcat-based Hive metastore port probes with bash /dev/tcp checks, because the Hive container image does not reliably provide nc and the previous implementation could block metastore readiness forever. ### Release note None ### Check List (For Author) - Test: Script validation only - No need to test (with reason): validated updated shell scripts with bash -n; waiting for external pipeline rerun to verify runtime behavior - Behavior changed: No - Does this need documentation: No
bec5cc8 to
aca9035
Compare
…, and metadata backend ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows. Main changes: - add structured Hive startup modes: --hive-mode fast|refresh|rebuild - add module-scoped refresh: --hive-modules - persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh - optimize healthy refresh path to skip unnecessary compose rebuild/up steps - reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal) - refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns) - remove redundant startup-heavy operations in refresh path - switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup - add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting ### Release note None ### Check List (For Author) - Test: Manual test - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend - Behavior changed: Yes - Hive startup now follows mode/module-based refresh semantics - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META) - Does this need documentation: No
aca9035 to
3943667
Compare
Contributor
Author
|
run external |
…-side JuiceFS init ### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: Revert Hive JuiceFS metadata default back to mysql_57 and stop running JuiceFS initialization/format flow during hive2/hive3 startup to avoid unexpected dependency and startup failures in pipeline environments. ### Release note None ### Check List (For Author) - Test: Manual test - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh - Behavior changed: Yes (hive startup no longer runs JuiceFS init flow; Hive JuiceFS metadata default points to mysql_57 again) - Does this need documentation: No
### What problem does this PR solve? Related Issue: apache#62101 Related PR: apache#62102 Problem Summary: Align JuiceFS startup behavior with the pre-refactor flow by restoring Hive-triggered JuiceFS jar sync and metadata initialization after hive2/hive3 startup while keeping MySQL as the default metadata backend. ### Release note None ### Check List (For Author) - Test: Manual test - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh - Behavior changed: Yes (hive startup again runs JuiceFS compatibility init path as before refactor) - Does this need documentation: No
Contributor
Author
|
run external |
…docker - add module-level timing in refresh_module (hive-module-lib.sh) and phase timing in start_hive_stack / maybe_refresh_hive_data (run-thirdparties-docker.sh) to establish a quantitative startup baseline for future volume optimisation work - update test_hdfs_tvf_compression/run.sh to skip the HDFS upload when /test_data is already populated (idempotent, aligns with existing module-SHA refresh semantics) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
88c6ce3 to
d256249
Compare
…h bootstrap group
Parallelization:
- split refresh_preinstalled_hql_module into two phases: serial SHA
check (phase 1) then parallel xargs -P ${LOAD_PARALLEL} execution
(phase 2), mirroring the existing refresh_run_scripts_in_dir pattern
- each subprocess sources hive-module-lib.sh for SHA/state helpers and
the beeline shim PATH; all required env vars (HIVE_BOOTSTRAP_GROUPS,
HIVE_STATE_DIR, LOAD_PARALLEL) are inherited via export
Dependency slimming:
- add tpch bootstrap group covering create_tpch1_orc.hql and
create_tpch1_parquet.hql; these are excluded from common so
BOOTSTRAP_GROUPS=common,hive3_only skips them automatically
- users needing TPCH data pass HIVE3_BOOTSTRAP_GROUPS=common,hive3_only,tpch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d developer guide - rewrite architecture section: service layer table with ports, module layer table, bootstrap group table (including new tpch group) - expand usage section: startup modes table, scoped module refresh examples, TPCH include example - add developer guide: - Pattern A (run.sh + HDFS upload) with idempotent template - Pattern B (create_preinstalled_scripts HQL) with naming rules - Access HiveServer2 via beeline inside/outside container - Ad-hoc HDFS inspection via hadoop fs - Metastore PostgreSQL direct query - add startup timing log examples showing new phase timestamps - update troubleshooting with HiveServer2 and state corruption cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
run buildall |
create_tpch1_orc.hql and create_tpch1_parquet.hql were incorrectly moved to the opt-in tpch group, causing CI pipelines to fail with "Database [tpch1_orc] does not exist" because the default HIVE3_BOOTSTRAP_GROUPS=common,hive3_only does not include tpch. Keep tpch group as an empty placeholder for future opt-in TPCH datasets; the two existing TPCH files remain in common so they are always loaded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tpch group had no useful purpose — TPCH HQL files must stay in common for CI compatibility, so the group was always empty. Remove it entirely to keep the bootstrap group model clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Related Issue: #62101
Problem Summary:
This PR overhauls Hive thirdparty startup in
docker/thirdpartiesto make startup/refresh predictable, faster, and repeatable in local and CI workflows.Main changes:
--hive-mode fast|refresh|rebuild--hive-modulesHIVE_DEBUG=1, cleaner staged refresh logs, obsolete composeversionremoval)Release note
None
Check List (For Author)
./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql