Skip to content

[improvement](build) Improve Hive docker startup refresh and idempotency#62103

Open
suxiaogang223 wants to merge 10 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-structured-modes
Open

[improvement](build) Improve Hive docker startup refresh and idempotency#62103
suxiaogang223 wants to merge 10 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-structured-modes

Conversation

@suxiaogang223
Copy link
Copy Markdown
Contributor

@suxiaogang223 suxiaogang223 commented Apr 3, 2026

What problem does this PR solve?

Related Issue: #62101

Problem Summary:
This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup/refresh predictable, faster, and repeatable in local and CI workflows.

Main changes:

  • add structured Hive startup modes: --hive-mode fast|refresh|rebuild
  • add module-scoped refresh: --hive-modules
  • persist/reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh
  • optimize healthy refresh path to skip unnecessary compose rebuild/up steps
  • reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal)
  • refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns)
  • remove redundant startup-heavy operations in refresh path
  • add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting

Release note

None

Check List (For Author)

  • Test: Manual test
    • Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
    • Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql
  • Behavior changed: Yes
    • Hive startup now follows mode/module-based refresh semantics
  • Does this need documentation: No

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 7, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62103

Problem Summary: Fix Hive startup timeout in external pipelines by replacing netcat-based Hive metastore port probes with bash /dev/tcp checks, because the Hive container image does not reliably provide nc and the previous implementation could block metastore readiness forever.

### Release note

None

### Check List (For Author)

- Test: Script validation only
    - No need to test (with reason): validated updated shell scripts with bash -n; waiting for external pipeline rerun to verify runtime behavior
- Behavior changed: No
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from bec5cc8 to aca9035 Compare April 8, 2026 03:20
@suxiaogang223 suxiaogang223 changed the title [improvement](build) Add structured Hive startup modes and state reuse [improvement](build) Improve Hive docker startup refresh, idempotency, and metadata backend Apr 8, 2026
…, and metadata backend

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary:
This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows.

Main changes:
- add structured Hive startup modes: --hive-mode fast|refresh|rebuild
- add module-scoped refresh: --hive-modules
- persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh
- optimize healthy refresh path to skip unnecessary compose rebuild/up steps
- reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal)
- refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns)
- remove redundant startup-heavy operations in refresh path
- switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup
- add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql
    - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend
- Behavior changed: Yes
    - Hive startup now follows mode/module-based refresh semantics
    - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META)
- Does this need documentation: No
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from aca9035 to 3943667 Compare April 8, 2026 03:25
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

@suxiaogang223 suxiaogang223 marked this pull request as ready for review April 8, 2026 03:33
…-side JuiceFS init

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary: Revert Hive JuiceFS metadata default back to mysql_57 and stop running JuiceFS initialization/format flow during hive2/hive3 startup to avoid unexpected dependency and startup failures in pipeline environments.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh

- Behavior changed: Yes (hive startup no longer runs JuiceFS init flow; Hive JuiceFS metadata default points to mysql_57 again)

- Does this need documentation: No
### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary: Align JuiceFS startup behavior with the pre-refactor flow by restoring Hive-triggered JuiceFS jar sync and metadata initialization after hive2/hive3 startup while keeping MySQL as the default metadata backend.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh

- Behavior changed: Yes (hive startup again runs JuiceFS compatibility init path as before refactor)

- Does this need documentation: No
@suxiaogang223 suxiaogang223 changed the title [improvement](build) Improve Hive docker startup refresh, idempotency, and metadata backend [improvement](build) Improve Hive docker startup refresh and idempotency Apr 8, 2026
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

…docker

- add module-level timing in refresh_module (hive-module-lib.sh) and
  phase timing in start_hive_stack / maybe_refresh_hive_data
  (run-thirdparties-docker.sh) to establish a quantitative startup
  baseline for future volume optimisation work
- update test_hdfs_tvf_compression/run.sh to skip the HDFS upload when
  /test_data is already populated (idempotent, aligns with existing
  module-SHA refresh semantics)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suxiaogang223 suxiaogang223 force-pushed the codex/hive-startup-structured-modes branch from 88c6ce3 to d256249 Compare April 9, 2026 06:46
suxiaogang223 and others added 2 commits April 9, 2026 15:24
…h bootstrap group

Parallelization:
- split refresh_preinstalled_hql_module into two phases: serial SHA
  check (phase 1) then parallel xargs -P ${LOAD_PARALLEL} execution
  (phase 2), mirroring the existing refresh_run_scripts_in_dir pattern
- each subprocess sources hive-module-lib.sh for SHA/state helpers and
  the beeline shim PATH; all required env vars (HIVE_BOOTSTRAP_GROUPS,
  HIVE_STATE_DIR, LOAD_PARALLEL) are inherited via export

Dependency slimming:
- add tpch bootstrap group covering create_tpch1_orc.hql and
  create_tpch1_parquet.hql; these are excluded from common so
  BOOTSTRAP_GROUPS=common,hive3_only skips them automatically
- users needing TPCH data pass HIVE3_BOOTSTRAP_GROUPS=common,hive3_only,tpch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d developer guide

- rewrite architecture section: service layer table with ports, module
  layer table, bootstrap group table (including new tpch group)
- expand usage section: startup modes table, scoped module refresh
  examples, TPCH include example
- add developer guide:
  - Pattern A (run.sh + HDFS upload) with idempotent template
  - Pattern B (create_preinstalled_scripts HQL) with naming rules
  - Access HiveServer2 via beeline inside/outside container
  - Ad-hoc HDFS inspection via hadoop fs
  - Metastore PostgreSQL direct query
- add startup timing log examples showing new phase timestamps
- update troubleshooting with HiveServer2 and state corruption cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run buildall

suxiaogang223 and others added 4 commits April 10, 2026 14:14
create_tpch1_orc.hql and create_tpch1_parquet.hql were incorrectly
moved to the opt-in tpch group, causing CI pipelines to fail with
"Database [tpch1_orc] does not exist" because the default
HIVE3_BOOTSTRAP_GROUPS=common,hive3_only does not include tpch.

Keep tpch group as an empty placeholder for future opt-in TPCH datasets;
the two existing TPCH files remain in common so they are always loaded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tpch group had no useful purpose — TPCH HQL files must stay in
common for CI compatibility, so the group was always empty. Remove it
entirely to keep the bootstrap group model clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants