Skip to content

[improvement](build) Reduce Hive startup bootstrap overhead#62102

Closed
suxiaogang223 wants to merge 2 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-hotpath-optimizations
Closed

[improvement](build) Reduce Hive startup bootstrap overhead#62102
suxiaogang223 wants to merge 2 commits intoapache:masterfrom
suxiaogang223:codex/hive-startup-hotpath-optimizations

Conversation

@suxiaogang223
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Related Issue: #62101

Problem Summary:
This PR reduces Hive third-party bootstrap overhead in the current startup flow without changing the broader startup model yet.

The main optimizations are:

  • merge selected create_preinstalled_scripts/*.hql into a single Hive invocation to avoid repeated JVM startup cost
  • skip redundant MSCK REPAIR TABLE statements on non-partitioned tables
  • reuse cached Hive auxiliary jars instead of downloading them on every startup

These changes are intended as an incremental optimization PR before the larger structured startup-script redesign tracked in #62101.

Release note

None

Check List (For Author)

  • Test: Script validation only
    • No end-to-end startup or regression test has been run yet; keep this PR as draft until runtime validation is completed
  • Behavior changed: No
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Reduce Hive bootstrap overhead by merging preinstalled HQL execution into a single hive invocation, skipping redundant MSCK REPAIR TABLE statements on non-partitioned tables, and reusing cached Hive auxiliary jars instead of downloading them on every startup.

### Release note

None

### Check List (For Author)

- Test: Script validation only
    - Manual test / No need to test (with reason)
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 3, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62102

Problem Summary: Introduce structured Hive startup control for third-party Docker scripts by decoupling metastore readiness from data loading, adding hive startup modes, and persisting baseline/module state so repeated startup can reuse existing Hive state while keeping the legacy run-thirdparties-docker.sh -c hive3 flow compatible.

### Release note

None

### Check List (For Author)

- Test: Script validation only
    - No need to test (with reason): validated modified shell scripts with bash -n, but end-to-end Docker runtime validation is still pending and the PR will stay draft
- Behavior changed: Yes (Hive startup flow now supports fast/refresh/rebuild modes while preserving the default refresh semantics for existing pipeline entrypoints)
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: None

Related PR: apache#62102

Problem Summary: Fix Hive startup slowdown in external pipelines by executing merged preinstalled HQL files in parallel shards instead of a single serialized Hive invocation. This preserves the reduced HQL file count while keeping the original startup parallelism characteristics.

### Release note

None

### Check List (For Author)

- Test: Script validation only
    - No need to test (with reason): validated the updated shell script with bash -n; waiting for pipeline rerun to confirm runtime improvement
- Behavior changed: No
- Does this need documentation: No
@suxiaogang223
Copy link
Copy Markdown
Contributor Author

run external

suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 8, 2026
…kflow

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary:

This PR consolidates Hive thirdparty startup improvements to make refresh/rebuild behavior more predictable, reduce startup overhead, and improve operational observability.

Key updates include:

- introduce structured hive startup modes () and module-selective refresh ()

- persist and reuse hive state, with SHA-based incremental refresh for modules and preinstalled HQL files

- reduce refresh/startup log noise (xtrace gating, obsolete compose version cleanup, cleaner refresh stage logs)

- make Hive bootstrap scripts/HQL idempotent with drop-then-create style and repeatable reruns

- optimize healthy refresh path by skipping unnecessary compose-up steps

- switch JuiceFS default metadata backend for Hive to metastore PostgreSQL and remove auto-MySQL dependency

- add Hive README documenting component segmentation, startup modes, module refresh, and troubleshooting

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Ran hive3 refresh and module-scoped refresh via run-thirdparties-docker.sh

- Behavior changed: Yes (Hive startup and refresh behavior is now mode/module driven and defaults to PostgreSQL-backed JuiceFS metadata)

- Does this need documentation: No
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 8, 2026
…, and metadata backend

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary:
This PR overhauls Hive thirdparty startup in docker/thirdparties to make startup and refresh predictable, faster, and repeatable in local and CI workflows.

Main changes:
- add structured Hive startup modes: --hive-mode fast|refresh|rebuild
- add module-scoped refresh: --hive-modules
- persist and reuse Hive state (HDFS/PostgreSQL/state dirs) and introduce baseline/module SHA tracking for incremental refresh
- optimize healthy refresh path to skip unnecessary compose rebuild/up steps
- reduce startup log noise (xtrace gated by HIVE_DEBUG=1, cleaner staged refresh logs, obsolete compose version removal)
- refactor Hive bootstrap scripts and HQL to be idempotent (drop-then-create style for repeated reruns)
- remove redundant startup-heavy operations in refresh path
- switch Hive JuiceFS default metadata backend to Hive metastore PostgreSQL and remove auto-MySQL dependency from Hive startup
- add Hive README documenting component segmentation, startup modes/modules, idempotency expectations, and troubleshooting

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh
    - Ran ./docker/thirdparties/run-thirdparties-docker.sh -c hive3 --hive-mode refresh --hive-modules preinstalled_hql
    - Verified healthy refresh path, module refresh behavior, and JuiceFS metadata initialization with PostgreSQL backend
- Behavior changed: Yes
    - Hive startup now follows mode/module-based refresh semantics
    - Default Hive JuiceFS metadata backend is PostgreSQL (still overrideable by JFS_CLUSTER_META)
- Does this need documentation: No
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 8, 2026
…-side JuiceFS init

### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary: Revert Hive JuiceFS metadata default back to mysql_57 and stop running JuiceFS initialization/format flow during hive2/hive3 startup to avoid unexpected dependency and startup failures in pipeline environments.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh

- Behavior changed: Yes (hive startup no longer runs JuiceFS init flow; Hive JuiceFS metadata default points to mysql_57 again)

- Does this need documentation: No
suxiaogang223 added a commit to suxiaogang223/doris that referenced this pull request Apr 8, 2026
### What problem does this PR solve?

Related Issue: apache#62101

Related PR: apache#62102

Problem Summary: Align JuiceFS startup behavior with the pre-refactor flow by restoring Hive-triggered JuiceFS jar sync and metadata initialization after hive2/hive3 startup while keeping MySQL as the default metadata backend.

### Release note

None

### Check List (For Author)

- Test: Manual test

    - Script syntax check: bash -n docker/thirdparties/run-thirdparties-docker.sh

- Behavior changed: Yes (hive startup again runs JuiceFS compatibility init path as before refactor)

- Does this need documentation: No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants