Skip to content

branch-4.0: [improvement](build) Simplify and speed up Hive docker startup workflow#63385

Open
suxiaogang223 wants to merge 1 commit into
apache:branch-4.0from
suxiaogang223:codex/pick-62103-branch-4.0
Open

branch-4.0: [improvement](build) Simplify and speed up Hive docker startup workflow#63385
suxiaogang223 wants to merge 1 commit into
apache:branch-4.0from
suxiaogang223:codex/pick-62103-branch-4.0

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

@suxiaogang223 suxiaogang223 commented May 19, 2026

cherry-pick: #62103

…ow (apache#62103)

Related Issue: apache#62101

Problem Summary:

Hive thirdparty startup in `docker/thirdparties` had three main
problems:

1. Startup was slow, especially when `refresh` repeatedly restored a
large baseline tarball and re-ran initialization paths that were hard to
reason about.
2. The runtime model exposed too many user-facing configuration concepts
(`bootstrap groups`, shared-id indirection, baseline version state
file), which made the workflow harder to understand and easier to
misconfigure.
3. Incremental refresh behavior was not observable enough, so it was
hard to tell what was actually re-executed and where startup time was
going.

This PR restructures the Hive startup flow around a simpler model:

- `rebuild`: ignore baseline and build everything from scratch.
- `refresh` (default): reset volumes, restore the published baseline,
then incrementally refresh only changed modules.
- `fast`: reuse existing volumes and skip refresh entirely.

At the same time, this PR simplifies the baseline restore path, removes
obsolete startup code, and improves logging/documentation so the
behavior is easier to maintain.

1. **Clarify startup modes and align implementation with them.**
- `rebuild` now means full local bootstrap from scratch, without
baseline restore.
- `refresh` now means reset volumes, restore the published baseline,
then reconcile changed modules.
- `fast` now means reuse existing volumes and skip refresh entirely; if
volumes are empty and no baseline is available, startup fails instead of
silently creating an empty environment.

2. **Simplify the Hive startup code path.**
- Remove obsolete monolithic scripts that were no longer on the main
startup path.
   - Keep the startup chain centered on:
     - `start-hive-metastore.sh`
     - `init-hive-baseline.sh`
     - `refresh-hive-modules.sh`
- Fix Hive health checks so stack reuse logic matches the actual
`CONTAINER_UID`-prefixed container names.

3. **Reduce unnecessary configuration surface.**
- Fix `HIVE_SHARED_ID` as an internal constant instead of exposing it as
user config.
- Remove user-facing `bootstrap groups` from the external interface;
Hive2/Hive3 now automatically select the correct shared/version-specific
bootstrap files internally.

4. **Simplify baseline packaging and restore semantics.**
- Standardize baseline filenames so local cache and OSS naming both use:
     - `<hive_version>-baseline-<version>.tar.gz`
   - Remove runtime dependence on `/mnt/state/baseline.version`.
- Newly exported baseline tarballs no longer retain `baseline.version`.
- Incremental refresh decisions now depend on module SHA state under
`/mnt/state/modules/`, not on a separate version marker file.

5. **Speed up repeated baseline restores.**
   - Keep the downloaded tarball under `HIVE_BASELINE_TARBALL_CACHE`.
- Add a second cache layer for the extracted baseline directory under
the same cache root.
- Subsequent restores reuse the extracted baseline tree directly,
avoiding repeated `tar.gz` decompression on every `refresh`.

6. **Improve refresh observability.**
- At the end of each refresh, print a summary of what was actually
re-executed.
- The summary includes refreshed modules and a compact detail preview,
for example changed `preinstalled_hql` files.
- This makes refresh regressions easier to spot when startup gets
slower.

7. **Refresh and simplify documentation.**
   - Rewrite the Hive docker README in both English and Chinese.
- Document the architecture, startup modes, baseline workflow,
incremental refresh model, typical developer workflows, and
troubleshooting.

(cherry picked from commit 0f1a7ba)
@suxiaogang223 suxiaogang223 changed the title [improvement](build) Simplify and speed up Hive docker startup workflow branch-4.0: [improvement](build) Simplify and speed up Hive docker startup workflow May 19, 2026
@suxiaogang223 suxiaogang223 marked this pull request as ready for review May 19, 2026 04:44
@suxiaogang223
Copy link
Copy Markdown
Member Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant