Skip to content

branch-4.1: [fix](hive table) Fill Hive meta cache when loading row count for queries #63470#63800

Merged
yiguolei merged 1 commit into
apache:branch-4.1from
yujun777:pick-63470-branch-4.1
May 29, 2026
Merged

branch-4.1: [fix](hive table) Fill Hive meta cache when loading row count for queries #63470#63800
yiguolei merged 1 commit into
apache:branch-4.1from
yujun777:pick-63470-branch-4.1

Conversation

@yujun777
Copy link
Copy Markdown
Contributor

cherry-pick: #63470

…ries (apache#63470)

related issue: close apache#63694

Hive external table row count estimation can read Hive Metastore
metadata without filling Doris' Hive external metadata cache. This makes
a normal query pay duplicate HMS metadata access in the same planning
flow.

The problematic path is:

1. A normal query asks the external table for row count through
`ExternalTable.getRowCount()`.
2. `ExternalRowCountCache` misses and calls
`HMSExternalTable.fetchRowCount()`.
3. If HMS table parameters do not contain row count and
`enable_get_row_count_from_file_list` is enabled, `HMSExternalTable`
estimates row count from file list.
4. Before this PR, that estimation path always used
`getAllPartitionsWithoutCache()` and `getFilesByPartitions(...,
withCache=false, ...)`.
5. Later in the same normal query, scan planning still needs partition
and file metadata, so it reads the same HMS/file metadata again through
the normal cached scan path.

This behavior was originally useful for non-query metadata display
requests such as `show table status`, `show stats`, and
`information_schema.tables`: those requests should not fill heavy Hive
metadata caches just because they display cached row count. However,
normal query planning is different. If row count estimation has already
fetched partition and file metadata, filling the metadata cache avoids
duplicated HMS reads in the following scan planning step.

This PR separates the two row-count loading modes with an explicit
`fillMetaCache` flag:

- Normal query row-count loading uses `fillMetaCache=true`.
- Cached row-count display paths such as `getCachedRowCount()` still use
`fillMetaCache=false`.
- The default async row-count cache loader keeps the existing
non-filling behavior unless the caller explicitly requests cache
filling.
- `HMSExternalTable` routes row-count file-list estimation through
cached or non-cached Hive metadata APIs based on `fillMetaCache`.

Concretely:

- `ExternalTable.getRowCount()` now requests
`ExternalRowCountCache.getCachedRowCount(..., true)`.
- `ExternalTable.getCachedRowCount()` and
`PluginDrivenExternalTable.getCachedRowCount()` request `false`.
- `ExternalRowCountCache` loads row count through
`ExternalTable.fetchRowCountWithMetaCache(fillMetaCache)`.
- `HMSExternalTable.fetchRowCount()` remains the lightweight non-filling
path.
- `HMSExternalTable.fetchRowCountWithMetaCache(true)` fills Hive
partition/file metadata cache while estimating row count from file list.

This keeps the previous optimization for show/stat display paths while
allowing normal queries to reuse metadata fetched during row-count
estimation.

- Test:
  - Unit Test: `ExternalRowCountCacheTest`
  - Unit Test: `HMSExternalTableTest`
- Behavior changed: Yes. Normal query row-count estimation for HMS
external tables may now fill Hive metadata cache when it has to estimate
row count from file list. Non-query cached row-count display paths still
avoid filling heavy metadata caches.
- Does this need documentation: No.

(cherry picked from commit df200d8)
@yujun777 yujun777 requested a review from yiguolei as a code owner May 28, 2026 06:54
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 23.68% (9/38) 🎉
Increment coverage report
Complete coverage report

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added approved Indicates a PR has been approved by one committer. reviewed labels May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@yiguolei yiguolei merged commit a794109 into apache:branch-4.1 May 29, 2026
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants