branch-4.1: [fix](hive table) Fill Hive meta cache when loading row count for queries #63470#63800
Merged
Merged
Conversation
…ries (apache#63470) related issue: close apache#63694 Hive external table row count estimation can read Hive Metastore metadata without filling Doris' Hive external metadata cache. This makes a normal query pay duplicate HMS metadata access in the same planning flow. The problematic path is: 1. A normal query asks the external table for row count through `ExternalTable.getRowCount()`. 2. `ExternalRowCountCache` misses and calls `HMSExternalTable.fetchRowCount()`. 3. If HMS table parameters do not contain row count and `enable_get_row_count_from_file_list` is enabled, `HMSExternalTable` estimates row count from file list. 4. Before this PR, that estimation path always used `getAllPartitionsWithoutCache()` and `getFilesByPartitions(..., withCache=false, ...)`. 5. Later in the same normal query, scan planning still needs partition and file metadata, so it reads the same HMS/file metadata again through the normal cached scan path. This behavior was originally useful for non-query metadata display requests such as `show table status`, `show stats`, and `information_schema.tables`: those requests should not fill heavy Hive metadata caches just because they display cached row count. However, normal query planning is different. If row count estimation has already fetched partition and file metadata, filling the metadata cache avoids duplicated HMS reads in the following scan planning step. This PR separates the two row-count loading modes with an explicit `fillMetaCache` flag: - Normal query row-count loading uses `fillMetaCache=true`. - Cached row-count display paths such as `getCachedRowCount()` still use `fillMetaCache=false`. - The default async row-count cache loader keeps the existing non-filling behavior unless the caller explicitly requests cache filling. - `HMSExternalTable` routes row-count file-list estimation through cached or non-cached Hive metadata APIs based on `fillMetaCache`. Concretely: - `ExternalTable.getRowCount()` now requests `ExternalRowCountCache.getCachedRowCount(..., true)`. - `ExternalTable.getCachedRowCount()` and `PluginDrivenExternalTable.getCachedRowCount()` request `false`. - `ExternalRowCountCache` loads row count through `ExternalTable.fetchRowCountWithMetaCache(fillMetaCache)`. - `HMSExternalTable.fetchRowCount()` remains the lightweight non-filling path. - `HMSExternalTable.fetchRowCountWithMetaCache(true)` fills Hive partition/file metadata cache while estimating row count from file list. This keeps the previous optimization for show/stat display paths while allowing normal queries to reuse metadata fetched during row-count estimation. - Test: - Unit Test: `ExternalRowCountCacheTest` - Unit Test: `HMSExternalTableTest` - Behavior changed: Yes. Normal query row-count estimation for HMS external tables may now fill Hive metadata cache when it has to estimate row count from file list. Non-query cached row-count display paths still avoid filling heavy metadata caches. - Does this need documentation: No. (cherry picked from commit df200d8)
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
Contributor
FE Regression Coverage ReportIncrement line coverage |
yiguolei
approved these changes
May 29, 2026
Contributor
|
PR approved by at least one committer and no changes requested. |
Contributor
|
PR approved by anyone and no changes requested. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cherry-pick: #63470