[chore](regression-test) make test_point_query_ck nonConcurrent#63587
Conversation
|
run buildall |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
Fixes incorrect point-query row accounting when a row is served from the storage row cache, ensuring downstream logic (notably delete-sign filtering and result block padding) sees the correct “hit row” count and doesn’t return stale deleted rows.
Changes:
- Increment
_row_hitswhen a row-cache lookup hits inPointQueryExecutor::_lookup_row_key().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/review |
There was a problem hiding this comment.
I found one data-correctness issue in the row-cache hit path. The change fixes row-hit accounting for cached rows, but the cached-row path still bypasses column-store backfill when the row-store column does not contain all requested columns, so cache hits can return default values for real columns.
Critical checkpoint conclusions:
- Goal/test: the PR targets point-query row-cache hit accounting for delete-sign filtering. The one-line change addresses full-row cached results, but misses partial row-store projections. The PR description reports manual/regression coverage; I did not run tests in this review.
- Scope: the change is small, but the affected row-cache path has a parallel missing-column backfill path that also needs to be handled.
- Concurrency/lifecycle: no new shared state or lifecycle hazard introduced by the line itself; existing cache handles remain RAII-managed.
- Compatibility/configuration/persistence: no protocol, storage format, config, or EditLog change.
- Data correctness: blocking issue below for partial row-store cached point queries returning defaults instead of stored column values.
- Performance/observability: no new material concern beyond the correctness issue.
User focus: no additional user-provided review focus was supplied.
| if (hit_cache) { | ||
| _row_read_ctxs[i]._cached_row_data = std::move(cache_handle); | ||
| ++_profile_metrics.row_cache_hits; | ||
| _row_hits++; |
There was a problem hiding this comment.
This cache-hit path still skips the column-store backfill needed when the row-store column is partial. RowCache stores only ROW_STORE_COL from BaseTablet::lookup_row_data() and is keyed by tablet id + primary key, not by the requested projection. For a table using row_store_columns, a first point query can populate the cache with only the row-store columns; a later point query for the same key that requests a column outside that row store hits here, increments _row_hits, and _lookup_row_data() deserializes the cached row then continues before the _reusable->missing_col_uids() column-store read. The padding block at the end then inserts defaults for the missing result columns, so the query can return incorrect values. Please either bypass/disable the row cache when _reusable->missing_col_uids() is non-empty, or still perform the row-location lookup on cache hits so the cached row can be supplemented from column store.
4b9923e to
52c40c9
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
I completed the PR review and found no additional distinct issues beyond the already-open inline thread. That existing thread is still blocking: the row-cache hit path increments _row_hits and then bypasses the column-store backfill path, so point queries on tables using partial row_store_columns can return default values for requested columns that are not present in the cached row-store payload.
Critical checkpoint conclusions:
- Goal/test: the PR aims to count storage row-cache hits so delete-sign filtering runs for cached rows. The added regression covers the full row-store delete/compaction scenario, but it does not cover partial row-store projections where cache hits still need missing-column backfill.
- Scope/focus: the code change is minimal, but the affected cache-hit path is not functionally complete because it skips an existing parallel path for
_reusable->missing_col_uids(). - Concurrency/lifecycle: no new shared mutable state, thread entry, lock-order, or cache-handle lifecycle issue was introduced by this one-line change;
RowCache::CacheHandleremains RAII-managed. - Configuration/compatibility/persistence: no new config, protocol/storage format, FE-BE variable passing, or EditLog compatibility concern.
- Data correctness: blocking issue remains in the existing inline comment for partial row-store cached point queries returning incorrect defaults.
- Transaction/write correctness: no transaction or persistence path is directly changed, but the read-side visibility result can still be incorrect for the partial row-store cache-hit scenario.
- Tests: regression coverage was added for delete-sign filtering after full-row storage row-cache hits. Missing coverage for partial
row_store_columnswith a cache warm-up followed by a projection outside the row store. - Performance/observability: no additional material issue found; no new observability appears necessary for this small fix.
User focus: no additional user-provided review focus was supplied.
TPC-H: Total hot run time: 31181 ms |
test_point_query and test_point_query_ck should be serial with nonConcurrent
TPC-H: Total hot run time: 31929 ms |
TPC-DS: Total hot run time: 173331 ms |
52c40c9 to
50db82a
Compare
|
run buildall |
TPC-DS: Total hot run time: 172699 ms |
|
PR approved by at least one committer and no changes requested. |
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Findings: one blocking test-scheduling issue was found. The existing point_query_executor row-cache comment is already-known review context and was not duplicated.
Critical checkpoint conclusions:
- Goal/test proof: The PR aims to make point_query tests serial, but it only marks test_point_query_ck as nonConcurrent; the sibling test_point_query suite remains concurrent, so the stated scheduling goal is incomplete.
- Scope/clarity: The change is small, but incomplete for the stated problem.
- Concurrency: This is directly about regression concurrency. nonConcurrent suites run in the SINGLE phase after NORMAL suites, while test_point_query still runs as NORMAL and mutates global FE config, so it can still interfere with other normal suites.
- Lifecycle/static initialization: Not applicable.
- Configuration items: No new config is added; existing global FE config mutation remains a concurrency hazard in test_point_query.
- Compatibility/storage/transaction/data correctness: Not applicable for this test-only PR.
- Parallel code paths: test_point_query is the parallel sibling path and should be updated as well.
- Special checks: Not applicable.
- Test coverage/results: No new test output is involved; this is regression scheduler metadata only.
- Observability/performance: Not applicable.
User focus: No additional user-provided review focus was specified.
| import java.math.BigDecimal; | ||
|
|
||
| suite("test_point_query_ck") { | ||
| suite("test_point_query_ck", "nonConcurrent") { |
There was a problem hiding this comment.
This only moves test_point_query_ck into the nonConcurrent phase, but the sibling regression-test/suites/point_query_p0/test_point_query.groovy suite is still declared as suite("test_point_query"). That suite also mutates process-wide FE state (ADMIN SET FRONTEND CONFIG ("enable_lightweight_lookup_request" = ...)), and the regression framework runs NORMAL suites concurrently before SINGLE suites (RegressionTest.runScripts waits for NORMAL futures before running GroupExecType.SINGLE). So this PR prevents test_point_query_ck from overlapping with normal suites, but test_point_query can still overlap with other normal suites and cause the same class of global-config interference that the PR title says should be fixed. Please mark test_point_query as nonConcurrent too, or otherwise explain why only the CK variant needs serialization.
test_point_query and test_point_query_ck should be serial with nonConcurrent
test_point_query and test_point_query_ck should be serial with nonConcurrent
test_point_query and test_point_query_ck should be serial with nonConcurrent