Skip to content

[fix](regression-test) stabilize 2 muted external_table_p0 tests#63646

Open
morningman wants to merge 1 commit into
apache:masterfrom
morningman:fix-external-p0-20260525
Open

[fix](regression-test) stabilize 2 muted external_table_p0 tests#63646
morningman wants to merge 1 commit into
apache:masterfrom
morningman:fix-external-p0-20260525

Conversation

@morningman
Copy link
Copy Markdown
Contributor

Summary

Both tests have been muted on the External Regression pipeline due to long-standing flakiness (analysis based on TeamCity build #92687 / id 953050). Neither is a real product bug — both are test-side robustness issues.

test_file_cache_query_limit (~50% pass rate)

After POST /api/file_cache?op=clear&sync=true the test waited exactly one file_cache_background_monitor_interval_ms window and then asserted normal_queue_curr_size == 0 once. The counters surfaced by information_schema.file_cache_statistics are republished by the background monitor on its own cadence, so a single fixed-time wait races the refresh and the assert fails roughly half the time even when the cache really is empty.

  • Replace the four wait-then-assert blocks (size == 0 after clear, size > 0 after a query) with Awaitility-based polling (already imported) on the relevant metric until the predicate holds, with a max(30s, 6 × monitor_interval) timeout.
  • The original assertFalse(...) calls with their metric-specific messages are kept as the final guard, so real failures still surface a precise reason.
  • The two waits for BE config propagation (enable_file_cache_query_limit flip) are left untouched — not in the failure path.

test_hive_query_cache (~20–25% fail rate)

The test { sql ...; time 20000 } block at L122 ran TPC-H Q9 against containerized hive parquet with enable_sql_cache=false set above, so the 20s upper bound was timing a cold 6-table join, not a cache hit. The query routinely exceeds 20s under cluster load.

  • Drop the time guard; the qt_tpch_1sf_q09 above already validates correctness, and the cache behavior is exercised in the subsequent blocks that explicitly enable sql cache.

Test plan

  • Run External Regression pipeline on this PR and confirm both cases pass.
  • After 5+ consecutive green runs, follow up to unmute these cases in TeamCity.

🤖 Generated with Claude Code

test_file_cache_query_limit:
After `POST /api/file_cache?op=clear&sync=true` the test waited one
file_cache_background_monitor_interval_ms window and then asserted
normal_queue_curr_size == 0 once. The counters surfaced by
information_schema.file_cache_statistics are republished by the
background monitor on its own cadence, so a single fixed-time wait
races the refresh and the assert fails roughly half the time even when
the cache really is empty.

Replace the four wait-then-assert blocks with Awaitility-based polling
(already imported) on the relevant metric until the predicate holds,
with a max(30s, 6 x monitor_interval) timeout. The original assertFalse
calls with their metric-specific messages are kept as the final guard,
so real failures still surface a precise reason. The two waits for BE
config propagation are left untouched.

test_hive_query_cache:
The `test { sql ...; time 20000 }` block ran TPC-H Q9 against
containerized hive parquet with enable_sql_cache=false set above, so
the 20s upper bound was timing a cold join, not a cache hit. The query
routinely exceeds 20s under cluster load, which explains the ~20-25%
flake rate. Drop the time guard; the qt_ above already validates
correctness, and the cache behavior is exercised in the subsequent
blocks that explicitly enable sql cache.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants