Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #57856

…external table queries (#57856)

**Symptom:** External table queries hang indefinitely, FE process
frozen.

**User-facing impact:** Query threads blocked waiting for schema cache:

  ```
  "mysql-nio-pool-14981" TIMED_WAITING
     at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
     at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
     at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
     at CacheLoader.asyncReload(CacheLoader.java:188)
     at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)
     at LocalLoadingCache.get(LocalLoadingCache.java:56)
     at ExternalSchemaCache.getSchemaValue(ExternalSchemaCache.java:86)
     at ExternalTable.getSchemaCacheValue(ExternalTable.java:371)
     at HMSExternalTable.getPartitionColumns(HMSExternalTable.java:288)
at
PruneFileScanPartition.pruneHivePartitions(PruneFileScanPartition.java:84)

  "CommonRefreshExecutor-63" TIMED_WAITING
     at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
     at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
     at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
     at CacheLoader.asyncReload(CacheLoader.java:188)
     at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)

  "CommonRefreshExecutor-62" TIMED_WAITING
     at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
     at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
     at BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:333)
     at BoundedLocalCache.removeNode(BoundedLocalCache.java:1882)
     at LocalManualCache.invalidateAll(LocalManualCache.java:150)
     at MetaCache.invalidateAll(MetaCache.java:121)
     at ExternalDatabase.setUnInitialized(ExternalDatabase.java:123)
```

 **Root cause**: Caffeine cache deadlock when:
  1. MetaCache uses bounded executor (CommonRefreshExecutor: 64 threads + 640K queue) for both async operations and removal listeners
  2. Database cache removal listener calls tableCache.invalidateAll()
  3. Executor is full (all threads busy + queue full)
  4. Both async reload and removal listener try to submit tasks to full executor
  5. Deadlock: executor threads wait for tasks, tasks wait for executor slots

  Jstack evidence: 82 CommonRefreshExecutor threads blocked on LinkedBlockingQueue.offer():

**Solution**

  - Add CacheFactory.buildCacheWithSyncRemovalListener() using Runnable::run executor
  - MetaCache.metaObjCache uses sync removal listener to avoid executor contention
  - Removal listener runs inline on calling thread instead of submitting to executor

**Changes**

  - CacheFactory: Add buildCacheWithSyncRemovalListener() and buildCacheWithAsyncRemovalListener()
  - MetaCache: Use buildCacheWithSyncRemovalListener() for metaObjCache
  - Add MetaCacheDeadlockTest to verify fix

**Test**

  Unit test reproduces deadlock with async removal listener and verifies fix with sync removal listener.
@github-actions github-actions bot requested a review from morrySnow as a code owner November 17, 2025 01:38
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Contributor

run buildall

1 similar comment
@morrySnow
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 65.00% (13/20) 🎉
Increment coverage report
Complete coverage report

@morningman morningman closed this Nov 20, 2025
@morningman morningman reopened this Nov 20, 2025
@morrySnow morrySnow merged commit de1e170 into branch-3.1 Nov 25, 2025
23 checks passed
@github-actions github-actions bot deleted the auto-pick-57856-branch-3.1 branch November 25, 2025 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants