Skip to content

[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666

Draft
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:ms_lru_cache
Draft

[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666
wyxxxcat wants to merge 1 commit intoapache:masterfrom
wyxxxcat:ms_lru_cache

Conversation

@wyxxxcat
Copy link
Collaborator

Summary

This PR implements a sharded LRU cache for meta_tablet_idx_key lookups in both MetaService and Recycler to reduce frequent FDB reads of immutable metadata.

Background

MS-side operations like commit_rowset, finish_tablet_job, and commit_txn frequently read the same tablet index metadata (TabletIndexPB), which is nearly immutable after creation. This causes unnecessary FDB IO overhead.

Implementation

Core Components

  1. KvCache Template (cloud/src/common/kv_cache.h)

    • Generic sharded LRU cache with 16 shards by default
    • Reduces lock contention in high-concurrency scenarios
    • Supports any KeyTuple and ValuePB types
    • TTL support: entries expire after configurable time
  2. KvCacheManager (cloud/src/common/kv_cache_manager.h)

    • Manages cache instances with configurable capacity and TTL
    • Extensible for future cache types (e.g., SchemaCache)
  3. Configuration (cloud/src/common/config.h)

    • ms_tablet_index_cache_capacity: MS cache capacity (default: 500000)
    • recycler_tablet_index_cache_capacity: Recycler cache capacity (default: 500000)
    • tablet_index_cache_ttl_seconds: TTL in seconds (default: 0, no TTL)

Integration Points

MetaService (cloud/src/meta-service/meta_service.cpp):

  • Initialize global g_ms_cache_manager in constructor
  • Add cache lookup/put in get_tablet_idx() function
  • Transparent to callers - no API changes required

Recycler (cloud/src/recycler/util.cpp, recycler.cpp):

  • Initialize global g_recycler_cache_manager in Recycler::start()
  • Add cache lookup/put in recycler's get_tablet_idx() function
  • Invalidate cache when deleting tablet_idx_key in recycle_tablets()

Cache Invalidation Strategy

  • MS: Can invalidate on drop_tablet/drop_index/drop_partition if needed
  • Recycler: Actively invalidates cache when deleting tablet_idx_key in recycle_tablets()
  • TTL: Entries automatically expire after configured TTL (if enabled)

Testing

Added comprehensive unit tests in cloud/test/kv_cache_test.cpp:

  • Basic get/put operations
  • LRU eviction behavior
  • Cache invalidation
  • Concurrent access (8 threads)

Performance Benefits

  • Reduces FDB read operations for frequently accessed tablet metadata
  • 16-way sharding minimizes lock contention under high concurrency
  • Transparent integration - zero impact on existing code paths
  • Dual eviction: LRU + TTL for flexible cache management

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat
Copy link
Collaborator Author

run buildall

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a lru cache in be, we should use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants