Skip to content

[Bug]: Concurrency issue around RocksDBHandlerImpl::db_. #105

@MrGuin

Description

@MrGuin

See details in eloqdata/eloqkv#224.

Crash Diagnosis

  • The core dump happens while ScanCommand is tearing down its scanner. When the std::unique_ptrtxservice::store::DataStoreScanner in
    ExecuteCommand goes out of scope (src/redis_service.cpp:6388), RocksDBScanner is destroyed and releases its rocksdb::Iterator.
  • During iterator cleanup RocksDB runs CleanupSuperVersionHandle, which tries to lock the DB’s internal mutex. Because the mutex has already been
    destroyed, pthread_mutex_lock returns EINVAL, triggering rocksdb::port::PthreadCall("lock") to call abort() (frames 3–7 in the stack trace).
  • That means the underlying rocksdb::DBImpl was shut down before the iterator finished. RocksDBHandlerImpl::Shutdown() deletes db_ and destroys its
    mutex (store_handler/rocksdb_handler.cpp:3115), while ScanForward simply hands out the raw db_ pointer to RocksDBScanner without holding db_mux_
    or any lifetime guard (store_handler/rocksdb_handler.cpp:1223).

Why It Happens
If another thread runs Shutdown() (for example during node failover) while a scan is still active, the DB object vanishes; the iterator held by the
scan then crashes when it’s destroyed.

Fix Ideas

  1. Keep db_mux_ locked (or otherwise reference-count the DB) for the full lifetime of each RocksDBScanner, preventing shutdown until scanners
    finish.
  2. Or have Shutdown() wait until all scanners/iterators complete before deleting db_.
  3. At minimum, check whether GetDBPtr() is still valid before creating the scanner and abort the scan early if the DB is closing.

Logs around shutdown or DB restarts will likely confirm this sequence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions