Skip to content

fix: some minor bugs#370

Merged
zhourrr merged 3 commits into
alibaba:mainfrom
zhourrr:fix/alot
Apr 23, 2026
Merged

fix: some minor bugs#370
zhourrr merged 3 commits into
alibaba:mainfrom
zhourrr:fix/alot

Conversation

@zhourrr
Copy link
Copy Markdown
Collaborator

@zhourrr zhourrr commented Apr 22, 2026

Asked qoder to do a full round of review before quota expired.

Bug Report: zvec Codebase Audit

Date: April 22, 2026
Scope: Full source audit of the zvec in-process vector database
Files changed: 9 source files edited, 1 file renamed


Bug 1 — Race condition in DropIndex() on destroyed_ flag

Severity: High
File: src/db/collection.cc

Description

CollectionImpl::DropIndex() checks the destroyed_ flag before acquiring the schema_handle_mtx_ lock. Every other mutating method in the class (Destroy(), Flush(), CreateIndex(), Optimize(), AddColumn(), DropColumn(), AlterColumn()) acquires the lock first, then checks destroyed_. This creates a TOCTOU (time-of-check-to-time-of-use) race: another thread could call Destroy() between the check and the lock acquisition, allowing DropIndex() to proceed on a destroyed collection, potentially causing use-after-free or null pointer dereferences on released resources.

Before

Status CollectionImpl::DropIndex(const std::string &column_name) {
  CHECK_COLLECTION_READONLY_RETURN_STATUS;
  CHECK_DESTROY_RETURN_STATUS(destroyed_, false);   // check before lock

  std::lock_guard lock(schema_handle_mtx_);          // lock after check

Fix

Swapped the ordering to match every other method — acquire the lock first, then check:

Status CollectionImpl::DropIndex(const std::string &column_name) {
  CHECK_COLLECTION_READONLY_RETURN_STATUS;

  std::lock_guard lock(schema_handle_mtx_);

  CHECK_DESTROY_RETURN_STATUS(destroyed_, false);

Bug 2 — Integer underflow in ConcurrentRoaringBitmap32::range_cardinality()

Severity: High
File: src/db/common/concurrent_roaring_bitmap.h

Description

range_cardinality(uint32_t min_doc_id, uint32_t max_doc_id) computes min_doc_id - 1 to calculate the rank at the position just before min_doc_id. When min_doc_id == 0, this arithmetic wraps to UINT32_MAX (4,294,967,295) because min_doc_id is an unsigned uint32_t. This causes roaring_bitmap_rank() to return the total cardinality, making the subtraction max_rank - min_rank return 0 (or a bogus value) instead of the correct range count.

The 64-bit counterpart (ConcurrentRoaringBitmap64::range_cardinality) already had a guard for this case (min_doc_id <= 0 ? 0 : ...), but the 32-bit version was missing it.

Before

max_rank = roaring_bitmap_rank(bitmap_, max_doc_id);
min_rank = roaring_bitmap_rank(bitmap_, min_doc_id - 1);  // underflows when min_doc_id == 0

Fix

max_rank = roaring_bitmap_rank(bitmap_, max_doc_id);
min_rank = min_doc_id == 0 ? 0 : roaring_bitmap_rank(bitmap_, min_doc_id - 1);

Bug 3 — Uninitialized member variables in Doc

Severity: High
File: src/include/zvec/db/doc.h

Description

Doc::doc_id_ (uint64_t) and Doc::op_ (Operator enum) have no default member initializers and are not initialized in the default constructor (Doc() = default). Reading these members before explicit assignment is undefined behavior per the C++ standard. In practice, this means a default-constructed Doc could report a garbage doc_id or operation type, which could silently corrupt data during writes or produce incorrect log output.

All other member variables in the class (pk_, score_, fields_) are properly initialized.

Before

uint64_t doc_id_;
Operator op_;

Fix

uint64_t doc_id_{0};
Operator op_{Operator::INSERT};

Bug 4 — Source filename typo: rocbsdb_context.cc

Severity: Medium
File: src/db/common/rocbsdb_context.cc

Description

The RocksDB wrapper implementation file was named rocbsdb_context.cc (missing the letter 'k' — "rocbs" instead of "rocks"). While the build system (SRCS *.cc glob) compiled it without issues, the typo creates confusion and makes it hard to find the file by searching for "rocksdb". The corresponding header is correctly named rocksdb_context.h.

Fix

Renamed rocbsdb_context.cc to rocksdb_context.cc via git mv.


Bug 5 — Wrong error descriptions for InvalidChannelCount and InvalidReplicaCount

Severity: Medium
File: src/db/common/error_code.cc

Description

Error codes 2045 (InvalidChannelCount) and 2046 (InvalidReplicaCount) both had their description string set to "Invalid field name", which is a copy-paste error from the preceding line (error code 2044, InvalidFieldName). This means any error message surfaced to users or logs for channel count or replica count validation failures would misleadingly say "Invalid field name".

Before

PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidFieldName, 2044, "Invalid field name");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidChannelCount, 2045, "Invalid field name");   // wrong
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidReplicaCount, 2046, "Invalid field name");   // wrong

Fix

PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidFieldName, 2044, "Invalid field name");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidChannelCount, 2045, "Invalid channel count");
PROXIMA_ZVEC_ERROR_CODE_DEFINE(InvalidReplicaCount, 2046, "Invalid replica count");

Bug 6 — DirectoryAlreadyExists and DirectoryNotExists declared but never defined

Severity: Medium
File: src/db/common/error_code.h

Description

error_code.h declares two error codes via PROXIMA_ZVEC_ERROR_CODE_DECLARE:

PROXIMA_ZVEC_ERROR_CODE_DECLARE(DirectoryAlreadyExists);
PROXIMA_ZVEC_ERROR_CODE_DECLARE(DirectoryNotExists);

However, error_code.cc contains no corresponding PROXIMA_ZVEC_ERROR_CODE_DEFINE for either. These are extern declarations of global const objects that have no definition in any translation unit. If any code ever references them, it would cause a linker error. A codebase search confirmed they are not referenced anywhere.

Fix

Removed both dead declarations from error_code.h.


Bug 7 — Incorrect comment: 64 * 1024 * 1024 labeled as "128M"

Severity: Low
File: src/include/zvec/db/options.h

Description

The constant DEFAULT_MAX_BUFFER_SIZE is defined as 64 * 1024 * 1024 (= 67,108,864 bytes = 64 MB), but the comment says // 128M. This is misleading for anyone reading the code to understand the default buffer size.

Before

const uint32_t DEFAULT_MAX_BUFFER_SIZE = 64 * 1024 * 1024;  // 128M

Fix

const uint32_t DEFAULT_MAX_BUFFER_SIZE = 64 * 1024 * 1024;  // 64M

Bug 8 — Write lock used for read-only storage_size_in_bytes()

Severity: Low
File: src/db/common/concurrent_roaring_bitmap.h

Description

ConcurrentRoaringBitmap32::storage_size_in_bytes() is a read-only operation (it calls roaring_bitmap_portable_size_in_bytes which does not mutate the bitmap), but it acquires a std::unique_lock (exclusive/write lock). This unnecessarily blocks all concurrent readers, reducing throughput. Every other read-only method in the class correctly uses std::shared_lock.

Before

size_t storage_size_in_bytes() const {
    std::unique_lock<std::shared_mutex> lock(mutex_);
    return roaring_bitmap_portable_size_in_bytes(bitmap_);
}

Fix

size_t storage_size_in_bytes() const {
    std::shared_lock<std::shared_mutex> lock(mutex_);
    return roaring_bitmap_portable_size_in_bytes(bitmap_);
}

Bug 9 — static function definition in header causes per-TU duplication

Severity: Low
File: src/db/common/file_helper.h

Description

GetFileName(FileID) is declared static in a header file. In C++, static at file/namespace scope gives each translation unit (.cc file) its own private copy of the function. Since this header is included across many .cc files, the linker includes duplicate copies of the function in the final binary, increasing code size. The function should be inline instead, which tells the linker to deduplicate identical definitions while still allowing the definition in the header.

Before

static const char *GetFileName(FileID t) {

Fix

inline const char *GetFileName(FileID t) {

Bug 10 — Missing #pragma once include guard in utils.h

Severity: Low
File: src/db/common/utils.h

Description

utils.h is the only header in src/db/common/ that does not have a #pragma once include guard. Without it, if the header is included multiple times (directly or transitively) in the same translation unit, it causes redefinition errors for std::string indent(int level). While the current include graph may not trigger this, it is fragile and inconsistent with the rest of the codebase.

Fix

Added #pragma once at the top of the file.


Bug 11 — Comment indices in is_valid_type_v do not match Value variant ordering

Severity: Info
File: src/include/zvec/db/doc.h

Description

The Value variant and the is_valid_type_v type trait both list the same types, but the comment indices diverge in two places:

  1. In the Value variant, index 13 is std::vector<int64_t> and index 14 is std::vector<uint32_t>. In is_valid_type_v, the comments had these swapped (13 for uint32_t, 14 for int64_t).

  2. In the Value variant, index 20 is std::pair<..., std::vector<float>> and index 21 is std::pair<..., std::vector<float16_t>>. In is_valid_type_v, the comments had these swapped.

While is_valid_type_v is a simple OR-chain (so the ordering does not affect runtime behavior), mismatched index comments are dangerous for anyone writing serialization or deserialization code that relies on variant indices. The wrong comment could lead someone to use the wrong index in a std::get<N>() call.

Fix

Reordered the entries in is_valid_type_v and corrected the comments to match the actual Value variant ordering exactly.

Verification note: Serialization in doc.cc uses std::visit with if constexpr type matching and a separate ValueType enum — not variant indices — so reordering is_valid_type_v entries has zero impact on existing serialized data. The memory_usage() and operator== functions do use .index() on the variant, but those already have the correct index-to-type mapping.


Bug 12 — DeleteByFilter() computes get_all_segments() twice (one result unused)

Severity: Medium
File: src/db/collection.cc

Description

CollectionImpl::DeleteByFilter() calls get_all_segments() at line 1552 and stores the result in a local variable segments, but never uses it. Then at line 1560, get_all_segments() is called a second time as an argument to sql_engine_->execute(). This is a copy-paste artifact that wastes work — get_all_segments() iterates and copies all segments each time it is called.

Before

auto segments = get_all_segments();   // result never used

VectorQuery query;
query.filter_ = filter;
query.topk_ = INT32_MAX;
query.output_fields_ = std::vector<std::string>{};
query.include_doc_id_ = true;

auto ret = sql_engine_->execute(schema_, query, get_all_segments());  // called again

Fix

Removed the unused segments variable. The single remaining get_all_segments() call inside execute() is sufficient:

VectorQuery query;
query.filter_ = filter;
query.topk_ = INT32_MAX;
query.output_fields_ = std::vector<std::string>{};
query.include_doc_id_ = true;

auto ret = sql_engine_->execute(schema_, query, get_all_segments());

Bug 13 — Open() allocates sql_engine_ even after create()/recovery() failure

Severity: Medium
File: src/db/collection.cc

Description

CollectionImpl::Open() calls create() or recovery() and stores the status in s, but then unconditionally proceeds to allocate a Profiler and create the sql_engine_ before returning s. If create() or recovery() failed, these resources are allocated for no reason — the callers (Collection::CreateAndOpen() and Collection::Open()) check the status and discard the CollectionImpl, so the engine is immediately destroyed.

Before

Status s;
if (schema_ == nullptr) {
  s = recovery();
} else {
  s = create();
}

auto profiler = std::make_shared<Profiler>();          // runs even if s is error
sql_engine_ = sqlengine::SQLEngine::create(profiler);  // wasted allocation

return s;

Fix

Added an early return via CHECK_RETURN_STATUS(s) before the sql_engine_ allocation:

Status s;
if (schema_ == nullptr) {
  s = recovery();
} else {
  s = create();
}
CHECK_RETURN_STATUS(s);

auto profiler = std::make_shared<Profiler>();
sql_engine_ = sqlengine::SQLEngine::create(profiler);

return s;

Bug 14 — Typo "ignnored" in options.h comment

Severity: Low
File: src/include/zvec/db/options.h

Description

The comment on enable_mmap_ has a double-n typo: "ignnored" instead of "ignored".

Before

bool enable_mmap_{true};  // ignnored when load collection

Fix

bool enable_mmap_{true};  // ignored when load collection

Verification Notes

All 14 fixes were reviewed for correctness and compatibility:

  • No data structure changes: No serialization formats, on-disk layouts, or public API signatures were modified.
  • No behavioral changes to serialization: The Doc serialization uses an explicit ValueType enum with std::visit / if constexpr type matching, not variant indices. Reordering is_valid_type_v entries (Bug 11) and initializing doc_id_/op_ (Bug 3) have no impact on existing persisted data.
  • Thread safety preserved: Lock reordering (Bug 1) and lock type change (Bug 8) both align with the established patterns used by all other methods in their respective classes.
  • doc_id_{0} initialization (Bug 3): The value 0 is consistent with Doc::clear() which already resets doc_id_ to 0. In all code paths, doc_id is reassigned by Segment::internal_insert() or Segment::Update() before any database operation, so the default value is never persisted.
  • Backward compatibility: All fixes are internal implementation corrections. No public headers changed their API contract. Existing client code requires no modifications.

Copy link
Copy Markdown
Collaborator

@chinaux chinaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zhourrr zhourrr merged commit 4e7fa0f into alibaba:main Apr 23, 2026
13 checks passed
@zhourrr zhourrr deleted the fix/alot branch April 23, 2026 11:02
ihb2032 pushed a commit to ihb2032/zvec that referenced this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants