Skip to content

refactor: Try to simplify StringColumn's CompactionOnDump, write to file in sequential#187

Closed
zhanglei1949 wants to merge 1 commit intoalibaba:mainfrom
zhanglei1949:zl/ref-str-col-comp-2
Closed

refactor: Try to simplify StringColumn's CompactionOnDump, write to file in sequential#187
zhanglei1949 wants to merge 1 commit intoalibaba:mainfrom
zhanglei1949:zl/ref-str-col-comp-2

Conversation

@zhanglei1949
Copy link
Copy Markdown
Member

Another solution diffs from #181
Fix #177

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Simplify StringColumn compaction with direct sequential file writing

✨ Enhancement 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Simplify StringColumn compaction logic by removing complex offset mapping
• Directly write live strings sequentially to new data file during dump
• Eliminate temporary buffer and reduce memory allocations during compaction
• Remove unused unordered_map include and CompactionPlan struct
Diagram
flowchart LR
  A["StringColumn dump"] --> B{"Check unique bytes<br/>vs total bytes"}
  B -->|No compaction needed| C["Write buffers as-is"]
  B -->|Compaction needed| D["Stream live strings<br/>to new file"]
  D --> E["Update item offsets<br/>and compute MD5"]
  E --> F["Write compacted data"]
  C --> G["Close buffers"]
  F --> G
Loading

Grey Divider

File Changes

1. include/neug/utils/property/column.h ✨ Enhancement +53/-153

Refactor dump method with direct sequential compaction

• Removed #include <unordered_map> dependency
• Replaced complex CompactionPlan and stream_compact_and_dump() with simplified inline logic
• Rewrote dump() method to directly stream live strings sequentially to file
• Added early exit path when no compaction is needed (unique_bytes == pos_val)
• Simplified offset tracking by writing strings in order without deduplication mapping

include/neug/utils/property/column.h


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 8, 2026

Code Review by Qodo

🐞 Bugs (2)   📘 Rule violations (0)   📎 Requirement gaps (1)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (1)
📎\ ⚙ Maintainability (1)

Grey Divider


Action required

1. Compaction logic still in header 📎
Description
TypedColumn<std::string_view>::dump() now contains substantial compaction + file I/O
implementation (directory creation, fopen/fwrite/fseek, MD5) directly in column.h, which
keeps complex non-interface logic in the header. This violates the refactor requirement to move
string compaction implementation out of column.h into appropriate implementation files.
Code

include/neug/utils/property/column.h[R288-342]

+    size_t unique_bytes = 0;
+    {
+      std::unordered_set<uint64_t> seen;
+      for (size_t i = 0; i < size_; ++i) {
+        const auto item = get_string_item(i);
+        if (item.length > 0 && seen.insert(item.offset).second)
+          unique_bytes += item.length;
      }
    }
-    pos_val = pos_.load();
-    // No-compaction path: dump containers as-is.
+    size_t pos_val = pos_.load();
+    if (unique_bytes == pos_val) {
+      write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
+      items_buffer_->Dump(filename + ".items");
+      data_buffer_->Dump(filename + ".data");
+      items_buffer_->Close();
+      data_buffer_->Close();
+      return;
+    }
+
+    // Slow path: stream each live slot to a new .data file, skipping stale
+    // bytes.  Reads from data_buffer_ and writes to a different file — no
+    // aliasing, so plain slot order is safe (no sort needed).
+    const auto data_path = filename + ".data";
+    auto parent = std::filesystem::path(data_path).parent_path();
+    if (!parent.empty())
+      std::filesystem::create_directories(parent);
+    std::unique_ptr<FILE, decltype(&fclose)> fout(
+        fopen(data_path.c_str(), "wb"), &fclose);
+    if (!fout)
+      THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);
+
+    FileHeader header{};
+    fwrite(&header, sizeof(header), 1, fout.get());  // placeholder
+
+    const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
+    size_t write_offset = 0;
+    MD5_CTX md5_ctx;
+    MD5_Init(&md5_ctx);
+
+    for (size_t i = 0; i < size_; ++i) {
+      const auto item = get_string_item(i);
+      if (item.length == 0)
+        continue;
+      fwrite(raw + item.offset, 1, item.length, fout.get());
+      MD5_Update(&md5_ctx, raw + item.offset, item.length);
+      set_string_item(i, {write_offset, item.length});
+      write_offset += item.length;
+    }
+
+    MD5_Final(header.data_md5, &md5_ctx);
+    fseek(fout.get(), 0, SEEK_SET);
+    fwrite(&header, sizeof(header), 1, fout.get());
+    fout.reset();
+
+    pos_val = write_offset;
Evidence
PR Compliance ID 1 requires refactoring string column stream compaction so column.h contains
minimal declarations/interfaces and not large implementation blocks. The added dump() code in
column.h implements compaction scanning, filesystem directory creation, raw file writing, and MD5
computation inline in the header.

Refactor to reduce code in column.h for string column stream compaction
include/neug/utils/property/column.h[288-342]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`include/neug/utils/property/column.h` still contains a large, non-interface implementation of string column compaction and sequential file dumping inside `TypedColumn<std::string_view>::dump()`, contrary to the requirement to reduce header logic.

## Issue Context
The compliance objective is to keep `column.h` focused on declarations/interfaces and move substantial compaction logic (filesystem operations, file handles, streaming writes, hashing, offset rewriting) into appropriate `.cc/.cpp` implementation files or a dedicated module.

## Fix Focus Areas
- include/neug/utils/property/column.h[288-342]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Compaction inflates shared strings 🐞
Description
TypedColumn<std::string_view>::dump() slow-path rewrites bytes for every live slot even when many
items share the same backing offset (e.g., from resize(default)), turning a single shared string
into N copies when any stale bytes exist. This can massively increase the dumped .data size and
makes compaction counterproductive.
Code

include/neug/utils/property/column.h[R288-335]

+    size_t unique_bytes = 0;
+    {
+      std::unordered_set<uint64_t> seen;
+      for (size_t i = 0; i < size_; ++i) {
+        const auto item = get_string_item(i);
+        if (item.length > 0 && seen.insert(item.offset).second)
+          unique_bytes += item.length;
      }
    }
-    pos_val = pos_.load();
-    // No-compaction path: dump containers as-is.
+    size_t pos_val = pos_.load();
+    if (unique_bytes == pos_val) {
+      write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
+      items_buffer_->Dump(filename + ".items");
+      data_buffer_->Dump(filename + ".data");
+      items_buffer_->Close();
+      data_buffer_->Close();
+      return;
+    }
+
+    // Slow path: stream each live slot to a new .data file, skipping stale
+    // bytes.  Reads from data_buffer_ and writes to a different file — no
+    // aliasing, so plain slot order is safe (no sort needed).
+    const auto data_path = filename + ".data";
+    auto parent = std::filesystem::path(data_path).parent_path();
+    if (!parent.empty())
+      std::filesystem::create_directories(parent);
+    std::unique_ptr<FILE, decltype(&fclose)> fout(
+        fopen(data_path.c_str(), "wb"), &fclose);
+    if (!fout)
+      THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);
+
+    FileHeader header{};
+    fwrite(&header, sizeof(header), 1, fout.get());  // placeholder
+
+    const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
+    size_t write_offset = 0;
+    MD5_CTX md5_ctx;
+    MD5_Init(&md5_ctx);
+
+    for (size_t i = 0; i < size_; ++i) {
+      const auto item = get_string_item(i);
+      if (item.length == 0)
+        continue;
+      fwrite(raw + item.offset, 1, item.length, fout.get());
+      MD5_Update(&md5_ctx, raw + item.offset, item.length);
+      set_string_item(i, {write_offset, item.length});
+      write_offset += item.length;
+    }
Evidence
StringColumn explicitly creates shared offsets when resizing with a non-empty default: it appends
the default once and then copies the same string_item (same offset/length) to all new rows. The new
dump slow-path does not deduplicate repeated offsets; it unconditionally fwrite()s each row’s bytes
and advances write_offset, so shared defaults become duplicated per-row when compaction is triggered
by any stale (unreferenced) bytes.

include/neug/utils/property/column.h[287-347]
include/neug/utils/property/column.h[364-388]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`TypedColumn<std::string_view>::dump()`'s slow compaction path writes data for every row even when multiple rows reference the same `string_item.offset` (shared backing bytes, created by `resize(size, default_value)`). When compaction triggers due to any stale bytes, this expands shared strings into N copies and can drastically inflate snapshot size.

## Issue Context
The previous implementation avoided this by mapping `old_offset -> new_offset` and only writing each unique backing segment once.

## Fix Focus Areas
- include/neug/utils/property/column.h[287-347]
- include/neug/utils/property/column.h[364-388]

## Implementation notes
- Track seen offsets in the slow path (e.g., `std::unordered_map<uint64_t, uint64_t> old_to_new`).
- If an offset was already written, only update the item to the previously assigned new offset; do not `fwrite()` again and do not advance `write_offset`.
- Ensure the resulting `pos_val` equals the compacted payload size (unique bytes actually written).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Unchecked file write errors 🐞
Description
The new dump compaction path ignores return values from fwrite()/fseek() and does not check fclose()
results, so I/O failures can silently produce corrupted snapshot .data files. Other dump
implementations in the repo check these operations and throw on failure.
Code

include/neug/utils/property/column.h[R314-340]

+    std::unique_ptr<FILE, decltype(&fclose)> fout(
+        fopen(data_path.c_str(), "wb"), &fclose);
+    if (!fout)
+      THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);
+
+    FileHeader header{};
+    fwrite(&header, sizeof(header), 1, fout.get());  // placeholder
+
+    const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
+    size_t write_offset = 0;
+    MD5_CTX md5_ctx;
+    MD5_Init(&md5_ctx);
+
+    for (size_t i = 0; i < size_; ++i) {
+      const auto item = get_string_item(i);
+      if (item.length == 0)
+        continue;
+      fwrite(raw + item.offset, 1, item.length, fout.get());
+      MD5_Update(&md5_ctx, raw + item.offset, item.length);
+      set_string_item(i, {write_offset, item.length});
+      write_offset += item.length;
+    }
+
+    MD5_Final(header.data_md5, &md5_ctx);
+    fseek(fout.get(), 0, SEEK_SET);
+    fwrite(&header, sizeof(header), 1, fout.get());
+    fout.reset();
Evidence
In the new manual streaming path, the code writes the placeholder header, payload segments, and the
final header without validating fwrite counts or fseek return values. In contrast,
MMapContainer::Dump checks fwrite results for both header and payload and throws on failure;
similarly, MutableCsr’s dump path checks fwrite/fseek/fclose and throws on errors.

include/neug/utils/property/column.h[314-340]
src/storages/container/mmap_container.cc[132-151]
src/storages/csr/mutable_csr.cc[163-196]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The StringColumn dump compaction path performs multiple file I/O operations (`fwrite`, `fseek`, `fclose`) without checking return values. On short writes, seek failures, or close/flush errors, the snapshot file can be silently corrupted.

## Issue Context
Other codepaths already enforce strict I/O checking (e.g., `MMapContainer::Dump`, `MutableCsr` dump).

## Fix Focus Areas
- include/neug/utils/property/column.h[314-340]
- src/storages/container/mmap_container.cc[132-151]
- src/storages/csr/mutable_csr.cc[163-196]

## Implementation notes
- Check `fwrite(...) == expected` for header and each data segment; throw `THROW_IO_EXCEPTION(...)` on mismatch.
- Check `fseek(...) == 0` before rewriting the header.
- Prefer `if (fclose(fp.release()) != 0) THROW_IO_EXCEPTION(...)` to ensure close/flush errors are surfaced.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment on lines +288 to +342
size_t unique_bytes = 0;
{
std::unordered_set<uint64_t> seen;
for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length > 0 && seen.insert(item.offset).second)
unique_bytes += item.length;
}
}
pos_val = pos_.load();
// No-compaction path: dump containers as-is.
size_t pos_val = pos_.load();
if (unique_bytes == pos_val) {
write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
items_buffer_->Dump(filename + ".items");
data_buffer_->Dump(filename + ".data");
items_buffer_->Close();
data_buffer_->Close();
return;
}

// Slow path: stream each live slot to a new .data file, skipping stale
// bytes. Reads from data_buffer_ and writes to a different file — no
// aliasing, so plain slot order is safe (no sort needed).
const auto data_path = filename + ".data";
auto parent = std::filesystem::path(data_path).parent_path();
if (!parent.empty())
std::filesystem::create_directories(parent);
std::unique_ptr<FILE, decltype(&fclose)> fout(
fopen(data_path.c_str(), "wb"), &fclose);
if (!fout)
THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);

FileHeader header{};
fwrite(&header, sizeof(header), 1, fout.get()); // placeholder

const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
size_t write_offset = 0;
MD5_CTX md5_ctx;
MD5_Init(&md5_ctx);

for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
}

MD5_Final(header.data_md5, &md5_ctx);
fseek(fout.get(), 0, SEEK_SET);
fwrite(&header, sizeof(header), 1, fout.get());
fout.reset();

pos_val = write_offset;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Compaction logic still in header 📎 Requirement gap ⚙ Maintainability

TypedColumn<std::string_view>::dump() now contains substantial compaction + file I/O
implementation (directory creation, fopen/fwrite/fseek, MD5) directly in column.h, which
keeps complex non-interface logic in the header. This violates the refactor requirement to move
string compaction implementation out of column.h into appropriate implementation files.
Agent Prompt
## Issue description
`include/neug/utils/property/column.h` still contains a large, non-interface implementation of string column compaction and sequential file dumping inside `TypedColumn<std::string_view>::dump()`, contrary to the requirement to reduce header logic.

## Issue Context
The compliance objective is to keep `column.h` focused on declarations/interfaces and move substantial compaction logic (filesystem operations, file handles, streaming writes, hashing, offset rewriting) into appropriate `.cc/.cpp` implementation files or a dedicated module.

## Fix Focus Areas
- include/neug/utils/property/column.h[288-342]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +288 to +335
size_t unique_bytes = 0;
{
std::unordered_set<uint64_t> seen;
for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length > 0 && seen.insert(item.offset).second)
unique_bytes += item.length;
}
}
pos_val = pos_.load();
// No-compaction path: dump containers as-is.
size_t pos_val = pos_.load();
if (unique_bytes == pos_val) {
write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
items_buffer_->Dump(filename + ".items");
data_buffer_->Dump(filename + ".data");
items_buffer_->Close();
data_buffer_->Close();
return;
}

// Slow path: stream each live slot to a new .data file, skipping stale
// bytes. Reads from data_buffer_ and writes to a different file — no
// aliasing, so plain slot order is safe (no sort needed).
const auto data_path = filename + ".data";
auto parent = std::filesystem::path(data_path).parent_path();
if (!parent.empty())
std::filesystem::create_directories(parent);
std::unique_ptr<FILE, decltype(&fclose)> fout(
fopen(data_path.c_str(), "wb"), &fclose);
if (!fout)
THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);

FileHeader header{};
fwrite(&header, sizeof(header), 1, fout.get()); // placeholder

const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
size_t write_offset = 0;
MD5_CTX md5_ctx;
MD5_Init(&md5_ctx);

for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Compaction inflates shared strings 🐞 Bug ≡ Correctness

TypedColumn<std::string_view>::dump() slow-path rewrites bytes for every live slot even when many
items share the same backing offset (e.g., from resize(default)), turning a single shared string
into N copies when any stale bytes exist. This can massively increase the dumped .data size and
makes compaction counterproductive.
Agent Prompt
## Issue description
`TypedColumn<std::string_view>::dump()`'s slow compaction path writes data for every row even when multiple rows reference the same `string_item.offset` (shared backing bytes, created by `resize(size, default_value)`). When compaction triggers due to any stale bytes, this expands shared strings into N copies and can drastically inflate snapshot size.

## Issue Context
The previous implementation avoided this by mapping `old_offset -> new_offset` and only writing each unique backing segment once.

## Fix Focus Areas
- include/neug/utils/property/column.h[287-347]
- include/neug/utils/property/column.h[364-388]

## Implementation notes
- Track seen offsets in the slow path (e.g., `std::unordered_map<uint64_t, uint64_t> old_to_new`).
- If an offset was already written, only update the item to the previously assigned new offset; do not `fwrite()` again and do not advance `write_offset`.
- Ensure the resulting `pos_val` equals the compacted payload size (unique bytes actually written).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +314 to +340
std::unique_ptr<FILE, decltype(&fclose)> fout(
fopen(data_path.c_str(), "wb"), &fclose);
if (!fout)
THROW_IO_EXCEPTION("Failed to open for compaction: " + data_path);

FileHeader header{};
fwrite(&header, sizeof(header), 1, fout.get()); // placeholder

const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
size_t write_offset = 0;
MD5_CTX md5_ctx;
MD5_Init(&md5_ctx);

for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
}

MD5_Final(header.data_md5, &md5_ctx);
fseek(fout.get(), 0, SEEK_SET);
fwrite(&header, sizeof(header), 1, fout.get());
fout.reset();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

3. Unchecked file write errors 🐞 Bug ☼ Reliability

The new dump compaction path ignores return values from fwrite()/fseek() and does not check fclose()
results, so I/O failures can silently produce corrupted snapshot .data files. Other dump
implementations in the repo check these operations and throw on failure.
Agent Prompt
## Issue description
The StringColumn dump compaction path performs multiple file I/O operations (`fwrite`, `fseek`, `fclose`) without checking return values. On short writes, seek failures, or close/flush errors, the snapshot file can be silently corrupted.

## Issue Context
Other codepaths already enforce strict I/O checking (e.g., `MMapContainer::Dump`, `MutableCsr` dump).

## Fix Focus Areas
- include/neug/utils/property/column.h[314-340]
- src/storages/container/mmap_container.cc[132-151]
- src/storages/csr/mutable_csr.cc[163-196]

## Implementation notes
- Check `fwrite(...) == expected` for header and each data segment; throw `THROW_IO_EXCEPTION(...)` on mismatch.
- Check `fseek(...) == 0` before rewriting the header.
- Prefer `if (fclose(fp.release()) != 0) THROW_IO_EXCEPTION(...)` to ensure close/flush errors are surfaced.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@zhanglei1949 zhanglei1949 changed the title refactor: Try to simplify StringColumn's CompactionOnDump, write to file sequentially refactor: Try to simplify StringColumn's CompactionOnDump, write to file in sequential Apr 8, 2026
@zhanglei1949 zhanglei1949 requested a review from Copilot April 8, 2026 02:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors StringColumn (TypedColumn<std::string_view>) dump-time compaction logic to simplify CompactionOnDump by writing the compacted .data file sequentially, removing the previous compaction plan/mapping implementation (Fix #177).

Changes:

  • Replaces the previous compaction-plan + offset-remap implementation with a two-path dump: fast path (no stale bytes) vs slow path (manual streaming write).
  • Removes the CompactionPlan/stream_compact_and_dump() helper code and associated unordered_map usage.
  • Introduces a unique_bytes scan to decide whether compaction is needed before dumping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +327 to +335
for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow-path compaction currently fwrite’s every non-empty slot in index order and updates each slot’s offset to the new write position. This duplicates bytes when multiple slots share the same old offset (e.g., resize(default) makes many rows share one backing string), which can massively increase the dumped .data size when any stale bytes exist. Track a mapping from old_offset -> new_offset and only write each unique offset once, then remap subsequent slots to the already-written location (like the previous implementation did).

Copilot uses AI. Check for mistakes.
Comment on lines +328 to +334
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slow-path loop skips item.length == 0 entries via continue and therefore never rewrites their offsets. Empty-string values can legitimately have a non-zero offset (set_value appends 0 bytes but records the current pos), and after compaction the new .data size may be smaller than that old offset, causing get_view()'s offset + length <= data_size assert to fail after reopen. Even for zero-length items, update the stored offset to a valid location within the new compacted buffer (e.g., current write_offset) before continuing.

Copilot uses AI. Check for mistakes.
Comment on lines 287 to +305
void dump(const std::string& filename) override {
// Compact before dumping. StringColumn uses an append-only strategy for
// updates, leaving stale copies in data_buffer_. When there is reused
// data we stream the compacted bytes directly to the output file in a
// single forward pass, computing MD5 on-the-fly, which avoids:
// 1. A temporary buffer allocation (effective_size bytes).
// 2. The memcpy from temp_buf → data_buffer_.
// 3. The subsequent container Dump() copy.
// When there is nothing to compact we fall through and let the container
// handle the write as usual (e.g. reflink / copy_file_range via
// FileSharedMMap::Dump, or a single fwrite via MMapContainer::Dump).
if (!items_buffer_ || !data_buffer_) {
THROW_RUNTIME_ERROR("Buffers not initialized for dumping");
}
resize(size_); // Resize the string column with avg size to shrink or
// expand data buffer
size_t pos_val;
if (size_ > 0) {
auto plan = prepare_compaction_plan();
if (plan.reused_size > 0) {
// Stream path: source (data_buffer_) and destination (snapshot file)
// are always different files, so there is no aliasing hazard.
pos_val = stream_compact_and_dump(plan, filename + ".data");
write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
items_buffer_->Dump(filename + ".items");
items_buffer_->Close();
data_buffer_->Close();
return;
size_t unique_bytes = 0;
{
std::unordered_set<uint64_t> seen;
for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length > 0 && seen.insert(item.offset).second)
unique_bytes += item.length;
}
}
pos_val = pos_.load();
// No-compaction path: dump containers as-is.
size_t pos_val = pos_.load();
if (unique_bytes == pos_val) {
write_file(filename + ".pos", &pos_val, sizeof(pos_val), 1);
items_buffer_->Dump(filename + ".items");
data_buffer_->Dump(filename + ".data");
items_buffer_->Close();
data_buffer_->Close();
return;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dump() no longer calls resize(size_) before dumping. In the no-compaction path (unique_bytes == pos_), this means data_buffer_->Dump() will persist the entire allocated buffer size (which may have grown via set_any() resizes) rather than a right-sized buffer, potentially bloating snapshot files and checkpoint time even when there are no stale bytes. Consider reintroducing the resize(size_) call (or otherwise shrinking data_buffer_ to pos_) before the fast-path dump.

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +340
FileHeader header{};
fwrite(&header, sizeof(header), 1, fout.get()); // placeholder

const auto* raw = reinterpret_cast<const char*>(data_buffer_->GetData());
size_t write_offset = 0;
MD5_CTX md5_ctx;
MD5_Init(&md5_ctx);

for (size_t i = 0; i < size_; ++i) {
const auto item = get_string_item(i);
if (item.length == 0)
continue;
fwrite(raw + item.offset, 1, item.length, fout.get());
MD5_Update(&md5_ctx, raw + item.offset, item.length);
set_string_item(i, {write_offset, item.length});
write_offset += item.length;
}

MD5_Final(header.data_md5, &md5_ctx);
fseek(fout.get(), 0, SEEK_SET);
fwrite(&header, sizeof(header), 1, fout.get());
fout.reset();
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual .data writer doesn’t check return values from fwrite()/fseek() (placeholder header write, per-item writes, final header overwrite), and it doesn’t check/propagate close/flush errors. Other dump paths (e.g., MMapContainer::Dump) throw on short writes/seek failures; this path should similarly validate all I/O calls and raise THROW_IO_EXCEPTION to avoid silently producing corrupted snapshots on disk-full or I/O errors.

Copilot uses AI. Check for mistakes.
Comment on lines +306 to +314

// Slow path: stream each live slot to a new .data file, skipping stale
// bytes. Reads from data_buffer_ and writes to a different file — no
// aliasing, so plain slot order is safe (no sort needed).
const auto data_path = filename + ".data";
auto parent = std::filesystem::path(data_path).parent_path();
if (!parent.empty())
std::filesystem::create_directories(parent);
std::unique_ptr<FILE, decltype(&fclose)> fout(
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing table/column dump+reopen tests, but they don’t appear to exercise this new slow-path compaction branch (unique_bytes != pos_). Adding a regression test that (1) creates many rows sharing a default string, (2) updates a row multiple times to create stale bytes, and (3) includes at least one empty-string value, would catch both size blow-ups and offset validity issues in the compaction writer.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor string column stream compaction

2 participants