Skip to content

Enhance ParquetWriter with atomic writes and improved file naming#2

Merged
dwerner merged 1 commit intomainfrom
phaser-parquet-writer
Oct 1, 2025
Merged

Enhance ParquetWriter with atomic writes and improved file naming#2
dwerner merged 1 commit intomainfrom
phaser-parquet-writer

Conversation

@dwerner
Copy link
Copy Markdown
Collaborator

@dwerner dwerner commented Oct 1, 2025

Previously, parquet files were written directly to their final names, which meant incomplete files could exist if writes were interrupted. Additionally, the naming scheme only showed segment IDs, making it difficult to determine actual block ranges without reading file metadata.

Changes:

  • Write to .parquet.tmp files during active writing
  • Atomically rename to .parquet only on successful finalization
  • Enhanced naming: {topic}_{segment_start}-{segment_end}from{actual_start}to{actual_end}.parquet
  • Track both theoretical segment boundaries and actual data ranges
  • Remove unused Path import

Benefits:

  • Incomplete files are clearly identifiable (.tmp extension)
  • File names show exact block range without reading metadata
  • Easy to spot files split due to size limits
  • Atomic completion prevents corrupted final files

Example filenames:

  • blocks_0-9999_from_0_to_9900.parquet
  • blocks_500000-509999_from_500002_to_509902.parquet (split due to size)

Tested against Erigon instance with 1M block sync across 4 parallel workers.

Previously, parquet files were written directly to their final names, which
meant incomplete files could exist if writes were interrupted. Additionally,
the naming scheme only showed segment IDs, making it difficult to determine
actual block ranges without reading file metadata.

Changes:
- Write to .parquet.tmp files during active writing
- Atomically rename to .parquet only on successful finalization
- Enhanced naming: {topic}_{segment_start}-{segment_end}_from_{actual_start}_to_{actual_end}.parquet
- Track both theoretical segment boundaries and actual data ranges
- Remove unused Path import

Benefits:
- Incomplete files are clearly identifiable (.tmp extension)
- File names show exact block range without reading metadata
- Easy to spot files split due to size limits
- Atomic completion prevents corrupted final files

Example filenames:
- blocks_0-9999_from_0_to_9900.parquet
- blocks_500000-509999_from_500002_to_509902.parquet (split due to size)

Tested against Erigon instance with 1M block sync across 4 parallel workers.
@dwerner dwerner merged commit 3e6b1b4 into main Oct 1, 2025
@dwerner dwerner deleted the phaser-parquet-writer branch October 2, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant