Enhance ParquetWriter with atomic writes and improved file naming by dwerner · Pull Request #2 · edgeandnode/phaser-bridge

dwerner · 2025-10-01T19:14:53Z

Previously, parquet files were written directly to their final names, which meant incomplete files could exist if writes were interrupted. Additionally, the naming scheme only showed segment IDs, making it difficult to determine actual block ranges without reading file metadata.

Changes:

Write to .parquet.tmp files during active writing
Atomically rename to .parquet only on successful finalization
Enhanced naming: {topic}_{segment_start}-{segment_end}from{actual_start}to{actual_end}.parquet
Track both theoretical segment boundaries and actual data ranges
Remove unused Path import

Benefits:

Incomplete files are clearly identifiable (.tmp extension)
File names show exact block range without reading metadata
Easy to spot files split due to size limits
Atomic completion prevents corrupted final files

Example filenames:

blocks_0-9999_from_0_to_9900.parquet
blocks_500000-509999_from_500002_to_509902.parquet (split due to size)

Tested against Erigon instance with 1M block sync across 4 parallel workers.

Previously, parquet files were written directly to their final names, which meant incomplete files could exist if writes were interrupted. Additionally, the naming scheme only showed segment IDs, making it difficult to determine actual block ranges without reading file metadata. Changes: - Write to .parquet.tmp files during active writing - Atomically rename to .parquet only on successful finalization - Enhanced naming: {topic}_{segment_start}-{segment_end}_from_{actual_start}_to_{actual_end}.parquet - Track both theoretical segment boundaries and actual data ranges - Remove unused Path import Benefits: - Incomplete files are clearly identifiable (.tmp extension) - File names show exact block range without reading metadata - Easy to spot files split due to size limits - Atomic completion prevents corrupted final files Example filenames: - blocks_0-9999_from_0_to_9900.parquet - blocks_500000-509999_from_500002_to_509902.parquet (split due to size) Tested against Erigon instance with 1M block sync across 4 parallel workers.

dwerner merged commit 3e6b1b4 into main Oct 1, 2025

dwerner deleted the phaser-parquet-writer branch October 2, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance ParquetWriter with atomic writes and improved file naming#2

Enhance ParquetWriter with atomic writes and improved file naming#2
dwerner merged 1 commit intomainfrom
phaser-parquet-writer

dwerner commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dwerner commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant