Skip to content

More sync fixes#14

Merged
dwerner merged 8 commits intomainfrom
more-sync
Oct 27, 2025
Merged

More sync fixes#14
dwerner merged 8 commits intomainfrom
more-sync

Conversation

@dwerner
Copy link
Copy Markdown
Collaborator

@dwerner dwerner commented Oct 27, 2025

Sync to parquet should work as of this PR.

Previously, parquet files were named with the actual data range, which could
create confusing gaps when blocks had no data. For example:
- transactions_from_2105678_to_2287399.parquet (gap before 2105678)
- transactions_from_2287400_to_2499998.parquet (gap after 2499998)

This made it difficult to determine if a segment was complete without reading
file metadata.

Changes:

1. ParquetWriter:
   - Files now named: {type}_from_{segment_start}_to_{segment_end}_{seq}.parquet
   - Added sequence_number field that increments on each rotation
   - Removed finalize_with_requested_range() - simplified to finalize_current_file()
   - Updated metadata to store both:
     - phaser.segment_start/end - The full segment range this file belongs to
     - phaser.file_start/end - The contiguous block range this file covers

2. Data Scanner:
   - Reads phaser.file_start/end metadata to determine coverage
   - Falls back to statistics if metadata not present

3. Worker:
   - Simplified to call finalize_current_file() instead of finalize_with_requested_range()
   - Already passes segment range via set_block_range()

Benefits:
- No filename gaps - all files clearly belong to their segment
- Simple rotation - just increment sequence counter
- Clear segment ownership - easy to identify which files belong to which segment
- Metadata preserves actual coverage - scanner reads footer to know exact ranges
- No intermediate file complexity - every file follows same pattern

Example new naming:
  logs_from_2000000_to_2499999_0.parquet (segment 4, sequence 0)
  logs_from_2000000_to_2499999_1.parquet (segment 4, sequence 1)
  logs_from_2000000_to_2499999_2.parquet (segment 4, sequence 2)

Each file's metadata contains the actual block range it covers for gap detection.
The existing metadata only tracked segment boundaries and actual data ranges,
which caused gap detection to incorrectly identify missing data when blocks
had no transactions/logs at segment boundaries.

For example, if transactions start at block 46147 in segment 0 (0-499999),
gap detection would think blocks 0-46146 were missing, when in reality those
blocks simply had no transactions.

Changes:

1. Add phaser-parquet-metadata crate:
   - PhaserMetadata struct with three range types:
     * segment_start/end: The 500K block segment boundary
     * responsibility_start/end: Blocks this file is responsible for covering
     * data_start/end: Actual blocks that have data
   - In-place footer mutation using parquet-rs footer rewriting
   - Serialization to JSON in 'phaser.meta' key-value metadata

2. Add parquet-meta CLI tool:
   - show: Display metadata from Parquet files
   - fix-meta: Update metadata on existing files
   - Supports --infer flag to read data ranges from statistics

Benefits:
- Accurate gap detection even with sparse data
- Files indicate responsibility ranges for complete coverage
- Backward compatible - falls back to statistics if metadata missing
- Can update metadata on existing files without rewriting data

Example metadata:
  Segment: 0-499999 (full segment boundary)
  Responsibility: 0-499999 (covers entire segment including gaps)
  Data range: 46147-499998 (actual data present)
When files rotate within a segment, the responsibility ranges must remain
contiguous to ensure complete coverage. Previously, files only tracked
their actual data ranges, causing gaps in responsibility when rotations
occurred.

Changes:

1. Add responsibility_start field to CurrentFile:
   - Tracks first block this file is responsible for
   - May differ from start_block (first block with data)
   - Set based on sequence_number at file creation

2. Responsibility logic:
   - First file (seq=0): responsibility_start = segment_start
   - Subsequent files: responsibility_start = start_block
   - Last file: responsibility_end = segment_end if within last block
   - Handles off-by-one at segment boundaries (499998 vs 499999)

3. Write metadata on finalization:
   - Call PhaserMetadata::update_file_metadata() after renaming
   - Store all three range types in file footer
   - Warn if metadata update fails but don't fail the write

Example with file rotation in segment 0 (0-499999):
  seq=0: responsibility 0-25000, data 46147-25000
  seq=1: responsibility 25001-50000, data 25001-50000
  seq=2: responsibility 50001-499999, data 50001-499998

Note: responsibility_end extends to 499999 even though data ends at 499998,
ensuring complete segment coverage.
Workers were passing gap ranges (from_block, to_block) to set_block_range(),
which caused metadata to incorrectly reflect the gap being synced rather than
the full segment boundary.

For example, when syncing gap 46147-499999 in segment 0, metadata showed:
  Segment: 46147-499999 (wrong - should be 0-499999)
  Responsibility: 46147-499998 (wrong - should be 0-499999)

This broke gap detection because files didn't indicate they covered the full
segment including blocks with no data.

Changes:

1. sync_blocks_range: Use self.from_block/to_block instead of from_block/to_block
2. sync_transactions_range: Use self.from_block/to_block instead of from_block/to_block
3. sync_logs_range: Use self.from_block/to_block instead of from_block/to_block

The worker's from_block/to_block fields contain the segment boundaries (e.g., 0-499999),
while the function parameters contain the gap ranges (e.g., 46147-499999).

After fix, metadata correctly shows:
  Segment: 0-499999 (correct segment boundary)
  Responsibility: 0-499999 (covers entire segment)
  Data range: 46147-499998 (actual data)
The existing gap detection used data statistics (min/max block numbers) to
determine coverage, which incorrectly identified gaps when blocks had no
transactions/logs.

For example, if segment 0 (0-499999) only had transactions starting at block
46147, the scanner would think blocks 0-46146 were missing.

Changes:

1. Read PhaserMetadata from file footers:
   - Use PhaserMetadata::read_from_file() to get responsibility ranges
   - Fall back to statistics if metadata not present
   - Handle both new and old file formats gracefully

2. Use responsibility ranges for gap detection:
   - Build coverage map from responsibility ranges, not data ranges
   - Properly handle file sequences within segments
   - Detect gaps by checking continuity of responsibility ranges

3. Improved segment analysis:
   - Files with responsibility covering full segment mark segment complete
   - Multiple files with contiguous responsibilities mark segment complete
   - Only report actual gaps in responsibility coverage

Benefits:
- No false positives for blocks with no data
- Accurate gap detection across file rotations
- Handles sparse data correctly (early blocks, genesis, etc.)
- Backward compatible with old files

Example:
Old behavior: Segment 0 incomplete, missing blocks 0-46146
New behavior: Segment 0 complete (file has responsibility 0-499999)
Previously, the only way to check for gaps was to start a sync job, which
would perform analysis as a side effect. This wasn't ideal for inspecting
data coverage without actually syncing.

Changes:

1. Add analyze subcommand:
   - Performs gap analysis without starting a sync job
   - Same parameters as sync (chain-id, bridge, from, to)
   - Prints detailed gap report

2. Display format:
   - Shows complete vs incomplete segments
   - Lists specific gap ranges for incomplete segments
   - Provides segment list to sync

Example usage:
  phaser-cli analyze --chain-id 1 --bridge erigon --from 0 --to 10000000

Output:
  Gap Analysis:
    Total segments: 21
    Complete: 15 (71.4%)
    Missing: 6

    Incomplete segments:
      Segment 5 (blocks 2500000-2999999): missing transactions
Add comprehensive documentation explaining the three-tier metadata system
and how it enables accurate gap detection.

Covers:
- Detailed explanation of segment, responsibility, and data ranges
- Gap detection algorithm description
- File rotation handling
- Examples showing metadata in action
- Regenerate gRPC proto files with latest prost/tonic tooling
- Add segment_start/segment_end to segment worker debug logs
  for easier debugging of segment boundary issues
@dwerner dwerner merged commit e5ba300 into main Oct 27, 2025
@dwerner dwerner deleted the more-sync branch October 27, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant