Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Sep 15, 2025

This PR implements periodic diff logging as data streams are consumed, addressing the need for real-time monitoring of data comparison progress and early detection of differences in large datasets.

Key Features Added

🔧 Enhanced Configuration System

  • New RunConfig structure supporting dual source configuration
  • Flexible periodic reporting settings via runConfig.yaml or command-line parameters
  • Dual trigger system: time-based (seconds) and record-count-based intervals
  • Automatic defaults: 30 seconds and 1000 records when not specified

📊 Stream Comparison Engine

  • Real-time diff detection as records are processed from both sources
  • Progressive reporting showing comparison statistics at configurable intervals
  • Key-based record matching with field-level difference analysis
  • Structured YAML output with timestamps and detailed diff categorization

🖥️ Command Line Interface

The tool now supports both configuration file and direct parameter approaches:

Using config file:

./stream-diff -config runConfig.yaml -key user_id

Using CLI parameters:

./stream-diff \
  -source1 data/source1.csv \
  -source2 data/source2.csv \
  -key user_id \
  -enable-periodic \
  -time-interval 30 \
  -record-interval 1000

📈 Example Output

Console shows real-time progress:

[PERIODIC] 15:04:05 - Records: 1000, Matching: 950, Identical: 900, Diffs: 50
[PERIODIC] 15:04:35 - Records: 2000, Matching: 1900, Identical: 1800, Diffs: 100

Reports are saved as timestamped YAML files with detailed breakdowns:

timestamp: 2025-09-15T15:04:05Z
records_processed: 1000
matching_keys: 950
identical_rows: 900
value_diffs_by_key:
  "123":
    - field: age
      source1_value: "30"
      source2_value: "31"

Configuration Example

source1:
  type: csv
  path: data/source1.csv
source2:
  type: csv
  path: data/source2.csv
output:
  final_report: final_report.yaml
  periodic_reports: periodic_reports
periodic:
  enabled: true
  time_interval_seconds: 30
  record_interval: 1000

Implementation Details

  • Minimal changes: Extended existing config and datareader packages without breaking compatibility
  • Comprehensive testing: Added test coverage for both time-based and record-based triggers
  • Error handling: Graceful handling of I/O errors and malformed data
  • Flexible triggers: Reports generate when either time OR record threshold is met
  • Clean separation: New comparator package isolates comparison logic from data reading

This implementation provides the requested periodic diff logging while maintaining the existing architecture and adding comprehensive CLI flexibility for different use cases.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 2 commits September 15, 2025 01:25
…iggers

Co-authored-by: clickstefan <5954967+clickstefan@users.noreply.github.com>
Co-authored-by: clickstefan <5954967+clickstefan@users.noreply.github.com>
Copilot AI changed the title [WIP] Add an option to log periodic diffs as the stream is consumed. Ideally configured by amount of time passed & minimum amout of records. All configuration should be part of a runConfig.yaml and / or run params. Add periodic diff logging functionality with configurable time and record intervals Sep 15, 2025
Copilot AI requested a review from clickstefan September 15, 2025 01:28
Copilot finished work on behalf of clickstefan September 15, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants