Skip to content

systems comparison engine

Douwe de Vries edited this page Jul 1, 2026 · 2 revisions

Comparison engine

Active contributors: Douwe de Vries

Purpose

The comparison engine turns two loaded CSV datasets and a validated comparison configuration into row-level outcomes and summary counts. It owns exact row-key matching, optional flexible key matching, duplicate and missing row classification, value normalization, and result summaries consumed by the backend responses and frontend results table.

Directory layout

Area Paths Responsibility
Engine orchestration src/comparison/engine.rs Resolves selected columns, groups rows by keys, chooses exact or flexible matching, creates row results, and generates summaries.
Row helpers src/comparison/rows.rs Resolves column selections, splits usable and unkeyed rows, tokenizes flexible keys, and classifies flexible matches.
Value comparison src/comparison/value_compare.rs Applies cleanup rules, compares values, handles JSON equality, and builds value differences.
Mapping suggestions src/comparison/mapping.rs Suggests exact and fuzzy column mappings from headers and optional cell profiles.
Data types src/data/types.rs Defines ComparisonConfig, ComparisonNormalizationConfig, RowComparisonResult, ValueDifference, and ComparisonSummary.
Response serialization src/presentation/responses.rs Converts domain results and summary values into frontend DTOs.
Request validation src/backend/validation.rs Builds a safe ComparisonConfig before the engine runs on user input.

Key abstractions

Abstraction Defined in Notes
ComparisonConfig src/data/types.rs Validated row keys, comparison columns, mappings, and normalization rules.
ComparisonNormalizationConfig src/data/types.rs Missing-value, flexible-key, text, numeric, decimal, and date cleanup options.
KeyedRows src/comparison/rows.rs A normalized key, display key, matching row indices, and source order.
FlexibleKeyMatch src/comparison/rows.rs Exact, wildcard, boundary, anchored-token, and weak shared-token match classes.
RowComparisonResult src/data/types.rs Match, mismatch, missing-left, missing-right, unkeyed-left, unkeyed-right, and duplicate outcomes.
ValueDifference src/data/types.rs Per-column mismatch detail with File A and File B values.
ComparisonSummary src/data/types.rs Counts used by the UI summary and saved snapshots.

How it works

flowchart TD
    Request[CompareRequest] --> Validation[src/backend/validation.rs]
    Validation --> Config[ComparisonConfig]
    Config --> Selections[Resolve physical and virtual columns]
    Selections --> SplitA[Split File A by usable normalized keys]
    Selections --> SplitB[Split File B by usable normalized keys]
    SplitA --> UnkeyedA[Unkeyed File A rows]
    SplitB --> UnkeyedB[Unkeyed File B rows]
    SplitA --> Mode{Flexible key matching?}
    SplitB --> Mode
    Mode -->|No| Exact[Exact key lookup]
    Mode -->|Yes| Flexible[Flexible candidate selection]
    Exact --> Results[RowComparisonResult list]
    Flexible --> Results
    UnkeyedA --> Results
    UnkeyedB --> Results
    Results --> Values[Value normalization and differences]
    Values --> Summary[ComparisonSummary]
Loading
  1. src/backend/validation.rs validates the incoming CompareRequest and builds a ComparisonConfig.
  2. src/comparison/engine.rs resolves selected key and comparison labels through src/comparison/rows.rs, including virtual JSON labels from src/data/json_fields.rs.
  3. Each side is split by split_rows_by_key_usable. A row is usable only when every selected key component normalizes to a non-missing value.
  4. Rows with unusable keys are emitted as unkeyed results before normal matching.
  5. Exact mode pairs only equal normalized keys. Flexible mode evaluates possible pairs and selects a preferred one-to-one set.
  6. Matched key groups with more than one row on either side become duplicate results instead of normal value comparisons.
  7. Single-row matched groups are compared by find_differences in src/comparison/value_compare.rs.
  8. generate_summary counts the resulting categories and preserves duplicate counts by side.

Exact matching

Exact matching is the default. src/comparison/engine.rs builds maps from normalized key vectors to KeyedRows for File A and File B.

  • A key present once on both sides becomes either Match or Mismatch.
  • A key present only in File A becomes MissingRight, shown as only in File A.
  • A key present only in File B becomes MissingLeft, shown as only in File B.
  • A key with multiple File A rows, multiple File B rows, or both becomes Duplicate.
  • Output is stable because key groups are processed by their first source-row index.

Flexible key matching overview

Flexible key matching is enabled by normalization.flexible_key_matching. It is documented from the user-facing side in flexible row-key matching.

The engine supports these match classes in src/comparison/rows.rs, ordered by preference in src/comparison/engine.rs:

  1. Exact normalized keys.
  2. Component wildcard matches using **.
  3. Boundary wildcard matches where wildcard patterns can cross key component boundaries.
  4. Shared anchored tokens, such as shared alpha tokens with an exact component or shared number.
  5. Weak shared text tokens, kept only when they are unique on both sides and no other candidate exists for either key.

Before running flexible matching, src/backend/workflow.rs asks the engine for bounded estimates:

Guard Limit Defined in
Candidate count 10,000 src/comparison/engine.rs
Key-pair comparison count 1,000,000 src/comparison/engine.rs

If either estimate exceeds the limit, validation fails before the expensive comparison runs.

Duplicate, missing, and unkeyed rows

Result Source condition Display meaning
match Single matched row on both sides and no value differences. The selected values match after cleanup.
mismatch Single matched row on both sides with value differences. One or more selected values differ.
missing_left Usable key appears only in File B. Only in File B.
missing_right Usable key appears only in File A. Only in File A.
unkeyed_left File B row has a missing selected key after cleanup. Ignored in File B.
unkeyed_right File A row has a missing selected key after cleanup. Ignored in File A.
duplicate_file_a A key has multiple File A rows only. Duplicate key in File A.
duplicate_file_b A key has multiple File B rows only. Duplicate key in File B.
duplicate_both A key has duplicate rows on both sides. Duplicate key in both files.

Normalization

src/comparison/value_compare.rs applies the same normalization rules to key values and comparison values, except that missing normalized key values remove the row from keyed matching.

Rule Behavior
Empty as missing Empty values become missing when treat_empty_as_null is true.
Null tokens Tokens such as null, na, n/a, and none become missing by default.
Whitespace Values are trimmed before comparison when trim_whitespace is true.
Date normalization Parsed dates are converted to canonical date or date-time strings when enabled.
Numeric equivalence Numeric strings are normalized so equivalent integer and decimal forms can match.
Decimal rounding Numeric values are rounded before comparison and display when enabled.
Case-insensitive text Normalized text is lowercased when case_insensitive is true.
JSON values If both normalized comparison values parse as JSON, structural JSON equality is used.

normalize_display_value preserves original display values except for trimming used during decimal rounding and rounded numeric output. That is why result rows can show rounded values when decimal rounding is enabled.

Summaries

generate_summary in src/comparison/engine.rs counts the final RowComparisonResult list.

  • matches, mismatches, missing_left, missing_right, unkeyed_left, and unkeyed_right count result rows of that type.
  • duplicates_a and duplicates_b count duplicate result rows by duplicate source, so a duplicate in both files increments both counters.
  • total_rows_a and total_rows_b are the original input row counts.

These counts feed SummaryResponse in src/presentation/responses.rs, buildSummaryOverview in frontend/src/features/results/presentation.ts, and saved snapshot validation in src/backend/persistence/v1/mod.rs.

Integration points

  • Backend workflow explains how requests are validated before calling the engine and how results are written back to a session.
  • Frontend workflow explains how results are filtered, searched, sorted, and exported.
  • CSV loading explains how physical and virtual columns become selectable inputs.
  • API is the serialized response contract that receives the engine output.

Entry points for modification

Need Change first Also check
Change exact row matching src/comparison/engine.rs Summary tests and result ordering expectations.
Change flexible key behavior src/comparison/rows.rs, src/comparison/engine.rs Flexible row-key matching, guard limits in src/backend/workflow.rs.
Change missing or unkeyed semantics src/comparison/rows.rs, src/comparison/engine.rs frontend/src/features/results/presentation.ts labels and summary banners.
Change value cleanup src/comparison/value_compare.rs frontend/src/config/normalization.ts, frontend/src/components/mapping-config/NormalizationPanel.tsx.
Change result DTOs src/data/types.rs, src/presentation/responses.rs frontend/src/types/api.ts, snapshot persistence in src/backend/persistence/v1/mod.rs.
Change summaries src/comparison/engine.rs frontend/src/components/SummaryStats.tsx, src/backend/persistence/v1/mod.rs.

Key source files

File Purpose
src/comparison/engine.rs Main comparison flow, exact and flexible matching, duplicate handling, missing row handling, summary generation.
src/comparison/rows.rs Key extraction, virtual column resolution, normalized key grouping, flexible key classification.
src/comparison/value_compare.rs Normalization and per-value difference detection.
src/comparison/mapping.rs Header and instance-based mapping suggestions.
src/data/types.rs Comparison config, normalization config, result enum, difference, and summary types.
src/backend/validation.rs Safe construction of ComparisonConfig from user requests.
src/presentation/responses.rs Conversion from domain results to API responses.
frontend/src/features/results/presentation.ts Frontend labels, filter buckets, row view models, and summary view models.

Clone this wiki locally