-
Notifications
You must be signed in to change notification settings - Fork 0
systems comparison engine
Active contributors: Douwe de Vries
The comparison engine turns two loaded CSV datasets and a validated comparison configuration into row-level outcomes and summary counts. It owns exact row-key matching, optional flexible key matching, duplicate and missing row classification, value normalization, and result summaries consumed by the backend responses and frontend results table.
| Area | Paths | Responsibility |
|---|---|---|
| Engine orchestration | src/comparison/engine.rs |
Resolves selected columns, groups rows by keys, chooses exact or flexible matching, creates row results, and generates summaries. |
| Row helpers | src/comparison/rows.rs |
Resolves column selections, splits usable and unkeyed rows, tokenizes flexible keys, and classifies flexible matches. |
| Value comparison | src/comparison/value_compare.rs |
Applies cleanup rules, compares values, handles JSON equality, and builds value differences. |
| Mapping suggestions | src/comparison/mapping.rs |
Suggests exact and fuzzy column mappings from headers and optional cell profiles. |
| Data types | src/data/types.rs |
Defines ComparisonConfig, ComparisonNormalizationConfig, RowComparisonResult, ValueDifference, and ComparisonSummary. |
| Response serialization | src/presentation/responses.rs |
Converts domain results and summary values into frontend DTOs. |
| Request validation | src/backend/validation.rs |
Builds a safe ComparisonConfig before the engine runs on user input. |
| Abstraction | Defined in | Notes |
|---|---|---|
ComparisonConfig |
src/data/types.rs |
Validated row keys, comparison columns, mappings, and normalization rules. |
ComparisonNormalizationConfig |
src/data/types.rs |
Missing-value, flexible-key, text, numeric, decimal, and date cleanup options. |
KeyedRows |
src/comparison/rows.rs |
A normalized key, display key, matching row indices, and source order. |
FlexibleKeyMatch |
src/comparison/rows.rs |
Exact, wildcard, boundary, anchored-token, and weak shared-token match classes. |
RowComparisonResult |
src/data/types.rs |
Match, mismatch, missing-left, missing-right, unkeyed-left, unkeyed-right, and duplicate outcomes. |
ValueDifference |
src/data/types.rs |
Per-column mismatch detail with File A and File B values. |
ComparisonSummary |
src/data/types.rs |
Counts used by the UI summary and saved snapshots. |
flowchart TD
Request[CompareRequest] --> Validation[src/backend/validation.rs]
Validation --> Config[ComparisonConfig]
Config --> Selections[Resolve physical and virtual columns]
Selections --> SplitA[Split File A by usable normalized keys]
Selections --> SplitB[Split File B by usable normalized keys]
SplitA --> UnkeyedA[Unkeyed File A rows]
SplitB --> UnkeyedB[Unkeyed File B rows]
SplitA --> Mode{Flexible key matching?}
SplitB --> Mode
Mode -->|No| Exact[Exact key lookup]
Mode -->|Yes| Flexible[Flexible candidate selection]
Exact --> Results[RowComparisonResult list]
Flexible --> Results
UnkeyedA --> Results
UnkeyedB --> Results
Results --> Values[Value normalization and differences]
Values --> Summary[ComparisonSummary]
-
src/backend/validation.rsvalidates the incomingCompareRequestand builds aComparisonConfig. -
src/comparison/engine.rsresolves selected key and comparison labels throughsrc/comparison/rows.rs, including virtual JSON labels fromsrc/data/json_fields.rs. - Each side is split by
split_rows_by_key_usable. A row is usable only when every selected key component normalizes to a non-missing value. - Rows with unusable keys are emitted as unkeyed results before normal matching.
- Exact mode pairs only equal normalized keys. Flexible mode evaluates possible pairs and selects a preferred one-to-one set.
- Matched key groups with more than one row on either side become duplicate results instead of normal value comparisons.
- Single-row matched groups are compared by
find_differencesinsrc/comparison/value_compare.rs. -
generate_summarycounts the resulting categories and preserves duplicate counts by side.
Exact matching is the default. src/comparison/engine.rs builds maps from normalized key vectors to KeyedRows for File A and File B.
- A key present once on both sides becomes either
MatchorMismatch. - A key present only in File A becomes
MissingRight, shown as only in File A. - A key present only in File B becomes
MissingLeft, shown as only in File B. - A key with multiple File A rows, multiple File B rows, or both becomes
Duplicate. - Output is stable because key groups are processed by their first source-row index.
Flexible key matching is enabled by normalization.flexible_key_matching. It is documented from the user-facing side in flexible row-key matching.
The engine supports these match classes in src/comparison/rows.rs, ordered by preference in src/comparison/engine.rs:
- Exact normalized keys.
- Component wildcard matches using
**. - Boundary wildcard matches where wildcard patterns can cross key component boundaries.
- Shared anchored tokens, such as shared alpha tokens with an exact component or shared number.
- Weak shared text tokens, kept only when they are unique on both sides and no other candidate exists for either key.
Before running flexible matching, src/backend/workflow.rs asks the engine for bounded estimates:
| Guard | Limit | Defined in |
|---|---|---|
| Candidate count | 10,000 |
src/comparison/engine.rs |
| Key-pair comparison count | 1,000,000 |
src/comparison/engine.rs |
If either estimate exceeds the limit, validation fails before the expensive comparison runs.
| Result | Source condition | Display meaning |
|---|---|---|
match |
Single matched row on both sides and no value differences. | The selected values match after cleanup. |
mismatch |
Single matched row on both sides with value differences. | One or more selected values differ. |
missing_left |
Usable key appears only in File B. | Only in File B. |
missing_right |
Usable key appears only in File A. | Only in File A. |
unkeyed_left |
File B row has a missing selected key after cleanup. | Ignored in File B. |
unkeyed_right |
File A row has a missing selected key after cleanup. | Ignored in File A. |
duplicate_file_a |
A key has multiple File A rows only. | Duplicate key in File A. |
duplicate_file_b |
A key has multiple File B rows only. | Duplicate key in File B. |
duplicate_both |
A key has duplicate rows on both sides. | Duplicate key in both files. |
src/comparison/value_compare.rs applies the same normalization rules to key values and comparison values, except that missing normalized key values remove the row from keyed matching.
| Rule | Behavior |
|---|---|
| Empty as missing | Empty values become missing when treat_empty_as_null is true. |
| Null tokens | Tokens such as null, na, n/a, and none become missing by default. |
| Whitespace | Values are trimmed before comparison when trim_whitespace is true. |
| Date normalization | Parsed dates are converted to canonical date or date-time strings when enabled. |
| Numeric equivalence | Numeric strings are normalized so equivalent integer and decimal forms can match. |
| Decimal rounding | Numeric values are rounded before comparison and display when enabled. |
| Case-insensitive text | Normalized text is lowercased when case_insensitive is true. |
| JSON values | If both normalized comparison values parse as JSON, structural JSON equality is used. |
normalize_display_value preserves original display values except for trimming used during decimal rounding and rounded numeric output. That is why result rows can show rounded values when decimal rounding is enabled.
generate_summary in src/comparison/engine.rs counts the final RowComparisonResult list.
-
matches,mismatches,missing_left,missing_right,unkeyed_left, andunkeyed_rightcount result rows of that type. -
duplicates_aandduplicates_bcount duplicate result rows by duplicate source, so a duplicate in both files increments both counters. -
total_rows_aandtotal_rows_bare the original input row counts.
These counts feed SummaryResponse in src/presentation/responses.rs, buildSummaryOverview in frontend/src/features/results/presentation.ts, and saved snapshot validation in src/backend/persistence/v1/mod.rs.
- Backend workflow explains how requests are validated before calling the engine and how results are written back to a session.
- Frontend workflow explains how results are filtered, searched, sorted, and exported.
- CSV loading explains how physical and virtual columns become selectable inputs.
- API is the serialized response contract that receives the engine output.
| Need | Change first | Also check |
|---|---|---|
| Change exact row matching | src/comparison/engine.rs |
Summary tests and result ordering expectations. |
| Change flexible key behavior |
src/comparison/rows.rs, src/comparison/engine.rs
|
Flexible row-key matching, guard limits in src/backend/workflow.rs. |
| Change missing or unkeyed semantics |
src/comparison/rows.rs, src/comparison/engine.rs
|
frontend/src/features/results/presentation.ts labels and summary banners. |
| Change value cleanup | src/comparison/value_compare.rs |
frontend/src/config/normalization.ts, frontend/src/components/mapping-config/NormalizationPanel.tsx. |
| Change result DTOs |
src/data/types.rs, src/presentation/responses.rs
|
frontend/src/types/api.ts, snapshot persistence in src/backend/persistence/v1/mod.rs. |
| Change summaries | src/comparison/engine.rs |
frontend/src/components/SummaryStats.tsx, src/backend/persistence/v1/mod.rs. |
| File | Purpose |
|---|---|
src/comparison/engine.rs |
Main comparison flow, exact and flexible matching, duplicate handling, missing row handling, summary generation. |
src/comparison/rows.rs |
Key extraction, virtual column resolution, normalized key grouping, flexible key classification. |
src/comparison/value_compare.rs |
Normalization and per-value difference detection. |
src/comparison/mapping.rs |
Header and instance-based mapping suggestions. |
src/data/types.rs |
Comparison config, normalization config, result enum, difference, and summary types. |
src/backend/validation.rs |
Safe construction of ComparisonConfig from user requests. |
src/presentation/responses.rs |
Conversion from domain results to API responses. |
frontend/src/features/results/presentation.ts |
Frontend labels, filter buckets, row view models, and summary view models. |