Skip to content
This repository was archived by the owner on Jun 13, 2026. It is now read-only.

Architecture

benzsevern edited this page Mar 29, 2026 · 1 revision

Architecture

Pipeline Overview

Source (file/DB/DataFrame)
  |
  v
SchemaProvider.extract() --> SchemaInfo (fields + metadata)
  |
Target (file/DB/DataFrame)
  |
  v
SchemaProvider.extract() --> SchemaInfo
  |
  v
MapEngine
  |-- For each (source_field, target_field) pair:
  |     |-- Run all scorers --> ScorerResult | None
  |     |-- Weighted average (min 2 contributors)
  |     \-- --> Score matrix (M x N)
  |
  |-- scipy.linear_sum_assignment (Hungarian algorithm)
  |-- Filter by min_confidence
  |-- Generate warnings for unmapped required fields
  |
  v
MapResult
  |-- .report()     --> structured dict with per-scorer breakdown
  |-- .apply(df)    --> remapped DataFrame
  |-- .to_config()  --> saveable YAML
  \-- .to_json()    --> JSON string

Four Layers

  1. Consumer Layer — CLI, Python API, TypeScript SDK (v1.1)
  2. Orchestrator — MapEngine coordinates everything
  3. Scorer Pipeline — Independent scorers, each returns (score, reasoning)
  4. Schema Providers — Normalize any source into SchemaInfo

Key Design Decisions

Weighted average, not staged filtering: All scorers run on all pairs. No early locking that could steal targets from better matches.

Optimal assignment, not greedy: The Hungarian algorithm finds the globally optimal 1:1 mapping, not the locally best one. This matters when two source fields compete for the same target.

None vs 0.0: Scorers return None to abstain (excluded from calculation) or ScorerResult(0.0) to signal a real negative (included in denominator). This prevents a single weak scorer from inflating a match.

Minimum 2 contributors: A pair needs at least 2 non-None scorer results to receive a score. Single-scorer matches are too unreliable.

Project Structure

infermap/
├── __init__.py          # Public API
├── engine.py            # MapEngine orchestrator
├── types.py             # Core dataclasses
├── errors.py            # Exception hierarchy
├── assignment.py        # Hungarian algorithm
├── config.py            # YAML config + from_config()
├── cli.py               # Typer CLI (4 commands)
├── scorers/
│   ├── base.py          # Scorer protocol
│   ├── exact.py         # ExactScorer
│   ├── alias.py         # AliasScorer + registry
│   ├── pattern_type.py  # PatternTypeScorer + regex
│   ├── profile.py       # ProfileScorer
│   ├── fuzzy_name.py    # FuzzyNameScorer
│   └── llm.py           # LLMScorer (v1.1 stub)
└── providers/
    ├── base.py          # Provider protocol
    ├── file.py          # CSV/Parquet/Excel
    ├── db.py            # SQLite/Postgres/DuckDB/MySQL
    ├── schema_file.py   # YAML/JSON definitions
    └── memory.py        # DataFrame/dict

Clone this wiki locally