Skip to content

background multilingual detection

Douwe de Vries edited this page Jul 1, 2026 · 1 revision

Multilingual detection

CSV Anonymizer's multilingual detection work is documented in docs/multilingual-detection-phased-plan.md and docs/multilingual-header-detection-investigation.md. The current implementation keeps the default detector local, deterministic, and explainable.

Active contributors: Douwe de Vries

Product boundary

Multilingual detection improves CSV header and value detection. It does not localize the app UI.

"Supported language" should mean deterministic header taxonomy coverage, fixture-backed expectations, detector evidence labels, and value detectors that still run independently of header language. It should not mean full parity for every jurisdiction or every free-text PII shape.

Implemented layers

Layer Current behavior Key paths
Unicode-safe normalization Header parsing preserves non-ASCII text, applies Unicode normalization, supports segmentation, accent folding for Latin terms, camelCase splitting, and compact aliases crates/csv-anonymizer-core/src/detection/header.rs
Header taxonomy Maintained taxonomy covers English, Dutch, German, French, Spanish, Portuguese, Italian, and a small Japanese pilot for unambiguous concepts crates/csv-anonymizer-core/src/detection/header_taxonomy.json, crates/csv-anonymizer-core/src/detection/header_rules.rs
Scored evidence Detector output includes evidence summaries, detector labels, reasons, scores, and multi-detector source lists crates/csv-anonymizer-core/src/detection, crates/csv-anonymizer-core/src/metadata.rs
Value validators Structured validators detect emails, URLs, UUIDs, IPs, MAC addresses, IBANs, payment cards, prefixed VAT IDs, Dutch BTW under Dutch header context, US SSN/EIN, and formatted phone numbers crates/csv-anonymizer-core/src/detection/validators.rs
Fixture coverage Table-driven tests cover taxonomy, validators, privacy evidence, and multilingual matrices crates/csv-anonymizer-core/src/detection/tests

Detection precedence

The intended detector hierarchy is:

  1. Strong value validators and exact structured patterns.
  2. Scored header taxonomy evidence.
  3. Conservative fuzzy header matching for longer taxonomy terms with sample-value confirmation.
  4. Generic fallbacks for names, numbers, enums, strings, and unknowns.

Value evidence should outrank header wording when the two conflict. Short ambiguous headers such as id, nr, code, or naam need sample evidence before they should drive high-confidence classification.

Taxonomy scope

The taxonomy is data, not scattered Rust conditionals. It is designed to be reviewed, fixture-backed, and versioned with the detector. Current concepts include common contact, person, address, date, account, tax, network, URL, and identifier fields.

Important guardrails:

  • Private and event date headers remain exact-only where fuzzy substring matching would create false positives, for example candidateOfBirth.
  • Bare Dutch BTW or omzetbelastingnummer values are detected only under Dutch BTW header context.
  • Non-Latin pilot coverage proves the tokenizer path, but it is intentionally narrower than the Latin-language taxonomy.

Optional future experiments

The investigation evaluated semantic embeddings, GLiNER/NER, cloud DLP APIs, and Local AI classifier assistance. These are not part of the default detector path.

If revisited, semantic or Local AI detection should stay optional and be:

  • opt-in or clearly labeled as assisted evidence,
  • measured against fixtures before changing default selection behavior,
  • local-first unless the user explicitly configures a connector,
  • constrained so exact validators and deterministic evidence remain explainable,
  • reviewed for model size, packaging complexity, latency, and false positive risk.

Cloud PII APIs are useful as benchmarks, but they do not fit the default privacy boundary because CSV schema or values would leave the device.

Related pages: Security, Data models, and Testing.

Clone this wiki locally