-
Notifications
You must be signed in to change notification settings - Fork 0
background multilingual detection
CSV Anonymizer's multilingual detection work is documented in docs/multilingual-detection-phased-plan.md and docs/multilingual-header-detection-investigation.md. The current implementation keeps the default detector local, deterministic, and explainable.
Active contributors: Douwe de Vries
Multilingual detection improves CSV header and value detection. It does not localize the app UI.
"Supported language" should mean deterministic header taxonomy coverage, fixture-backed expectations, detector evidence labels, and value detectors that still run independently of header language. It should not mean full parity for every jurisdiction or every free-text PII shape.
| Layer | Current behavior | Key paths |
|---|---|---|
| Unicode-safe normalization | Header parsing preserves non-ASCII text, applies Unicode normalization, supports segmentation, accent folding for Latin terms, camelCase splitting, and compact aliases | crates/csv-anonymizer-core/src/detection/header.rs |
| Header taxonomy | Maintained taxonomy covers English, Dutch, German, French, Spanish, Portuguese, Italian, and a small Japanese pilot for unambiguous concepts |
crates/csv-anonymizer-core/src/detection/header_taxonomy.json, crates/csv-anonymizer-core/src/detection/header_rules.rs
|
| Scored evidence | Detector output includes evidence summaries, detector labels, reasons, scores, and multi-detector source lists |
crates/csv-anonymizer-core/src/detection, crates/csv-anonymizer-core/src/metadata.rs
|
| Value validators | Structured validators detect emails, URLs, UUIDs, IPs, MAC addresses, IBANs, payment cards, prefixed VAT IDs, Dutch BTW under Dutch header context, US SSN/EIN, and formatted phone numbers | crates/csv-anonymizer-core/src/detection/validators.rs |
| Fixture coverage | Table-driven tests cover taxonomy, validators, privacy evidence, and multilingual matrices | crates/csv-anonymizer-core/src/detection/tests |
The intended detector hierarchy is:
- Strong value validators and exact structured patterns.
- Scored header taxonomy evidence.
- Conservative fuzzy header matching for longer taxonomy terms with sample-value confirmation.
- Generic fallbacks for names, numbers, enums, strings, and unknowns.
Value evidence should outrank header wording when the two conflict. Short ambiguous headers such as id, nr, code, or naam need sample evidence before they should drive high-confidence classification.
The taxonomy is data, not scattered Rust conditionals. It is designed to be reviewed, fixture-backed, and versioned with the detector. Current concepts include common contact, person, address, date, account, tax, network, URL, and identifier fields.
Important guardrails:
- Private and event date headers remain exact-only where fuzzy substring matching would create false positives, for example
candidateOfBirth. - Bare Dutch BTW or
omzetbelastingnummervalues are detected only under Dutch BTW header context. - Non-Latin pilot coverage proves the tokenizer path, but it is intentionally narrower than the Latin-language taxonomy.
The investigation evaluated semantic embeddings, GLiNER/NER, cloud DLP APIs, and Local AI classifier assistance. These are not part of the default detector path.
If revisited, semantic or Local AI detection should stay optional and be:
- opt-in or clearly labeled as assisted evidence,
- measured against fixtures before changing default selection behavior,
- local-first unless the user explicitly configures a connector,
- constrained so exact validators and deterministic evidence remain explainable,
- reviewed for model size, packaging complexity, latency, and false positive risk.
Cloud PII APIs are useful as benchmarks, but they do not fit the default privacy boundary because CSV schema or values would leave the device.
Related pages: Security, Data models, and Testing.