background multilingual detection

Multilingual detection

CSV Anonymizer's multilingual detection work is documented in docs/multilingual-detection-phased-plan.md and docs/multilingual-header-detection-investigation.md. The current implementation keeps the default detector local, deterministic, and explainable.

Active contributors: Douwe de Vries

Product boundary

Multilingual detection improves CSV header and value detection. It does not localize the app UI.

"Supported language" should mean deterministic header taxonomy coverage, fixture-backed expectations, detector evidence labels, and value detectors that still run independently of header language. It should not mean full parity for every jurisdiction or every free-text PII shape.

Implemented layers

Layer	Current behavior	Key paths
Unicode-safe normalization	Header parsing preserves non-ASCII text, applies Unicode normalization, supports segmentation, accent folding for Latin terms, camelCase splitting, and compact aliases	`crates/csv-anonymizer-core/src/detection/header.rs`
Header taxonomy	Maintained taxonomy covers English, Dutch, German, French, Spanish, Portuguese, Italian, and a small Japanese pilot for unambiguous concepts	`crates/csv-anonymizer-core/src/detection/header_taxonomy.json`, `crates/csv-anonymizer-core/src/detection/header_rules.rs`
Scored evidence	Detector output includes evidence summaries, detector labels, reasons, scores, and multi-detector source lists	`crates/csv-anonymizer-core/src/detection`, `crates/csv-anonymizer-core/src/metadata.rs`
Value validators	Structured validators detect emails, URLs, UUIDs, IPs, MAC addresses, IBANs, payment cards, prefixed VAT IDs, Dutch BTW under Dutch header context, US SSN/EIN, and formatted phone numbers	`crates/csv-anonymizer-core/src/detection/validators.rs`
Fixture coverage	Table-driven tests cover taxonomy, validators, privacy evidence, and multilingual matrices	`crates/csv-anonymizer-core/src/detection/tests`

Detection precedence

The intended detector hierarchy is:

Strong value validators and exact structured patterns.
Scored header taxonomy evidence.
Conservative fuzzy header matching for longer taxonomy terms with sample-value confirmation.
Generic fallbacks for names, numbers, enums, strings, and unknowns.

Value evidence should outrank header wording when the two conflict. Short ambiguous headers such as id, nr, code, or naam need sample evidence before they should drive high-confidence classification.

Taxonomy scope

The taxonomy is data, not scattered Rust conditionals. It is designed to be reviewed, fixture-backed, and versioned with the detector. Current concepts include common contact, person, address, date, account, tax, network, URL, and identifier fields.

Important guardrails:

Private and event date headers remain exact-only where fuzzy substring matching would create false positives, for example candidateOfBirth.
Bare Dutch BTW or omzetbelastingnummer values are detected only under Dutch BTW header context.
Non-Latin pilot coverage proves the tokenizer path, but it is intentionally narrower than the Latin-language taxonomy.

Optional future experiments

The investigation evaluated semantic embeddings, GLiNER/NER, cloud DLP APIs, and Local AI classifier assistance. These are not part of the default detector path.

If revisited, semantic or Local AI detection should stay optional and be:

opt-in or clearly labeled as assisted evidence,
measured against fixtures before changing default selection behavior,
local-first unless the user explicitly configures a connector,
constrained so exact validators and deterministic evidence remain explainable,
reviewed for model size, packaging complexity, latency, and false positive risk.

Cloud PII APIs are useful as benchmarks, but they do not fit the default privacy boundary because CSV schema or values would leave the device.

Related pages: Security, Data models, and Testing.

CSV Anonymizer

By the numbers

Lore

Fun facts

Tauri commands

Deployment

Security

Uh oh!

background multilingual detection

Multilingual detection

Product boundary

Implemented layers

Detection precedence

Taxonomy scope

Optional future experiments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally