High-performance French medical text anonymization library. Rust core with Python bindings.
Designed to process millions of documents efficiently. Inspired by Micropot/incognito
pip install unpii
# With Polars support
pip install unpii[polars]import unpii
text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"
# Anonymize with placeholders (default)
unpii.anonymize(text)
# → "Dr <PERSON> au <PHONE>, email: <EMAIL>"
# Anonymize with stars
unpii.anonymize(text, style="stars")
# → "Dr ***** au *****, email: *****"Two detection levels: standard (reliable patterns) and paranoid (aggressive).
# Standard: titles, known patterns, blacklisted names
unpii.anonymize("Dr Martin est ici")
# → "Dr <PERSON> est ici"
unpii.anonymize("DUPONT Jean est ici")
# → "DUPONT Jean est ici" (not detected in standard)
# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.anonymize("DUPONT Jean est ici", mode="paranoid")
# → "<PERSON> est ici"Pass additional words to mask per call. Useful when you know the patient's name:
unpii.anonymize("bob dylan est ici", mask=["bob", "dylan"])
# → "<PII> <PII> est ici"Case-insensitive with word boundary checks:
unpii.anonymize("Bonjour Bob", mask=["bob"])
# → "Bonjour <PII>"Skip specific categories:
unpii.anonymize("Dr Martin au 06 12 34 56 78", ignore_groups=["PHONE"])
# → "Dr <PERSON> au 06 12 34 56 78"Dry-run mode to see what would be masked:
for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
print(span)
# Span(start=3, end=9, category="PERSON")
# Span(start=13, end=27, category="PHONE")anonymize_dataframe anonymizes a column in a Polars DataFrame:
import polars as pl
import unpii
df = pl.DataFrame({"text": [
"Dr Martin au 06 12 34 56 78",
"Email: joe@chu-brest.fr",
"Maladie de Parkinson",
]})
# Anonymize in place (overwrites the column)
df = unpii.anonymize_dataframe(df, "text")
# ┌─────────────────────────┐
# │ text │
# ╞═════════════════════════╡
# │ Dr <PERSON> au <PHONE> │
# │ Email: <EMAIL> │
# │ Maladie de Parkinson │ ← protected by whitelist
# └─────────────────────────┘
# Write to a new column
df = unpii.anonymize_dataframe(df, "text", new_column="text_anonymized")
# With options
df = unpii.anonymize_dataframe(df, "text", style="stars", mode="paranoid", ignore_groups=["PHONE"])Pass column names whose values are added as words to mask, per row. Useful when patient name/city are in structured columns:
df = pl.DataFrame({
"text": ["bob est ici", "alice va bien"],
"nom": ["bob", "alice"],
})
df = unpii.anonymize_dataframe(df, "text", mask_from_columns=["nom"])
# ┌─────────────────┬───────┐
# │ text ┆ nom │
# ╞═════════════════╪═══════╡
# │ <PII> est ici ┆ bob │
# │ <PII> va bien ┆ alice │
# └─────────────────┴───────┘Words to mask on every row (e.g. the doctor who wrote all reports):
df = unpii.anonymize_dataframe(df, "text", mask=["Dupont", "Cabinet Santé Plus"])df = unpii.anonymize_dataframe(df, "text",
mask_from_columns=["nom", "ville"],
mask=["Dupont"],
style="stars",
)Operates on a Polars Series directly:
masked = unpii.anonymize_series(
df["text"],
mask_from_columns=[df["nom"], df["ville"]],
mask=["Dupont"],
)
df = df.with_columns(masked.alias("text_anonymized"))Operates on plain Python lists (no Polars dependency):
results = unpii.anonymize_batch(["Dr Martin ici", "Email: a@b.fr"])
# → ["Dr <PERSON> ici", "Email: <EMAIL>"]Control the number of threads used by anonymize_batch, anonymize_series, and anonymize_dataframe:
unpii.set_max_threads(4) # Use 4 threads
unpii.get_max_threads() # → 4
unpii.set_max_threads(0) # Use all available cores (default)| Group | Placeholder | Standard | Paranoid |
|---|---|---|---|
| PERSON | <PERSON> |
Titles + name, blacklist | UPPERCASE/Titlecase patterns, initials |
| PHONE | <PHONE> |
French phone numbers | International numbers |
<EMAIL> |
Valid emails | Anything with @ |
|
| DATE | <DATE> |
DD/MM/YYYY, literal months, ISO | — |
| BIRTH_DATE | <BIRTH_DATE> |
né(e) le + date, date de naissance + date | — |
| LOCATION | <LOCATION> |
Street number + type + name, blacklist (cities, regions) | Street type + name (no number) |
| ZIP_CODE | <ZIP_CODE> |
5 digits + city name | — |
| SSN | <SSN> |
French social security number (NIR) | — |
| IBAN | <IBAN> |
French IBAN | — |
| URL | <URL> |
— | http(s) URLs |
| NUMBER | <NUMBER> |
— | 5+ consecutive digits |
| PII | <PII> |
Custom words passed via mask= |
— |
Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.
# Single text
def anonymize(text, *, style="placeholder", mode="standard", ignore_groups=None, mask=None) -> str
def find_spans(text, *, mode="standard", ignore_groups=None, mask=None) -> list[Span]
# Batch (plain Python lists, no Polars needed)
def anonymize_batch(texts, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> list[str | None]
# DataFrame (requires polars)
def anonymize_dataframe(df, column, *, mask_from_columns=None, mask=None, new_column=None, style="placeholder", mode="standard", ignore_groups=None) -> DataFrame
def anonymize_series(series, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> Series
# Threading
def set_max_threads(n: int) -> None # 0 = all cores (default)
def get_max_threads() -> int
# Span attributes: .start, .end, .categoryRust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.
anonymize_batch, anonymize_series, and anonymize_dataframe use rayon for automatic parallelization across all cores.
MIT
https://github.com/micropot/incognito https://github.com/microsoft/presidio