unpii

High-performance French medical text anonymization library. Rust core with Python bindings.

Designed to process millions of documents efficiently. Inspired by Micropot/incognito

Installation

pip install unpii

# With Polars support
pip install unpii[polars]

Quick Start

import unpii

text = "Dr Martin au 06 12 34 56 78, email: martin@chu-brest.fr"

# Anonymize with placeholders (default)
unpii.anonymize(text)
# → "Dr <PERSON> au <PHONE>, email: <EMAIL>"

# Anonymize with stars
unpii.anonymize(text, style="stars")
# → "Dr ***** au *****, email: *****"

Detection Modes

Two detection levels: standard (reliable patterns) and paranoid (aggressive).

# Standard: titles, known patterns, blacklisted names
unpii.anonymize("Dr Martin est ici")
# → "Dr <PERSON> est ici"

unpii.anonymize("DUPONT Jean est ici")
# → "DUPONT Jean est ici"  (not detected in standard)

# Paranoid: also catches UPPERCASE Titlecase patterns, 5+ digit sequences, loose emails
unpii.anonymize("DUPONT Jean est ici", mode="paranoid")
# → "<PERSON> est ici"

Custom Words to Mask

Pass additional words to mask per call. Useful when you know the patient's name:

unpii.anonymize("bob dylan est ici", mask=["bob", "dylan"])
# → "<PII> <PII> est ici"

Case-insensitive with word boundary checks:

unpii.anonymize("Bonjour Bob", mask=["bob"])
# → "Bonjour <PII>"

Ignore Groups

Skip specific categories:

unpii.anonymize("Dr Martin au 06 12 34 56 78", ignore_groups=["PHONE"])
# → "Dr <PERSON> au 06 12 34 56 78"

Inspect Detected Spans

Dry-run mode to see what would be masked:

for span in unpii.find_spans("Dr Martin au 06 12 34 56 78"):
    print(span)
# Span(start=3, end=9, category="PERSON")
# Span(start=13, end=27, category="PHONE")

DataFrame Integration

anonymize_dataframe anonymizes a column in a Polars DataFrame:

import polars as pl
import unpii

df = pl.DataFrame({"text": [
    "Dr Martin au 06 12 34 56 78",
    "Email: joe@chu-brest.fr",
    "Maladie de Parkinson",
]})

# Anonymize in place (overwrites the column)
df = unpii.anonymize_dataframe(df, "text")
# ┌─────────────────────────┐
# │ text                    │
# ╞═════════════════════════╡
# │ Dr <PERSON> au <PHONE>   │
# │ Email: <EMAIL>           │
# │ Maladie de Parkinson    │  ← protected by whitelist
# └─────────────────────────┘

# Write to a new column
df = unpii.anonymize_dataframe(df, "text", new_column="text_anonymized")

# With options
df = unpii.anonymize_dataframe(df, "text", style="stars", mode="paranoid", ignore_groups=["PHONE"])

Per-row words to mask (`mask_from_columns`)

Pass column names whose values are added as words to mask, per row. Useful when patient name/city are in structured columns:

df = pl.DataFrame({
    "text": ["bob est ici", "alice va bien"],
    "nom": ["bob", "alice"],
})

df = unpii.anonymize_dataframe(df, "text", mask_from_columns=["nom"])
# ┌─────────────────┬───────┐
# │ text            ┆ nom   │
# ╞═════════════════╪═══════╡
# │ <PII> est ici   ┆ bob   │
# │ <PII> va bien   ┆ alice │
# └─────────────────┴───────┘

Global words to mask (`mask`)

Words to mask on every row (e.g. the doctor who wrote all reports):

df = unpii.anonymize_dataframe(df, "text", mask=["Dupont", "Cabinet Santé Plus"])

Both combined

df = unpii.anonymize_dataframe(df, "text",
    mask_from_columns=["nom", "ville"],
    mask=["Dupont"],
    style="stars",
)

Low-level: `anonymize_series`

Operates on a Polars Series directly:

masked = unpii.anonymize_series(
    df["text"],
    mask_from_columns=[df["nom"], df["ville"]],
    mask=["Dupont"],
)
df = df.with_columns(masked.alias("text_anonymized"))

Batch processing: `anonymize_batch`

Operates on plain Python lists (no Polars dependency):

results = unpii.anonymize_batch(["Dr Martin ici", "Email: a@b.fr"])
# → ["Dr <PERSON> ici", "Email: <EMAIL>"]

Threading

Control the number of threads used by anonymize_batch, anonymize_series, and anonymize_dataframe:

unpii.set_max_threads(4)     # Use 4 threads
unpii.get_max_threads()      # → 4
unpii.set_max_threads(0)     # Use all available cores (default)

Whitelist

Medical eponyms (Parkinson, Alzheimer, Verneuil...) are protected from masking by a global whitelist, regardless of which group detected them.

API Reference

# Single text
def anonymize(text, *, style="placeholder", mode="standard", ignore_groups=None, mask=None) -> str
def find_spans(text, *, mode="standard", ignore_groups=None, mask=None) -> list[Span]

# Batch (plain Python lists, no Polars needed)
def anonymize_batch(texts, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> list[str | None]

# DataFrame (requires polars)
def anonymize_dataframe(df, column, *, mask_from_columns=None, mask=None, new_column=None, style="placeholder", mode="standard", ignore_groups=None) -> DataFrame
def anonymize_series(series, *, mask_from_columns=None, mask=None, style="placeholder", mode="standard", ignore_groups=None) -> Series

# Threading
def set_max_threads(n: int) -> None   # 0 = all cores (default)
def get_max_threads() -> int

# Span attributes: .start, .end, .category

Performance

Rust core with compiled regex and Aho-Corasick automata. All rules and dictionaries are embedded in the binary — zero I/O at runtime.

anonymize_batch, anonymize_series, and anonymize_dataframe use rayon for automatic parallelization across all cores.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bench		bench
crates		crates
lang/fr		lang/fr
python/unpii		python/unpii
tests/python		tests/python
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
demo.py		demo.py
demo2.py		demo2.py
plan.md		plan.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Group	Placeholder	Standard	Paranoid
PERSON	`<PERSON>`	Titles + name, blacklist	UPPERCASE/Titlecase patterns, initials
PHONE	`<PHONE>`	French phone numbers	International numbers
EMAIL	`<EMAIL>`	Valid emails	Anything with `@`
DATE	`<DATE>`	DD/MM/YYYY, literal months, ISO	—
BIRTH_DATE	`<BIRTH_DATE>`	né(e) le + date, date de naissance + date	—
LOCATION	`<LOCATION>`	Street number + type + name, blacklist (cities, regions)	Street type + name (no number)
ZIP_CODE	`<ZIP_CODE>`	5 digits + city name	—
SSN	`<SSN>`	French social security number (NIR)	—
IBAN	`<IBAN>`	French IBAN	—
URL	`<URL>`	—	http(s) URLs
NUMBER	`<NUMBER>`	—	5+ consecutive digits
PII	`<PII>`	Custom words passed via `mask=`	—

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unpii

Installation

Quick Start

Detection Modes

Custom Words to Mask

Ignore Groups

Inspect Detected Spans

DataFrame Integration

Per-row words to mask (`mask_from_columns`)

Global words to mask (`mask`)

Both combined

Low-level: `anonymize_series`

Batch processing: `anonymize_batch`

Threading

Categories

Whitelist

API Reference

Performance

License

See also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

unpii

Installation

Quick Start

Detection Modes

Custom Words to Mask

Ignore Groups

Inspect Detected Spans

DataFrame Integration

Per-row words to mask (mask_from_columns)

Global words to mask (mask)

Both combined

Low-level: anonymize_series

Batch processing: anonymize_batch

Threading

Categories

Whitelist

API Reference

Performance

License

See also

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Per-row words to mask (`mask_from_columns`)

Global words to mask (`mask`)

Low-level: `anonymize_series`

Batch processing: `anonymize_batch`

Packages