The art of the mask - hide the identity, keep the meaning.
Carnaval is an open-source Python framework for reversible PII anonymization. It masks sensitive entities in text documents before they are sent to a cloud LLM, then restores the original values in the structured response the LLM returns.
You want to use a cloud LLM (Claude, GPT, Mistral, Gemini...) to process text documents - order acknowledgements, invoices, business emails, contracts - but those documents contain personal or confidential data that must never leave your infrastructure in clear text.
RAW DOCUMENT ──▶ [ Carnaval ] ──▶ MASKED DOCUMENT ──▶ Cloud LLM
│
FINAL DOCUMENT ◀── [ Carnaval ] ◀── JSON / XML response ◀──┘
- Before sending - sensitive entities are replaced with placeholders such
as
[PERSON_1],[EMAIL_2],[ORG]. The placeholder ↔ real-value mapping is stored in an encrypted local vault. - After the response - the original values are re-injected into the JSON or XML structure returned by the LLM.
No data ever leaves your machine in clear text, and the LLM still receives a coherent, structured document it can reason about.
- Reversible - every masked entity maps to a unique placeholder; the mapping lives in an AES-256-GCM encrypted vault.
- Coherent - the same value always receives the same placeholder within a run, so the LLM can reason about cross-references.
- Local-first - no network calls to anonymize. The optional neural model runs on your own machine.
- 9 entity types -
PERSON,ORGANIZATION,LOCATION,EMAIL,PHONE,IBAN,BIC,VAT,SIREN/SIRET,URL. - Layered detection - regex recognizers, deny lists, bundled dictionaries (GeoNames cities, first names), and an optional zero-shot neural recognizer (GLiNER).
- Multilingual - 6 languages: French, English, German, Spanish, Italian, Portuguese.
- Business profiles -
acknowledge,invoice,email, plus private per-client profiles kept out of version control. - 8 output formats - TXT, JSON, JSONL, XML, CoNLL, HTML, encrypted vault, audit metadata - all produced in a single pass.
- CLI and library - use the
carnaval-anonymize/carnaval-reinjectcommands, or importcarnavaldirectly into your Python code.
Carnaval is built as 7 self-contained stages, each with a clear input → output contract:
TXT ──▶ S1 Intake ──▶ S2 Preprocess ──▶ S3 Detect ──▶ S4 Resolve ──▶ S5 Mask ──▶ S6 Output
(read) (language, (recognizers) (dedup, (placeholders (8 formats)
normalize) arbitration) + vault)
JSON / XML ──▶ S7 Reinject ──▶ JSON / XML with original values restored
See Architecture for details on each stage.
Requires Python 3.11+ (tested on 3.13).
pip install carnavalThis installs the core library and the carnaval-anonymize and
carnaval-reinject command-line tools.
The optional zero-shot neural recognizer (GLiNER) is not installed by
default - it pulls in PyTorch. Enable it with the ai extra:
pip install "carnaval[ai]"The GLiNER model (~500 MB) is then downloaded automatically on first use; afterwards Carnaval works fully offline. See the Installation guide for an offline / air-gapped setup.
To work on Carnaval itself:
git clone https://github.com/carnaval-ai/carnaval.git
cd carnaval
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# Linux / macOS
source .venv/bin/activate
pip install -r requirements.txtCarnaval reads the vault password from a .env file in your working
directory. Create one and set a strong secret (16 characters minimum,
32+ recommended):
CARNAVAL_VAULT_PASSWORD=a-strong-randomly-generated-secret
# 1. Anonymize a document
carnaval-anonymize inbox/order.txt --profile acknowledge
# 2. Send outbox/txt/order_anonymise.txt to your LLM, collect a JSON response
# 3. Re-inject the real values into the LLM response
carnaval-reinject response.json --vault outbox/vault/order_vault.enccarnaval-anonymize produces, in one pass, all 8 output files under outbox/
(txt/, json/, jsonl/, xml/, conll/, html/, vault/, meta/).
Useful flags: --no-gliner (regex + deny lists only, faster),
--gliner-threshold 0.6, --profile invoice, --private my_client,
--console (human-readable logs).
from pathlib import Path
from carnaval.pipeline import run_anonymization
masked, written, config = run_anonymization(
input_path=Path("inbox/order.txt"),
outbox_dir=Path("outbox"),
vault_password="a-strong-randomly-generated-secret",
profile="acknowledge",
use_gliner=True,
)
print(masked.anonymized_text) # text with placeholders
print(masked.by_category) # {'PERSON': 2, 'ORGANIZATION': 1, ...}
print(written.json_path) # path to the JSON outputRe-injecting an LLM response:
from carnaval.core.vault import Vault
from carnaval.stages.s7_reinject import reinject_json_data
vault = Vault(password="a-strong-randomly-generated-secret",
path="outbox/vault/order_vault.enc")
vault.load()
llm_response = {"supplier": "[ORG_1]", "contact": "[PERSON_1]"}
restored = reinject_json_data(llm_response, vault)
# {"supplier": "Globex Inc.", "contact": "Jane Doe"}See the Quickstart and Reinjection wiki pages for more.
The placeholder ↔ value mapping is stored in an encrypted vault:
| Property | Value |
|---|---|
| Symmetric cipher | AES-256-GCM (authenticated encryption) |
| Key derivation | PBKDF2-HMAC-SHA256, 600,000 iterations |
| Salt | 16 random bytes per file |
| Nonce | 16 random bytes per file |
| Integrity tag | 16 bytes - any tampering is detected on read |
Without the password, the vault is unreadable. Carnaval makes no outbound network calls once the GLiNER model has been downloaded, and its structured logger redacts sensitive keys by default. It supports GDPR-style pseudonymization (Article 4.5). See Vault and Security.
French (FR), English (EN), German (DE), Spanish (ES), Italian (IT) and Portuguese (PT). The language is auto-detected; mixed-language documents are handled via in-text linguistic markers. See Multilingual.
Carnaval is a functional proof of concept. Core anonymization, re-injection, the encrypted vault and the 8 output formats are implemented and covered by an extensive automated test suite.
pytest # full suite (skips slow neural tests)
pytest -m slow # real GLiNER tests (downloads the model)
pytest --cov=src/carnaval # with coverageThe complete reference lives in the project wiki:
- Home - overview and table of contents
- Installation
- Quickstart
- Architecture
- Vault and Security
- Profiles
- Recognizers
- Multilingual
- Output Formats
- Reinjection
- Troubleshooting
- Contributing
The original design notes are kept under docs/.
Contributions are welcome - see CONTRIBUTING.md and our Code of Conduct. Please use only fictitious entities (Acme Corp, Globex, Jane Doe, Springfield...) in public fixtures and examples.
- General questions, conduct reports: carnaval.oss@gmail.com
- Bug reports and feature requests: GitHub issues
- Security vulnerabilities: please do not open a public issue - see SECURITY.md for responsible disclosure.
If you use Carnaval in your work, please cite it via its archived DOI:
Patrice AUBERT. Carnaval: a reversible PII anonymization framework. 2026. DOI: 10.5281/zenodo.20219603
A machine-readable CITATION.cff is included - GitHub turns it
into a "Cite this repository" button.
Carnaval is released under the Apache License 2.0. See LICENSE.