Skip to content

didmar/sentinel

Repository files navigation

Sentinel

CI Codecov Python License Ruff pre-commit

Sentinel is a pseudonymization engine tailored for legal work. It was designed to ensure you don't send sensitive information to AI assistants.

Sentinel detects personally identifiable information (PII) and sensitive entities in text, replaces them with standardized tags (e.g., [PERSON_NAME_1], [EMAIL_ADDRESS_1]), and can reverse the process (deanonymization) to restore original values.

Features

  • Hybrid NER pipeline — combines a fine-tuned GLiNER-based spaCy model with 14 rule-based Presidio recognizers for high recall
  • 24 entity types across 6 groups (names, locations, electronic addresses, numbers, dates, other), specifically designed for legal work.
  • Multi-language support — English and French
  • Long documents support — splits text into configurable chunks before analysis
  • Entity aliases — link variations of the same entity (e.g., "John Doe" / "John" / "J. Doe") to a single ID
  • Differential anonymization — per-occurrence rules to anonymize or keep specific entities
  • Deanonymization — reverse placeholders back to original values (useful for LLM output post-processing)
  • GPU acceleration — optional GPU support for the NER model, with automatic CPU fallback

Architecture

Text input
  │
  ├─ Chunk splitting (RecursiveCharacterTextSplitter)
  │
  ├─ NER detection (parallel per chunk)
  │   ├─ GLiNER spaCy model ─── names, orgs, locations, ...
  │   └─ Rule-based recognizers ─── emails, dates, IPs, URLs, ...
  │        └─ Presidio AnalyzerEngine (merges & deduplicates)
  │
  ├─ Post-processing
  │   ├─ Remove excluded terms & false positives
  │   ├─ Correct entity boundaries
  │   └─ Add acronyms for organizations
  │
  ├─ Apply user rules (enable/disable types, per-occurrence overrides, aliases)
  │
  └─ Output: anonymized text + entity mappings + review items

Supported Entity Types

Group Entity Type Anonymized by default
Names PERSON_NAME — people's name (including nickname) or initials on
ORGANIZATION_NAME — companies, agencies, institutions (incl. their legal form) on
AUTHORITY_NAME — any public entity issuing or enforcing binding norms off
OBJECT_NAME — project, product, software name on
NORM_NAME — reference to a public norm of a legal nature (laws, standards, regulations) off
USERNAME — social media / service account usernames on
Location POSTAL_ADDRESS — postal address containing at least a zip code and city name on
LOCATION_CITY — city names (incl. any article if part of the name) on
LOCATION_STATE — state / province names on
LOCATION_COUNTRY — country names on
LOCATION_REGION — geographic regions (e.g., Middle East) on
LOCATION_COORDINATE — GPS coordinates on
LOCATION_NAME — other named locations (politically or geographically defined) on
ORIGIN — nationality / ethnicity / provenance off
LANGUAGE — named languages off
Electronic EMAIL_ADDRESS on
IP_ADDRESS on
URL on
FILENAME on
Numbers ID_NUMBER — alphanumeric IDs (bank account, fiscal, ...) on
QUANTITY — quantity / measurements (excl. unit) on
AMOUNT — currency amounts (excl. currency name of symbol) on
PHONE_NUMBER on
Date / Time DATE off
DURATION — Number and unit that corresponds to a time duration (incl. qualifiers, e.g., "working days") off
Other PERSONAL_ATTRIBUTE — sensitive information like racial / ethnic origin, political affiliation, age, gender etc. on
MISC — miscellaneous on

Note: Entity types that are not anonymized by default are defined to avoid mislabelling with other type (e.g., confusing a date with an ID number), but also to let the user customize what they consider to be sensitive.

Installation

Requires Python 3.12+. Uses uv for dependency management.

# Install the lightweight base package (pydantic, httpx, typing-extensions)
# — enough for sentinel.store, sentinel.types, sentinel.anonymization_settings
uv pip install .

# Install with the full NER/anonymization server stack (torch, spacy, presidio, fastapi, …)
uv pip install ".[server]"

# Install with dev tools
uv sync

The base install is intentionally lightweight so integrators who only need the store and HTTP client can avoid pulling in torch/spacy/presidio.

Model Setup

Sentinel requires a GLiNER-based spaCy NER model for entity detection. Models are downloaded automatically from HuggingFace on first startup — no manual setup needed.

To pre-download models (e.g., during a Docker build or CI step):

sentinel-download-models

For private HuggingFace repos, set the HF_TOKEN environment variable.

To use a custom model, set SENTINEL_MODEL_REPO=org/model-name.

The model directory defaults to models/ and can be overridden with the SPACY_DIR environment variable. When using Docker Compose, models are stored in a named volume (sentinel-models) so they persist across container restarts.

Note: Without the model, you can still run the API server in mock mode (SENTINEL_USE_MOCK=true) and run the test suite, which uses mock analyzers.

Usage

Python API

from sentinel.presidio_anonymizer import PresidioAnalyzer
from sentinel.anonymization_settings import AnonymizationSettings
from sentinel.anonymizer import anonymize_documents

analyzer = PresidioAnalyzer()
settings = AnonymizationSettings.default_settings(
    fullname="Jane Doe",
    email="jane@example.com",
    organization_name="Acme Corp",
)

result = anonymize_documents(
    analyzer,
    ["My name is John Smith and I work at Microsoft."],
    settings,
    msg_anonymization_rules=[],
    msg_alias_rules=[],
    existing_entities=[],
)

print(result.anon_texts[0])
# My name is [PERSON_NAME_1] and I work at [ORGANIZATION_NAME_1].

CLI

# Anonymize a string
python -m sentinel.cli -t "My email is john@example.com"

# Anonymize a file
python -m sentinel.cli -f document.txt

# Interactive mode (prompts for input)
python -m sentinel.cli

API Server

Start the FastAPI server:

SENTINEL_HOST=0.0.0.0 python -m sentinel.api

Or with Docker Compose (development):

docker compose up

Endpoints

Method Path Description
GET /health Health check
POST /check-sensitive Check if texts contain sensitive information
POST /anonymize Detect and anonymize entities in texts
POST /deanonymize Reverse anonymization in text (e.g., LLM output)

Store & Client

The sentinel.store submodule provides persistence for anonymization settings and session entities, so integrators don't have to build that boilerplate themselves. It ships with a SQLite backend and a high-level SentinelClient that combines the store with the Sentinel HTTP API.

from sentinel.store import SentinelClient, SqliteStore

store = SqliteStore("my_app.db")
client = SentinelClient(store, sentinel_url="http://localhost:8010")

# Anonymize — auto-creates settings & session on first call
result = client.anonymize("user-1", "session-1", ["My name is John Smith."])
print(result.anon_texts[0])
# My name is [PERSON_NAME_1].

# Deanonymize an LLM response
plain = client.deanonymize("session-1", "[PERSON_NAME_1] said hello.")
print(plain)
# John Smith said hello.

# Check if text is sensitive
is_sensitive = client.check_sensitive("user-1", ["Call me at 555-1234"])

# Update a user's anonymization rules
from sentinel.types import AnonRule
client.update_user_settings(
    "user-1",
    anon_rules=[AnonRule(cleartext="Acme Corp", entity_type="ORGANIZATION_NAME", anonymize=True)],
)

The SentinelStore protocol can be implemented against any backend (PostgreSQL, Redis, etc.) — SqliteStore is the included default.

Environment Variables

Variable Description Default
SENTINEL_HOST API server bind host (required for API)
SENTINEL_PORT API server bind port 8010
SENTINEL_USE_GPU Enable GPU for NER model True
SENTINEL_GPU_ID GPU device ID 0
SENTINEL_USE_MOCK Use mock analyzer (testing) False
SENTINEL_RELOAD Enable hot-reload False
SENTINEL_MODEL_REPO HuggingFace repo ID for the NER model copilex/sentiner_ner_gliner_2024-12-06T16-41-46
HF_TOKEN HuggingFace token (for private repos) (unset)
SENTINEL_CHUNK_SIZE Text chunk size for analyzer 1000
SENTINEL_GLINER_CHUNK_SIZE GLiNER model chunk size 250
SENTINEL_LOG_FILENAME Log file path logs_<timestamp>.txt
SENTINEL_LOG_NO_COLORS Disable colored log output False
LOG Log level INFO

Development

# Lint (auto-fixes issues)
./lint.sh

# Run tests
pytest

# Pre-commit hooks (installed via pre-commit)
pre-commit install

Project Structure

sentinel/
├── models/                  # NER model weights (git-ignored, see Model Setup)
├── sentinel/                # Main package
│   ├── api.py               # FastAPI endpoints
│   ├── cli.py               # Command-line interface
│   ├── anonymizer.py        # Core anonymization logic & rule application
│   ├── presidio_anonymizer.py  # Presidio + GLiNER NER pipeline
│   ├── deanonymize.py       # Reverse anonymization
│   ├── types.py             # Entity types, Pydantic models
│   ├── anonymization_settings.py  # Settings & rule management
│   ├── recognizers/         # 14 rule-based Presidio recognizers
│   ├── spacy_utils/         # spaCy + GLiNER integration
│   ├── utils/               # Logger, language detection, text normalization
│   ├── store/               # Persistence layer (settings & session entities)
│   │   ├── protocol.py      # SentinelStore abstract protocol
│   │   ├── client.py        # SentinelClient (store + HTTP API wrapper)
│   │   ├── exceptions.py    # StoreError, SessionNotFoundError
│   │   └── sqlite/          # SQLite backend implementation
│   └── data/                # Reference data (countries, regions, authorities, exclusions)
├── tests/                   # pytest test suite
├── pyproject.toml
├── uv.lock
├── lint.sh
└── docker-compose.yml       # Development-only Docker setup

About

Pseudonymization engine tailored for legal work

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages