Sentinel

Sentinel is a pseudonymization engine tailored for legal work. It was designed to ensure you don't send sensitive information to AI assistants.

Sentinel detects personally identifiable information (PII) and sensitive entities in text, replaces them with standardized tags (e.g., [PERSON_NAME_1], [EMAIL_ADDRESS_1]), and can reverse the process (deanonymization) to restore original values.

Features

Hybrid NER pipeline — combines a fine-tuned GLiNER-based spaCy model with 14 rule-based Presidio recognizers for high recall
24 entity types across 6 groups (names, locations, electronic addresses, numbers, dates, other), specifically designed for legal work.
Multi-language support — English and French
Long documents support — splits text into configurable chunks before analysis
Entity aliases — link variations of the same entity (e.g., "John Doe" / "John" / "J. Doe") to a single ID
Differential anonymization — per-occurrence rules to anonymize or keep specific entities
Deanonymization — reverse placeholders back to original values (useful for LLM output post-processing)
GPU acceleration — optional GPU support for the NER model, with automatic CPU fallback

Architecture

Text input
  │
  ├─ Chunk splitting (RecursiveCharacterTextSplitter)
  │
  ├─ NER detection (parallel per chunk)
  │   ├─ GLiNER spaCy model ─── names, orgs, locations, ...
  │   └─ Rule-based recognizers ─── emails, dates, IPs, URLs, ...
  │        └─ Presidio AnalyzerEngine (merges & deduplicates)
  │
  ├─ Post-processing
  │   ├─ Remove excluded terms & false positives
  │   ├─ Correct entity boundaries
  │   └─ Add acronyms for organizations
  │
  ├─ Apply user rules (enable/disable types, per-occurrence overrides, aliases)
  │
  └─ Output: anonymized text + entity mappings + review items

Supported Entity Types

Group	Entity Type	Anonymized by default
Names	`PERSON_NAME` — people's name (including nickname) or initials	on
	`ORGANIZATION_NAME` — companies, agencies, institutions (incl. their legal form)	on
	`AUTHORITY_NAME` — any public entity issuing or enforcing binding norms	off
	`OBJECT_NAME` — project, product, software name	on
	`NORM_NAME` — reference to a public norm of a legal nature (laws, standards, regulations)	off
	`USERNAME` — social media / service account usernames	on
Location	`POSTAL_ADDRESS` — postal address containing at least a zip code and city name	on
	`LOCATION_CITY` — city names (incl. any article if part of the name)	on
	`LOCATION_STATE` — state / province names	on
	`LOCATION_COUNTRY` — country names	on
	`LOCATION_REGION` — geographic regions (e.g., Middle East)	on
	`LOCATION_COORDINATE` — GPS coordinates	on
	`LOCATION_NAME` — other named locations (politically or geographically defined)	on
	`ORIGIN` — nationality / ethnicity / provenance	off
	`LANGUAGE` — named languages	off
Electronic	`EMAIL_ADDRESS`	on
	`IP_ADDRESS`	on
	`URL`	on
	`FILENAME`	on
Numbers	`ID_NUMBER` — alphanumeric IDs (bank account, fiscal, ...)	on
	`QUANTITY` — quantity / measurements (excl. unit)	on
	`AMOUNT` — currency amounts (excl. currency name of symbol)	on
	`PHONE_NUMBER`	on
Date / Time	`DATE`	off
	`DURATION` — Number and unit that corresponds to a time duration (incl. qualifiers, e.g., "working days")	off
Other	`PERSONAL_ATTRIBUTE` — sensitive information like racial / ethnic origin, political affiliation, age, gender etc.	on
	`MISC` — miscellaneous	on

Note: Entity types that are not anonymized by default are defined to avoid mislabelling with other type (e.g., confusing a date with an ID number), but also to let the user customize what they consider to be sensitive.

Installation

Requires Python 3.12+. Uses uv for dependency management.

# Install the lightweight base package (pydantic, httpx, typing-extensions)
# — enough for sentinel.store, sentinel.types, sentinel.anonymization_settings
uv pip install .

# Install with the full NER/anonymization server stack (torch, spacy, presidio, fastapi, …)
uv pip install ".[server]"

# Install with dev tools
uv sync

The base install is intentionally lightweight so integrators who only need the store and HTTP client can avoid pulling in torch/spacy/presidio.

Model Setup

Sentinel requires a GLiNER-based spaCy NER model for entity detection. Models are downloaded automatically from HuggingFace on first startup — no manual setup needed.

To pre-download models (e.g., during a Docker build or CI step):

sentinel-download-models

For private HuggingFace repos, set the HF_TOKEN environment variable.

To use a custom model, set SENTINEL_MODEL_REPO=org/model-name.

The model directory defaults to models/ and can be overridden with the SPACY_DIR environment variable. When using Docker Compose, models are stored in a named volume (sentinel-models) so they persist across container restarts.

Note: Without the model, you can still run the API server in mock mode (SENTINEL_USE_MOCK=true) and run the test suite, which uses mock analyzers.

Usage

Python API

from sentinel.presidio_anonymizer import PresidioAnalyzer
from sentinel.anonymization_settings import AnonymizationSettings
from sentinel.anonymizer import anonymize_documents

analyzer = PresidioAnalyzer()
settings = AnonymizationSettings.default_settings(
    fullname="Jane Doe",
    email="jane@example.com",
    organization_name="Acme Corp",
)

result = anonymize_documents(
    analyzer,
    ["My name is John Smith and I work at Microsoft."],
    settings,
    msg_anonymization_rules=[],
    msg_alias_rules=[],
    existing_entities=[],
)

print(result.anon_texts[0])
# My name is [PERSON_NAME_1] and I work at [ORGANIZATION_NAME_1].

CLI

# Anonymize a string
python -m sentinel.cli -t "My email is john@example.com"

# Anonymize a file
python -m sentinel.cli -f document.txt

# Interactive mode (prompts for input)
python -m sentinel.cli

API Server

Start the FastAPI server:

SENTINEL_HOST=0.0.0.0 python -m sentinel.api

Or with Docker Compose (development):

docker compose up

Endpoints

Method	Path	Description
`GET`	`/health`	Health check
`POST`	`/check-sensitive`	Check if texts contain sensitive information
`POST`	`/anonymize`	Detect and anonymize entities in texts
`POST`	`/deanonymize`	Reverse anonymization in text (e.g., LLM output)

Store & Client

The sentinel.store submodule provides persistence for anonymization settings and session entities, so integrators don't have to build that boilerplate themselves. It ships with a SQLite backend and a high-level SentinelClient that combines the store with the Sentinel HTTP API.

from sentinel.store import SentinelClient, SqliteStore

store = SqliteStore("my_app.db")
client = SentinelClient(store, sentinel_url="http://localhost:8010")

# Anonymize — auto-creates settings & session on first call
result = client.anonymize("user-1", "session-1", ["My name is John Smith."])
print(result.anon_texts[0])
# My name is [PERSON_NAME_1].

# Deanonymize an LLM response
plain = client.deanonymize("session-1", "[PERSON_NAME_1] said hello.")
print(plain)
# John Smith said hello.

# Check if text is sensitive
is_sensitive = client.check_sensitive("user-1", ["Call me at 555-1234"])

# Update a user's anonymization rules
from sentinel.types import AnonRule
client.update_user_settings(
    "user-1",
    anon_rules=[AnonRule(cleartext="Acme Corp", entity_type="ORGANIZATION_NAME", anonymize=True)],
)

The SentinelStore protocol can be implemented against any backend (PostgreSQL, Redis, etc.) — SqliteStore is the included default.

Environment Variables

Variable	Description	Default
`SENTINEL_HOST`	API server bind host	(required for API)
`SENTINEL_PORT`	API server bind port	`8010`
`SENTINEL_USE_GPU`	Enable GPU for NER model	`True`
`SENTINEL_GPU_ID`	GPU device ID	`0`
`SENTINEL_USE_MOCK`	Use mock analyzer (testing)	`False`
`SENTINEL_RELOAD`	Enable hot-reload	`False`
`SENTINEL_MODEL_REPO`	HuggingFace repo ID for the NER model	`copilex/sentiner_ner_gliner_2024-12-06T16-41-46`
`HF_TOKEN`	HuggingFace token (for private repos)	(unset)
`SENTINEL_CHUNK_SIZE`	Text chunk size for analyzer	`1000`
`SENTINEL_GLINER_CHUNK_SIZE`	GLiNER model chunk size	`250`
`SENTINEL_LOG_FILENAME`	Log file path	`logs_<timestamp>.txt`
`SENTINEL_LOG_NO_COLORS`	Disable colored log output	`False`
`LOG`	Log level	`INFO`

Development

# Lint (auto-fixes issues)
./lint.sh

# Run tests
pytest

# Pre-commit hooks (installed via pre-commit)
pre-commit install

Project Structure

sentinel/
├── models/                  # NER model weights (git-ignored, see Model Setup)
├── sentinel/                # Main package
│   ├── api.py               # FastAPI endpoints
│   ├── cli.py               # Command-line interface
│   ├── anonymizer.py        # Core anonymization logic & rule application
│   ├── presidio_anonymizer.py  # Presidio + GLiNER NER pipeline
│   ├── deanonymize.py       # Reverse anonymization
│   ├── types.py             # Entity types, Pydantic models
│   ├── anonymization_settings.py  # Settings & rule management
│   ├── recognizers/         # 14 rule-based Presidio recognizers
│   ├── spacy_utils/         # spaCy + GLiNER integration
│   ├── utils/               # Logger, language detection, text normalization
│   ├── store/               # Persistence layer (settings & session entities)
│   │   ├── protocol.py      # SentinelStore abstract protocol
│   │   ├── client.py        # SentinelClient (store + HTTP API wrapper)
│   │   ├── exceptions.py    # StoreError, SessionNotFoundError
│   │   └── sqlite/          # SQLite backend implementation
│   └── data/                # Reference data (countries, regions, authorities, exclusions)
├── tests/                   # pytest test suite
├── pyproject.toml
├── uv.lock
├── lint.sh
└── docker-compose.yml       # Development-only Docker setup

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
sentinel		sentinel
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
lint.sh		lint.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentinel

Features

Architecture

Supported Entity Types

Installation

Model Setup

Usage

Python API

CLI

API Server

Endpoints

Store & Client

Environment Variables

Development

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sentinel

Features

Architecture

Supported Entity Types

Installation

Model Setup

Usage

Python API

CLI

API Server

Endpoints

Store & Client

Environment Variables

Development

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages