Portiere

AI-Powered Clinical and Health Data Mapping Tool

Documentation · Quick Start · Examples · Issues

What is Portiere?

Mapping clinical data to standard models like OMOP CDM, FHIR R4, HL7 v2, and OpenEHR is one of the most time-consuming and error-prone tasks in health informatics. It typically requires domain experts to manually map hundreds of source fields and thousands of clinical codes — a process that can take weeks or months.

Portiere automates this with an AI-powered 5-stage pipeline that handles schema mapping, concept mapping, ETL generation, and data quality validation — all running locally on your machine with no cloud dependency required.

flowchart LR
    A["Source Data"] --> B["Ingest & Profile (Stage 1)"]
    B --> C["Schema Mapping (Stage 2)"]
    C --> D["Concept Mapping (Stage 3)"]
    D --> E["ETL Generation (Stage 4)"]
    E --> F["Validation (Stage 5)"]

Portiere combines clinical-domain embeddings (SapBERT as default model), lexical search (BM25s), cross-encoder reranking, and optional LLM verification to achieve high-accuracy mappings with confidence routing — automatically accepting high-confidence results while flagging uncertain ones for human review.

Key Features

Multi-Standard Support — OMOP CDM v5.4, FHIR R4, HL7 v2.5.1, OpenEHR 1.0.4 (extensible via YAML)
AI-Powered Mapping — SapBERT embeddings + cross-encoder reranking + optional LLM verification
9 Knowledge Backends — BM25s, FAISS, Elasticsearch, ChromaDB, PGVector, MongoDB, Qdrant, Milvus, Hybrid (RRF fusion)
BYO-LLM — Bring your own LLM: OpenAI, Anthropic Claude, AWS Bedrock, Ollama (local)
Pluggable Engines — Polars (default), PySpark / Databricks, Pandas, DuckDB
Standalone ETL Artifacts — Generated ETL scripts run without the SDK
Data Quality Validation — Great Expectations integration for post-ETL checks
Confidence Routing — Auto-accept, needs-review, and manual tiers with human-in-the-loop
Cross-Standard Mapping — Transform between standards (OMOP ↔ FHIR, HL7v2 → FHIR, OMOP → OpenEHR)
Local-First — All processing runs on your machine; no cloud dependency

Quick Start

Install

pip install portiere-health

# With a compute engine (pick one)
pip install "portiere-health[polars]"    # Lightweight (recommended)
pip install "portiere-health[spark]"     # Large-scale / Databricks
pip install "portiere-health[pandas]"    # Prototyping

Map Clinical Data to OMOP CDM

import portiere
from portiere.engines import PolarsEngine

# Initialize a project
project = portiere.init(
    name="Hospital OMOP Migration",
    engine=PolarsEngine(),
    target_model="omop_cdm_v5.4",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)

# Add and profile a data source
source = project.add_source("patients.csv")
profile = project.profile(source)

# AI-powered schema mapping (source columns → OMOP tables)
schema_map = project.map_schema(source)

# AI-powered concept mapping (clinical codes → standard concepts)
concept_map = project.map_concepts(codes=["E11.9", "I10", "R73.03"])

# Review mappings
schema_map.summary()
concept_map.summary()

# Generate and run ETL
result = project.run_etl(source, schema_map, concept_map)

Cross-Standard Mapping (OMOP → FHIR)

project = portiere.init(
    name="FHIR Export",
    engine=PolarsEngine(),
    task="cross_map",
    source_standard="omop_cdm_v5.4",
    target_model="fhir_r4",
)

Installation

Core Package

pip install portiere-health

Optional Extras

Install only what you need:

Category	Extra	Command
Engines	Polars	`pip install "portiere[polars]"`
	PySpark	`pip install "portiere[spark]"`
	Pandas	`pip install "portiere[pandas]"`
	DuckDB	`pip install "portiere[duckdb]"`
LLM Providers	OpenAI	`pip install "portiere[openai]"`
	Anthropic	`pip install "portiere[anthropic]"`
	AWS Bedrock	`pip install "portiere[bedrock]"`
	Ollama	`pip install "portiere[ollama]"`
Knowledge Backends	FAISS	`pip install "portiere[faiss]"`
	Elasticsearch	`pip install "portiere[elasticsearch]"`
	ChromaDB	`pip install "portiere[chromadb]"`
	PGVector	`pip install "portiere[pgvector]"`
	MongoDB	`pip install "portiere[mongodb]"`
	Qdrant	`pip install "portiere[qdrant]"`
	Milvus	`pip install "portiere[milvus]"`
Quality	Great Expectations	`pip install "portiere[quality]"`
Everything	All extras	`pip install "portiere[all]"`

Requirements: Python 3.10+

How It Works

Portiere implements a 5-stage AI pipeline for clinical data transformation:

Stage 1: Ingest & Profile

Connects to your data source (CSV, Parquet, databases) and extracts schema metadata — column names, types, cardinality, detected code columns, and PHI indicators.

Stage 2: Schema Mapping

Maps source columns to target standard entities using a fusion of:

Pattern matching — Regex patterns defined in YAML standard files
Embedding similarity — SapBERT clinical embeddings for semantic matching
Cross-encoder reranking — Precision reranking of top candidates

Stage 3: Concept Mapping

Maps clinical codes (ICD-10, CPT, local codes) to standard vocabularies (SNOMED CT, LOINC, RxNorm) through:

Direct code lookup — Exact match in knowledge base
Knowledge layer search — BM25s lexical / FAISS vector / Hybrid search
Cross-encoder reranking — Rerank top-k candidates for precision
LLM verification — Optional AI verification for medium-confidence mappings
Confidence routing — Auto-accept (>0.95), needs-review (0.70–0.95), manual (<0.70)

Stage 4: ETL Generation

Generates standalone ETL scripts (Spark, Polars, or Pandas) and lookup tables (CSV) that run without the Portiere SDK — no vendor lock-in.

Stage 5: Validation

Post-ETL data quality checks using Great Expectations, with standards-aware conformance for all supported models (OMOP, FHIR, HL7, OpenEHR, custom YAML):

Completeness — Non-null percentages for required fields
Conformance — Type and constraint compliance derived from YAML field metadata
Plausibility — Domain-specific clinical rules

Supported Standards

Standard	Version	Use Case
OMOP CDM	v5.4	Observational research, population health
FHIR R4	R4	Interoperability, health information exchange
HL7 v2	2.5.1	Legacy hospital system integration
OpenEHR	1.0.4	European clinical data, archetype-based EHRs

Standards are defined as YAML files and are fully extensible — you can define custom hospital CDMs or registry schemas.

Cross-Standard Mapping

Built-in crossmaps for transforming between standards:

Source	Target	File
FHIR R4	OMOP CDM	`fhir_r4_to_omop.yaml`
OMOP CDM	FHIR R4	`omop_to_fhir_r4.yaml`
HL7 v2	FHIR R4	`hl7v2_to_fhir_r4.yaml`
OMOP CDM	OpenEHR	`omop_to_openehr.yaml`
FHIR R4	OpenEHR	`fhir_r4_to_openehr.yaml`

Custom Standards

Portiere is not limited to built-in standards. You can define any clinical data model — a hospital CDM, a disease registry schema, a research database, a legacy warehouse — as a YAML file and use it identically to built-in standards.

Define a Custom Standard (YAML)

Create a .yaml file with the following structure:

name: "hospital_cdm_v1"
version: "1.0"
standard_type: "relational"
organization: "General Hospital Research"
description: "Internal clinical data model for General Hospital"

entities:
  patients:
    description: "Core patient demographics"
    fields:
      patient_id:
        type: integer
        required: true
        description: "Unique patient identifier"
        ddl: "INTEGER PRIMARY KEY"
      date_of_birth:
        type: date
        description: "Patient date of birth"
        ddl: "DATE NOT NULL"
      sex:
        type: string
        description: "Biological sex (M/F/U)"
        ddl: "VARCHAR(1)"

    # Fast pattern matching: source column name → target field
    source_patterns:
      patient_id: "patient_id"
      subject_id: "patient_id"
      dob: "date_of_birth"
      birth_date: "date_of_birth"
      gender: "sex"
      sex: "sex"

    # Embedding descriptions: optimized text for AI semantic matching
    # Write what a clinician would search for, not just the field name
    embedding_descriptions:
      patient_id: "unique patient identifier number"
      date_of_birth: "patient birth date birthday date of birth"
      sex: "biological sex gender male female M F"

  encounters:
    description: "Hospital visits and admissions"
    fields:
      encounter_id:
        type: integer
        required: true
        description: "Unique encounter identifier"
        ddl: "INTEGER PRIMARY KEY"
      admit_date:
        type: datetime
        description: "Admission date and time"
        ddl: "TIMESTAMP NOT NULL"
      encounter_type:
        type: string
        description: "Type of encounter (inpatient, outpatient, ED)"
        ddl: "VARCHAR(20)"

    source_patterns:
      encounter_id: "encounter_id"
      visit_id: "encounter_id"
      hadm_id: "encounter_id"
      admit_date: "admit_date"
      admittime: "admit_date"
      visit_type: "encounter_type"

    embedding_descriptions:
      encounter_id: "hospital encounter visit admission identifier"
      admit_date: "admission date time when patient was admitted"
      encounter_type: "visit type inpatient outpatient emergency department"

Use Your Custom Standard

import portiere
from portiere.engines import PolarsEngine

# Reference via "custom:" prefix — works anywhere target_model is accepted
project = portiere.init(
    name="Hospital Migration",
    engine=PolarsEngine(),
    target_model="custom:/path/to/hospital_cdm_v1.yaml",
)

source = project.add_source("patients.csv")
schema_map = project.map_schema(source)
concept_map = project.map_concepts(codes=["E11.9", "I10"])
result = project.run_etl(source, schema_map, concept_map)

Or load directly for inspection:

from portiere.standards import YAMLTargetModel

model = YAMLTargetModel("/path/to/hospital_cdm_v1.yaml")
print(model.get_schema())      # entity → [fields]
print(model.get_source_patterns())  # source column hints

You can also ship your custom standard as a built-in by placing the YAML in src/portiere/standards/ — it will then be loadable by name:

model = YAMLTargetModel.from_name("hospital_cdm_v1")

Column Naming Guide

Portiere's schema mapper uses two strategies in sequence: exact pattern matching (fast, zero-cost) then embedding similarity (AI-powered). Understanding both helps you get higher auto-accept rates.

Strategy 1 — Source Patterns (rule-based, highest priority)

Each entity in a standard YAML defines source_patterns — a dictionary mapping source column names to target fields. Matches here are always accepted, regardless of confidence score.

Built-in OMOP patterns include common aliases:

Your column name	Maps to
`patient_id`, `subject_id`, `mrn`	`person.person_id`
`dob`, `birth_date`, `date_of_birth`	`person.birth_datetime`
`gender`, `sex`	`person.gender_concept_id`
`icd_code`, `diagnosis_code`, `dx_code`	`condition_occurrence.condition_source_value`
`admit_date`, `admittime`	`visit_occurrence.visit_start_date`
`drug_code`, `ndc`, `medication_code`	`drug_exposure.drug_source_value`

To maximize pattern hits in your own standard, add all known aliases to source_patterns in your YAML:

source_patterns:
  patient_id: "person_id"     # exact name
  pid: "person_id"            # short alias
  subject_id: "person_id"     # research alias
  pt_id: "person_id"          # abbreviated
  medical_record_number: "person_id"  # verbose

Strategy 2 — Embedding Similarity (semantic, AI-powered)

When no pattern matches, the mapper encodes both the source column name and the embedding_descriptions into vectors using SapBERT, then finds the closest target field by cosine similarity.

What to write in embedding_descriptions:

Write natural-language phrases a clinician would use to describe what that column contains — not just a rephrasing of the field name.

# ❌ Too literal — just re-states the name
embedding_descriptions:
  admit_date: "admission date"
  dx_code: "diagnosis code"

# ✅ Rich synonyms and clinical context — maximizes semantic recall
embedding_descriptions:
  admit_date: "hospital admission date time when patient was admitted inpatient start"
  dx_code: "ICD diagnosis code ICD-10-CM ICD-9 disease condition clinical code"

Naming your source columns well also helps. The source column name itself is encoded alongside the description. Prefer descriptive names over cryptic abbreviations:

Less matchable	More matchable
`col_32`	`diagnosis_code`
`dt1`	`admission_date`
`flg_act`	`is_active`
`cd_race`	`race_code`
`proc_nm`	`procedure_name`

Confidence Tiers

After matching, every column receives a confidence score:

Score	Tier	Action
≥ 0.95	Auto-accepted	Written to output immediately
0.70 – 0.95	Needs review	Flagged for human inspection
< 0.70	Manual	Requires explicit override

Tune these thresholds to match your project's risk tolerance:

from portiere import PortiereConfig, ThresholdsConfig
from portiere.config import SchemaMappingThresholds

config = PortiereConfig(
    thresholds=ThresholdsConfig(
        schema_mapping=SchemaMappingThresholds(
            auto_accept=0.90,   # lower → more auto-accepts
            needs_review=0.60,  # lower → fewer manual items
        )
    )
)

Full Workflow with Review

schema_map = project.map_schema(source)

# Inspect what needs review
for item in schema_map.needs_review():
    print(f"{item.source_column} → {item.target_table}.{item.target_column} "
          f"(confidence={item.confidence:.2f})")
    for c in item.candidates[:3]:
        print(f"  candidate: {c['target_table']}.{c['target_column']} ({c['confidence']:.2f})")

# Approve, override, or reject
schema_map.approve("patient_name")
schema_map.override("pt_zip", target_table="location", target_column="zip")
schema_map.reject("internal_audit_flag")

# Approve all remaining items
schema_map.approve_all()
schema_map.finalize()

Knowledge Layer Backends

Backend	Type	Dependencies	Best For
BM25s	Lexical	None (built-in)	Quick start, no infra needed
FAISS	Vector	`faiss-cpu`, `sentence-transformers`	High-accuracy local search
Elasticsearch	Hybrid	`elasticsearch`	Production deployments
ChromaDB	Vector	`chromadb`	Lightweight vector store
PGVector	Vector	`psycopg`, `pgvector`	PostgreSQL environments
MongoDB	Vector	`pymongo`	Atlas Vector Search users
Qdrant	Vector	`qdrant-client`	Dedicated vector DB
Milvus	Vector	`pymilvus`	Large-scale vector search
Hybrid	Fusion	Varies	Combine backends with RRF

Hybrid Search Example

from portiere import PortiereConfig, KnowledgeLayerConfig

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="hybrid",
        hybrid_backends=["bm25s", "faiss"],
        hybrid_fusion="rrf",  # Reciprocal Rank Fusion
    )
)

LLM Providers

Portiere supports Bring-Your-Own-LLM for concept verification:

Provider	Extra	Model Examples
OpenAI	`portiere[openai]`	GPT-4o, GPT-4o-mini
Anthropic	`portiere[anthropic]`	Claude Sonnet, Claude Haiku
AWS Bedrock	`portiere[bedrock]`	Claude, Titan, Llama
Ollama	`portiere[ollama]`	Llama 3, Mistral, Gemma (local)

from portiere import PortiereConfig, LLMConfig

config = PortiereConfig(
    llm=LLMConfig(
        provider="openai",
        model="gpt-4o-mini",
        api_key="sk-...",
    )
)

Configuration

Portiere auto-discovers configuration from multiple sources (in priority order):

1. Python Objects

from portiere import PortiereConfig, EmbeddingConfig, KnowledgeLayerConfig

config = PortiereConfig(
    target_model="omop_cdm_v5.4",
    embedding=EmbeddingConfig(
        provider="huggingface",
        model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext",
    ),
    knowledge_layer=KnowledgeLayerConfig(backend="bm25s"),
)

2. YAML File (`portiere.yaml`)

target_model: omop_cdm_v5.4
storage: local

embedding:
  provider: huggingface
  model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext

knowledge_layer:
  backend: bm25s

llm:
  provider: openai
  model: gpt-4o-mini

thresholds:
  auto_accept: 0.95
  needs_review: 0.70

3. Environment Variables

export PORTIERE_TARGET_MODEL=omop_cdm_v5.4
export PORTIERE_LLM__PROVIDER=openai
export PORTIERE_LLM__API_KEY=sk-...
export PORTIERE_KNOWLEDGE_LAYER__BACKEND=faiss

Building the Knowledge Layer

Before concept mapping, build a searchable index from standard vocabularies (e.g., OHDSI Athena):

from portiere import build_knowledge_layer, PortiereConfig

config = PortiereConfig()

stats = build_knowledge_layer(
    vocabulary_dir="./data/athena/",
    config=config,
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)

print(f"Indexed {stats['total_concepts']:,} concepts")

Documentation

Resource	Description
Quick Start Guide	Get started in 5 minutes
API Reference	Full SDK API documentation
Configuration Guide	YAML, Python, and env var config
Knowledge Layer Guide	All 9 backends explained
LLM Integration	BYO-LLM setup
Pipeline Architecture	5-stage pipeline deep dive
Multi-Standard Support	Standards and custom schemas
Cross-Standard Mapping	OMOP ↔ FHIR, HL7v2 → FHIR
Example Notebooks	19 Jupyter notebooks with walkthroughs

Project Structure

portiere/
├── src/portiere/
│   ├── __init__.py          # Public API: init(), PortiereProject, configs
│   ├── config.py            # Configuration with auto-discovery
│   ├── project.py           # Unified project interface
│   ├── exceptions.py        # Error hierarchy
│   ├── stages/              # 5-stage pipeline implementation
│   ├── engines/             # Compute engines (Polars, Spark, Pandas, DuckDB)
│   ├── knowledge/           # Knowledge layer backends (9 backends)
│   ├── embedding/           # Embedding providers & gateway
│   ├── llm/                 # LLM providers & gateway
│   ├── local/               # Local AI components (schema mapper, concept mapper)
│   ├── artifacts/           # ETL code generation (Jinja2 templates)
│   ├── runner/              # ETL execution engine
│   ├── quality/             # Data quality validation (Great Expectations)
│   ├── standards/           # Clinical standard YAML definitions & crossmaps
│   ├── storage/             # Storage backends (local filesystem)
│   └── models/              # Pydantic data models
├── tests/                   # 36 test modules, 689 tests
├── docs/
│   ├── documentations/      # 22 guides and references
│   └── notebooks_examples/  # 19 Jupyter notebook examples
├── pyproject.toml           # Package configuration (hatchling)
└── LICENSE                  # Apache 2.0

Star History

Contributing

We welcome contributions! Here's how to get started:

# Clone the repository
git clone https://github.com/Cuspal/portiere.git
cd portiere

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install in development mode
pip install -e ".[dev,docs,polars,quality]"

# Run tests
pytest

# Run linter
ruff check src/ tests/

# Run type checker
mypy src/portiere/

Please read our contributing guidelines before submitting a pull request.

License

Portiere is licensed under the Apache License 2.0.

Copyright 2026 Cuspal Co. Ltd.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Citation

If you use Portiere in your research, please cite:

@software{portiere2024,
  title     = {Portiere: AI-Powered Clinical Data Mapping SDK},
  author    = {Cuspal Co.,Ltd.},
  year      = {2026},
  url       = {https://github.com/Cuspal/portiere},
  license   = {Apache-2.0},
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
docs		docs
src/portiere		src/portiere
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
coverage.json		coverage.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Portiere

What is Portiere?

Key Features

Quick Start

Install

Map Clinical Data to OMOP CDM

Cross-Standard Mapping (OMOP → FHIR)

Installation

Core Package

Optional Extras

How It Works

Stage 1: Ingest & Profile

Stage 2: Schema Mapping

Stage 3: Concept Mapping

Stage 4: ETL Generation

Stage 5: Validation

Supported Standards

Cross-Standard Mapping

Custom Standards

Define a Custom Standard (YAML)

Use Your Custom Standard

Column Naming Guide

Strategy 1 — Source Patterns (rule-based, highest priority)

Strategy 2 — Embedding Similarity (semantic, AI-powered)

Confidence Tiers

Full Workflow with Review

Knowledge Layer Backends

Hybrid Search Example

LLM Providers

Configuration

1. Python Objects

2. YAML File (portiere.yaml)

3. Environment Variables

Building the Knowledge Layer

Documentation

Project Structure

Star History

Contributing

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

2. YAML File (`portiere.yaml`)

Packages