AI-Powered Clinical and Health Data Mapping Tool
Documentation · Quick Start · Examples · Issues
Mapping clinical data to standard models like OMOP CDM, FHIR R4, HL7 v2, and OpenEHR is one of the most time-consuming and error-prone tasks in health informatics. It typically requires domain experts to manually map hundreds of source fields and thousands of clinical codes — a process that can take weeks or months.
Portiere automates this with an AI-powered 5-stage pipeline that handles schema mapping, concept mapping, ETL generation, and data quality validation — all running locally on your machine with no cloud dependency required.
flowchart LR
A["Source Data"] --> B["Ingest & Profile (Stage 1)"]
B --> C["Schema Mapping (Stage 2)"]
C --> D["Concept Mapping (Stage 3)"]
D --> E["ETL Generation (Stage 4)"]
E --> F["Validation (Stage 5)"]
Portiere combines clinical-domain embeddings (SapBERT as default model), lexical search (BM25s), cross-encoder reranking, and optional LLM verification to achieve high-accuracy mappings with confidence routing — automatically accepting high-confidence results while flagging uncertain ones for human review.
- Multi-Standard Support — OMOP CDM v5.4, FHIR R4, HL7 v2.5.1, OpenEHR 1.0.4 (extensible via YAML)
- AI-Powered Mapping — SapBERT embeddings + cross-encoder reranking + optional LLM verification
- 9 Knowledge Backends — BM25s, FAISS, Elasticsearch, ChromaDB, PGVector, MongoDB, Qdrant, Milvus, Hybrid (RRF fusion)
- BYO-LLM — Bring your own LLM: OpenAI, Anthropic Claude, AWS Bedrock, Ollama (local)
- Pluggable Engines — Polars (default), PySpark / Databricks, Pandas, DuckDB
- Standalone ETL Artifacts — Generated ETL scripts run without the SDK
- Data Quality Validation — Great Expectations integration for post-ETL checks
- Confidence Routing — Auto-accept, needs-review, and manual tiers with human-in-the-loop
- Cross-Standard Mapping — Transform between standards (OMOP ↔ FHIR, HL7v2 → FHIR, OMOP → OpenEHR)
- Local-First — All processing runs on your machine; no cloud dependency
pip install portiere-health
# With a compute engine (pick one)
pip install "portiere-health[polars]" # Lightweight (recommended)
pip install "portiere-health[spark]" # Large-scale / Databricks
pip install "portiere-health[pandas]" # Prototypingimport portiere
from portiere.engines import PolarsEngine
# Initialize a project
project = portiere.init(
name="Hospital OMOP Migration",
engine=PolarsEngine(),
target_model="omop_cdm_v5.4",
vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
# Add and profile a data source
source = project.add_source("patients.csv")
profile = project.profile(source)
# AI-powered schema mapping (source columns → OMOP tables)
schema_map = project.map_schema(source)
# AI-powered concept mapping (clinical codes → standard concepts)
concept_map = project.map_concepts(codes=["E11.9", "I10", "R73.03"])
# Review mappings
schema_map.summary()
concept_map.summary()
# Generate and run ETL
result = project.run_etl(source, schema_map, concept_map)project = portiere.init(
name="FHIR Export",
engine=PolarsEngine(),
task="cross_map",
source_standard="omop_cdm_v5.4",
target_model="fhir_r4",
)pip install portiere-healthInstall only what you need:
| Category | Extra | Command |
|---|---|---|
| Engines | Polars | pip install "portiere[polars]" |
| PySpark | pip install "portiere[spark]" |
|
| Pandas | pip install "portiere[pandas]" |
|
| DuckDB | pip install "portiere[duckdb]" |
|
| LLM Providers | OpenAI | pip install "portiere[openai]" |
| Anthropic | pip install "portiere[anthropic]" |
|
| AWS Bedrock | pip install "portiere[bedrock]" |
|
| Ollama | pip install "portiere[ollama]" |
|
| Knowledge Backends | FAISS | pip install "portiere[faiss]" |
| Elasticsearch | pip install "portiere[elasticsearch]" |
|
| ChromaDB | pip install "portiere[chromadb]" |
|
| PGVector | pip install "portiere[pgvector]" |
|
| MongoDB | pip install "portiere[mongodb]" |
|
| Qdrant | pip install "portiere[qdrant]" |
|
| Milvus | pip install "portiere[milvus]" |
|
| Quality | Great Expectations | pip install "portiere[quality]" |
| Everything | All extras | pip install "portiere[all]" |
Requirements: Python 3.10+
Portiere implements a 5-stage AI pipeline for clinical data transformation:
Connects to your data source (CSV, Parquet, databases) and extracts schema metadata — column names, types, cardinality, detected code columns, and PHI indicators.
Maps source columns to target standard entities using a fusion of:
- Pattern matching — Regex patterns defined in YAML standard files
- Embedding similarity — SapBERT clinical embeddings for semantic matching
- Cross-encoder reranking — Precision reranking of top candidates
Maps clinical codes (ICD-10, CPT, local codes) to standard vocabularies (SNOMED CT, LOINC, RxNorm) through:
- Direct code lookup — Exact match in knowledge base
- Knowledge layer search — BM25s lexical / FAISS vector / Hybrid search
- Cross-encoder reranking — Rerank top-k candidates for precision
- LLM verification — Optional AI verification for medium-confidence mappings
- Confidence routing — Auto-accept (>0.95), needs-review (0.70–0.95), manual (<0.70)
Generates standalone ETL scripts (Spark, Polars, or Pandas) and lookup tables (CSV) that run without the Portiere SDK — no vendor lock-in.
Post-ETL data quality checks using Great Expectations, with standards-aware conformance for all supported models (OMOP, FHIR, HL7, OpenEHR, custom YAML):
- Completeness — Non-null percentages for required fields
- Conformance — Type and constraint compliance derived from YAML field metadata
- Plausibility — Domain-specific clinical rules
| Standard | Version | Use Case |
|---|---|---|
| OMOP CDM | v5.4 | Observational research, population health |
| FHIR R4 | R4 | Interoperability, health information exchange |
| HL7 v2 | 2.5.1 | Legacy hospital system integration |
| OpenEHR | 1.0.4 | European clinical data, archetype-based EHRs |
Standards are defined as YAML files and are fully extensible — you can define custom hospital CDMs or registry schemas.
Built-in crossmaps for transforming between standards:
| Source | Target | File |
|---|---|---|
| FHIR R4 | OMOP CDM | fhir_r4_to_omop.yaml |
| OMOP CDM | FHIR R4 | omop_to_fhir_r4.yaml |
| HL7 v2 | FHIR R4 | hl7v2_to_fhir_r4.yaml |
| OMOP CDM | OpenEHR | omop_to_openehr.yaml |
| FHIR R4 | OpenEHR | fhir_r4_to_openehr.yaml |
Portiere is not limited to built-in standards. You can define any clinical data model — a hospital CDM, a disease registry schema, a research database, a legacy warehouse — as a YAML file and use it identically to built-in standards.
Create a .yaml file with the following structure:
name: "hospital_cdm_v1"
version: "1.0"
standard_type: "relational"
organization: "General Hospital Research"
description: "Internal clinical data model for General Hospital"
entities:
patients:
description: "Core patient demographics"
fields:
patient_id:
type: integer
required: true
description: "Unique patient identifier"
ddl: "INTEGER PRIMARY KEY"
date_of_birth:
type: date
description: "Patient date of birth"
ddl: "DATE NOT NULL"
sex:
type: string
description: "Biological sex (M/F/U)"
ddl: "VARCHAR(1)"
# Fast pattern matching: source column name → target field
source_patterns:
patient_id: "patient_id"
subject_id: "patient_id"
dob: "date_of_birth"
birth_date: "date_of_birth"
gender: "sex"
sex: "sex"
# Embedding descriptions: optimized text for AI semantic matching
# Write what a clinician would search for, not just the field name
embedding_descriptions:
patient_id: "unique patient identifier number"
date_of_birth: "patient birth date birthday date of birth"
sex: "biological sex gender male female M F"
encounters:
description: "Hospital visits and admissions"
fields:
encounter_id:
type: integer
required: true
description: "Unique encounter identifier"
ddl: "INTEGER PRIMARY KEY"
admit_date:
type: datetime
description: "Admission date and time"
ddl: "TIMESTAMP NOT NULL"
encounter_type:
type: string
description: "Type of encounter (inpatient, outpatient, ED)"
ddl: "VARCHAR(20)"
source_patterns:
encounter_id: "encounter_id"
visit_id: "encounter_id"
hadm_id: "encounter_id"
admit_date: "admit_date"
admittime: "admit_date"
visit_type: "encounter_type"
embedding_descriptions:
encounter_id: "hospital encounter visit admission identifier"
admit_date: "admission date time when patient was admitted"
encounter_type: "visit type inpatient outpatient emergency department"import portiere
from portiere.engines import PolarsEngine
# Reference via "custom:" prefix — works anywhere target_model is accepted
project = portiere.init(
name="Hospital Migration",
engine=PolarsEngine(),
target_model="custom:/path/to/hospital_cdm_v1.yaml",
)
source = project.add_source("patients.csv")
schema_map = project.map_schema(source)
concept_map = project.map_concepts(codes=["E11.9", "I10"])
result = project.run_etl(source, schema_map, concept_map)Or load directly for inspection:
from portiere.standards import YAMLTargetModel
model = YAMLTargetModel("/path/to/hospital_cdm_v1.yaml")
print(model.get_schema()) # entity → [fields]
print(model.get_source_patterns()) # source column hintsYou can also ship your custom standard as a built-in by placing the YAML in src/portiere/standards/ — it will then be loadable by name:
model = YAMLTargetModel.from_name("hospital_cdm_v1")Portiere's schema mapper uses two strategies in sequence: exact pattern matching (fast, zero-cost) then embedding similarity (AI-powered). Understanding both helps you get higher auto-accept rates.
Each entity in a standard YAML defines source_patterns — a dictionary mapping source column names to target fields. Matches here are always accepted, regardless of confidence score.
Built-in OMOP patterns include common aliases:
| Your column name | Maps to |
|---|---|
patient_id, subject_id, mrn |
person.person_id |
dob, birth_date, date_of_birth |
person.birth_datetime |
gender, sex |
person.gender_concept_id |
icd_code, diagnosis_code, dx_code |
condition_occurrence.condition_source_value |
admit_date, admittime |
visit_occurrence.visit_start_date |
drug_code, ndc, medication_code |
drug_exposure.drug_source_value |
To maximize pattern hits in your own standard, add all known aliases to source_patterns in your YAML:
source_patterns:
patient_id: "person_id" # exact name
pid: "person_id" # short alias
subject_id: "person_id" # research alias
pt_id: "person_id" # abbreviated
medical_record_number: "person_id" # verboseWhen no pattern matches, the mapper encodes both the source column name and the embedding_descriptions into vectors using SapBERT, then finds the closest target field by cosine similarity.
What to write in embedding_descriptions:
Write natural-language phrases a clinician would use to describe what that column contains — not just a rephrasing of the field name.
# ❌ Too literal — just re-states the name
embedding_descriptions:
admit_date: "admission date"
dx_code: "diagnosis code"
# ✅ Rich synonyms and clinical context — maximizes semantic recall
embedding_descriptions:
admit_date: "hospital admission date time when patient was admitted inpatient start"
dx_code: "ICD diagnosis code ICD-10-CM ICD-9 disease condition clinical code"Naming your source columns well also helps. The source column name itself is encoded alongside the description. Prefer descriptive names over cryptic abbreviations:
| Less matchable | More matchable |
|---|---|
col_32 |
diagnosis_code |
dt1 |
admission_date |
flg_act |
is_active |
cd_race |
race_code |
proc_nm |
procedure_name |
After matching, every column receives a confidence score:
| Score | Tier | Action |
|---|---|---|
| ≥ 0.95 | Auto-accepted | Written to output immediately |
| 0.70 – 0.95 | Needs review | Flagged for human inspection |
| < 0.70 | Manual | Requires explicit override |
Tune these thresholds to match your project's risk tolerance:
from portiere import PortiereConfig, ThresholdsConfig
from portiere.config import SchemaMappingThresholds
config = PortiereConfig(
thresholds=ThresholdsConfig(
schema_mapping=SchemaMappingThresholds(
auto_accept=0.90, # lower → more auto-accepts
needs_review=0.60, # lower → fewer manual items
)
)
)schema_map = project.map_schema(source)
# Inspect what needs review
for item in schema_map.needs_review():
print(f"{item.source_column} → {item.target_table}.{item.target_column} "
f"(confidence={item.confidence:.2f})")
for c in item.candidates[:3]:
print(f" candidate: {c['target_table']}.{c['target_column']} ({c['confidence']:.2f})")
# Approve, override, or reject
schema_map.approve("patient_name")
schema_map.override("pt_zip", target_table="location", target_column="zip")
schema_map.reject("internal_audit_flag")
# Approve all remaining items
schema_map.approve_all()
schema_map.finalize()| Backend | Type | Dependencies | Best For |
|---|---|---|---|
| BM25s | Lexical | None (built-in) | Quick start, no infra needed |
| FAISS | Vector | faiss-cpu, sentence-transformers |
High-accuracy local search |
| Elasticsearch | Hybrid | elasticsearch |
Production deployments |
| ChromaDB | Vector | chromadb |
Lightweight vector store |
| PGVector | Vector | psycopg, pgvector |
PostgreSQL environments |
| MongoDB | Vector | pymongo |
Atlas Vector Search users |
| Qdrant | Vector | qdrant-client |
Dedicated vector DB |
| Milvus | Vector | pymilvus |
Large-scale vector search |
| Hybrid | Fusion | Varies | Combine backends with RRF |
from portiere import PortiereConfig, KnowledgeLayerConfig
config = PortiereConfig(
knowledge_layer=KnowledgeLayerConfig(
backend="hybrid",
hybrid_backends=["bm25s", "faiss"],
hybrid_fusion="rrf", # Reciprocal Rank Fusion
)
)Portiere supports Bring-Your-Own-LLM for concept verification:
| Provider | Extra | Model Examples |
|---|---|---|
| OpenAI | portiere[openai] |
GPT-4o, GPT-4o-mini |
| Anthropic | portiere[anthropic] |
Claude Sonnet, Claude Haiku |
| AWS Bedrock | portiere[bedrock] |
Claude, Titan, Llama |
| Ollama | portiere[ollama] |
Llama 3, Mistral, Gemma (local) |
from portiere import PortiereConfig, LLMConfig
config = PortiereConfig(
llm=LLMConfig(
provider="openai",
model="gpt-4o-mini",
api_key="sk-...",
)
)Portiere auto-discovers configuration from multiple sources (in priority order):
from portiere import PortiereConfig, EmbeddingConfig, KnowledgeLayerConfig
config = PortiereConfig(
target_model="omop_cdm_v5.4",
embedding=EmbeddingConfig(
provider="huggingface",
model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext",
),
knowledge_layer=KnowledgeLayerConfig(backend="bm25s"),
)target_model: omop_cdm_v5.4
storage: local
embedding:
provider: huggingface
model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
knowledge_layer:
backend: bm25s
llm:
provider: openai
model: gpt-4o-mini
thresholds:
auto_accept: 0.95
needs_review: 0.70export PORTIERE_TARGET_MODEL=omop_cdm_v5.4
export PORTIERE_LLM__PROVIDER=openai
export PORTIERE_LLM__API_KEY=sk-...
export PORTIERE_KNOWLEDGE_LAYER__BACKEND=faissBefore concept mapping, build a searchable index from standard vocabularies (e.g., OHDSI Athena):
from portiere import build_knowledge_layer, PortiereConfig
config = PortiereConfig()
stats = build_knowledge_layer(
vocabulary_dir="./data/athena/",
config=config,
vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
print(f"Indexed {stats['total_concepts']:,} concepts")| Resource | Description |
|---|---|
| Quick Start Guide | Get started in 5 minutes |
| API Reference | Full SDK API documentation |
| Configuration Guide | YAML, Python, and env var config |
| Knowledge Layer Guide | All 9 backends explained |
| LLM Integration | BYO-LLM setup |
| Pipeline Architecture | 5-stage pipeline deep dive |
| Multi-Standard Support | Standards and custom schemas |
| Cross-Standard Mapping | OMOP ↔ FHIR, HL7v2 → FHIR |
| Example Notebooks | 19 Jupyter notebooks with walkthroughs |
portiere/
├── src/portiere/
│ ├── __init__.py # Public API: init(), PortiereProject, configs
│ ├── config.py # Configuration with auto-discovery
│ ├── project.py # Unified project interface
│ ├── exceptions.py # Error hierarchy
│ ├── stages/ # 5-stage pipeline implementation
│ ├── engines/ # Compute engines (Polars, Spark, Pandas, DuckDB)
│ ├── knowledge/ # Knowledge layer backends (9 backends)
│ ├── embedding/ # Embedding providers & gateway
│ ├── llm/ # LLM providers & gateway
│ ├── local/ # Local AI components (schema mapper, concept mapper)
│ ├── artifacts/ # ETL code generation (Jinja2 templates)
│ ├── runner/ # ETL execution engine
│ ├── quality/ # Data quality validation (Great Expectations)
│ ├── standards/ # Clinical standard YAML definitions & crossmaps
│ ├── storage/ # Storage backends (local filesystem)
│ └── models/ # Pydantic data models
├── tests/ # 36 test modules, 689 tests
├── docs/
│ ├── documentations/ # 22 guides and references
│ └── notebooks_examples/ # 19 Jupyter notebook examples
├── pyproject.toml # Package configuration (hatchling)
└── LICENSE # Apache 2.0
We welcome contributions! Here's how to get started:
# Clone the repository
git clone https://github.com/Cuspal/portiere.git
cd portiere
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install in development mode
pip install -e ".[dev,docs,polars,quality]"
# Run tests
pytest
# Run linter
ruff check src/ tests/
# Run type checker
mypy src/portiere/Please read our contributing guidelines before submitting a pull request.
Portiere is licensed under the Apache License 2.0.
Copyright 2026 Cuspal Co. Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
If you use Portiere in your research, please cite:
@software{portiere2024,
title = {Portiere: AI-Powered Clinical Data Mapping SDK},
author = {Cuspal Co.,Ltd.},
year = {2026},
url = {https://github.com/Cuspal/portiere},
license = {Apache-2.0},
}