Skip to content

corzosoft/azure-edm-reference-data-platform

azure-edm-reference-data-platform

CI License: MIT Python 3.12

An open-source EDM-style reference data platform simulator for Azure. It gives developers, data engineers, and architects a runnable reference for common enterprise data management capabilities: ingestion, staging, validation, survivorship, golden source, audit, lineage, monitoring, and downstream distribution.

This is not Markit EDM and does not include proprietary vendor code, SDKs, schemas, or real bank data. It is a generic, educational, production-inspired simulator built with synthetic data.

Who This Is For

  • Data engineers learning reference data platform patterns.
  • SQL Server developers modernizing batch-oriented data platforms.
  • Azure architects designing ADF, Azure SQL, ADLS, Key Vault, and monitoring patterns.
  • Teams that need a non-proprietary sandbox for validation, survivorship, audit, and lineage discussions.

What You Can Do With It

  • Validate fake vendor securities, prices, and ratings files.
  • Generate markdown data quality summaries.
  • Load CSV files into SQL Server staging tables.
  • Study Azure SQL-compatible staging, core, audit, lineage, and golden source scripts.
  • Review ADF sample pipeline JSON for ingestion and distribution.
  • Use Bicep templates as a starting point for Azure resource design.

Architecture

flowchart LR
    VendorFiles[Fake vendor CSV files] --> Landing[ADLS-style landing zone]
    Landing --> ADF[Azure Data Factory pipelines]
    ADF --> Staging[SQL staging tables]
    Staging --> Validation[Validation rules]
    Validation --> Audit[Audit and DQ issue tables]
    Validation --> Survivorship[Survivorship rules]
    Survivorship --> Core[Core master tables]
    Core --> Golden[Golden source views]
    Golden --> Distribution[Downstream distribution extracts]
    ADF --> Lineage[Lineage events]
    KeyVault[Azure Key Vault] --> ADF
    Monitor[Azure Monitor] --> ADF
Loading

Quick Start

git clone https://github.com/corzosoft/azure-edm-reference-data-platform.git
cd azure-edm-reference-data-platform
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
python -m pytest
python -m ruff check .

On macOS/Linux, activate the environment with source .venv/bin/activate.

Run The Local CLI

edm-ref validate-file sample-data/vendor_a_securities.csv
edm-ref validate-file sample-data/vendor_b_prices.csv --file-type price
edm-ref quality-report sample-data/vendor_a_securities.csv sample-data/vendor_b_prices.csv
edm-ref generate-lineage-report --batch-id demo-001 --records 4

Expected quality report shape:

# Data Quality Summary

| File | Type | Errors | Warnings | Rule Counts |
| --- | --- | ---: | ---: | --- |
| sample-data/vendor_b_prices.csv | price | 0 | 1 | STALE_PRICE=1 |

Run SQL Server Locally

docker compose up -d

SQL Server starts on localhost,1433 with user sa and password YourStrong!Passw0rd.

Apply scripts from sql/ in numeric order using Azure Data Studio, SQL Server Management Studio, or sqlcmd.

To load staging tables from the CLI, install the optional SQL Server dependency and Microsoft ODBC Driver for SQL Server:

python -m pip install -e ".[mssql]"
edm-ref load-staging sample-data/vendor_a_securities.csv

Core Concepts Included

Concept Where To Look
Staging model sql/002_create_staging_tables.sql
Core master model sql/003_create_core_tables.sql
Audit and lineage sql/004_create_audit_lineage_tables.sql
Validation rules sql/005_validation_rules.sql, src/edm_reference_platform/file_validator.py
Survivorship sql/006_survivorship_rules.sql, src/edm_reference_platform/reconciliation.py
Golden source views sql/007_golden_source_views.sql
ADF samples adf/pipelines/
Azure infrastructure infra/bicep/

Azure Deployment Concept

The Bicep templates are reference templates for:

  • ADLS Gen2-style storage account.
  • Azure SQL logical server and database.
  • Azure Key Vault secret pattern.
  • Log Analytics and Application Insights.

They are intentionally simple so teams can adapt them to their own network, identity, private endpoint, policy, and naming standards.

Production Boundaries

This project is suitable as a learning tool, architecture accelerator, and local simulator. Before production use, add:

  • Enterprise authentication and managed identity.
  • Private networking and Key Vault-backed secrets.
  • Production-grade orchestration and retry policy.
  • Data contracts, schema evolution, and source file reconciliation.
  • Formal access controls, PII controls, and audit retention.
  • Performance testing with representative volumes.

Roadmap

  • Add dbt-style documentation for SQL entities.
  • Add optional Azure SQL deployment scripts.
  • Add more asset classes and corporate action examples.
  • Add sample data quality dashboard output.
  • Add containerized SQL script bootstrap.

Contributing

Contributions are welcome. See CONTRIBUTING.md.

Use fake or synthetic data only. Do not submit proprietary vendor code, real bank data, or confidential schemas.

About

Open-source EDM-style reference data platform simulator for Azure with SQL, Python, ADF, Bicep, audit, lineage, validation, and survivorship examples.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors