Skip to content

Feat/snowflake importer#16

Merged
dmaresma merged 112 commits into
release/mckesson_datahub202606from
feat/snowflake_importer
Jun 4, 2026
Merged

Feat/snowflake importer#16
dmaresma merged 112 commits into
release/mckesson_datahub202606from
feat/snowflake_importer

Conversation

@dmaresma
Copy link
Copy Markdown
Owner

@dmaresma dmaresma commented Jun 4, 2026

  • Tests pass
  • ruff format
  • README.md updated (if relevant)
  •  CHANGELOG.md entry added

dmaresma and others added 30 commits January 24, 2026 19:32
…attributes add sort by schema.name and properties by ordinal_position
simonharrer and others added 29 commits May 21, 2026 22:07
Split the image into builder and runtime stages so uv and build-time
artifacts no longer ship in the final image. Add a .dockerignore so the
build context only contains the files actually needed.
…atacontract#1240)

SparkExporter.export() was annotated as returning dict[str, StructType]
but actually returns a str, consistent with all other exporters and with
DataContract.export()'s own str | bytes signature.

Note: to_spark_dict() exists as a separate public utility for Python
users who need live StructType objects. It is tested directly and
intentionally bypasses export(). Whether this pattern should be
documented is an open question for maintainers.

Co-authored-by: = <=>
…cks (datacontract#1219, datacontract#1245) (datacontract#1253)

* fix: skip schema type check for varchar(n) and map columns on Databricks (issues datacontract#1219, datacontract#1245)

* trim tests

---------

Co-authored-by: Alexandra Studer <alexandra.studer@mobi.ch>
Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
…inline-references (datacontract#1261)

* feat: resolve authoritativeDefinitions[type=definition] and add --no-inline-references

Re-introduces the definition inlining that became a no-op in 0.11.x when the
internal model moved from DCS to ODCS. For each schema property,
inline_definition_into_property now resolves its first
authoritativeDefinitions[type=definition] reference, parses the response as an
ODCS SchemaProperty (the entropy-data endpoint already serves this shape under
application/vnd.entropydata.odcs+json), and fills in the fields the property
leaves unset. Inline values always win (model_fields_set);
authoritativeDefinitions, name, properties and items are never overwritten.

Resolution rules:

  - Relative `url:` is joined onto ENTROPY_DATA_HOST. Absolute is used as-is.
  - The x-api-key header is sent only when the resolved URL's host matches
    ENTROPY_DATA_HOST -- a third-party `url:` cannot receive the user's key.
  - Any failure (host mismatch, HTTP error, network error, malformed body)
    raises a DataContractException. No silent skip.
  - Per-process success-only cache so a definition shared by many properties
    is fetched once; failures are not cached so a transient blip doesn't
    poison later runs.

Exposes the new behaviour via --no-inline-references / --inline-references on
lint, test, ci, changelog, and all export subcommands. Doubles as an escape
hatch when a definition is briefly unreachable or invalid: the contract was
already validated against the ODCS schema before resolution, so skipping the
fetch still produces a syntactically valid run from the contract as written.

Naming chosen generic ("references", not "definitions") so future
authoritativeDefinitions[type=...] handlers can plug into the same flag and
field. Existing DataContract(inline_definitions=...) parameter renamed to
inline_references.

Fixtures using type:"definition" against demo URLs were switched to
type:"businessDefinition" so the existing tests don't trigger HTTP --
shipments-odcs.xlsx regenerated to match.

Covered by 17 new tests in tests/test_resolve_definitions.py: URL shapes,
api-key scoping, inline-wins precedence, identity-field protection, nested
properties, array items, caching, every failure mode, and a CLI smoke test
proving --no-inline-references makes zero HTTP calls.

* style: apply ruff format to tests/test_lint.py

CI's ruff format --check failed. After renaming inline_definitions ->
inline_references the DataContract(...) call in test_lint_with_ref fit
on one line, but the formatter left it wrapped. Reformat.

* docs: compact function docs in the resolver

Trim the comments and docstrings I added: drop redundant what already
conveyed by function names and types, keep only the non-obvious why
(non-mergeable-field rationale, success-only cache, host-scoped API key,
model_fields_set semantics).
  - _NON_MERGEABLE_FIELDS gains `id`: a property's `id` identifies the
    property itself, like `name`, and should not be overwritten by the
    definition's id (jschoedl).

  - Escape `\[type=definition]` in the inline-references help text on
    `lint`, `test`, `ci`, `export`, and `changelog`. Typer/rich was
    parsing the bare bracket as markup, so the type-name didn't render
    in `--help` (jschoedl).

  - CHANGELOG entry split into three bullets so the breaking nature of
    "any resolution failure rejects the contract" is its own line
    instead of buried at the end of a paragraph (jschoedl).

  - README regenerated via `python3 update_help.py` so the help section
    matches the new strings.
Updates the requirements on [uvicorn](https://github.com/Kludex/uvicorn) to permit the latest version.
- [Release notes](https://github.com/Kludex/uvicorn/releases)
- [Changelog](https://github.com/Kludex/uvicorn/blob/main/docs/release-notes.md)
- [Commits](Kludex/uvicorn@0.44.0...0.48.0)

---
updated-dependencies:
- dependency-name: uvicorn
  dependency-version: 0.48.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Updates the requirements on [databricks-sdk](https://github.com/databricks/databricks-sdk-py) to permit the latest version.
- [Release notes](https://github.com/databricks/databricks-sdk-py/releases)
- [Changelog](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md)
- [Commits](databricks/databricks-sdk-py@v0.0.1...v0.111.0)

---
updated-dependencies:
- dependency-name: databricks-sdk
  dependency-version: 0.111.0
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.15.13 to 0.15.14.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.15.13...0.15.14)

---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.15.14
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…RI) (datacontract#1262)

Broadens the resolver from "definition" only to a precedence-ordered list
("semantics", "semantic", "definition") and adds an IRI lookup path so
semantic URLs that aren't directly fetchable (because they encode the
namespace as a stable identifier on a non-entropy-data host) still resolve.

Routing:
  - URL on the configured entropy-data host (relative or matching absolute):
    fetched directly, x-api-key sent when configured.
  - Off-host `definition` URL: fetched directly and anonymously -- a contract
    may legitimately reference a third-party REST URL, and the API key must
    never leak across hosts (existing behaviour, unchanged).
  - Off-host `semantics`/`semantic` URL: treated as an IRI and routed through
    `GET /api/semantics?iri=...` on the configured entropy-data host (a new
    endpoint added in the entropy-data side of this change). Requires an API
    key because that endpoint is API-key only.

Precedence: when a property has both a semantics and a definition reference,
the semantic wins -- matches useInheritedDefinition in datacontract-editor.
The legacy singular `"semantic"` type is also accepted for contracts written
before the entropy-data type migration.

Five new tests in test_resolve_definitions.py (17 -> 22): REST URL with
api-key, IRI routed through /api/semantics, IRI without api-key raises with
clear guidance, legacy "semantic" type resolves, semantics-wins-over-definition.
…ract#1249)

* test: add failing tests for versioned dbt model filter

Add a versioned dbt manifest fixture (manifest_versioned_models.json)
with two versions of 'mart_orders' (v1, v2) and one unversioned model
to cover the following cases:

- Plain name (mart_orders) imports all versions — already works
- mart_orders.v1 imports only v1 — FAILS (bug)
- mart_orders.v2 imports only v2 — FAILS (bug)
- CLI --model mart_orders.v1 produces non-empty output — FAILS (bug)

Relates to: datacontract import dbt --model <name>.vN silent empty output

* fix: support versioned dbt model filter (name.vN)

The node filter in import_dbt_manifest compared the raw user-supplied
string against node["name"], which is always the base model name in
manifest.json.  Supplying "mart_orders.v1" therefore never matched
"mart_orders" and produced a silent empty contract.

Add _matches_dbt_node_filter() which understands dbt's name.vN
convention:
- "mart_orders"    → matches all versions of mart_orders (unchanged behaviour)
- "mart_orders.v1" → matches only the node whose name=="mart_orders"
                      and version==1

Replace the inline not-in check with a call to the new helper.

* fix: handle dotted plain model names in dbt node filter

Using lstrip('v') to extract the version number was fragile — any
dotted plain name (e.g. 'schema.orders') would be misread as a
versioned filter with base='schema' and version='orders', causing
the node to be silently skipped.

Replace the lstrip approach with a _VERSION_SUFFIX_RE regex
(^v?(\d+)$) that only treats the suffix as a version when it
actually looks like one. If the suffix does not match, execution
falls through to the plain-name comparison so 'schema.orders'
still resolves correctly.

Add unit tests for _matches_dbt_node_filter covering:
- plain name match
- versioned match with and without 'v' prefix
- wrong version does not match
- plain name matches all versions
- dotted plain name is not misread as versioned
- empty filter list

* docs: update README and CHANGELOG for versioned dbt model filter fix

* trim tests and README

---------

Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
…tacontract#1263)

The `[api]` optional extra hard-pinned `fastapi==0.136.1`. For a published
library this forces downstream apps that depend on FastAPI to that exact
version, frequently causing pip resolution conflicts.

Switch to `fastapi>=0.115.0,<0.137.0`:
- Wide floor lets downstream projects co-resolve.
- Ceiling caps at the next minor because FastAPI ships breaking changes in
  0.x MINOR releases (per FastAPI's versioning policy); Dependabot keeps
  advancing the ceiling with CI validating each minor.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
datacontract#1264)

* fix: surface WARNING/ERROR logs by default instead of suppressing them

Every command called enable_debug_logging(debug), which added a
logging.NullHandler to the root logger when --debug was off. That
disabled Python's lastResort handler, silencing all logging.warning/
logging.error output across the codebase unless --debug was passed.

Default the root level to WARNING (on stderr) so diagnostics like the
`import sql` placeholder-connection warning are visible. test/ci/lint
render run.logs to the console themselves and would otherwise print the
mirrored Run.log_* records twice, so they pass otherwise_disable_stderr
to keep the old NullHandler behavior. Noisy third-party loggers are
pinned to ERROR to preserve the original intent of muting library chatter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* shorten Changelog entry

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tacontract#1260)

* fix(exporters): correct return type annotations on export() methods

All exporters returning str were incorrectly annotated as -> dict.
Fixed 15 exporters to -> str (or -> str | None for MermaidExporter)
and added the missing -> str annotation to IcebergExporter.

This was identified during review of PR datacontract#1240

* style: Fix formatting

---------

Co-authored-by: = <=>
…ntract#1266)

* feat: accept extra top-level fields when --json-schema is set

When a custom JSON schema is supplied via `--json-schema`, treat it as the
source of truth and parse the contract with an `extra='allow'` ODCS subclass
so the Pydantic step no longer re-rejects extras the schema already accepted.
Lets teams extend ODCS with their own root-level fields (e.g.
`change_management`) and still use `lint`, `test`, `export`, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update CHANGELOG

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…contract#1265)

* fix: bundle DuckDB cloud extensions to support air-gapped tests (datacontract#1191)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: harden duckdb extension loading for read-only and incomplete-extra envs

Extract bundled extensions to a tempfile rather than writing into the
wheel directory (which fails on system installs and read-only Docker
layers). Pull [duckdb] transitively in [gcs] and [azure] extras, and
emit a server-type-aware hint when soda-core is missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: skip duckdb extension wheels on linux/aarch64 (no 1.0.x wheels available)

Without this, [s3]/[gcs]/[azure] don't install on linux/arm64 (Graviton,
ARM K8s nodes, Apple-Silicon Linux containers). ARM Linux falls back to
DuckDB's online auto-install — same as before the bundling work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: tighten soda-missing error message and trim verbose comments

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update pyproject.toml

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* failing test

* improve error navigation by solving schema name and property name in check message

* improve schema parsing

* improve the python code

* refactor move nested try catch into the function

* improve error message

---------

Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
Add a pull_request_target/synchronize-triggered job that removes the
'stale' label as soon as commits are pushed, instead of waiting for the
weekly scheduled run. Guard the existing scan job to non-PR events.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct#1163)

* fix: detect missing fields in CSV/Parquet files correctly (datacontract#1065)

field_is_present check always passed for CSV/Parquet because
create_view_with_schema_union creates a table with ALL contract columns
(missing ones filled with NULLs), making SodaCL schema check see the
column as present.

Fix: Create a _raw view that exposes actual data columns (without
contract-only columns), and use it for field_is_present checks on
CSV/Parquet data.

- duckdb_connection.py: Create {model}_raw view alongside unioned table
- data_contract_checks.py: Use _raw view for field_is_present on csv/parquet
- tests: Update schema evolution tests to expect correct behavior

* fix: extend missing-field detection to JSON and harden raw view (datacontract#1065)

Apply review follow-ups to the field_is_present fix: also create a {model}__raw__
view for JSON (read_json_auto without the columns= projection), create it
unconditionally so the fallback branch is covered, rename the sentinel from _raw
to __raw__ to avoid colliding with user models, and let check_property_is_present
derive the target view from the server. Adds JSON schema-evolution regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: infer test output format from file extension

* fix: add accepted values to ci output-format help; tidy infer error message

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* add CHANGELOG

---------

Co-authored-by: dallylee <dallylee@users.noreply.github.com>
Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: use Docker Hardened Image for container builds

Switch the Dockerfile from `python:3.11-slim-trixie` to
`dhi.io/python:3.11-debian13-sfw-dev` for both builder and runtime
stages. The DHI image is signed, ships SBOM/VEX metadata, has tighter
CVE patch SLAs than upstream Debian, and routes `pip` / `uv` installs
through Socket Firewall Free to block malicious dependencies at build
time.

Build pulls require `docker login dhi.io` with a Docker Hub account
that has DHI access enabled (the free Socket Firewall tier is enough).

Closes datacontract#1275

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: add dhi.io login step to docker build jobs

The Dockerfile now bases off `dhi.io/python:3.11-debian13-sfw-dev`,
which requires authentication on pull. Add a `docker/login-action`
step targeting `dhi.io` to both the CI snapshot job and the release
publish job. Reuses the existing DOCKERHUB_USERNAME / DOCKERHUB_TOKEN
secrets — DHI uses Docker Hub auth, so the same token works provided
the Docker Hub account has DHI access (free Socket Firewall tier is
sufficient).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: switch runtime to minimal DHI image (nonroot, no shell)

Move the runtime stage from `dhi.io/python:3.11-debian13-sfw-dev`
to the minimal `dhi.io/python:3.11-debian13` image. The minimal
runtime is also DHI Free, runs as nonroot (uid 65532), ships no
shell, no apt, and only the libraries Python itself needs — strictly
smaller attack surface than carrying the dev variant + Socket Firewall
at runtime, where neither is used.

Changes to make the minimal runtime work:

- Install `protobuf-compiler` in the builder stage (it can't be
  apt-installed in the runtime anymore) and copy `protoc` plus its
  shared libs (`libprotoc.so.32`, `libprotobuf.so.32`) into
  `/opt/protoc` in the runtime. Add `/opt/protoc/bin` to PATH and
  `/opt/protoc/lib` to LD_LIBRARY_PATH so `datacontract import
  protobuf` continues to work. The library glob handles both
  linux/amd64 and linux/arm64.
- `COPY --chown=65532:65532` the venv so the nonroot user owns it.
- Drop the `RUN mkdir /home/datacontract` (no coreutils in the
  minimal runtime); `WORKDIR` creates the directory and `USER
  65532:65532` explicitly pins runtime identity.

Image size: 1.88 GB (was 2.12 GB on sfw-dev, was 2.01 GB on the
previous python:3.11-slim-trixie baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: changelog entry for Docker Hardened Images switch

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: use DHI's nonroot user by name instead of uid

Cosmetic; same uid 65532. Easier to grep and read.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: switch runtime to DHI dev variant + bundle Java 17

Move the runtime stage from the minimal DHI image
(`dhi.io/python:3.11-debian13`) to the dev variant
(`dhi.io/python:3.11-debian13-dev`) and bundle Eclipse Temurin
JRE 17 via multi-stage COPY from `dhi.io/eclipse-temurin:17-debian13`.

Why dev instead of minimal:
- The minimal runtime has no shell, so PySpark's `spark-submit`
  (a bash script) can't even resolve its shebang. Without bash,
  Kafka and Spark engines fail with a misleading "spark-submit
  not found" error even when Java is present.
- The dev variant adds bash, apt, coreutils — same packages the
  previous slim-trixie image carried — without changing the
  hardening posture: still signed, still SBOM/VEX, still on DHI
  patch SLAs, still nonroot via the USER directive below.
- Image size: 2.13 GB vs. 1.97 GB on slim-trixie (+8%), the delta
  is the bundled JRE.

Why bundle Java:
- PySpark-backed engines (`datacontract test` against a kafka or
  spark server, `datacontract import/export spark` with a live
  session) have never worked in the published image: the previous
  slim-trixie base shipped no JVM either. Adding the JRE here means
  those paths now actually run — `SparkSession` starts, queries
  reach the engine. Previously a user had to extend the image with
  a JVM themselves.
- Eclipse Temurin 17 is DHI-published (signed, SBOM/VEX) and is in
  the version range PySpark 3.5.x supports.

Other simplifications vs. the previous minimal-runtime version of
this PR:
- `apt-get install protobuf-compiler` moves back to the runtime
  stage (dev has apt); drops the manual `/opt/protoc` shared-lib
  copy.
- `--chown=65532:65532` on the venv copy lets the nonroot user own
  it; `USER nonroot:nonroot` switches identity at the end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: install protoc in builder, copy artifacts into runtime

Idiomatic multi-stage pattern: build tools live in the builder, only
artifacts cross the stage boundary. Drops the `apt-get update + install
+ rm` from the runtime layer in favor of two `COPY --from=builder`
lines for the protoc binary and its shared libs. Size is roughly
unchanged; the runtime layer is more reproducible (no live apt fetch
at image-build time on the runtime side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: explain dev vs minimal runtime choice in Dockerfile comment

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: flag that DHI -dev runtime is intentional, not a mistake

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: drop stale reference to /opt/protoc copy from runtime comment

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: replace soda-core with ibis as the quality/test engine

Remove soda-core entirely as the data quality execution engine and replace
it with ibis (https://ibis-project.org/), which compiles one expression API
to many SQL dialects via sqlglot and reads local/remote files through DuckDB.

Motivation: soda-core v3 was an unmaintained, string-templated per-dialect SQL
generator that forced a `setuptools`/`distutils.strtobool` shim, a
`mysql-connector-python` override, and brittle version pinning across ~13
`soda-core-*` extras.

What changed:
- New engine-neutral check IR (`datacontract/engines/checks/`): `CheckSpec` +
  structured `Threshold`, and `create_checks` that enumerates an ODCS contract
  into specs, preserving every legacy check key/type/name.
- New ibis engine (`datacontract/engines/ibis/`): batches row/missing/invalid
  counts into one aggregation per model; runs dedicated queries for duplicates,
  schema/type, freshness/retention and user SQL; evaluates thresholds in Python.
  Reproduces soda's invalid_count semantics
  (NOT missing AND (NOT valid OR in invalid_values)). Counts use
  `CASE WHEN ... THEN 1 ELSE 0 END` for dialect portability (e.g. Oracle).
- Per-source ibis connection builders reusing the existing DuckDB view builder
  (files) and Spark/Kafka helpers; Spark sources run via the ibis pyspark
  backend.
- SodaCL kept but isolated: all SodaCL generation moved into
  `datacontract/export/sodacl_check_builder.py`, used only by `SodaExporter`.
  `export sodacl` is unchanged and no longer shares code with the test path.
- Removed `engines/soda/`, the old `data_contract_checks.py`, the soda
  config-builder tests, the setuptools shim, and all `soda-core-*` deps;
  pyproject extras now map to `ibis-framework[<backend>]`.
- Raw SodaCL custom checks (quality.engine: soda) now surface a migration
  warning instead of executing.

Verified end-to-end against testcontainers/local data for DuckDB (parquet/csv/
json/s3), Postgres (full quality fixture), Trino, and Oracle; full non-DB suite
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(ibis): don't tear down caller-owned connections; migrate kafka fixture

- execute_ibis_checks only disconnects connections the engine created. The
  pyspark backend wraps a caller-owned SparkSession, and an externally supplied
  DuckDB connection is owned by the caller; disconnecting either broke
  subsequent runs (e.g. the session-scoped Spark fixture shared by the two
  dataframe tests). Skip disposal for the pyspark backend and for
  caller-provided duckdb/spark resources.
- Migrate tests/fixtures/kafka to the native rowCount quality metric, replacing
  the removed raw SodaCL custom check, so the kafka/Spark path is exercised
  end-to-end.

Verified with Java 21: test_test_dataframe (x2), test_import_spark (x3) and
test_test_kafka all pass via the ibis pyspark backend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.13.0 and document soda-to-ibis migration

Minor bump (0.x SemVer) for the breaking quality-engine replacement.
Document the soda-core -> ibis migration, the dropped raw-SodaCL execution,
and the soda-core dependency removal in the CHANGELOG Unreleased section.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(mysql): route MySQL through DuckDB extension; pin duckdb to 1.0.x

- ibis's native MySQL backend requires `mysqlclient`, a C extension with no
  macOS/Linux wheels that fails to build without pkg-config + MySQL client
  libraries (broke `pip install -e .[dev]`). Connect to MySQL via DuckDB's
  `mysql` extension instead: ATTACH the database and materialize each contract
  model into a local DuckDB table, then run checks locally. Keeps the `mysql`
  extra pure-pip. Materializing avoids DuckDB MySQL-scanner pushdown errors
  (e.g. the grouped duplicate-count query hit a DuckDB binder assertion).
- Pin `duckdb` to `<1.1.0` to match the bundled `duckdb-extension-*` wheels
  (httpfs/aws/azure, pinned `<1.1.0`). Without a lockfile, fresh installs
  resolved duckdb 1.5.3, which mismatched those wheels (S3 "Secret Validation
  Failure") and changed CSV/JSON/secret behavior and mysql-extension port
  handling — breaking s3, csv-import, nested-json, and mysql tests.

Full suite (Java 21, duckdb 1.0.0): 744 passed, 14 skipped. Remaining 5
failures are environmental: 4 protobuf (no `protoc`), 1 kafka (pre-existing
Spark-session conflict with the dataframe test in non-xdist runs; passes in
isolation, skipped under xdist in CI).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(deps): move mysql-connector-python to the dev extra

The runtime MySQL path goes through DuckDB's mysql extension, so no Python
MySQL driver is needed at install time. mysql-connector-python is only used by
the MySQL test fixture to seed data, so it belongs in `dev`. The `mysql` extra
is now just `datacontract-cli[duckdb]`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(duckdb): upgrade to 1.5.x (from 1.0.x) with lockstep extension wheels

Loosen the duckdb pin off the 1.0.x line. duckdb and the bundled
duckdb-extension-* wheels (httpfs/aws/azure) are bumped together to 1.5.x; the
1.5.x extension wheels ship arm64 Linux builds, so the platform skip markers are
dropped and air-gapped installs on arm64 Linux now work.

Fixes for DuckDB >=1.5 behavior changes:
- S3 secret: explicit KEY_ID/SECRET now use the default `config` provider;
  `PROVIDER CREDENTIAL_CHAIN` with explicit credentials is rejected in 1.5.x
  ("Secret Validation Failure").
- csv import: the uniqueness probe uses `count(DISTINCT ...)` via SQL instead of
  the relational `.count('DISTINCT ...')` form, which 1.5.x's binder rejects.
- test_duckdb_json: assert on the stable DuckDBPyType `.id` (number->bigint,
  dict->struct) instead of the old DBAPI type-code strings.

Full suite (Java 21, duckdb 1.5.3): 744 passed, 14 skipped; remaining 5 failures
are environmental (4 protobuf: no protoc; 1 kafka: Spark-session conflict in
non-xdist runs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(spark): use tmp_dir.name for spark.sql.warehouse.dir

Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir
in create_spark_session() and import_spark().

* fix(pyspark): cap to <4.0 to match the Spark 3.5 connector jars

Without a lockfile, `pyspark>=3.5.0,<5.0.0` resolved to 4.0.x on fresh installs,
but the Kafka/Avro paths load `spark-sql-kafka-0-10_2.12:3.5.5` /
`spark-avro_2.12:3.5.5` (Scala 2.12, Spark 3.5) jars, which fail to load on a
Spark 4.x (Scala 2.13) runtime — breaking `datacontract test` against Kafka.
Cap pyspark to the 3.5.x line in the kafka and databricks extras.

Full suite (Java 21, duckdb 1.5.3, pyspark 3.5.8): 745 passed, 14 skipped, 4
failed; the 4 failures are the protobuf importer tests, which require the
`protoc` system binary (documented manual install).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(protobuf): import .proto with pure-Python parser (drop protoc)

Replace the protoc-based importer with proto-schema-parser (pure Python, only
depends on antlr4-python3-runtime). `import protobuf` no longer needs the
`protoc` system binary or the protobuf runtime: imports are resolved transitively
by reading `import` statements and parsing each file, and message/enum type
references are linked across files by simple name (handling package-qualified
and subdirectory imports).

Output is preserved byte-for-byte vs the protoc-based importer (scalar
physicalType is still the protobuf type number; repeated scalars stay scalar;
only repeated messages become arrays). The `protobuf` extra now declares
`proto-schema-parser` instead of `protobuf`.

Dockerfile: drop the protobuf-compiler install and the protoc binary/lib copy
into the runtime image — no longer needed.

tests/test_import_protobuf.py (incl. nested-imports and subdirs): 4 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(snowflake): accept DATACONTRACT_SNOWFLAKE_USERNAME as alias for user

The ibis Snowflake backend forwards connection kwargs to
snowflake-connector-python, which expects `user` (not soda's `username`).
Map the documented DATACONTRACT_SNOWFLAKE_USERNAME env var to `user` so it
keeps working after the soda -> ibis migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: update README for the soda -> ibis migration

- protobuf import: drop the protoc system-dependency instructions (now a
  pure-Python parser).
- engine description: ibis + fastjsonschema instead of soda-core.
- bigquery: ADC/WIF fallback no longer described via soda's use_context_auth.
- snowflake: document env vars as snowflake-connector-python params (USERNAME
  kept as an alias for `user`).
- redshift: username/password via the Postgres backend; note IAM auth is not
  currently supported.
- impala: drop 'Soda' wording.

The `export sodacl` format is unchanged and remains documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(spark): derive Kafka/Avro connector jars from the runtime PySpark

Replace the hardcoded spark-sql-kafka / spark-avro coordinates (_2.12:3.5.5)
with spark_connector_packages(), which reads the Spark version and Scala binary
version (2.12 vs 2.13) from the installed PySpark jars. This lets the kafka
extra allow PySpark 4.x (Scala 2.13) without the connector JARs mismatching the
runtime. Tests use the same helper.

* chore(databricks): lift the pyspark <4.0 cap to <5.0

The databricks path connects via a caller-provided Spark session (or the
databricks SQL connector) and never calls create_spark_session(), so it loads
no Kafka/Avro connector jars and isn't tied to a Scala/Spark line. Align its
pyspark range with the kafka extra (<5.0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(changelog): reflect dynamic Spark connector jars (not a pyspark cap)

The kafka/databricks extras allow pyspark<5.0; the Kafka/Avro connector jars are
derived from the runtime PySpark (Scala 2.12/2.13), so the earlier "capped to
<4.0" note no longer describes the final state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(ibis): record real compiled SQL in check implementation/language

Previously every ibis check carried a SodaCL-looking pseudo-string in
`implementation` with `language: null`, a leftover from the soda model. Now each
check records what it actually runs:

- count-style metrics (row_count, missing_count, invalid_count), duplicates,
  freshness/retention and custom-SQL checks store the backend-dialect SQL
  (compiled via ibis.to_sql) with `language: "sql"`.
- schema checks (field_is_present, field_type) use schema introspection, so they
  record a short note with `language: "introspection"`.

The batched count metrics still execute as a single aggregation; each check's
recorded SQL is the representative single-metric query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(databricks): support OAuth M2M, config profile, and U2M auth

The direct (non-Spark) Databricks test connection only accepted a personal
access token. Now that the engine connects via ibis.databricks.connect, the
full connector auth surface is available, so resolve the auth method from env
vars in priority order:

1. DATACONTRACT_DATABRICKS_TOKEN — personal access token (unchanged default)
2. DATACONTRACT_DATABRICKS_CLIENT_ID + _CLIENT_SECRET — OAuth service principal
   (M2M), the usual choice for CI/CD
3. DATACONTRACT_DATABRICKS_PROFILE — a local config profile via the Databricks
   SDK unified auth (parity with the Unity Catalog importer; also Azure CLI/MSI)
4. DATACONTRACT_DATABRICKS_AUTH_TYPE — explicit connector auth_type, e.g.
   databricks-oauth for the interactive U2M browser flow

The OAuth credential providers build their SDK Config lazily so token exchange
happens at connect time, not while reading env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sqlserver): restore full authentication set dropped in the ibis migration

The soda->ibis migration reduced the SQL Server connection to username/password
+ driver, silently dropping the documented auth and SSL env vars. With ODBC
Driver 18 (encrypt + verify by default) this broke connections to servers with
a self-signed certificate, including the test container (test_test_sqlserver).

Restore the documented behavior in _sqlserver_connection_kwargs, selected by
DATACONTRACT_SQLSERVER_AUTHENTICATION:

- sql (default): USERNAME/PASSWORD
- windows: Trusted_Connection (Kerberos/NTLM)
- ActiveDirectoryPassword: Entra ID USERNAME/PASSWORD
- ActiveDirectoryServicePrincipal: Entra ID CLIENT_ID/CLIENT_SECRET
- ActiveDirectoryInteractive: Entra ID browser login
- cli: az login session via ActiveDirectoryDefault (ODBC Driver 18.1+)

Plus the legacy TRUSTED_CONNECTION switch (== windows, takes precedence),
ENCRYPTED_CONNECTION (Encrypt=yes/no, default yes), and
TRUST_SERVER_CERTIFICATE (TrustServerCertificate=yes). The auth modes that pass
no credentials explicitly set Trusted_Connection to avoid ibis's integrated-auth
default leaking into Entra ID / cli connections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(changelog): note the new Databricks authentication methods

SQL Server auth gets no entry: it restores parity with the last release
(0.12.5), so there is no user-visible change relative to a shipped version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: support Python 3.13

Raise requires-python to >=3.10,<3.14. The core and all non-Spark extras work
on 3.13; the Spark extras resolve to PySpark 4.0 there (Spark 3.5 has no 3.13
build), and the connector jars already adapt to the runtime Spark/Scala version.

create_spark_session now sets PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON to
sys.executable so Spark's Python workers use the same interpreter as the driver
(otherwise PySpark fails with PYTHON_VERSION_MISMATCH when PATH's python3
differs, which is common on 3.13).

Full suite passes on Python 3.13.12 / PySpark 4.0.2 and on 3.11 / PySpark 3.5.8
(769 passed, 14 skipped).

* docs(changelog): surface breaking changes in a section above Added

* ci: add Python 3.13 to test matrix, update changelog with breaking changes

* ci: add Python 3.13 to test matrix, update changelog with breaking changes

* feat: support Python 3.14

Raise requires-python to >=3.10,<3.15 and add 3.14 to the CI test matrix.
The full dependency graph resolves and installs with native 3.14 wheels
within the existing version caps (no cap changes needed); notably duckdb
1.5.x, ibis-framework 12, pyspark 4.0, pydantic-core, pyarrow, numpy, and
cryptography all ship 3.14 wheels.

No code changes required: on 3.14 the Spark-backed extras resolve to
PySpark 4.0 (Spark 3.5 has no 3.14 build), same as 3.13, and the
PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON pinning added for 3.13 already keeps
Spark's Python workers matched to the driver.

Full suite passes on Python 3.14.3 / PySpark 4.0.2 (768 passed, 15 skipped).

* feat(test): structured diagnostics, percent thresholds, severity, and failed-row samples

Extend the ibis quality engine with four ODCS-aware capabilities:

- diagnostics: each check records structured diagnostics (metric, measured
  value, threshold, row count, failed fraction, and the enforced validity
  rule) on Check.diagnostics, surfaced in JSON and JUnit output. Removes the
  unused, never-populated Check.details field.

- percent thresholds: honor ODCS quality.unit: percent for the
  count-of-bad-rows metrics (nullValues, missingValues, invalidValues),
  comparing the failed fraction (0-100) of the row count against the
  threshold. Percent on metrics with no row-fraction meaning (rowCount,
  duplicateValues) warns and falls back to an absolute count.

- severity: honor ODCS quality.severity; a non-blocking severity (info,
  warning, low, minor, trivial) downgrades a failing quality check to a
  warning so it no longer fails the run. Any other severity still fails.

- failed-row samples: new `datacontract test --include-failed-samples`
  collects a capped (5-row) sample of offending rows for missing/invalid/
  duplicate checks, restricted to identifier (unique/primaryKey) plus the
  offending column, omitting columns whose ODCS classification marks them
  sensitive. Stored on Check.failed_samples and surfaced in JSON and the
  JUnit failure text. Local-only; needs no Soda Cloud.

Add in-process duckdb tests for diagnostics, percent/severity, and failed
samples.

* ci: add Java setup for Spark-based tests, update README

* chore(deps): remove aiobotocore dependency and update README with Java installation instructions

* chore(release): 1.0.0

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dmaresma dmaresma merged commit 6f858a2 into release/mckesson_datahub202606 Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.