Feat/snowflake importer#16
Merged
dmaresma merged 112 commits intoJun 4, 2026
Merged
Conversation
Owner
dmaresma
commented
Jun 4, 2026
- Tests pass
- ruff format
- README.md updated (if relevant)
- CHANGELOG.md entry added
…attributes add sort by schema.name and properties by ordinal_position
…/datacontract-cli into feat/snowflake_importer
…, defaultValue, Roles in server
…/datacontract-cli into feat/snowflake_importer_bhn
added tests and key authentication
Split the image into builder and runtime stages so uv and build-time artifacts no longer ship in the final image. Add a .dockerignore so the build context only contains the files actually needed.
…atacontract#1240) SparkExporter.export() was annotated as returning dict[str, StructType] but actually returns a str, consistent with all other exporters and with DataContract.export()'s own str | bytes signature. Note: to_spark_dict() exists as a separate public utility for Python users who need live StructType objects. It is tested directly and intentionally bypasses export(). Whether this pattern should be documented is an open question for maintainers. Co-authored-by: = <=>
…cks (datacontract#1219, datacontract#1245) (datacontract#1253) * fix: skip schema type check for varchar(n) and map columns on Databricks (issues datacontract#1219, datacontract#1245) * trim tests --------- Co-authored-by: Alexandra Studer <alexandra.studer@mobi.ch> Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
…inline-references (datacontract#1261) * feat: resolve authoritativeDefinitions[type=definition] and add --no-inline-references Re-introduces the definition inlining that became a no-op in 0.11.x when the internal model moved from DCS to ODCS. For each schema property, inline_definition_into_property now resolves its first authoritativeDefinitions[type=definition] reference, parses the response as an ODCS SchemaProperty (the entropy-data endpoint already serves this shape under application/vnd.entropydata.odcs+json), and fills in the fields the property leaves unset. Inline values always win (model_fields_set); authoritativeDefinitions, name, properties and items are never overwritten. Resolution rules: - Relative `url:` is joined onto ENTROPY_DATA_HOST. Absolute is used as-is. - The x-api-key header is sent only when the resolved URL's host matches ENTROPY_DATA_HOST -- a third-party `url:` cannot receive the user's key. - Any failure (host mismatch, HTTP error, network error, malformed body) raises a DataContractException. No silent skip. - Per-process success-only cache so a definition shared by many properties is fetched once; failures are not cached so a transient blip doesn't poison later runs. Exposes the new behaviour via --no-inline-references / --inline-references on lint, test, ci, changelog, and all export subcommands. Doubles as an escape hatch when a definition is briefly unreachable or invalid: the contract was already validated against the ODCS schema before resolution, so skipping the fetch still produces a syntactically valid run from the contract as written. Naming chosen generic ("references", not "definitions") so future authoritativeDefinitions[type=...] handlers can plug into the same flag and field. Existing DataContract(inline_definitions=...) parameter renamed to inline_references. Fixtures using type:"definition" against demo URLs were switched to type:"businessDefinition" so the existing tests don't trigger HTTP -- shipments-odcs.xlsx regenerated to match. Covered by 17 new tests in tests/test_resolve_definitions.py: URL shapes, api-key scoping, inline-wins precedence, identity-field protection, nested properties, array items, caching, every failure mode, and a CLI smoke test proving --no-inline-references makes zero HTTP calls. * style: apply ruff format to tests/test_lint.py CI's ruff format --check failed. After renaming inline_definitions -> inline_references the DataContract(...) call in test_lint_with_ref fit on one line, but the formatter left it wrapped. Reformat. * docs: compact function docs in the resolver Trim the comments and docstrings I added: drop redundant what already conveyed by function names and types, keep only the non-obvious why (non-mergeable-field rationale, success-only cache, host-scoped API key, model_fields_set semantics).
- _NON_MERGEABLE_FIELDS gains `id`: a property's `id` identifies the
property itself, like `name`, and should not be overwritten by the
definition's id (jschoedl).
- Escape `\[type=definition]` in the inline-references help text on
`lint`, `test`, `ci`, `export`, and `changelog`. Typer/rich was
parsing the bare bracket as markup, so the type-name didn't render
in `--help` (jschoedl).
- CHANGELOG entry split into three bullets so the breaking nature of
"any resolution failure rejects the contract" is its own line
instead of buried at the end of a paragraph (jschoedl).
- README regenerated via `python3 update_help.py` so the help section
matches the new strings.
Updates the requirements on [uvicorn](https://github.com/Kludex/uvicorn) to permit the latest version. - [Release notes](https://github.com/Kludex/uvicorn/releases) - [Changelog](https://github.com/Kludex/uvicorn/blob/main/docs/release-notes.md) - [Commits](Kludex/uvicorn@0.44.0...0.48.0) --- updated-dependencies: - dependency-name: uvicorn dependency-version: 0.48.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Updates the requirements on [databricks-sdk](https://github.com/databricks/databricks-sdk-py) to permit the latest version. - [Release notes](https://github.com/databricks/databricks-sdk-py/releases) - [Changelog](https://github.com/databricks/databricks-sdk-py/blob/main/CHANGELOG.md) - [Commits](databricks/databricks-sdk-py@v0.0.1...v0.111.0) --- updated-dependencies: - dependency-name: databricks-sdk dependency-version: 0.111.0 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.15.13 to 0.15.14. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md) - [Commits](astral-sh/ruff@0.15.13...0.15.14) --- updated-dependencies: - dependency-name: ruff dependency-version: 0.15.14 dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…RI) (datacontract#1262) Broadens the resolver from "definition" only to a precedence-ordered list ("semantics", "semantic", "definition") and adds an IRI lookup path so semantic URLs that aren't directly fetchable (because they encode the namespace as a stable identifier on a non-entropy-data host) still resolve. Routing: - URL on the configured entropy-data host (relative or matching absolute): fetched directly, x-api-key sent when configured. - Off-host `definition` URL: fetched directly and anonymously -- a contract may legitimately reference a third-party REST URL, and the API key must never leak across hosts (existing behaviour, unchanged). - Off-host `semantics`/`semantic` URL: treated as an IRI and routed through `GET /api/semantics?iri=...` on the configured entropy-data host (a new endpoint added in the entropy-data side of this change). Requires an API key because that endpoint is API-key only. Precedence: when a property has both a semantics and a definition reference, the semantic wins -- matches useInheritedDefinition in datacontract-editor. The legacy singular `"semantic"` type is also accepted for contracts written before the entropy-data type migration. Five new tests in test_resolve_definitions.py (17 -> 22): REST URL with api-key, IRI routed through /api/semantics, IRI without api-key raises with clear guidance, legacy "semantic" type resolves, semantics-wins-over-definition.
…ract#1249) * test: add failing tests for versioned dbt model filter Add a versioned dbt manifest fixture (manifest_versioned_models.json) with two versions of 'mart_orders' (v1, v2) and one unversioned model to cover the following cases: - Plain name (mart_orders) imports all versions — already works - mart_orders.v1 imports only v1 — FAILS (bug) - mart_orders.v2 imports only v2 — FAILS (bug) - CLI --model mart_orders.v1 produces non-empty output — FAILS (bug) Relates to: datacontract import dbt --model <name>.vN silent empty output * fix: support versioned dbt model filter (name.vN) The node filter in import_dbt_manifest compared the raw user-supplied string against node["name"], which is always the base model name in manifest.json. Supplying "mart_orders.v1" therefore never matched "mart_orders" and produced a silent empty contract. Add _matches_dbt_node_filter() which understands dbt's name.vN convention: - "mart_orders" → matches all versions of mart_orders (unchanged behaviour) - "mart_orders.v1" → matches only the node whose name=="mart_orders" and version==1 Replace the inline not-in check with a call to the new helper. * fix: handle dotted plain model names in dbt node filter Using lstrip('v') to extract the version number was fragile — any dotted plain name (e.g. 'schema.orders') would be misread as a versioned filter with base='schema' and version='orders', causing the node to be silently skipped. Replace the lstrip approach with a _VERSION_SUFFIX_RE regex (^v?(\d+)$) that only treats the suffix as a version when it actually looks like one. If the suffix does not match, execution falls through to the plain-name comparison so 'schema.orders' still resolves correctly. Add unit tests for _matches_dbt_node_filter covering: - plain name match - versioned match with and without 'v' prefix - wrong version does not match - plain name matches all versions - dotted plain name is not misread as versioned - empty filter list * docs: update README and CHANGELOG for versioned dbt model filter fix * trim tests and README --------- Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
…tacontract#1263) The `[api]` optional extra hard-pinned `fastapi==0.136.1`. For a published library this forces downstream apps that depend on FastAPI to that exact version, frequently causing pip resolution conflicts. Switch to `fastapi>=0.115.0,<0.137.0`: - Wide floor lets downstream projects co-resolve. - Ceiling caps at the next minor because FastAPI ships breaking changes in 0.x MINOR releases (per FastAPI's versioning policy); Dependabot keeps advancing the ceiling with CI validating each minor. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
datacontract#1264) * fix: surface WARNING/ERROR logs by default instead of suppressing them Every command called enable_debug_logging(debug), which added a logging.NullHandler to the root logger when --debug was off. That disabled Python's lastResort handler, silencing all logging.warning/ logging.error output across the codebase unless --debug was passed. Default the root level to WARNING (on stderr) so diagnostics like the `import sql` placeholder-connection warning are visible. test/ci/lint render run.logs to the console themselves and would otherwise print the mirrored Run.log_* records twice, so they pass otherwise_disable_stderr to keep the old NullHandler behavior. Noisy third-party loggers are pinned to ERROR to preserve the original intent of muting library chatter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * shorten Changelog entry --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tacontract#1260) * fix(exporters): correct return type annotations on export() methods All exporters returning str were incorrectly annotated as -> dict. Fixed 15 exporters to -> str (or -> str | None for MermaidExporter) and added the missing -> str annotation to IcebergExporter. This was identified during review of PR datacontract#1240 * style: Fix formatting --------- Co-authored-by: = <=>
…ntract#1266) * feat: accept extra top-level fields when --json-schema is set When a custom JSON schema is supplied via `--json-schema`, treat it as the source of truth and parse the contract with an `extra='allow'` ODCS subclass so the Pydantic step no longer re-rejects extras the schema already accepted. Lets teams extend ODCS with their own root-level fields (e.g. `change_management`) and still use `lint`, `test`, `export`, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update CHANGELOG --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…contract#1265) * fix: bundle DuckDB cloud extensions to support air-gapped tests (datacontract#1191) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: harden duckdb extension loading for read-only and incomplete-extra envs Extract bundled extensions to a tempfile rather than writing into the wheel directory (which fails on system installs and read-only Docker layers). Pull [duckdb] transitively in [gcs] and [azure] extras, and emit a server-type-aware hint when soda-core is missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: skip duckdb extension wheels on linux/aarch64 (no 1.0.x wheels available) Without this, [s3]/[gcs]/[azure] don't install on linux/arm64 (Graviton, ARM K8s nodes, Apple-Silicon Linux containers). ARM Linux falls back to DuckDB's online auto-install — same as before the bundling work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: tighten soda-missing error message and trim verbose comments Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update pyproject.toml --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* failing test * improve error navigation by solving schema name and property name in check message * improve schema parsing * improve the python code * refactor move nested try catch into the function * improve error message --------- Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com>
Add a pull_request_target/synchronize-triggered job that removes the 'stale' label as soon as commits are pushed, instead of waiting for the weekly scheduled run. Guard the existing scan job to non-PR events. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ct#1163) * fix: detect missing fields in CSV/Parquet files correctly (datacontract#1065) field_is_present check always passed for CSV/Parquet because create_view_with_schema_union creates a table with ALL contract columns (missing ones filled with NULLs), making SodaCL schema check see the column as present. Fix: Create a _raw view that exposes actual data columns (without contract-only columns), and use it for field_is_present checks on CSV/Parquet data. - duckdb_connection.py: Create {model}_raw view alongside unioned table - data_contract_checks.py: Use _raw view for field_is_present on csv/parquet - tests: Update schema evolution tests to expect correct behavior * fix: extend missing-field detection to JSON and harden raw view (datacontract#1065) Apply review follow-ups to the field_is_present fix: also create a {model}__raw__ view for JSON (read_json_auto without the columns= projection), create it unconditionally so the fallback branch is covered, rename the sentinel from _raw to __raw__ to avoid colliding with user models, and let check_property_is_present derive the target view from the server. Adds JSON schema-evolution regression tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: infer test output format from file extension * fix: add accepted values to ci output-format help; tidy infer error message Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * add CHANGELOG --------- Co-authored-by: dallylee <dallylee@users.noreply.github.com> Co-authored-by: Jakob Schödl <jakob.schoedl@entropy-data.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: use Docker Hardened Image for container builds Switch the Dockerfile from `python:3.11-slim-trixie` to `dhi.io/python:3.11-debian13-sfw-dev` for both builder and runtime stages. The DHI image is signed, ships SBOM/VEX metadata, has tighter CVE patch SLAs than upstream Debian, and routes `pip` / `uv` installs through Socket Firewall Free to block malicious dependencies at build time. Build pulls require `docker login dhi.io` with a Docker Hub account that has DHI access enabled (the free Socket Firewall tier is enough). Closes datacontract#1275 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: add dhi.io login step to docker build jobs The Dockerfile now bases off `dhi.io/python:3.11-debian13-sfw-dev`, which requires authentication on pull. Add a `docker/login-action` step targeting `dhi.io` to both the CI snapshot job and the release publish job. Reuses the existing DOCKERHUB_USERNAME / DOCKERHUB_TOKEN secrets — DHI uses Docker Hub auth, so the same token works provided the Docker Hub account has DHI access (free Socket Firewall tier is sufficient). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: switch runtime to minimal DHI image (nonroot, no shell) Move the runtime stage from `dhi.io/python:3.11-debian13-sfw-dev` to the minimal `dhi.io/python:3.11-debian13` image. The minimal runtime is also DHI Free, runs as nonroot (uid 65532), ships no shell, no apt, and only the libraries Python itself needs — strictly smaller attack surface than carrying the dev variant + Socket Firewall at runtime, where neither is used. Changes to make the minimal runtime work: - Install `protobuf-compiler` in the builder stage (it can't be apt-installed in the runtime anymore) and copy `protoc` plus its shared libs (`libprotoc.so.32`, `libprotobuf.so.32`) into `/opt/protoc` in the runtime. Add `/opt/protoc/bin` to PATH and `/opt/protoc/lib` to LD_LIBRARY_PATH so `datacontract import protobuf` continues to work. The library glob handles both linux/amd64 and linux/arm64. - `COPY --chown=65532:65532` the venv so the nonroot user owns it. - Drop the `RUN mkdir /home/datacontract` (no coreutils in the minimal runtime); `WORKDIR` creates the directory and `USER 65532:65532` explicitly pins runtime identity. Image size: 1.88 GB (was 2.12 GB on sfw-dev, was 2.01 GB on the previous python:3.11-slim-trixie baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: changelog entry for Docker Hardened Images switch Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: use DHI's nonroot user by name instead of uid Cosmetic; same uid 65532. Easier to grep and read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: switch runtime to DHI dev variant + bundle Java 17 Move the runtime stage from the minimal DHI image (`dhi.io/python:3.11-debian13`) to the dev variant (`dhi.io/python:3.11-debian13-dev`) and bundle Eclipse Temurin JRE 17 via multi-stage COPY from `dhi.io/eclipse-temurin:17-debian13`. Why dev instead of minimal: - The minimal runtime has no shell, so PySpark's `spark-submit` (a bash script) can't even resolve its shebang. Without bash, Kafka and Spark engines fail with a misleading "spark-submit not found" error even when Java is present. - The dev variant adds bash, apt, coreutils — same packages the previous slim-trixie image carried — without changing the hardening posture: still signed, still SBOM/VEX, still on DHI patch SLAs, still nonroot via the USER directive below. - Image size: 2.13 GB vs. 1.97 GB on slim-trixie (+8%), the delta is the bundled JRE. Why bundle Java: - PySpark-backed engines (`datacontract test` against a kafka or spark server, `datacontract import/export spark` with a live session) have never worked in the published image: the previous slim-trixie base shipped no JVM either. Adding the JRE here means those paths now actually run — `SparkSession` starts, queries reach the engine. Previously a user had to extend the image with a JVM themselves. - Eclipse Temurin 17 is DHI-published (signed, SBOM/VEX) and is in the version range PySpark 3.5.x supports. Other simplifications vs. the previous minimal-runtime version of this PR: - `apt-get install protobuf-compiler` moves back to the runtime stage (dev has apt); drops the manual `/opt/protoc` shared-lib copy. - `--chown=65532:65532` on the venv copy lets the nonroot user own it; `USER nonroot:nonroot` switches identity at the end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: install protoc in builder, copy artifacts into runtime Idiomatic multi-stage pattern: build tools live in the builder, only artifacts cross the stage boundary. Drops the `apt-get update + install + rm` from the runtime layer in favor of two `COPY --from=builder` lines for the protoc binary and its shared libs. Size is roughly unchanged; the runtime layer is more reproducible (no live apt fetch at image-build time on the runtime side). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: explain dev vs minimal runtime choice in Dockerfile comment Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: flag that DHI -dev runtime is intentional, not a mistake Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: drop stale reference to /opt/protoc copy from runtime comment Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: replace soda-core with ibis as the quality/test engine Remove soda-core entirely as the data quality execution engine and replace it with ibis (https://ibis-project.org/), which compiles one expression API to many SQL dialects via sqlglot and reads local/remote files through DuckDB. Motivation: soda-core v3 was an unmaintained, string-templated per-dialect SQL generator that forced a `setuptools`/`distutils.strtobool` shim, a `mysql-connector-python` override, and brittle version pinning across ~13 `soda-core-*` extras. What changed: - New engine-neutral check IR (`datacontract/engines/checks/`): `CheckSpec` + structured `Threshold`, and `create_checks` that enumerates an ODCS contract into specs, preserving every legacy check key/type/name. - New ibis engine (`datacontract/engines/ibis/`): batches row/missing/invalid counts into one aggregation per model; runs dedicated queries for duplicates, schema/type, freshness/retention and user SQL; evaluates thresholds in Python. Reproduces soda's invalid_count semantics (NOT missing AND (NOT valid OR in invalid_values)). Counts use `CASE WHEN ... THEN 1 ELSE 0 END` for dialect portability (e.g. Oracle). - Per-source ibis connection builders reusing the existing DuckDB view builder (files) and Spark/Kafka helpers; Spark sources run via the ibis pyspark backend. - SodaCL kept but isolated: all SodaCL generation moved into `datacontract/export/sodacl_check_builder.py`, used only by `SodaExporter`. `export sodacl` is unchanged and no longer shares code with the test path. - Removed `engines/soda/`, the old `data_contract_checks.py`, the soda config-builder tests, the setuptools shim, and all `soda-core-*` deps; pyproject extras now map to `ibis-framework[<backend>]`. - Raw SodaCL custom checks (quality.engine: soda) now surface a migration warning instead of executing. Verified end-to-end against testcontainers/local data for DuckDB (parquet/csv/ json/s3), Postgres (full quality fixture), Trino, and Oracle; full non-DB suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ibis): don't tear down caller-owned connections; migrate kafka fixture - execute_ibis_checks only disconnects connections the engine created. The pyspark backend wraps a caller-owned SparkSession, and an externally supplied DuckDB connection is owned by the caller; disconnecting either broke subsequent runs (e.g. the session-scoped Spark fixture shared by the two dataframe tests). Skip disposal for the pyspark backend and for caller-provided duckdb/spark resources. - Migrate tests/fixtures/kafka to the native rowCount quality metric, replacing the removed raw SodaCL custom check, so the kafka/Spark path is exercised end-to-end. Verified with Java 21: test_test_dataframe (x2), test_import_spark (x3) and test_test_kafka all pass via the ibis pyspark backend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version to 0.13.0 and document soda-to-ibis migration Minor bump (0.x SemVer) for the breaking quality-engine replacement. Document the soda-core -> ibis migration, the dropped raw-SodaCL execution, and the soda-core dependency removal in the CHANGELOG Unreleased section. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(mysql): route MySQL through DuckDB extension; pin duckdb to 1.0.x - ibis's native MySQL backend requires `mysqlclient`, a C extension with no macOS/Linux wheels that fails to build without pkg-config + MySQL client libraries (broke `pip install -e .[dev]`). Connect to MySQL via DuckDB's `mysql` extension instead: ATTACH the database and materialize each contract model into a local DuckDB table, then run checks locally. Keeps the `mysql` extra pure-pip. Materializing avoids DuckDB MySQL-scanner pushdown errors (e.g. the grouped duplicate-count query hit a DuckDB binder assertion). - Pin `duckdb` to `<1.1.0` to match the bundled `duckdb-extension-*` wheels (httpfs/aws/azure, pinned `<1.1.0`). Without a lockfile, fresh installs resolved duckdb 1.5.3, which mismatched those wheels (S3 "Secret Validation Failure") and changed CSV/JSON/secret behavior and mysql-extension port handling — breaking s3, csv-import, nested-json, and mysql tests. Full suite (Java 21, duckdb 1.0.0): 744 passed, 14 skipped. Remaining 5 failures are environmental: 4 protobuf (no `protoc`), 1 kafka (pre-existing Spark-session conflict with the dataframe test in non-xdist runs; passes in isolation, skipped under xdist in CI). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(deps): move mysql-connector-python to the dev extra The runtime MySQL path goes through DuckDB's mysql extension, so no Python MySQL driver is needed at install time. mysql-connector-python is only used by the MySQL test fixture to seed data, so it belongs in `dev`. The `mysql` extra is now just `datacontract-cli[duckdb]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(duckdb): upgrade to 1.5.x (from 1.0.x) with lockstep extension wheels Loosen the duckdb pin off the 1.0.x line. duckdb and the bundled duckdb-extension-* wheels (httpfs/aws/azure) are bumped together to 1.5.x; the 1.5.x extension wheels ship arm64 Linux builds, so the platform skip markers are dropped and air-gapped installs on arm64 Linux now work. Fixes for DuckDB >=1.5 behavior changes: - S3 secret: explicit KEY_ID/SECRET now use the default `config` provider; `PROVIDER CREDENTIAL_CHAIN` with explicit credentials is rejected in 1.5.x ("Secret Validation Failure"). - csv import: the uniqueness probe uses `count(DISTINCT ...)` via SQL instead of the relational `.count('DISTINCT ...')` form, which 1.5.x's binder rejects. - test_duckdb_json: assert on the stable DuckDBPyType `.id` (number->bigint, dict->struct) instead of the old DBAPI type-code strings. Full suite (Java 21, duckdb 1.5.3): 744 passed, 14 skipped; remaining 5 failures are environmental (4 protobuf: no protoc; 1 kafka: Spark-session conflict in non-xdist runs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(spark): use tmp_dir.name for spark.sql.warehouse.dir Use the TemporaryDirectory path instead of its repr for the Spark warehouse dir in create_spark_session() and import_spark(). * fix(pyspark): cap to <4.0 to match the Spark 3.5 connector jars Without a lockfile, `pyspark>=3.5.0,<5.0.0` resolved to 4.0.x on fresh installs, but the Kafka/Avro paths load `spark-sql-kafka-0-10_2.12:3.5.5` / `spark-avro_2.12:3.5.5` (Scala 2.12, Spark 3.5) jars, which fail to load on a Spark 4.x (Scala 2.13) runtime — breaking `datacontract test` against Kafka. Cap pyspark to the 3.5.x line in the kafka and databricks extras. Full suite (Java 21, duckdb 1.5.3, pyspark 3.5.8): 745 passed, 14 skipped, 4 failed; the 4 failures are the protobuf importer tests, which require the `protoc` system binary (documented manual install). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(protobuf): import .proto with pure-Python parser (drop protoc) Replace the protoc-based importer with proto-schema-parser (pure Python, only depends on antlr4-python3-runtime). `import protobuf` no longer needs the `protoc` system binary or the protobuf runtime: imports are resolved transitively by reading `import` statements and parsing each file, and message/enum type references are linked across files by simple name (handling package-qualified and subdirectory imports). Output is preserved byte-for-byte vs the protoc-based importer (scalar physicalType is still the protobuf type number; repeated scalars stay scalar; only repeated messages become arrays). The `protobuf` extra now declares `proto-schema-parser` instead of `protobuf`. Dockerfile: drop the protobuf-compiler install and the protoc binary/lib copy into the runtime image — no longer needed. tests/test_import_protobuf.py (incl. nested-imports and subdirs): 4 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(snowflake): accept DATACONTRACT_SNOWFLAKE_USERNAME as alias for user The ibis Snowflake backend forwards connection kwargs to snowflake-connector-python, which expects `user` (not soda's `username`). Map the documented DATACONTRACT_SNOWFLAKE_USERNAME env var to `user` so it keeps working after the soda -> ibis migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: update README for the soda -> ibis migration - protobuf import: drop the protoc system-dependency instructions (now a pure-Python parser). - engine description: ibis + fastjsonschema instead of soda-core. - bigquery: ADC/WIF fallback no longer described via soda's use_context_auth. - snowflake: document env vars as snowflake-connector-python params (USERNAME kept as an alias for `user`). - redshift: username/password via the Postgres backend; note IAM auth is not currently supported. - impala: drop 'Soda' wording. The `export sodacl` format is unchanged and remains documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(spark): derive Kafka/Avro connector jars from the runtime PySpark Replace the hardcoded spark-sql-kafka / spark-avro coordinates (_2.12:3.5.5) with spark_connector_packages(), which reads the Spark version and Scala binary version (2.12 vs 2.13) from the installed PySpark jars. This lets the kafka extra allow PySpark 4.x (Scala 2.13) without the connector JARs mismatching the runtime. Tests use the same helper. * chore(databricks): lift the pyspark <4.0 cap to <5.0 The databricks path connects via a caller-provided Spark session (or the databricks SQL connector) and never calls create_spark_session(), so it loads no Kafka/Avro connector jars and isn't tied to a Scala/Spark line. Align its pyspark range with the kafka extra (<5.0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(changelog): reflect dynamic Spark connector jars (not a pyspark cap) The kafka/databricks extras allow pyspark<5.0; the Kafka/Avro connector jars are derived from the runtime PySpark (Scala 2.12/2.13), so the earlier "capped to <4.0" note no longer describes the final state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ibis): record real compiled SQL in check implementation/language Previously every ibis check carried a SodaCL-looking pseudo-string in `implementation` with `language: null`, a leftover from the soda model. Now each check records what it actually runs: - count-style metrics (row_count, missing_count, invalid_count), duplicates, freshness/retention and custom-SQL checks store the backend-dialect SQL (compiled via ibis.to_sql) with `language: "sql"`. - schema checks (field_is_present, field_type) use schema introspection, so they record a short note with `language: "introspection"`. The batched count metrics still execute as a single aggregation; each check's recorded SQL is the representative single-metric query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(databricks): support OAuth M2M, config profile, and U2M auth The direct (non-Spark) Databricks test connection only accepted a personal access token. Now that the engine connects via ibis.databricks.connect, the full connector auth surface is available, so resolve the auth method from env vars in priority order: 1. DATACONTRACT_DATABRICKS_TOKEN — personal access token (unchanged default) 2. DATACONTRACT_DATABRICKS_CLIENT_ID + _CLIENT_SECRET — OAuth service principal (M2M), the usual choice for CI/CD 3. DATACONTRACT_DATABRICKS_PROFILE — a local config profile via the Databricks SDK unified auth (parity with the Unity Catalog importer; also Azure CLI/MSI) 4. DATACONTRACT_DATABRICKS_AUTH_TYPE — explicit connector auth_type, e.g. databricks-oauth for the interactive U2M browser flow The OAuth credential providers build their SDK Config lazily so token exchange happens at connect time, not while reading env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(sqlserver): restore full authentication set dropped in the ibis migration The soda->ibis migration reduced the SQL Server connection to username/password + driver, silently dropping the documented auth and SSL env vars. With ODBC Driver 18 (encrypt + verify by default) this broke connections to servers with a self-signed certificate, including the test container (test_test_sqlserver). Restore the documented behavior in _sqlserver_connection_kwargs, selected by DATACONTRACT_SQLSERVER_AUTHENTICATION: - sql (default): USERNAME/PASSWORD - windows: Trusted_Connection (Kerberos/NTLM) - ActiveDirectoryPassword: Entra ID USERNAME/PASSWORD - ActiveDirectoryServicePrincipal: Entra ID CLIENT_ID/CLIENT_SECRET - ActiveDirectoryInteractive: Entra ID browser login - cli: az login session via ActiveDirectoryDefault (ODBC Driver 18.1+) Plus the legacy TRUSTED_CONNECTION switch (== windows, takes precedence), ENCRYPTED_CONNECTION (Encrypt=yes/no, default yes), and TRUST_SERVER_CERTIFICATE (TrustServerCertificate=yes). The auth modes that pass no credentials explicitly set Trusted_Connection to avoid ibis's integrated-auth default leaking into Entra ID / cli connections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(changelog): note the new Databricks authentication methods SQL Server auth gets no entry: it restores parity with the last release (0.12.5), so there is no user-visible change relative to a shipped version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: support Python 3.13 Raise requires-python to >=3.10,<3.14. The core and all non-Spark extras work on 3.13; the Spark extras resolve to PySpark 4.0 there (Spark 3.5 has no 3.13 build), and the connector jars already adapt to the runtime Spark/Scala version. create_spark_session now sets PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON to sys.executable so Spark's Python workers use the same interpreter as the driver (otherwise PySpark fails with PYTHON_VERSION_MISMATCH when PATH's python3 differs, which is common on 3.13). Full suite passes on Python 3.13.12 / PySpark 4.0.2 and on 3.11 / PySpark 3.5.8 (769 passed, 14 skipped). * docs(changelog): surface breaking changes in a section above Added * ci: add Python 3.13 to test matrix, update changelog with breaking changes * ci: add Python 3.13 to test matrix, update changelog with breaking changes * feat: support Python 3.14 Raise requires-python to >=3.10,<3.15 and add 3.14 to the CI test matrix. The full dependency graph resolves and installs with native 3.14 wheels within the existing version caps (no cap changes needed); notably duckdb 1.5.x, ibis-framework 12, pyspark 4.0, pydantic-core, pyarrow, numpy, and cryptography all ship 3.14 wheels. No code changes required: on 3.14 the Spark-backed extras resolve to PySpark 4.0 (Spark 3.5 has no 3.14 build), same as 3.13, and the PYSPARK_PYTHON/PYSPARK_DRIVER_PYTHON pinning added for 3.13 already keeps Spark's Python workers matched to the driver. Full suite passes on Python 3.14.3 / PySpark 4.0.2 (768 passed, 15 skipped). * feat(test): structured diagnostics, percent thresholds, severity, and failed-row samples Extend the ibis quality engine with four ODCS-aware capabilities: - diagnostics: each check records structured diagnostics (metric, measured value, threshold, row count, failed fraction, and the enforced validity rule) on Check.diagnostics, surfaced in JSON and JUnit output. Removes the unused, never-populated Check.details field. - percent thresholds: honor ODCS quality.unit: percent for the count-of-bad-rows metrics (nullValues, missingValues, invalidValues), comparing the failed fraction (0-100) of the row count against the threshold. Percent on metrics with no row-fraction meaning (rowCount, duplicateValues) warns and falls back to an absolute count. - severity: honor ODCS quality.severity; a non-blocking severity (info, warning, low, minor, trivial) downgrades a failing quality check to a warning so it no longer fails the run. Any other severity still fails. - failed-row samples: new `datacontract test --include-failed-samples` collects a capped (5-row) sample of offending rows for missing/invalid/ duplicate checks, restricted to identifier (unique/primaryKey) plus the offending column, omitting columns whose ODCS classification marks them sensitive. Stored on Check.failed_samples and surfaced in JSON and the JUnit failure text. Local-only; needs no Soda Cloud. Add in-process duckdb tests for diagnostics, percent/severity, and failed samples. * ci: add Java setup for Spark-based tests, update README * chore(deps): remove aiobotocore dependency and update README with Java installation instructions * chore(release): 1.0.0 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.