Skip to content

feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135

Open
devin-ai-integration[bot] wants to merge 85 commits intomasterfrom
devin/ELE-5266-1772368864-code-ci
Open

feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135
devin-ai-integration[bot] wants to merge 85 commits intomasterfrom
devin/ELE-5266-1772368864-code-ci

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 1, 2026

Summary

Adds full CI infrastructure and CLI support for DuckDB, Trino, Dremio, and Spark adapters to elementary-cli. All 11 warehouse tests (including the 4 new ones) plus code-quality now pass in CI.

Key changes:

  • pyproject.toml: Added dbt-duckdb and dbt-dremio as optional dependencies
  • profiles.yml.j2: Added profile targets for all 4 adapters (DuckDB file-based, Trino with Iceberg, Dremio with Nessie enterprise catalog, Spark Thrift Server with Delta)
  • docker-compose.yml: Added Docker service stacks for Trino (Iceberg + Hive Metastore + MinIO), Dremio (Nessie + MinIO), Spark (Thrift Server + Delta Lake)
  • CI workflows: Docker startup steps, seed caching for all Docker adapters, external seeders for Dremio and Spark
  • E2E dbt project: Renamed reserved-word columns (minmin_val, maxmax_val, sumsum_val, onecol1), updated data_type mappings, excluded schema_changes for Spark, added adapter-specific schema drop dispatches
  • CLI adapter code: Added transient error patterns and get_adapter_unique_id dispatch for DuckDB

Dremio-specific workarounds (dbt-dremio adapter limitations):

  • --exclude tag:ephemeral_model on dbt test — dbt-dremio doesn't support ephemeral model CTEs
  • Elementary profile uses enterprise_catalog_folder instead of schema to avoid duplicate-key errors from dbt-dremio's alias chain
  • Seeds placed in same Nessie namespace as models (VDS validator can't resolve cross-namespace references)
  • Single-threaded execution (--threads 1) to avoid Nessie catalog race conditions

Updates since last revision

  • Refactored all manual Dremio reserved-keyword quoting to use elementary.escape_reserved_keywords() — Replaced {% if target.type == 'dremio' %}..."keyword"...{% else %}keyword{% endif %} patterns with the centralized utility macro in 5 files:
    • row_number alias in get_models_latest_invocation.sql and get_models_latest_invocations_data.sql
    • count alias in can_upload_source_freshness.sql
    • filter column reference in source_freshness_alerts.sql and get_source_freshness_results.sql
  • Added row_number and snapshots to Dremio reserved keywords listdbt-data-reliability#955 (merged).
  • Fixed Column 'unique_id' is ambiguous error — Replaced USING (unique_id) with explicit ON run_results.unique_id = models.unique_id join and qualified column references. Dremio (and some other databases) don't deduplicate join columns when using ON instead of USING, so unqualified references became ambiguous.
  • Fixed invocation_id is ambiguous error — Qualified all column references in get_models_latest_invocations_data.sql final SELECT to avoid ambiguity with ON join.
  • Fully removed continue-on-error workaround for Dremio edr steps. Now that PR #2137 (ref() macro override for dbt-core 1.11 compatibility) is merged to master, the Dremio edr monitor/report/send-report/validate-alerts/e2e steps are fully validated in CI without any silent failures. Merged latest master into this branch to pick up the fix.
  • Fixed Dremio SQL reserved keyword conflicts:
    • Renamed snapshots CTE to snapshots_data in model_alerts.sqlsnapshots is reserved in Dremio's Calcite parser
  • Fixed dbt-dremio dateadd macro bugs: Created override in elementary/monitor/dbt_project/macros/overrides/dateadd.sql:
    • Casts interval parameter to string before calling .replace() (dbt-dremio calls it on int, causing AttributeError)
    • Removes select prefix from TIMESTAMPADD output (dbt-dremio wraps in scalar subquery, causing $SCALAR_QUERY errors in expression contexts)
  • Fixed Dremio elementary profile schema: Uses dot-separated namespace {{ schema_name }}.elementary instead of underscore to match how elementary tables are created
  • Reverted temporary dbt-data-reliability branch pin now that dbt-data-reliability#948 and #955 are merged.
  • Fixed Nessie Docker image registry: projectnessie/nessie:0.80.0 no longer exists on Docker Hub; now uses ghcr.io/projectnessie/nessie:0.80.0.
  • Fixed Docker healthchecks: MinIO 2024 and hive-metastore images lack curl/nc; replaced with bash -c 'echo > /dev/tcp/localhost/PORT'.
  • Fixed Dremio credential mismatch: dremio-setup.sh default password aligned to dremio123 to match docker-compose.yml.
  • Fixed Dremio seed loading: dremio-setup now uses --exit-code-from to wait for NessieSource creation before external seeder runs.
  • Fixed Dremio NessieSource initialization: Added nessie: condition: service_healthy dependency to ensure Nessie catalog is ready.
  • Increased Trino hive-metastore healthcheck timeout: 60 retries with 60s start_period for CI environments.

Review & Testing Checklist for Human

  • USINGON join refactoring affects all adapters — The change from USING (unique_id) to explicit ON join with qualified column references fixes Dremio's ambiguous column error, but applies to ALL adapters. CI passes for all 16 checks, but verify edge cases around column resolution don't break for specific adapters.
  • escape_reserved_keywords() utility macro dependency — The refactored code depends on dbt-data-reliability having row_number and snapshots in the Dremio reserved keywords list (PR #955, now merged). Verify the elementary package correctly picks up the merged version.
  • Reserved word CTE rename (snapshotssnapshots_data) — This change in model_alerts.sql affects ALL adapters, not just Dremio. Verify no downstream references to the old CTE name exist.
  • Dremio dateadd macro override — New override at elementary/monitor/dbt_project/macros/overrides/dateadd.sql fixes two dbt-dremio bugs (int interval AttributeError + scalar subquery wrapping). Verify this doesn't break other date arithmetic operations or introduce edge cases with different interval types.
  • Dremio elementary profile schema namespace — Uses dot-separated {{ schema_name }}.elementary instead of underscore. Verify elementary CLI can correctly resolve this schema for Dremio connections.
  • Dremio edr integration validation — The continue-on-error workaround is now removed, so Dremio's edr monitor/report/send-report steps are fully tested in CI. All 16 CI checks pass, confirming the ref() override + dateadd override + reserved keyword fixes work end-to-end.

Suggested Test Plan

  1. ✅ All 16 CI checks pass (including Dremio edr monitor/report steps without continue-on-error)
  2. Spot-check that reserved-word CTE rename (snapshots_data) doesn't break other adapters in CI
  3. Verify escape_reserved_keywords() works correctly for Dremio and doesn't affect other adapters
  4. Verify seed caching works correctly: trigger a second CI run and confirm Docker adapter tests complete faster on cache hit
  5. (Optional) Test Dremio edr monitor locally with the dateadd override to validate edge cases

Notes

  • All Docker credentials (e.g. minio123, dremio123) are local-only test credentials, not real secrets
  • Seed caching reduces Docker adapter test time from ~10min to ~2min on cache hit
  • The data_type mappings in schema.yml assume Spark/BigQuery/Databricks/Athena use STRING and Trino/Dremio/DuckDB use varchar
  • Spark excluded from schema_changes tests like Databricks (same limitation)
  • Fixed python-devpython3-dev for Spark requirements (existing bug)
  • DuckDB profile uses threads: 1 to match dbt-duckdb's concurrency model
  • Fixed schema name for Docker adapters: now uses fixed elementary_tests to enable seed caching (safe because concurrency group prevents parallel runs)
  • Dremio ref() override fix: PR #2137 adds a project-level ref() macro that properly handles dbt-core 1.11's two-arg cross-package syntax. See dremio/dbt-dremio#232 for upstream issue.
  • Dremio dateadd override addresses two dbt-dremio bugs: (1) interval.replace() called on int parameter, (2) select TIMESTAMPADD(...) wraps in scalar subquery causing Calcite parser errors in expression contexts
  • The escape_reserved_keywords() macro dispatches to adapter-specific implementations — for Dremio it wraps keywords in double quotes; for other adapters it's a no-op

Requested by: @haritamar
Link to Devin Session: https://app.devin.ai/sessions/6c12a19d456541568bcd7974623f3557

Summary by CodeRabbit

  • New Features

    • Added support for DuckDB, Trino, Dremio, and Spark warehouse backends.
  • Chores

    • Added optional adapter packages for DuckDB and Dremio.
    • Expanded CI workflows and test matrix to cover DuckDB, Trino, Dremio, and Spark.
    • Extended transient error handling for Spark.
  • Tests

    • Added end-to-end orchestration, services, and configs to test new backends.
    • Made test seed data deterministic and introduced seed caching for faster CI.

devin-ai-integration bot and others added 7 commits March 1, 2026 14:31
- pyproject.toml: add dbt-duckdb and dbt-dremio as optional dependencies
- Docker config files for Trino and Spark (non-credential files)
- test-all-warehouses.yml: add duckdb, trino, dremio, spark to CI matrix
- schema.yml: update data_type expressions for new adapter type mappings
- test_alerts_union.sql: exclude schema_changes for Spark (like Databricks)
- drop_test_schemas.sql: add dispatched edr_drop_schema for all new adapters
- transient_errors.py: add spark and duckdb entries to _ADAPTER_PATTERNS
- get_adapter_type_and_unique_id.sql: add duckdb dispatch (uses target.path)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear
Copy link

linear bot commented Mar 1, 2026

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

👋 @devin-ai-integration[bot]
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in this pull request.

@coderabbitai
Copy link

coderabbitai bot commented Mar 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds support for four new warehouses—DuckDB, Trino, Dremio, and Spark—across CI workflows, dependencies, dbt profiles/macros, transient error handling, Docker orchestration, test macros, deterministic e2e data generation, and e2e test assets.

Changes

Cohort / File(s) Summary
CI/CD Workflow Updates
.github/workflows/test-all-warehouses.yml, .github/workflows/test-warehouse.yml
Added duckdb/trino/dremio/spark to matrices and inputs; introduced seed cache restore/save logic, adapter install conditionals, and startup/orchestration blocks for new warehouses.
Dependencies
pyproject.toml
Added optional dependencies dbt-duckdb and dbt-dremio, extras duckdb/dremio, and included them in the all extras aggregation.
Transient Errors & Adapter IDs
elementary/clients/dbt/transient_errors.py, elementary/monitor/dbt_project/macros/get_adapter_type_and_unique_id.sql
Added Spark transient-error patterns and DuckDB entry; added duckdb__get_adapter_unique_id() macro returning target.path.
Docker Orchestration (e2e)
tests/e2e_dbt_project/docker-compose.yml, tests/e2e_dbt_project/docker/...
Added Trino, Dremio (Nessie/MinIO), and Spark stacks: new services, networks, volumes, healthchecks, and setup jobs.
Dremio Init Script
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
New script to wait for Dremio, authenticate, and create a Nessie/S3 source via Dremio REST API.
Spark Runtime Assets
tests/e2e_dbt_project/docker/spark/Dockerfile, .../entrypoint.sh, .../hive-site.xml, .../spark-defaults.conf
Added Spark image, entrypoint with WAIT_FOR host:port logic, Hive metastore config, and Spark defaults (Delta, memory, parallelism, etc.).
Trino Configs
tests/e2e_dbt_project/docker/trino/...
Added Trino catalog files (Iceberg/memory), server config, JVM options, and node properties.
Test Macros & Models
tests/e2e_dbt_project/macros/system/drop_test_schemas.sql, tests/e2e_dbt_project/macros/system/dremio_target_database.sql, tests/e2e_dbt_project/models/*, tests/e2e_dbt_project/models/test_alerts_union.sql
Added engine-specific drop-schema macros for trino/dremio/duckdb/spark; added dremio__target_database(); updated schema type mappings and guards to include spark.
Profiles
tests/profiles/profiles.yml.j2
Added profile outputs and target entries for trino, dremio, duckdb, and spark; expanded elementary targets list.
Deterministic Test Data
tests/e2e_dbt_project/generate_data.py
Seeded RNG with RANDOM_SEED = 42 to make generated CSVs deterministic and cacheable across CI runs.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant GH as GitHub Actions
    participant DC as Docker Compose
    participant MINIO as MinIO
    participant MS as Metastore
    participant SVC as Service (Trino/Dremio/Spark)
    participant DBT as dbt Runner

    GH->>DC: start orchestration (docker-compose up)
    DC->>MINIO: start MinIO
    DC->>MS: start Metastore (Hive/Postgres)
    DC->>SVC: start main service (Trino/Dremio/Spark)
    SVC-->>DC: respond to health checks
    MINIO-->>DC: report ready
    MS-->>DC: report ready
    DC->>GH: services ready
    GH->>DBT: restore seed cache (if available) & run seeds
    DBT->>SVC: connect and execute tests
    DBT-->>GH: return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through compose and CI bright,
I nudged DuckDB, Trino, Dremio, Spark to light,
Metastores hum and MinIO sings,
Seeds cached tight, deterministic things,
A rabbit cheers: new warehouses take flight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main objective: adding support for four new data warehouse adapters (DuckDB, Trino, Dremio, and Spark) to the CI and CLI infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/ELE-5266-1772368864-code-ci

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
tests/e2e_dbt_project/docker/spark/Dockerfile (1)

1-30: Run the Spark container as a non-root user.

The image currently runs as root; adding a dedicated runtime user improves baseline container hardening.

Suggested hardening diff
 COPY entrypoint.sh /scripts/
-RUN chmod +x /scripts/entrypoint.sh
+RUN chmod +x /scripts/entrypoint.sh && \
+    groupadd -r spark && useradd -r -g spark -m spark && \
+    chown -R spark:spark /usr/spark /scripts
 
+USER spark
 ENTRYPOINT ["/scripts/entrypoint.sh"]
 CMD ["--help"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile` around lines 1 - 30, The
Dockerfile runs the Spark image as root; create and switch to a non-root runtime
user to improve container hardening by adding a dedicated user/group (e.g.,
spark:spark) after installing packages and before COPY/ENTRYPOINT, ensure
ownership of SPARK_HOME (/usr/spark) and /scripts is changed to that user (chown
-R spark:spark /usr/spark /scripts) and then add a USER spark line so the
ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run unprivileged; make sure any
later runtime actions in entrypoint.sh do not require root or handle privilege
escalation explicitly.
tests/e2e_dbt_project/docker-compose.yml (1)

106-106: Pin new container images to explicit versions/digests.

Using latest (or no tag) makes CI non-deterministic and harder to debug when upstream images change.

🔧 Suggested direction
-    image: projectnessie/nessie:latest
+    image: projectnessie/nessie:<pinned-version-or-digest>

-    image: minio/minio:latest
+    image: minio/minio:<pinned-version-or-digest>

-    image: minio/mc
+    image: minio/mc:<pinned-version-or-digest>

-    image: dremio/dremio-oss:latest
+    image: dremio/dremio-oss:<pinned-version-or-digest>

-    image: alpine:latest
+    image: alpine:<pinned-version-or-digest>

Also applies to: 114-114, 132-132, 150-150, 175-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker-compose.yml` at line 106, Replace unconstrained
image tags using "latest" with explicit version tags or digests to ensure
deterministic CI; update the image entry "image: projectnessie/nessie:latest" in
the docker-compose.yml to a specific version or sha256 digest, and apply the
same change for the other image entries referenced (lines 114, 132, 150, 175) so
all container images are pinned to fixed tags/digests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops for Trino, Dremio and Spark currently
break on success but never fail on timeout; update each curl loop (the
Trino/Dremio/Spark readiness checks) to detect if the loop completed without
success by adding a post-loop conditional that checks whether the last curl
succeeded and, if not, prints an error and exits non‑zero (e.g., after the seq
loop, if the curl never returned success then echo "Timed out waiting for
<Service>" and exit 1). Ensure you apply this pattern to the Trino readiness
loop and the analogous Dremio and Spark loops so the job fails fast when
dependencies never become healthy.
- Line 216: The DuckDB target is using an in-memory database (path ":memory:")
which cannot persist across workflow steps; change the DuckDB configuration used
in CI so it uses a persistent file path (for example set path to
"/tmp/elementary_test.db") or exclude DuckDB from the multi-step workflow when
in-memory testing is intended; update the workflow logic that references
inputs.warehouse-type (and any DuckDB-specific job/step configs) to use the
persistent file path or add a conditional to skip DuckDB when
inputs.warehouse-type == 'duckdb' and in-memory mode is active.

In `@tests/e2e_dbt_project/docker-compose.yml`:
- Around line 95-100: The trino-mc-job entrypoint uses a fixed sleep which can
fail if MinIO starts slower; update the trino-mc-job entrypoint to replace
"sleep 5" with a readiness loop that waits for minio:9000 to accept connections
(e.g., a small loop using curl or nc that retries until success) before running
the mc commands, then proceed with "/usr/bin/mc config ...", "/usr/bin/mc mb
..."; modify the entrypoint block in the trino-mc-job service so it polls minio
(hostname "minio", port 9000) and exits the loop only when reachable.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-33: The script currently treats AUTH_TOKEN and the S3 source
creation as always successful; update the AUTH_TOKEN acquisition and the
subsequent curl call that creates S3Source to fail fast: after the login curl
that sets AUTH_TOKEN (variable AUTH_TOKEN), check for empty string OR the
literal "null" and exit non‑zero with an error message if found; for both curl
calls (the login POST and the PUT to create S3Source) capture HTTP status (e.g.,
via curl -w/%{http_code}% and/or -f) and the response body, verify the status is
a 2xx before proceeding, and on non‑2xx log the response body and exit 1 so the
script does not print "Obtained Dremio auth token" or "S3 Source created in
Dremio" on failure.

In `@tests/e2e_dbt_project/macros/system/drop_test_schemas.sql`:
- Around line 43-53: The DROP macros interpolate schema_name unquoted which
breaks on special chars/reserved words; update dremio__edr_drop_schema,
duckdb__edr_drop_schema, and spark__edr_drop_schema to use
adapter.quote(schema_name) (same pattern as clickhouse__edr_drop_schema) when
building the raw DROP SQL so the identifier is safely quoted (e.g., replace
plain schema_name concatenation with adapter.quote(schema_name) in the run_query
argument).

In `@tests/profiles/profiles.yml.j2`:
- Around line 49-53: The DuckDB profile currently uses the process-local
in-memory path ":memory:", which breaks across SubprocessDbtRunner subprocesses;
update the profiles template (the duckdb &duckdb block) to use a persistent or
shared DB URI instead of ":memory:"—either switch to a file-backed path (e.g., a
temp DB file under a test-specific directory) or use a shared-memory URI like
the SQLite-style shared cache (e.g., file:...?mode=memory&cache=shared) so
multiple dbt subprocesses can see the same database; keep the schema
interpolation ({{ schema_name }}) and threads setting unchanged.

---

Nitpick comments:
In `@tests/e2e_dbt_project/docker-compose.yml`:
- Line 106: Replace unconstrained image tags using "latest" with explicit
version tags or digests to ensure deterministic CI; update the image entry
"image: projectnessie/nessie:latest" in the docker-compose.yml to a specific
version or sha256 digest, and apply the same change for the other image entries
referenced (lines 114, 132, 150, 175) so all container images are pinned to
fixed tags/digests.

In `@tests/e2e_dbt_project/docker/spark/Dockerfile`:
- Around line 1-30: The Dockerfile runs the Spark image as root; create and
switch to a non-root runtime user to improve container hardening by adding a
dedicated user/group (e.g., spark:spark) after installing packages and before
COPY/ENTRYPOINT, ensure ownership of SPARK_HOME (/usr/spark) and /scripts is
changed to that user (chown -R spark:spark /usr/spark /scripts) and then add a
USER spark line so the ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run
unprivileged; make sure any later runtime actions in entrypoint.sh do not
require root or handle privilege escalation explicitly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 746ce41 and 0836bbf.

📒 Files selected for processing (20)
  • .github/workflows/test-all-warehouses.yml
  • .github/workflows/test-warehouse.yml
  • elementary/clients/dbt/transient_errors.py
  • elementary/monitor/dbt_project/macros/get_adapter_type_and_unique_id.sql
  • pyproject.toml
  • tests/e2e_dbt_project/docker-compose.yml
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/e2e_dbt_project/docker/spark/Dockerfile
  • tests/e2e_dbt_project/docker/spark/entrypoint.sh
  • tests/e2e_dbt_project/docker/spark/hive-site.xml
  • tests/e2e_dbt_project/docker/spark/spark-defaults.conf
  • tests/e2e_dbt_project/docker/trino/catalog/iceberg.properties
  • tests/e2e_dbt_project/docker/trino/catalog/memory.properties
  • tests/e2e_dbt_project/docker/trino/etc/config.properties
  • tests/e2e_dbt_project/docker/trino/etc/jvm.config
  • tests/e2e_dbt_project/docker/trino/etc/node.properties
  • tests/e2e_dbt_project/macros/system/drop_test_schemas.sql
  • tests/e2e_dbt_project/models/schema.yml
  • tests/e2e_dbt_project/models/test_alerts_union.sql
  • tests/profiles/profiles.yml.j2

- DuckDB: use file-backed DB path instead of :memory: to persist across
  subprocess calls, and reduce threads to 1 to avoid concurrent commit errors
- Spark: install dbt-spark[PyHive] extras required for thrift connection method
- Dremio: add dremio__target_database() dispatch override in e2e project
  to return target.database (upstream elementary package lacks this dispatch)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

109-113: ⚠️ Potential issue | 🟠 Major

Fail fast when readiness polling times out.

Line 109, Line 124, and Line 141 loops still continue even if services never become ready. This can allow later steps to run against unhealthy dependencies and fail non-deterministically.

Suggested fix pattern (apply to all three loops)
-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
             echo "Waiting for Trino... ($i/60)"
             sleep 5
           done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Also applies to: 124-128, 141-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness
polling loops that use "for i in $(seq 1 60); do curl -sf ... && break; ...
done" should fail fast when the max attempts are reached; update each such loop
(the Trino/other service readiness loops) to check after sleeping whether i
equals the max (e.g., 60) and, if so, print an error to stderr and exit 1 so the
workflow stops rather than continuing with unhealthy dependencies. Ensure the
same pattern is applied to all three polling loops to provide deterministic
failure on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness polling loops that use "for i in $(seq 1
60); do curl -sf ... && break; ... done" should fail fast when the max attempts
are reached; update each such loop (the Trino/other service readiness loops) to
check after sleeping whether i equals the max (e.g., 60) and, if so, print an
error to stderr and exit 1 so the workflow stops rather than continuing with
unhealthy dependencies. Ensure the same pattern is applied to all three polling
loops to provide deterministic failure on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0836bbf and 40b2c74.

📒 Files selected for processing (3)
  • .github/workflows/test-warehouse.yml
  • tests/e2e_dbt_project/macros/system/dremio_target_database.sql
  • tests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/profiles/profiles.yml.j2

Plain S3 sources in Dremio do not support CREATE TABLE (needed for dbt seed).
Switch to Nessie catalog source which supports table creation via Iceberg.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh (1)

15-31: ⚠️ Potential issue | 🟠 Major

Fail fast on login/source API errors and reject "null" tokens.

At Line 20, checking only -z misses the literal "null" token case; and Lines 15-31 do not validate HTTP status for either API call, so setup can report success on non-2xx responses.

🔧 Suggested patch
-# Log in to Dremio to get the auth token
-AUTH_TOKEN=$(curl -s -X POST "http://dremio:9047/apiv2/login" \
-  -H "Content-Type: application/json" \
-  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}" | jq -r .token)
+# Log in to Dremio to get the auth token
+LOGIN_RESPONSE="$(mktemp)"
+LOGIN_HTTP_CODE=$(curl -sS -o "$LOGIN_RESPONSE" -w "%{http_code}" -X POST "http://dremio:9047/apiv2/login" \
+  -H "Content-Type: application/json" \
+  --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}")
+if [ "$LOGIN_HTTP_CODE" -lt 200 ] || [ "$LOGIN_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to log in to Dremio (HTTP $LOGIN_HTTP_CODE)"
+  cat "$LOGIN_RESPONSE"
+  exit 1
+fi
+AUTH_TOKEN=$(jq -r '.token // empty' "$LOGIN_RESPONSE")
 
 # Check if AUTH_TOKEN is not empty
-if [ -z "$AUTH_TOKEN" ]; then
+if [ -z "$AUTH_TOKEN" ] || [ "$AUTH_TOKEN" = "null" ]; then
   echo "Failed to obtain Dremio auth token"
   exit 1
 fi
 
 echo "Obtained Dremio auth token"
 
 # Create a Nessie catalog source in Dremio (supports CREATE TABLE for dbt seed)
-curl -s -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
+SOURCE_RESPONSE="$(mktemp)"
+SOURCE_HTTP_CODE=$(curl -sS -o "$SOURCE_RESPONSE" -w "%{http_code}" -X PUT "http://dremio:9047/apiv2/source/NessieSource" \
   -H "Content-Type: application/json" \
   -H "Authorization: _dremio$AUTH_TOKEN" \
-  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}"
+  --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}")
+if [ "$SOURCE_HTTP_CODE" -lt 200 ] || [ "$SOURCE_HTTP_CODE" -ge 300 ]; then
+  echo "Failed to create Nessie Source in Dremio (HTTP $SOURCE_HTTP_CODE)"
+  cat "$SOURCE_RESPONSE"
+  exit 1
+fi
#!/bin/bash
# Verify login/source calls capture and validate HTTP status and token nullability.
FILE="$(fd 'dremio-setup.sh' | head -n1)"
nl -ba "$FILE" | sed -n '12,90p'
rg -n 'apiv2/login|apiv2/source/NessieSource|http_code|AUTH_TOKEN|jq -r| -z ' "$FILE"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 15 - 31,
The login and source API calls must validate HTTP status and reject a literal
"null" token; modify the AUTH_TOKEN retrieval block to capture both response
body and HTTP status from the login curl (use curl -w to produce HTTP code and
capture body separately), then extract token with jq -r .token and fail if the
HTTP status is not 2xx or if AUTH_TOKEN is empty or equals "null" (print the
response body/error and exit 1); do the same for the PUT creating NessieSource:
capture response body and HTTP status, and exit non-zero with the response on
any non-2xx status. Ensure you reference the existing AUTH_TOKEN variable, the
login curl to "apiv2/login", and the PUT to "apiv2/source/NessieSource" when
making these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 7-10: The readiness loop that uses "until curl -s
http://dremio:9047; do ... done" must be bounded: add a MAX_ATTEMPTS variable
and a counter, set curl timeouts (e.g., --connect-timeout and/or --max-time) so
each try fails fast, increment the counter inside the loop, and when attempts >=
MAX_ATTEMPTS log an error and exit non-zero; update the existing until/curl loop
code in the script to use these variables and checks so CI will fail instead of
hanging indefinitely.

---

Duplicate comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-31: The login and source API calls must validate HTTP status
and reject a literal "null" token; modify the AUTH_TOKEN retrieval block to
capture both response body and HTTP status from the login curl (use curl -w to
produce HTTP code and capture body separately), then extract token with jq -r
.token and fail if the HTTP status is not 2xx or if AUTH_TOKEN is empty or
equals "null" (print the response body/error and exit 1); do the same for the
PUT creating NessieSource: capture response body and HTTP status, and exit
non-zero with the response on any non-2xx status. Ensure you reference the
existing AUTH_TOKEN variable, the login curl to "apiv2/login", and the PUT to
"apiv2/source/NessieSource" when making these checks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 40b2c74 and bac36f9.

📒 Files selected for processing (2)
  • tests/e2e_dbt_project/docker/dremio/dremio-setup.sh
  • tests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/profiles/profiles.yml.j2

Comment on lines +7 to +10
until curl -s http://dremio:9047; do
echo "Waiting for Dremio..."
sleep 5
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd 'dremio-setup.sh' --type f

Repository: elementary-data/elementary

Length of output: 121


🏁 Script executed:

cat -n tests/e2e_dbt_project/docker/dremio/dremio-setup.sh

Repository: elementary-data/elementary

Length of output: 1945


Bound the Dremio readiness wait to avoid CI hangs.

Lines 7-10 create an unbounded loop that retries indefinitely if Dremio is unreachable or if curl cannot connect. This can stall the CI pipeline with no failure signal. The loop needs an explicit max-attempts limit, a timeout on curl, and a hard exit on failure.

🔧 Suggested patch
+max_attempts=60
+attempt=1
-until curl -s http://dremio:9047; do
+until curl --silent --show-error --fail --max-time 3 http://dremio:9047 >/dev/null; do
+  if [ "$attempt" -ge "$max_attempts" ]; then
+    echo "Dremio did not become ready after $max_attempts attempts"
+    exit 1
+  fi
   echo "Waiting for Dremio..."
   sleep 5
+  attempt=$((attempt + 1))
 done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
until curl -s http://dremio:9047; do
echo "Waiting for Dremio..."
sleep 5
done
max_attempts=60
attempt=1
until curl --silent --show-error --fail --max-time 3 http://dremio:9047 >/dev/null; do
if [ "$attempt" -ge "$max_attempts" ]; then
echo "Dremio did not become ready after $max_attempts attempts"
exit 1
fi
echo "Waiting for Dremio..."
sleep 5
attempt=$((attempt + 1))
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 7 - 10, The
readiness loop that uses "until curl -s http://dremio:9047; do ... done" must be
bounded: add a MAX_ATTEMPTS variable and a counter, set curl timeouts (e.g.,
--connect-timeout and/or --max-time) so each try fails fast, increment the
counter inside the loop, and when attempts >= MAX_ATTEMPTS log an error and exit
non-zero; update the existing until/curl loop code in the script to use these
variables and checks so CI will fail instead of hanging indefinitely.

- Make generate_data.py deterministic (fixed random seed)
- Use fixed schema name for Docker adapters (ephemeral containers)
- Cache seeded Docker volumes between runs using actions/cache
- Cache DuckDB database file between runs
- Skip dbt seed on cache hit, restoring from cached volumes instead
- Applies to: Spark, Trino, Dremio, Postgres, ClickHouse, DuckDB

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

109-113: ⚠️ Potential issue | 🟠 Major

Readiness loops still don’t fail on timeout.

These checks can still silently continue after 60 attempts if the service never becomes healthy, which allows downstream steps to run against broken dependencies.

Also applies to: 124-128, 141-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness
loops using "for i in $(seq 1 60); do curl -sf http://localhost:8086/v1/info &&
break; ... done" do not fail when the service never becomes healthy; update each
such loop (the curl-based readiness loops at the three occurrences) to detect
timeout and exit non-zero if the loop completes without success — e.g., after
the done add a check like "if [ $i -eq 60 ] && ! curl -sf
http://localhost:8086/v1/info; then echo 'Service did not become ready' >&2;
exit 1; fi" (or equivalent) so the workflow fails instead of continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 249-264: The "Restore cached seed data into Docker volumes"
workflow step currently restores tarballs into volumes after the warehouse
services have started (uses CACHE_DIR, docker volume create, docker run ... tar
xzf), which can corrupt live DB files; move this step to run before the job step
that starts warehouse containers (the service startup block) so volumes are
populated prior to container creation, or alternatively modify the step to stop
the warehouse containers before restore and then restart and run a readiness
check (e.g., wait-for/postgres healthcheck) after extraction to ensure services
reinitialize cleanly.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops using "for i in $(seq 1 60); do curl
-sf http://localhost:8086/v1/info && break; ... done" do not fail when the
service never becomes healthy; update each such loop (the curl-based readiness
loops at the three occurrences) to detect timeout and exit non-zero if the loop
completes without success — e.g., after the done add a check like "if [ $i -eq
60 ] && ! curl -sf http://localhost:8086/v1/info; then echo 'Service did not
become ready' >&2; exit 1; fi" (or equivalent) so the workflow fails instead of
continuing.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bac36f9 and 3682f1c.

📒 Files selected for processing (2)
  • .github/workflows/test-warehouse.yml
  • tests/e2e_dbt_project/generate_data.py

Addresses CodeRabbit review: restoring cached tarballs into Docker
volumes while containers are already running risks data corruption.
Now the cache key computation and volume restore happen before any
Docker services are started, so containers initialise with the
pre-seeded data.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)

160-164: ⚠️ Potential issue | 🟠 Major

Readiness loops still don’t fail the job on timeout.

On Line 160-164, Line 175-179, and Line 192-196, the loops stop on success but never exit 1 when retries are exhausted. The workflow can continue with unhealthy dependencies and fail later with misleading errors.

Fail-fast pattern to apply to each loop
-          for i in $(seq 1 60); do
-            curl -sf http://localhost:8086/v1/info && break
-            echo "Waiting for Trino... ($i/60)"
-            sleep 5
-          done
+          ready=0
+          for i in $(seq 1 60); do
+            if curl -sf http://localhost:8086/v1/info > /dev/null; then
+              ready=1
+              break
+            fi
+            echo "Waiting for Trino... ($i/60)"
+            sleep 5
+          done
+          if [ "$ready" -ne 1 ]; then
+            echo "Timed out waiting for Trino"
+            exit 1
+          fi

Use the same timeout-fail pattern for Dremio and Spark checks.

Also applies to: 175-179, 192-196

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test-warehouse.yml around lines 160 - 164, The readiness
loops that poll services (the Trino loop shown and the similar Dremio and Spark
loops) currently break on success but do nothing when retries are exhausted;
update each loop so that if the curl never succeeds within the retries it fails
the job (exit 1). Concretely, after each for/seq loop (the Trino, Dremio, and
Spark readiness checks), check whether the last curl succeeded and, if not,
print an error and run exit 1 (or set a failure flag inside the loop and exit
after the loop) so the workflow stops immediately on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-103: The SEED_HASH calculation (the SEED_HASH env var block)
currently hashes generate_data.py, dbt_project.yml and the profiles.yml.j2
template but omits the Docker stack/compose definition, allowing cached DB
volume tarballs to be restored against incompatible images or volume layouts;
update the SEED_HASH command to also include the Docker stack/compose file(s)
used to bring up the test DB (e.g., docker-compose.yml,
docker-compose.override.yml or any stack definition referenced by the workflow)
so changes in images/layouts change the cache key and prevent restoring
incompatible tarballs.

---

Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 160-164: The readiness loops that poll services (the Trino loop
shown and the similar Dremio and Spark loops) currently break on success but do
nothing when retries are exhausted; update each loop so that if the curl never
succeeds within the retries it fails the job (exit 1). Concretely, after each
for/seq loop (the Trino, Dremio, and Spark readiness checks), check whether the
last curl succeeded and, if not, print an error and run exit 1 (or set a failure
flag inside the loop and exit after the loop) so the workflow stops immediately
on timeout.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3682f1c and 7969949.

📒 Files selected for processing (1)
  • .github/workflows/test-warehouse.yml

devin-ai-integration bot and others added 2 commits March 1, 2026 15:37
… loops

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ching

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
devin-ai-integration bot and others added 5 commits March 1, 2026 15:46
…upport

Points to dbt-data-reliability#948 which adds:
- trino__full_name_split (1-based array indexing)
- trino__edr_get_create_table_as_sql (bypass model.config issue)
- spark__edr_get_create_table_as_sql

TODO: revert after dbt-data-reliability#948 is merged
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Prevents tar race condition where ClickHouse temporary merge files
disappear during archiving, causing 'No such file or directory' errors.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ache

After stopping containers for volume archiving and restarting them,
services like Trino need time to reinitialize. Added per-adapter
health checks to wait for readiness before proceeding.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
The /v1/info endpoint returns HTTP 200 even when Trino is still
initializing. Check for '"starting":false' in the response body
to ensure Trino is fully ready before proceeding.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…rino

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
devin-ai-integration bot and others added 9 commits March 2, 2026 14:44
…d for CI

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…24 and hive images)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…io123)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ults

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…_period for CI

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…code-from)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ucceeds

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…incompatibility)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
devin-ai-integration bot and others added 14 commits March 2, 2026 18:54
…w on master)

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
dbt-dremio's generate_schema_name uses dot separation for nested Nessie
namespaces (e.g. elementary_tests.elementary), not underscore concatenation
(elementary_tests_elementary). The CLI profile must match the namespace
path created by the e2e project's dbt run.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite-based SQL parser treats 'snapshots' as a reserved keyword,
causing 'Encountered ", snapshots" at line 6, column 6' error in the
populate_model_alerts_query post-hook. Renamed to 'snapshots_data'.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite-based SQL parser treats 'filter' as a reserved keyword,
causing 'Encountered ". filter" at line 52' error in the
populate_source_freshness_alerts_query post-hook.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…case issue

Snowflake stores columns as UPPERCASE, so quoting as "filter" (lowercase)
breaks column resolution. Only quote for Dremio where it's a reserved keyword.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
dbt-dremio's dateadd macro calls interval.replace() which fails when
interval is an integer. This override casts to string first.
Upstream bug in dbt-dremio's macros/utils/date_spine.sql.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ERY error

dbt-dremio's dateadd wraps result in 'select TIMESTAMPADD(...)' which creates
a scalar subquery when embedded in larger SQL. Dremio's Calcite parser rejects
multi-field RECORDTYPE in scalar subquery context. Output just TIMESTAMPADD(...)
as a plain expression instead.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…xception

Dremio's Gandiva (Arrow execution engine) cannot parse ISO 8601 timestamps
with the 'Z' UTC timezone suffix (e.g. '2026-03-02T22:50:42.101Z'). This
causes 'Invalid timestamp or unknown zone' errors during edr monitor report.

Override dremio__edr_cast_as_timestamp in the monitor project to strip the
'Z' suffix before casting. Also add dispatch config so elementary_cli macros
take priority over the elementary package for adapter-dispatched macros.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Gandiva rejects both 'Z' suffix and 'T' separator in ISO 8601 timestamps.
Normalize '2026-03-02T23:31:12.443Z' to '2026-03-02 23:31:12.443'.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…imestamp text

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…for Dremio

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite SQL parser reserves ROW_NUMBER and COUNT as keywords.
These were used as unquoted column aliases in:
- get_models_latest_invocation.sql
- get_models_latest_invocations_data.sql
- can_upload_source_freshness.sql

Applied Dremio-specific double-quoting via target.type conditional,
same pattern used for 'filter' and 'snapshots' reserved keywords.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
where {{ elementary.edr_datediff(elementary.edr_cast_as_timestamp('generated_at'), elementary.edr_current_timestamp(), 'day') }} < {{ days_back }}
)
select count(*) as count
select count(*) as {% if target.type == 'dremio' %}"count"{% else %}count{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can use elementary.escape_reserved_keywords? It was created for this purpose
(its code is in the dbt-data-reliability repo)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Refactored all manual {% if target.type == 'dremio' %} quoting to use elementary.escape_reserved_keywords() instead. This covers:

  • row_number alias (2 files)
  • count alias (1 file)
  • filter column reference (2 files)

Also created dbt-data-reliability#955 to add row_number and snapshots to the reserved keywords list. CI is temporarily pinned to that branch.

devin-ai-integration bot and others added 4 commits March 3, 2026 01:35
…ed words

Replace manual {% if target.type == 'dremio' %} quoting with the existing
elementary.escape_reserved_keywords() utility from dbt-data-reliability.

Files updated:
- get_models_latest_invocation.sql: row_number alias
- get_models_latest_invocations_data.sql: row_number alias
- can_upload_source_freshness.sql: count alias
- source_freshness_alerts.sql: filter column reference
- get_source_freshness_results.sql: filter column reference

Also temporarily pins dbt-data-reliability to branch with row_number
and snapshots added to the reserved keywords list (PR #955).

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Replace USING (unique_id) with explicit ON clause and select specific
columns instead of SELECT * to avoid ambiguous column references in
Dremio's SQL engine, which doesn't deduplicate join columns with USING.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…N join

The switch from USING to ON for Dremio compatibility requires qualifying
column references since ON doesn't deduplicate join columns like USING does.

Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant