feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135
feat: add DuckDB, Trino, Dremio & Spark support to CI and CLI#2135devin-ai-integration[bot] wants to merge 85 commits intomasterfrom
Conversation
- pyproject.toml: add dbt-duckdb and dbt-dremio as optional dependencies - Docker config files for Trino and Spark (non-credential files) - test-all-warehouses.yml: add duckdb, trino, dremio, spark to CI matrix - schema.yml: update data_type expressions for new adapter type mappings - test_alerts_union.sql: exclude schema_changes for Spark (like Databricks) - drop_test_schemas.sql: add dispatched edr_drop_schema for all new adapters - transient_errors.py: add spark and duckdb entries to _ADAPTER_PATTERNS - get_adapter_type_and_unique_id.sql: add duckdb dispatch (uses target.path) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds support for four new warehouses—DuckDB, Trino, Dremio, and Spark—across CI workflows, dependencies, dbt profiles/macros, transient error handling, Docker orchestration, test macros, deterministic e2e data generation, and e2e test assets. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant GH as GitHub Actions
participant DC as Docker Compose
participant MINIO as MinIO
participant MS as Metastore
participant SVC as Service (Trino/Dremio/Spark)
participant DBT as dbt Runner
GH->>DC: start orchestration (docker-compose up)
DC->>MINIO: start MinIO
DC->>MS: start Metastore (Hive/Postgres)
DC->>SVC: start main service (Trino/Dremio/Spark)
SVC-->>DC: respond to health checks
MINIO-->>DC: report ready
MS-->>DC: report ready
DC->>GH: services ready
GH->>DBT: restore seed cache (if available) & run seeds
DBT->>SVC: connect and execute tests
DBT-->>GH: return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
tests/e2e_dbt_project/docker/spark/Dockerfile (1)
1-30: Run the Spark container as a non-root user.The image currently runs as root; adding a dedicated runtime user improves baseline container hardening.
Suggested hardening diff
COPY entrypoint.sh /scripts/ -RUN chmod +x /scripts/entrypoint.sh +RUN chmod +x /scripts/entrypoint.sh && \ + groupadd -r spark && useradd -r -g spark -m spark && \ + chown -R spark:spark /usr/spark /scripts +USER spark ENTRYPOINT ["/scripts/entrypoint.sh"] CMD ["--help"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e_dbt_project/docker/spark/Dockerfile` around lines 1 - 30, The Dockerfile runs the Spark image as root; create and switch to a non-root runtime user to improve container hardening by adding a dedicated user/group (e.g., spark:spark) after installing packages and before COPY/ENTRYPOINT, ensure ownership of SPARK_HOME (/usr/spark) and /scripts is changed to that user (chown -R spark:spark /usr/spark /scripts) and then add a USER spark line so the ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run unprivileged; make sure any later runtime actions in entrypoint.sh do not require root or handle privilege escalation explicitly.tests/e2e_dbt_project/docker-compose.yml (1)
106-106: Pin new container images to explicit versions/digests.Using
latest(or no tag) makes CI non-deterministic and harder to debug when upstream images change.🔧 Suggested direction
- image: projectnessie/nessie:latest + image: projectnessie/nessie:<pinned-version-or-digest> - image: minio/minio:latest + image: minio/minio:<pinned-version-or-digest> - image: minio/mc + image: minio/mc:<pinned-version-or-digest> - image: dremio/dremio-oss:latest + image: dremio/dremio-oss:<pinned-version-or-digest> - image: alpine:latest + image: alpine:<pinned-version-or-digest>Also applies to: 114-114, 132-132, 150-150, 175-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e_dbt_project/docker-compose.yml` at line 106, Replace unconstrained image tags using "latest" with explicit version tags or digests to ensure deterministic CI; update the image entry "image: projectnessie/nessie:latest" in the docker-compose.yml to a specific version or sha256 digest, and apply the same change for the other image entries referenced (lines 114, 132, 150, 175) so all container images are pinned to fixed tags/digests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops for Trino, Dremio and Spark currently
break on success but never fail on timeout; update each curl loop (the
Trino/Dremio/Spark readiness checks) to detect if the loop completed without
success by adding a post-loop conditional that checks whether the last curl
succeeded and, if not, prints an error and exits non‑zero (e.g., after the seq
loop, if the curl never returned success then echo "Timed out waiting for
<Service>" and exit 1). Ensure you apply this pattern to the Trino readiness
loop and the analogous Dremio and Spark loops so the job fails fast when
dependencies never become healthy.
- Line 216: The DuckDB target is using an in-memory database (path ":memory:")
which cannot persist across workflow steps; change the DuckDB configuration used
in CI so it uses a persistent file path (for example set path to
"/tmp/elementary_test.db") or exclude DuckDB from the multi-step workflow when
in-memory testing is intended; update the workflow logic that references
inputs.warehouse-type (and any DuckDB-specific job/step configs) to use the
persistent file path or add a conditional to skip DuckDB when
inputs.warehouse-type == 'duckdb' and in-memory mode is active.
In `@tests/e2e_dbt_project/docker-compose.yml`:
- Around line 95-100: The trino-mc-job entrypoint uses a fixed sleep which can
fail if MinIO starts slower; update the trino-mc-job entrypoint to replace
"sleep 5" with a readiness loop that waits for minio:9000 to accept connections
(e.g., a small loop using curl or nc that retries until success) before running
the mc commands, then proceed with "/usr/bin/mc config ...", "/usr/bin/mc mb
..."; modify the entrypoint block in the trino-mc-job service so it polls minio
(hostname "minio", port 9000) and exits the loop only when reachable.
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-33: The script currently treats AUTH_TOKEN and the S3 source
creation as always successful; update the AUTH_TOKEN acquisition and the
subsequent curl call that creates S3Source to fail fast: after the login curl
that sets AUTH_TOKEN (variable AUTH_TOKEN), check for empty string OR the
literal "null" and exit non‑zero with an error message if found; for both curl
calls (the login POST and the PUT to create S3Source) capture HTTP status (e.g.,
via curl -w/%{http_code}% and/or -f) and the response body, verify the status is
a 2xx before proceeding, and on non‑2xx log the response body and exit 1 so the
script does not print "Obtained Dremio auth token" or "S3 Source created in
Dremio" on failure.
In `@tests/e2e_dbt_project/macros/system/drop_test_schemas.sql`:
- Around line 43-53: The DROP macros interpolate schema_name unquoted which
breaks on special chars/reserved words; update dremio__edr_drop_schema,
duckdb__edr_drop_schema, and spark__edr_drop_schema to use
adapter.quote(schema_name) (same pattern as clickhouse__edr_drop_schema) when
building the raw DROP SQL so the identifier is safely quoted (e.g., replace
plain schema_name concatenation with adapter.quote(schema_name) in the run_query
argument).
In `@tests/profiles/profiles.yml.j2`:
- Around line 49-53: The DuckDB profile currently uses the process-local
in-memory path ":memory:", which breaks across SubprocessDbtRunner subprocesses;
update the profiles template (the duckdb &duckdb block) to use a persistent or
shared DB URI instead of ":memory:"—either switch to a file-backed path (e.g., a
temp DB file under a test-specific directory) or use a shared-memory URI like
the SQLite-style shared cache (e.g., file:...?mode=memory&cache=shared) so
multiple dbt subprocesses can see the same database; keep the schema
interpolation ({{ schema_name }}) and threads setting unchanged.
---
Nitpick comments:
In `@tests/e2e_dbt_project/docker-compose.yml`:
- Line 106: Replace unconstrained image tags using "latest" with explicit
version tags or digests to ensure deterministic CI; update the image entry
"image: projectnessie/nessie:latest" in the docker-compose.yml to a specific
version or sha256 digest, and apply the same change for the other image entries
referenced (lines 114, 132, 150, 175) so all container images are pinned to
fixed tags/digests.
In `@tests/e2e_dbt_project/docker/spark/Dockerfile`:
- Around line 1-30: The Dockerfile runs the Spark image as root; create and
switch to a non-root runtime user to improve container hardening by adding a
dedicated user/group (e.g., spark:spark) after installing packages and before
COPY/ENTRYPOINT, ensure ownership of SPARK_HOME (/usr/spark) and /scripts is
changed to that user (chown -R spark:spark /usr/spark /scripts) and then add a
USER spark line so the ENTRYPOINT ("/scripts/entrypoint.sh") and CMD run
unprivileged; make sure any later runtime actions in entrypoint.sh do not
require root or handle privilege escalation explicitly.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (20)
.github/workflows/test-all-warehouses.yml.github/workflows/test-warehouse.ymlelementary/clients/dbt/transient_errors.pyelementary/monitor/dbt_project/macros/get_adapter_type_and_unique_id.sqlpyproject.tomltests/e2e_dbt_project/docker-compose.ymltests/e2e_dbt_project/docker/dremio/dremio-setup.shtests/e2e_dbt_project/docker/spark/Dockerfiletests/e2e_dbt_project/docker/spark/entrypoint.shtests/e2e_dbt_project/docker/spark/hive-site.xmltests/e2e_dbt_project/docker/spark/spark-defaults.conftests/e2e_dbt_project/docker/trino/catalog/iceberg.propertiestests/e2e_dbt_project/docker/trino/catalog/memory.propertiestests/e2e_dbt_project/docker/trino/etc/config.propertiestests/e2e_dbt_project/docker/trino/etc/jvm.configtests/e2e_dbt_project/docker/trino/etc/node.propertiestests/e2e_dbt_project/macros/system/drop_test_schemas.sqltests/e2e_dbt_project/models/schema.ymltests/e2e_dbt_project/models/test_alerts_union.sqltests/profiles/profiles.yml.j2
- DuckDB: use file-backed DB path instead of :memory: to persist across subprocess calls, and reduce threads to 1 to avoid concurrent commit errors - Spark: install dbt-spark[PyHive] extras required for thrift connection method - Dremio: add dremio__target_database() dispatch override in e2e project to return target.database (upstream elementary package lacks this dispatch) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)
109-113:⚠️ Potential issue | 🟠 MajorFail fast when readiness polling times out.
Line 109, Line 124, and Line 141 loops still continue even if services never become ready. This can allow later steps to run against unhealthy dependencies and fail non-deterministically.
Suggested fix pattern (apply to all three loops)
- for i in $(seq 1 60); do - curl -sf http://localhost:8086/v1/info && break + ready=0 + for i in $(seq 1 60); do + if curl -sf http://localhost:8086/v1/info > /dev/null; then + ready=1 + break + fi echo "Waiting for Trino... ($i/60)" sleep 5 done + if [ "$ready" -ne 1 ]; then + echo "Timed out waiting for Trino" + exit 1 + fiAlso applies to: 124-128, 141-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness polling loops that use "for i in $(seq 1 60); do curl -sf ... && break; ... done" should fail fast when the max attempts are reached; update each such loop (the Trino/other service readiness loops) to check after sleeping whether i equals the max (e.g., 60) and, if so, print an error to stderr and exit 1 so the workflow stops rather than continuing with unhealthy dependencies. Ensure the same pattern is applied to all three polling loops to provide deterministic failure on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness polling loops that use "for i in $(seq 1
60); do curl -sf ... && break; ... done" should fail fast when the max attempts
are reached; update each such loop (the Trino/other service readiness loops) to
check after sleeping whether i equals the max (e.g., 60) and, if so, print an
error to stderr and exit 1 so the workflow stops rather than continuing with
unhealthy dependencies. Ensure the same pattern is applied to all three polling
loops to provide deterministic failure on timeout.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.github/workflows/test-warehouse.ymltests/e2e_dbt_project/macros/system/dremio_target_database.sqltests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/profiles/profiles.yml.j2
Plain S3 sources in Dremio do not support CREATE TABLE (needed for dbt seed). Switch to Nessie catalog source which supports table creation via Iceberg. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/e2e_dbt_project/docker/dremio/dremio-setup.sh (1)
15-31:⚠️ Potential issue | 🟠 MajorFail fast on login/source API errors and reject
"null"tokens.At Line 20, checking only
-zmisses the literal"null"token case; and Lines 15-31 do not validate HTTP status for either API call, so setup can report success on non-2xx responses.🔧 Suggested patch
-# Log in to Dremio to get the auth token -AUTH_TOKEN=$(curl -s -X POST "http://dremio:9047/apiv2/login" \ - -H "Content-Type: application/json" \ - --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}" | jq -r .token) +# Log in to Dremio to get the auth token +LOGIN_RESPONSE="$(mktemp)" +LOGIN_HTTP_CODE=$(curl -sS -o "$LOGIN_RESPONSE" -w "%{http_code}" -X POST "http://dremio:9047/apiv2/login" \ + -H "Content-Type: application/json" \ + --data "{\"userName\":\"dremio\", \"password\":\"dremio123\"}") +if [ "$LOGIN_HTTP_CODE" -lt 200 ] || [ "$LOGIN_HTTP_CODE" -ge 300 ]; then + echo "Failed to log in to Dremio (HTTP $LOGIN_HTTP_CODE)" + cat "$LOGIN_RESPONSE" + exit 1 +fi +AUTH_TOKEN=$(jq -r '.token // empty' "$LOGIN_RESPONSE") # Check if AUTH_TOKEN is not empty -if [ -z "$AUTH_TOKEN" ]; then +if [ -z "$AUTH_TOKEN" ] || [ "$AUTH_TOKEN" = "null" ]; then echo "Failed to obtain Dremio auth token" exit 1 fi echo "Obtained Dremio auth token" # Create a Nessie catalog source in Dremio (supports CREATE TABLE for dbt seed) -curl -s -X PUT "http://dremio:9047/apiv2/source/NessieSource" \ +SOURCE_RESPONSE="$(mktemp)" +SOURCE_HTTP_CODE=$(curl -sS -o "$SOURCE_RESPONSE" -w "%{http_code}" -X PUT "http://dremio:9047/apiv2/source/NessieSource" \ -H "Content-Type: application/json" \ -H "Authorization: _dremio$AUTH_TOKEN" \ - --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}" + --data "{\"name\":\"NessieSource\",\"config\":{\"nessieEndpoint\":\"http://catalog:19120/api/v2\",\"nessieAuthType\":\"NONE\",\"credentialType\":\"ACCESS_KEY\",\"awsAccessKey\":\"admin\",\"awsAccessSecret\":\"password\",\"awsRootPath\":\"datalake\",\"secure\":false,\"propertyList\":[{\"name\":\"fs.s3a.path.style.access\",\"value\":\"true\"},{\"name\":\"fs.s3a.endpoint\",\"value\":\"dremio-storage:9000\"},{\"name\":\"dremio.s3.compat\",\"value\":\"true\"}]},\"type\":\"NESSIE\",\"metadataPolicy\":{\"deleteUnavailableDatasets\":true,\"autoPromoteDatasets\":false,\"namesRefreshMillis\":3600000,\"datasetDefinitionRefreshAfterMillis\":3600000,\"datasetDefinitionExpireAfterMillis\":10800000,\"authTTLMillis\":86400000,\"updateMode\":\"PREFETCH_QUERIED\"}}") +if [ "$SOURCE_HTTP_CODE" -lt 200 ] || [ "$SOURCE_HTTP_CODE" -ge 300 ]; then + echo "Failed to create Nessie Source in Dremio (HTTP $SOURCE_HTTP_CODE)" + cat "$SOURCE_RESPONSE" + exit 1 +fi#!/bin/bash # Verify login/source calls capture and validate HTTP status and token nullability. FILE="$(fd 'dremio-setup.sh' | head -n1)" nl -ba "$FILE" | sed -n '12,90p' rg -n 'apiv2/login|apiv2/source/NessieSource|http_code|AUTH_TOKEN|jq -r| -z ' "$FILE"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 15 - 31, The login and source API calls must validate HTTP status and reject a literal "null" token; modify the AUTH_TOKEN retrieval block to capture both response body and HTTP status from the login curl (use curl -w to produce HTTP code and capture body separately), then extract token with jq -r .token and fail if the HTTP status is not 2xx or if AUTH_TOKEN is empty or equals "null" (print the response body/error and exit 1); do the same for the PUT creating NessieSource: capture response body and HTTP status, and exit non-zero with the response on any non-2xx status. Ensure you reference the existing AUTH_TOKEN variable, the login curl to "apiv2/login", and the PUT to "apiv2/source/NessieSource" when making these checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 7-10: The readiness loop that uses "until curl -s
http://dremio:9047; do ... done" must be bounded: add a MAX_ATTEMPTS variable
and a counter, set curl timeouts (e.g., --connect-timeout and/or --max-time) so
each try fails fast, increment the counter inside the loop, and when attempts >=
MAX_ATTEMPTS log an error and exit non-zero; update the existing until/curl loop
code in the script to use these variables and checks so CI will fail instead of
hanging indefinitely.
---
Duplicate comments:
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh`:
- Around line 15-31: The login and source API calls must validate HTTP status
and reject a literal "null" token; modify the AUTH_TOKEN retrieval block to
capture both response body and HTTP status from the login curl (use curl -w to
produce HTTP code and capture body separately), then extract token with jq -r
.token and fail if the HTTP status is not 2xx or if AUTH_TOKEN is empty or
equals "null" (print the response body/error and exit 1); do the same for the
PUT creating NessieSource: capture response body and HTTP status, and exit
non-zero with the response on any non-2xx status. Ensure you reference the
existing AUTH_TOKEN variable, the login curl to "apiv2/login", and the PUT to
"apiv2/source/NessieSource" when making these checks.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tests/e2e_dbt_project/docker/dremio/dremio-setup.shtests/profiles/profiles.yml.j2
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/profiles/profiles.yml.j2
| until curl -s http://dremio:9047; do | ||
| echo "Waiting for Dremio..." | ||
| sleep 5 | ||
| done |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd 'dremio-setup.sh' --type fRepository: elementary-data/elementary
Length of output: 121
🏁 Script executed:
cat -n tests/e2e_dbt_project/docker/dremio/dremio-setup.shRepository: elementary-data/elementary
Length of output: 1945
Bound the Dremio readiness wait to avoid CI hangs.
Lines 7-10 create an unbounded loop that retries indefinitely if Dremio is unreachable or if curl cannot connect. This can stall the CI pipeline with no failure signal. The loop needs an explicit max-attempts limit, a timeout on curl, and a hard exit on failure.
🔧 Suggested patch
+max_attempts=60
+attempt=1
-until curl -s http://dremio:9047; do
+until curl --silent --show-error --fail --max-time 3 http://dremio:9047 >/dev/null; do
+ if [ "$attempt" -ge "$max_attempts" ]; then
+ echo "Dremio did not become ready after $max_attempts attempts"
+ exit 1
+ fi
echo "Waiting for Dremio..."
sleep 5
+ attempt=$((attempt + 1))
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| until curl -s http://dremio:9047; do | |
| echo "Waiting for Dremio..." | |
| sleep 5 | |
| done | |
| max_attempts=60 | |
| attempt=1 | |
| until curl --silent --show-error --fail --max-time 3 http://dremio:9047 >/dev/null; do | |
| if [ "$attempt" -ge "$max_attempts" ]; then | |
| echo "Dremio did not become ready after $max_attempts attempts" | |
| exit 1 | |
| fi | |
| echo "Waiting for Dremio..." | |
| sleep 5 | |
| attempt=$((attempt + 1)) | |
| done |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e_dbt_project/docker/dremio/dremio-setup.sh` around lines 7 - 10, The
readiness loop that uses "until curl -s http://dremio:9047; do ... done" must be
bounded: add a MAX_ATTEMPTS variable and a counter, set curl timeouts (e.g.,
--connect-timeout and/or --max-time) so each try fails fast, increment the
counter inside the loop, and when attempts >= MAX_ATTEMPTS log an error and exit
non-zero; update the existing until/curl loop code in the script to use these
variables and checks so CI will fail instead of hanging indefinitely.
- Make generate_data.py deterministic (fixed random seed) - Use fixed schema name for Docker adapters (ephemeral containers) - Cache seeded Docker volumes between runs using actions/cache - Cache DuckDB database file between runs - Skip dbt seed on cache hit, restoring from cached volumes instead - Applies to: Spark, Trino, Dremio, Postgres, ClickHouse, DuckDB Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)
109-113:⚠️ Potential issue | 🟠 MajorReadiness loops still don’t fail on timeout.
These checks can still silently continue after 60 attempts if the service never becomes healthy, which allows downstream steps to run against broken dependencies.
Also applies to: 124-128, 141-145
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-warehouse.yml around lines 109 - 113, The readiness loops using "for i in $(seq 1 60); do curl -sf http://localhost:8086/v1/info && break; ... done" do not fail when the service never becomes healthy; update each such loop (the curl-based readiness loops at the three occurrences) to detect timeout and exit non-zero if the loop completes without success — e.g., after the done add a check like "if [ $i -eq 60 ] && ! curl -sf http://localhost:8086/v1/info; then echo 'Service did not become ready' >&2; exit 1; fi" (or equivalent) so the workflow fails instead of continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 249-264: The "Restore cached seed data into Docker volumes"
workflow step currently restores tarballs into volumes after the warehouse
services have started (uses CACHE_DIR, docker volume create, docker run ... tar
xzf), which can corrupt live DB files; move this step to run before the job step
that starts warehouse containers (the service startup block) so volumes are
populated prior to container creation, or alternatively modify the step to stop
the warehouse containers before restore and then restart and run a readiness
check (e.g., wait-for/postgres healthcheck) after extraction to ensure services
reinitialize cleanly.
---
Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 109-113: The readiness loops using "for i in $(seq 1 60); do curl
-sf http://localhost:8086/v1/info && break; ... done" do not fail when the
service never becomes healthy; update each such loop (the curl-based readiness
loops at the three occurrences) to detect timeout and exit non-zero if the loop
completes without success — e.g., after the done add a check like "if [ $i -eq
60 ] && ! curl -sf http://localhost:8086/v1/info; then echo 'Service did not
become ready' >&2; exit 1; fi" (or equivalent) so the workflow fails instead of
continuing.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/test-warehouse.ymltests/e2e_dbt_project/generate_data.py
Addresses CodeRabbit review: restoring cached tarballs into Docker volumes while containers are already running risks data corruption. Now the cache key computation and volume restore happen before any Docker services are started, so containers initialise with the pre-seeded data. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/test-warehouse.yml (1)
160-164:⚠️ Potential issue | 🟠 MajorReadiness loops still don’t fail the job on timeout.
On Line 160-164, Line 175-179, and Line 192-196, the loops stop on success but never
exit 1when retries are exhausted. The workflow can continue with unhealthy dependencies and fail later with misleading errors.Fail-fast pattern to apply to each loop
- for i in $(seq 1 60); do - curl -sf http://localhost:8086/v1/info && break - echo "Waiting for Trino... ($i/60)" - sleep 5 - done + ready=0 + for i in $(seq 1 60); do + if curl -sf http://localhost:8086/v1/info > /dev/null; then + ready=1 + break + fi + echo "Waiting for Trino... ($i/60)" + sleep 5 + done + if [ "$ready" -ne 1 ]; then + echo "Timed out waiting for Trino" + exit 1 + fiUse the same timeout-fail pattern for Dremio and Spark checks.
Also applies to: 175-179, 192-196
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/test-warehouse.yml around lines 160 - 164, The readiness loops that poll services (the Trino loop shown and the similar Dremio and Spark loops) currently break on success but do nothing when retries are exhausted; update each loop so that if the curl never succeeds within the retries it fails the job (exit 1). Concretely, after each for/seq loop (the Trino, Dremio, and Spark readiness checks), check whether the last curl succeeded and, if not, print an error and run exit 1 (or set a failure flag inside the loop and exit after the loop) so the workflow stops immediately on timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/test-warehouse.yml:
- Around line 96-103: The SEED_HASH calculation (the SEED_HASH env var block)
currently hashes generate_data.py, dbt_project.yml and the profiles.yml.j2
template but omits the Docker stack/compose definition, allowing cached DB
volume tarballs to be restored against incompatible images or volume layouts;
update the SEED_HASH command to also include the Docker stack/compose file(s)
used to bring up the test DB (e.g., docker-compose.yml,
docker-compose.override.yml or any stack definition referenced by the workflow)
so changes in images/layouts change the cache key and prevent restoring
incompatible tarballs.
---
Duplicate comments:
In @.github/workflows/test-warehouse.yml:
- Around line 160-164: The readiness loops that poll services (the Trino loop
shown and the similar Dremio and Spark loops) currently break on success but do
nothing when retries are exhausted; update each loop so that if the curl never
succeeds within the retries it fails the job (exit 1). Concretely, after each
for/seq loop (the Trino, Dremio, and Spark readiness checks), check whether the
last curl succeeded and, if not, print an error and run exit 1 (or set a failure
flag inside the loop and exit after the loop) so the workflow stops immediately
on timeout.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/test-warehouse.yml
… loops Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ching Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…upport Points to dbt-data-reliability#948 which adds: - trino__full_name_split (1-based array indexing) - trino__edr_get_create_table_as_sql (bypass model.config issue) - spark__edr_get_create_table_as_sql TODO: revert after dbt-data-reliability#948 is merged Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Prevents tar race condition where ClickHouse temporary merge files disappear during archiving, causing 'No such file or directory' errors. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ache After stopping containers for volume archiving and restarting them, services like Trino need time to reinitialize. Added per-adapter health checks to wait for readiness before proceeding. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
The /v1/info endpoint returns HTTP 200 even when Trino is still initializing. Check for '"starting":false' in the response body to ensure Trino is fully ready before proceeding. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…rino Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…d for CI Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…24 and hive images) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…io123) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ults Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…_period for CI Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…code-from) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ucceeds Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…incompatibility) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…w on master) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
dbt-dremio's generate_schema_name uses dot separation for nested Nessie namespaces (e.g. elementary_tests.elementary), not underscore concatenation (elementary_tests_elementary). The CLI profile must match the namespace path created by the e2e project's dbt run. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite-based SQL parser treats 'snapshots' as a reserved keyword, causing 'Encountered ", snapshots" at line 6, column 6' error in the populate_model_alerts_query post-hook. Renamed to 'snapshots_data'. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite-based SQL parser treats 'filter' as a reserved keyword, causing 'Encountered ". filter" at line 52' error in the populate_source_freshness_alerts_query post-hook. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…case issue Snowflake stores columns as UPPERCASE, so quoting as "filter" (lowercase) breaks column resolution. Only quote for Dremio where it's a reserved keyword. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
dbt-dremio's dateadd macro calls interval.replace() which fails when interval is an integer. This override casts to string first. Upstream bug in dbt-dremio's macros/utils/date_spine.sql. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…ERY error dbt-dremio's dateadd wraps result in 'select TIMESTAMPADD(...)' which creates a scalar subquery when embedded in larger SQL. Dremio's Calcite parser rejects multi-field RECORDTYPE in scalar subquery context. Output just TIMESTAMPADD(...) as a plain expression instead. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…xception Dremio's Gandiva (Arrow execution engine) cannot parse ISO 8601 timestamps with the 'Z' UTC timezone suffix (e.g. '2026-03-02T22:50:42.101Z'). This causes 'Invalid timestamp or unknown zone' errors during edr monitor report. Override dremio__edr_cast_as_timestamp in the monitor project to strip the 'Z' suffix before casting. Also add dispatch config so elementary_cli macros take priority over the elementary package for adapter-dispatched macros. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Gandiva rejects both 'Z' suffix and 'T' separator in ISO 8601 timestamps. Normalize '2026-03-02T23:31:12.443Z' to '2026-03-02 23:31:12.443'. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…imestamp text Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…for Dremio Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Dremio's Calcite SQL parser reserves ROW_NUMBER and COUNT as keywords. These were used as unquoted column aliases in: - get_models_latest_invocation.sql - get_models_latest_invocations_data.sql - can_upload_source_freshness.sql Applied Dremio-specific double-quoting via target.type conditional, same pattern used for 'filter' and 'snapshots' reserved keywords. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
| where {{ elementary.edr_datediff(elementary.edr_cast_as_timestamp('generated_at'), elementary.edr_current_timestamp(), 'day') }} < {{ days_back }} | ||
| ) | ||
| select count(*) as count | ||
| select count(*) as {% if target.type == 'dremio' %}"count"{% else %}count{% endif %} |
There was a problem hiding this comment.
do you think we can use elementary.escape_reserved_keywords? It was created for this purpose
(its code is in the dbt-data-reliability repo)
There was a problem hiding this comment.
Good call! Refactored all manual {% if target.type == 'dremio' %} quoting to use elementary.escape_reserved_keywords() instead. This covers:
row_numberalias (2 files)countalias (1 file)filtercolumn reference (2 files)
Also created dbt-data-reliability#955 to add row_number and snapshots to the reserved keywords list. CI is temporarily pinned to that branch.
…ed words
Replace manual {% if target.type == 'dremio' %} quoting with the existing
elementary.escape_reserved_keywords() utility from dbt-data-reliability.
Files updated:
- get_models_latest_invocation.sql: row_number alias
- get_models_latest_invocations_data.sql: row_number alias
- can_upload_source_freshness.sql: count alias
- source_freshness_alerts.sql: filter column reference
- get_source_freshness_results.sql: filter column reference
Also temporarily pins dbt-data-reliability to branch with row_number
and snapshots added to the reserved keywords list (PR #955).
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Replace USING (unique_id) with explicit ON clause and select specific columns instead of SELECT * to avoid ambiguous column references in Dremio's SQL engine, which doesn't deduplicate join columns with USING. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…N join The switch from USING to ON for Dremio compatibility requires qualifying column references since ON doesn't deduplicate join columns like USING does. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Summary
Adds full CI infrastructure and CLI support for DuckDB, Trino, Dremio, and Spark adapters to elementary-cli. All 11 warehouse tests (including the 4 new ones) plus code-quality now pass in CI.
Key changes:
pyproject.toml: Addeddbt-duckdbanddbt-dremioas optional dependenciesprofiles.yml.j2: Added profile targets for all 4 adapters (DuckDB file-based, Trino with Iceberg, Dremio with Nessie enterprise catalog, Spark Thrift Server with Delta)docker-compose.yml: Added Docker service stacks for Trino (Iceberg + Hive Metastore + MinIO), Dremio (Nessie + MinIO), Spark (Thrift Server + Delta Lake)min→min_val,max→max_val,sum→sum_val,one→col1), updateddata_typemappings, excludedschema_changesfor Spark, added adapter-specific schema drop dispatchesget_adapter_unique_iddispatch for DuckDBDremio-specific workarounds (dbt-dremio adapter limitations):
--exclude tag:ephemeral_modelondbt test— dbt-dremio doesn't support ephemeral model CTEsenterprise_catalog_folderinstead ofschemato avoid duplicate-key errors from dbt-dremio's alias chain--threads 1) to avoid Nessie catalog race conditionsUpdates since last revision
elementary.escape_reserved_keywords()— Replaced{% if target.type == 'dremio' %}..."keyword"...{% else %}keyword{% endif %}patterns with the centralized utility macro in 5 files:row_numberalias inget_models_latest_invocation.sqlandget_models_latest_invocations_data.sqlcountalias incan_upload_source_freshness.sqlfiltercolumn reference insource_freshness_alerts.sqlandget_source_freshness_results.sqlrow_numberandsnapshotsto Dremio reserved keywords list — dbt-data-reliability#955 (merged).Column 'unique_id' is ambiguouserror — ReplacedUSING (unique_id)with explicitON run_results.unique_id = models.unique_idjoin and qualified column references. Dremio (and some other databases) don't deduplicate join columns when usingONinstead ofUSING, so unqualified references became ambiguous.invocation_id is ambiguouserror — Qualified all column references inget_models_latest_invocations_data.sqlfinal SELECT to avoid ambiguity withONjoin.continue-on-errorworkaround for Dremio edr steps. Now that PR #2137 (ref() macro override for dbt-core 1.11 compatibility) is merged to master, the Dremio edr monitor/report/send-report/validate-alerts/e2e steps are fully validated in CI without any silent failures. Merged latest master into this branch to pick up the fix.snapshotsCTE tosnapshots_datainmodel_alerts.sql—snapshotsis reserved in Dremio's Calcite parserelementary/monitor/dbt_project/macros/overrides/dateadd.sql:.replace()(dbt-dremio calls it on int, causing AttributeError)selectprefix from TIMESTAMPADD output (dbt-dremio wraps in scalar subquery, causing$SCALAR_QUERYerrors in expression contexts){{ schema_name }}.elementaryinstead of underscore to match how elementary tables are createddbt-data-reliabilitybranch pin now that dbt-data-reliability#948 and #955 are merged.projectnessie/nessie:0.80.0no longer exists on Docker Hub; now usesghcr.io/projectnessie/nessie:0.80.0.curl/nc; replaced withbash -c 'echo > /dev/tcp/localhost/PORT'.dremio-setup.shdefault password aligned todremio123to matchdocker-compose.yml.dremio-setupnow uses--exit-code-fromto wait for NessieSource creation before external seeder runs.nessie: condition: service_healthydependency to ensure Nessie catalog is ready.hive-metastorehealthcheck timeout: 60 retries with 60s start_period for CI environments.Review & Testing Checklist for Human
USING→ONjoin refactoring affects all adapters — The change fromUSING (unique_id)to explicitONjoin with qualified column references fixes Dremio's ambiguous column error, but applies to ALL adapters. CI passes for all 16 checks, but verify edge cases around column resolution don't break for specific adapters.escape_reserved_keywords()utility macro dependency — The refactored code depends on dbt-data-reliability havingrow_numberandsnapshotsin the Dremio reserved keywords list (PR #955, now merged). Verify the elementary package correctly picks up the merged version.snapshots→snapshots_data) — This change inmodel_alerts.sqlaffects ALL adapters, not just Dremio. Verify no downstream references to the old CTE name exist.elementary/monitor/dbt_project/macros/overrides/dateadd.sqlfixes two dbt-dremio bugs (int interval AttributeError + scalar subquery wrapping). Verify this doesn't break other date arithmetic operations or introduce edge cases with different interval types.{{ schema_name }}.elementaryinstead of underscore. Verify elementary CLI can correctly resolve this schema for Dremio connections.continue-on-errorworkaround is now removed, so Dremio's edr monitor/report/send-report steps are fully tested in CI. All 16 CI checks pass, confirming the ref() override + dateadd override + reserved keyword fixes work end-to-end.Suggested Test Plan
continue-on-error)snapshots_data) doesn't break other adapters in CIescape_reserved_keywords()works correctly for Dremio and doesn't affect other adaptersNotes
minio123,dremio123) are local-only test credentials, not real secretsdata_typemappings inschema.ymlassume Spark/BigQuery/Databricks/Athena useSTRINGand Trino/Dremio/DuckDB usevarcharschema_changestests like Databricks (same limitation)python-dev→python3-devfor Spark requirements (existing bug)threads: 1to match dbt-duckdb's concurrency modelelementary_teststo enable seed caching (safe because concurrency group prevents parallel runs)interval.replace()called on int parameter, (2)select TIMESTAMPADD(...)wraps in scalar subquery causing Calcite parser errors in expression contextsescape_reserved_keywords()macro dispatches to adapter-specific implementations — for Dremio it wraps keywords in double quotes; for other adapters it's a no-opRequested by: @haritamar
Link to Devin Session: https://app.devin.ai/sessions/6c12a19d456541568bcd7974623f3557
Summary by CodeRabbit
New Features
Chores
Tests