feat: add daily cleanup workflow for stale CI schemas#2128
Conversation
Re-uses the elementary.drop_stale_ci_schemas macro from dbt-data-reliability (checked out at workflow time) to drop py_-prefixed CI schemas older than 24 hours from cloud warehouses. Runs weekly on Sunday 03:00 UTC. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a GitHub Actions workflow that runs daily (03:00 UTC) and on-demand to remove stale CI schemas across five warehouse types by running a Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions"
participant Runner as "CI Runner"
participant Repo as "dbt-data-reliability repo"
participant Env as "Runner env (Python/dbt)"
participant DBT as "dbt run-operation"
participant Warehouse as "Warehouse (snowflake / bigquery / redshift / databricks / athena)"
GH->>Runner: trigger (cron daily 03:00 UTC or manual)
Runner->>Runner: validate `MAX_AGE_HOURS` input (non-negative integer)
Runner->>Runner: verify `CI_WAREHOUSE_SECRETS` present
Runner->>Repo: checkout repo
Runner->>Env: setup Python 3.10 & pip cache
Runner->>Env: install warehouse-specific dbt package
Runner->>Env: generate `profiles.yml` and install test deps
Runner->>Env: symlink local `elementary` package into tests
Runner->>DBT: run-operation `drop_stale_ci_schemas` (prefixes:["py_"], max_age_hours)
DBT->>Warehouse: connect to target and drop stale schemas
Warehouse-->>DBT: operation result
DBT-->>Runner: return result
Runner-->>GH: job status (matrix per warehouse)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
.github/workflows/cleanup-stale-schemas.yml (1)
52-60: Fail fast ifCI_WAREHOUSE_SECRETSis not configured.Line 53 silently falls back to empty string. Add an explicit guard so failures are immediate and actionable instead of surfacing later as opaque dbt/profile errors.
✅ Suggested fix
- name: Write dbt profiles env: CI_WAREHOUSE_SECRETS: ${{ secrets.CI_WAREHOUSE_SECRETS || '' }} run: | + if [ -z "${CI_WAREHOUSE_SECRETS}" ]; then + echo "::error::Missing required secret: CI_WAREHOUSE_SECRETS" + exit 1 + fi # The cleanup job doesn't create schemas, but generate_profiles.py # requires --schema-name. Use a dummy value. python "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py" \ --template "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/profiles.yml.j2" \ --output ~/.dbt/profiles.yml \ --schema-name "cleanup_placeholder"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 52 - 60, Add a fail-fast guard that errors out when CI_WAREHOUSE_SECRETS is unset/empty before calling generate_profiles.py: check the CI_WAREHOUSE_SECRETS environment variable (set in the env: CI_WAREHOUSE_SECRETS entry) at the top of the run: block (e.g., if [ -z "${CI_WAREHOUSE_SECRETS}" ]; then echo "CI_WAREHOUSE_SECRETS is required"; exit 1; fi) so the workflow exits immediately with a clear message instead of defaulting to an empty string and causing opaque errors when running python "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 71-74: Validate and sanitize the inputs.max-age-hours value before
interpolation: ensure inputs.max-age-hours contains only digits (and optionally
enforce a min/max, e.g., 1–168), coerce or fallback to '24' if invalid, and pass
the sanitized value as an environment variable (or separate shell argument) to
the dbt run-operation call (dbt run-operation elementary.drop_stale_ci_schemas)
instead of embedding the raw expression inside the quoted --args string; also
keep the matrix.warehouse-type token intact when passing -t "${{
matrix.warehouse-type }}".
- Around line 33-38: The checkout step that uses actions/checkout@v4 to grab
repository "elementary-data/dbt-data-reliability" should be pinned to an
immutable ref instead of the moving default branch; update the checkout
invocation (the step named "Checkout dbt package") to include a ref set to a
commit SHA or permanent tag (e.g., ref: '<commit-sha-or-tag>') so the workflow
always runs a known immutable revision before executing destructive macros that
drop schemas.
- Around line 45-49: The Install dbt step currently installs unpinned packages
("dbt-core" and "dbt-${{ ... }}"), which can cause breakage; update that run
line to pin both dbt-core and the chosen adapter to the tested dbt 1.8 series
(e.g. use "dbt-core==1.8.0" and "dbt-${{ (matrix.warehouse-type ==
'databricks_catalog' && 'databricks') || (matrix.warehouse-type == 'athena' &&
'athena-community') || matrix.warehouse-type }}==1.8.0") so the matrix variable
expansion still selects the correct adapter but with a fixed version.
---
Nitpick comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 52-60: Add a fail-fast guard that errors out when
CI_WAREHOUSE_SECRETS is unset/empty before calling generate_profiles.py: check
the CI_WAREHOUSE_SECRETS environment variable (set in the env:
CI_WAREHOUSE_SECRETS entry) at the top of the run: block (e.g., if [ -z
"${CI_WAREHOUSE_SECRETS}" ]; then echo "CI_WAREHOUSE_SECRETS is required"; exit
1; fi) so the workflow exits immediately with a clear message instead of
defaulting to an empty string and causing opaque errors when running python "${{
github.workspace
}}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py".
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/cleanup-stale-schemas.yml
The drop_stale_ci_schemas macro moved from the main elementary package to the integration_tests project in dbt-data-reliability. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (3)
.github/workflows/cleanup-stale-schemas.yml (3)
33-38:⚠️ Potential issue | 🔴 CriticalPin external checkout to an immutable ref before running schema-drop logic.
Line 36 currently tracks a moving branch; this can change runtime behavior unexpectedly for a destructive cleanup job.
🔒 Suggested fix
- name: Checkout dbt package uses: actions/checkout@v4 with: repository: elementary-data/dbt-data-reliability + ref: <immutable-tag-or-commit-sha> path: dbt-data-reliability🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 33 - 38, The checkout step named "Checkout dbt package" currently fetches a moving branch; update that step to pin the external repo to an immutable ref by adding a ref input (e.g., a specific commit SHA or tag) for the actions/checkout usage so the schema-drop logic always runs against a fixed revision; locate the "Checkout dbt package" step in the workflow and add the ref field alongside repository and path to ensure deterministic, non-moving behavior.
69-74:⚠️ Potential issue | 🟠 MajorValidate
max-age-hoursbefore building--args.Line 73 interpolates unsanitized user input directly into the command string.
🛡️ Suggested fix
- name: Drop stale CI schemas working-directory: ${{ env.TESTS_DIR }}/dbt_project - run: > - dbt run-operation drop_stale_ci_schemas - --args '{prefixes: ["py_"], max_age_hours: ${{ inputs.max-age-hours || '24' }}}' - -t "${{ matrix.warehouse-type }}" + env: + MAX_AGE_HOURS: ${{ inputs.max-age-hours || '24' }} + run: | + if ! [[ "$MAX_AGE_HOURS" =~ ^[0-9]+$ ]]; then + echo "::error::max-age-hours must be a non-negative integer" + exit 1 + fi + ARGS=$(printf '{"prefixes":["py_"],"max_age_hours":%s}' "$MAX_AGE_HOURS") + dbt run-operation drop_stale_ci_schemas \ + --args "$ARGS" \ + -t "${{ matrix.warehouse-type }}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 69 - 74, The workflow step "Drop stale CI schemas" builds --args using the raw input `${{ inputs.max-age-hours }}`, which allows unsanitized values into the shell command; validate and sanitize `inputs.max-age-hours` before interpolation by checking it is a positive integer (or falling back to a safe default like 24) and only then build the `--args` string; locate the step named "Drop stale CI schemas" and the interpolation of `max-age-hours` and replace it with a validated/sanitized variable (e.g., compute a sanitized `MAX_AGE_HOURS` in a preceding run/if/step or use an expressions-based guard) so the command only ever receives a numeric value.
45-49:⚠️ Potential issue | 🟠 MajorPin dbt-core and adapter versions for workflow stability.
Lines 47-49 install floating versions, which can silently break this scheduled job when upstream releases land.
📦 Suggested fix
- name: Install dbt run: > pip install - "dbt-core" - "dbt-${{ (matrix.warehouse-type == 'databricks_catalog' && 'databricks') || (matrix.warehouse-type == 'athena' && 'athena-community') || matrix.warehouse-type }}" + "dbt-core>=1.8,<1.9" + "dbt-${{ (matrix.warehouse-type == 'databricks_catalog' && 'databricks') || (matrix.warehouse-type == 'athena' && 'athena-community') || matrix.warehouse-type }}>=1.8,<1.9"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 45 - 49, The workflow's "Install dbt" step installs floating packages ("dbt-core" and the interpolated "dbt-${{ ... }}" adapter) which can break jobs when upstream releases change; update the pip install invocation in that step to pin explicit versions for dbt-core and the adapter (e.g., use pinned version strings or variables like DBT_CORE_VERSION and DBT_ADAPTER_VERSION) instead of bare package names so the matrix expression "dbt-${{ (matrix.warehouse-type == 'databricks_catalog' && 'databricks') || (matrix.warehouse-type == 'athena' && 'athena-community') || matrix.warehouse-type }}" installs a specific, pinned adapter package; ensure the workflow exposes or documents those version variables and use them in the run command to guarantee reproducible runs.
🧹 Nitpick comments (1)
.github/workflows/cleanup-stale-schemas.yml (1)
52-54: Fail fast whenCI_WAREHOUSE_SECRETSis missing.Using
|| ''hides misconfiguration until later steps fail less clearly.✅ Suggested improvement
- name: Write dbt profiles env: CI_WAREHOUSE_SECRETS: ${{ secrets.CI_WAREHOUSE_SECRETS || '' }} run: | + if [ -z "$CI_WAREHOUSE_SECRETS" ]; then + echo "::error::Missing required secret: CI_WAREHOUSE_SECRETS" + exit 1 + fi # The cleanup job doesn't create schemas, but generate_profiles.py # requires --schema-name. Use a dummy value. python "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/generate_profiles.py" \ --template "${{ github.workspace }}/dbt-data-reliability/integration_tests/profiles/profiles.yml.j2" \ --output ~/.dbt/profiles.yml \ --schema-name "cleanup_placeholder"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 52 - 54, Remove the fallback that masks a missing secret and make the job fail fast when CI_WAREHOUSE_SECRETS is not provided: stop using the "|| ''" default for CI_WAREHOUSE_SECRETS and add an explicit early check in the workflow (a small run step that tests -n "$CI_WAREHOUSE_SECRETS" or similar and exits with a clear error message) so the run aborts immediately when CI_WAREHOUSE_SECRETS is empty or unset.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 33-38: The checkout step named "Checkout dbt package" currently
fetches a moving branch; update that step to pin the external repo to an
immutable ref by adding a ref input (e.g., a specific commit SHA or tag) for the
actions/checkout usage so the schema-drop logic always runs against a fixed
revision; locate the "Checkout dbt package" step in the workflow and add the ref
field alongside repository and path to ensure deterministic, non-moving
behavior.
- Around line 69-74: The workflow step "Drop stale CI schemas" builds --args
using the raw input `${{ inputs.max-age-hours }}`, which allows unsanitized
values into the shell command; validate and sanitize `inputs.max-age-hours`
before interpolation by checking it is a positive integer (or falling back to a
safe default like 24) and only then build the `--args` string; locate the step
named "Drop stale CI schemas" and the interpolation of `max-age-hours` and
replace it with a validated/sanitized variable (e.g., compute a sanitized
`MAX_AGE_HOURS` in a preceding run/if/step or use an expressions-based guard) so
the command only ever receives a numeric value.
- Around line 45-49: The workflow's "Install dbt" step installs floating
packages ("dbt-core" and the interpolated "dbt-${{ ... }}" adapter) which can
break jobs when upstream releases change; update the pip install invocation in
that step to pin explicit versions for dbt-core and the adapter (e.g., use
pinned version strings or variables like DBT_CORE_VERSION and
DBT_ADAPTER_VERSION) instead of bare package names so the matrix expression
"dbt-${{ (matrix.warehouse-type == 'databricks_catalog' && 'databricks') ||
(matrix.warehouse-type == 'athena' && 'athena-community') ||
matrix.warehouse-type }}" installs a specific, pinned adapter package; ensure
the workflow exposes or documents those version variables and use them in the
run command to guarantee reproducible runs.
---
Nitpick comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 52-54: Remove the fallback that masks a missing secret and make
the job fail fast when CI_WAREHOUSE_SECRETS is not provided: stop using the "||
''" default for CI_WAREHOUSE_SECRETS and add an explicit early check in the
workflow (a small run step that tests -n "$CI_WAREHOUSE_SECRETS" or similar and
exits with a clear error message) so the run aborts immediately when
CI_WAREHOUSE_SECRETS is empty or unset.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/cleanup-stale-schemas.yml
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
- Pin dbt-core and adapter versions to >=1.8,<1.10 - Validate max-age-hours input is a non-negative integer - Fail fast when CI_WAREHOUSE_SECRETS secret is missing Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 45-49: The workflow step named "Install dbt" currently installs
unpinned packages ("dbt-core" and "dbt-${{ ... }}"); change these to pinned
versions to match other workflows (e.g., use "dbt-core==1.8.*" and pin the
adapter package similarly such as "dbt-databricks==1.8.*" or
"dbt-athena-community==1.8.*" depending on matrix.warehouse-type). Update the
run command that installs "dbt-core" and "dbt-${{ (matrix.warehouse-type... )
}}" so it injects the appropriate pinned adapter package for each matrix value,
mirroring the versioning pattern used by inputs.dbt-version in
test-warehouse.yml/test-github-action.yml.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.github/workflows/cleanup-stale-schemas.yml
feat: add daily cleanup workflow for stale CI schemas
Summary
Adds a scheduled GitHub Actions workflow (
cleanup-stale-schemas.yml) that runs daily at 03:00 UTC to drop stale CI schemas (py_-prefixed) from cloud warehouses. Can also be triggered manually with a configurablemax-age-hours(default: 24h).The workflow does not duplicate any cleanup logic. Instead, it checks out
dbt-data-reliabilityat runtime and invokes thedrop_stale_ci_schemasdbt macro added in the companion PR: elementary-data/dbt-data-reliability#943 (already merged).How it works:
dbt-data-reliability(default branch, no pin)dbt-data-reliability's template usingCI_WAREHOUSE_SECRETSdbt run-operation drop_stale_ci_schemas --args '{prefixes: ["py_"], max_age_hours: 24}'per warehouseTargets: snowflake, bigquery, redshift, databricks_catalog, athena (docker-only targets like postgres/clickhouse are ephemeral and don't need cleanup).
Updates since last revision
0 3 * * 0(Sundays) to0 3 * * *(daily).Review & Testing Checklist for Human
CI_WAREHOUSE_SECRETSis available as a repo secret inelementary(it should be — existing test workflows use it).--args: Confirm the--args '{prefixes: ["py_"], max_age_hours: ...}'interpolation works correctly in GitHub Actions, especially the${{ inputs.max-age-hours || '24' }}expression nested inside single quotes.workflow_dispatchagainst one warehouse (e.g. snowflake) with a smallmax-age-hoursvalue. Check the job logs to confirm schemas are listed and stale ones are dropped.Notes
dbt-data-reliabilitycheckout is unpinned (tracks default branch HEAD). This is intentional for a daily maintenance job but means macro changes in that repo will be picked up automatically.Requested by: @haritamar
Link to Devin run
Summary by CodeRabbit