Skip to content

Conversation

@cyclux
Copy link
Contributor

@cyclux cyclux commented Nov 28, 2025

This pull request introduces a new Snowflake integration module for the Jaffle Shop dataset. It provides a robust, idempotent workflow for bootstrapping Snowflake infrastructure, ingesting CSV data from S3, and preparing weekly sales forecasting data for use with getML. The implementation features modular Python scripts, externalized SQL queries, and comprehensive logging and error handling. The workflow is automated via a new GitHub Actions CI pipeline.

Infrastructure and Workflow Automation

  • Added a new GitHub Actions workflow (.github/workflows/snowflake-test.yml) to automate Python linting, formatting, type checking, and testing for the Snowflake integration, including coverage reporting and support for multiple Python versions.

Snowflake Infrastructure Bootstrapping

  • Implemented bootstrap.py to create Snowflake warehouses and databases if they do not exist, using idempotent SQL and externalized queries (create_warehouse.sql, create_database.sql). [1] [2] [3]

Data Ingestion Pipeline

  • Developed data/ingestion.py to ingest Jaffle Shop CSV data from S3 into Snowflake's RAW schema using external stages and native COPY INTO commands, with transaction management and error handling. [1] [2] [3]

Data Preparation for Feature Store

  • Added data/preparation.py to create weekly sales population tables per store, calculate forecasting targets, perform schema validation, and run data quality checks, leveraging externalized SQL for maintainability. [1] [2]

Modular Utilities and API

  • Introduced data/__init__.py and _sql_loader.py to provide a clean API and internal SQL file loading utilities for the integration package. [1] [2]

@cyclux cyclux self-assigned this Nov 28, 2025
@cyclux cyclux linked an issue Nov 28, 2025 that may be closed by this pull request
7 tasks
@@ -0,0 +1,56 @@
-- Create weekly snapshots per store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE TABLE PREPARED.weekly_stores AS
WITH store_activity AS (
    SELECT 
        s.id as store_id,
        s.name as store_name,
        s.opened_at,
        DATE_TRUNC('week', s.opened_at) + INTERVAL '7 days' as first_full_week,
        MIN(o.ordered_at) as first_order_date,
        MAX(o.ordered_at) as last_order_date,
        DATE_TRUNC('week', MIN(o.ordered_at)) as first_order_week,
        DATE_TRUNC('week', MAX(o.ordered_at)) as last_order_week
    FROM RAW.raw_stores s
    LEFT JOIN RAW.raw_orders o ON o.store_id = s.id
    GROUP BY s.id, s.name, s.opened_at
),

all_weeks AS (
    SELECT DISTINCT 
        DATE_TRUNC('week', ordered_at) as reference_date
    FROM RAW.raw_orders
    WHERE ordered_at IS NOT NULL
),

store_weeks AS (
    SELECT 
        sa.store_id,
        sa.store_name,
        w.reference_date,
        sa.opened_at,
        sa.first_full_week,
        sa.first_order_week,
        sa.last_order_week
    FROM store_activity sa
    CROSS JOIN all_weeks w
    WHERE w.reference_date >= sa.opened_at
      AND w.reference_date < sa.last_order_week
)

SELECT 
    store_id,
    store_name,
    reference_date,
    EXTRACT(year FROM reference_date) as year,
    EXTRACT(month FROM reference_date) as month,
    EXTRACT(week FROM reference_date) as week_number,
    DATEDIFF('day', opened_at, reference_date) as days_since_open,
    reference_date >= first_full_week as is_full_week_after_opening,
    first_order_week IS NOT NULL 
        AND reference_date >= first_order_week
        AND reference_date < last_order_week as has_order_activity,
    DATEDIFF('day', opened_at, reference_date) >= 7 as has_min_history
FROM store_weeks
ORDER BY reference_date, store_id;

…sion management

- Added bootstrap functionality to ensure the existence of Snowflake warehouse and database.
- Created SnowflakeSettings class for managing authentication and connection settings.
- Implemented session management for Snowflake using Snowpark.
- Developed SQL loading utilities for dynamic SQL execution.
- Added SQL scripts for creating schemas, tables, and stages, as well as data ingestion processes.
- Prepared a script to load and prepare Jaffle Shop data for integration with getML Feature Store.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build data preparation infrastructure for feature store notebooks

3 participants