-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Overview
Build the data ingestion and preparation infrastructure for feature store integration notebooks. This involves hosting jaffle shop parquet files on GCS, creating an ingestion module for GCS→Snowflake, refactoring the data preparation script into a proper orchestration module with external SQL files, and writing integration tests.
User Story
As a developer building feature store integration notebooks,
I want reliable data infrastructure that loads and prepares jaffle shop data,
So that I can focus on demonstrating getml's feature engineering capabilities without worrying about data pipeline issues.
Acceptance Criteria
- Parquet files hosted on
gs://static.getml.com/datasets/jaffle-shop/ - Ingestion module loads data from GCS/S3 to Snowflake
- SQL queries externalized to
.sqlfiles - All SQL is Snowflake-compliant
- Off-by-one bug in target window fixed
- Integration test passes
-
population_weekly_by_store_with_targettable created correctly
Example Usage
from integration.snowflake.data import ingestion, preparation
# Load raw data from GCS to Snowflake
ingestion.load_from_gcs(
bucket="gs://static.getml.com/datasets/jaffle-shop/",
destination_schema="RAW"
)
# Prepare population table with target
preparation.create_population_with_target(
source_schema="RAW",
target_schema="PREPARED"
)Implementation Breakdown
This feature will be decomposed into the following tasks (to be created as sub-issues):
- Upload jaffle shop parquet files to GCS
- Implement GCS → Snowflake ingestion module
- Refactor SQL queries into external
.sqlfiles - Implement data preparation orchestration module
- Fix off-by-one bug in target look-ahead window
- Create integration tests
Technical Context
Input: Parquet files from jaffle shop example
Output: population_weekly_by_store_with_target table in Snowflake
File Structure:
getml-demo/integration/snowflake/
├── data/
│ ├── ingestion.py # GCS → Snowflake loader
│ ├── preparation.py # Orchestration module
│ └── sql/
│ ├── create_population.sql
│ ├── calculate_target.sql
│ └── ...
└── tests/
└── test_data_pipeline.py # Integration test
Related Issues
- Parent: Initiative - Feature Store Integration
- Blocks: Snowflake notebook (Build Snowflake Feature Store integration notebook #43), Databricks notebook (Build Databricks Feature Store integration notebook #44)
Documentation
prepare_getml_weekly_sales_by_store.py- Original script to refactor