Skip to content

Build data preparation infrastructure for feature store notebooks #42

@srnnkls

Description

@srnnkls

Overview

Build the data ingestion and preparation infrastructure for feature store integration notebooks. This involves hosting jaffle shop parquet files on GCS, creating an ingestion module for GCS→Snowflake, refactoring the data preparation script into a proper orchestration module with external SQL files, and writing integration tests.

User Story

As a developer building feature store integration notebooks,
I want reliable data infrastructure that loads and prepares jaffle shop data,
So that I can focus on demonstrating getml's feature engineering capabilities without worrying about data pipeline issues.

Acceptance Criteria

  • Parquet files hosted on gs://static.getml.com/datasets/jaffle-shop/
  • Ingestion module loads data from GCS/S3 to Snowflake
  • SQL queries externalized to .sql files
  • All SQL is Snowflake-compliant
  • Off-by-one bug in target window fixed
  • Integration test passes
  • population_weekly_by_store_with_target table created correctly

Example Usage

from integration.snowflake.data import ingestion, preparation

# Load raw data from GCS to Snowflake
ingestion.load_from_gcs(
    bucket="gs://static.getml.com/datasets/jaffle-shop/",
    destination_schema="RAW"
)

# Prepare population table with target
preparation.create_population_with_target(
    source_schema="RAW",
    target_schema="PREPARED"
)

Implementation Breakdown

This feature will be decomposed into the following tasks (to be created as sub-issues):

  1. Upload jaffle shop parquet files to GCS
  2. Implement GCS → Snowflake ingestion module
  3. Refactor SQL queries into external .sql files
  4. Implement data preparation orchestration module
  5. Fix off-by-one bug in target look-ahead window
  6. Create integration tests

Technical Context

Input: Parquet files from jaffle shop example
Output: population_weekly_by_store_with_target table in Snowflake

File Structure:

getml-demo/integration/snowflake/
├── data/
│   ├── ingestion.py          # GCS → Snowflake loader
│   ├── preparation.py        # Orchestration module
│   └── sql/
│       ├── create_population.sql
│       ├── calculate_target.sql
│       └── ...
└── tests/
    └── test_data_pipeline.py  # Integration test

Related Issues

Documentation

  • prepare_getml_weekly_sales_by_store.py - Original script to refactor

Sub-issues

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions