Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions integration/jaffle-shop-data/GENERATE_JAFFLE_SHOP_PARQUET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generate Parquet files from Jaffle Shop CSV data

## Prerequisites

- pipx
- gcloud CLI
Comment on lines +5 to +6
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks around pipx for consistency with other command-line tool references. Should be - `pipx` to match the formatting of gcloud CLI and other tool references in the document.

Suggested change
- pipx
- gcloud CLI
- `pipx`
- `gcloud CLI`

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks around gcloud for consistency with other command-line tool references. Should be - `gcloud` CLI to match the formatting style used for other tools.

Suggested change
- gcloud CLI
- `gcloud` CLI

Copilot uses AI. Check for mistakes.

This script reads the Jaffle Shop CSV files and converts them to Parquet format for more efficient storage and querying in Snowflake.

## Generate Jaffle Shop Data (CSV)

To generate the Jaffle Shop CSV data, run the following command:

```bash
pipx run jafgen 6
```

This will create the necessary CSV files in the `jaffle-data` directory.

## Convert CSV to Parquet

To convert the generated CSV files to Parquet format, run the following script:

```bash
python convert_jaffle_csv_to_parquet.py
```

This will read each CSV file from the `jaffle-data` directory and save the corresponding Parquet files in the `jaffle-data/parquet` directory.

## Upload Parquet Files to GCP

To upload the Parquet files to your GCP bucket, use the following commands:

```bash
gcloud config set project getml-infra
gcloud storage cp jaffle-data/parquet/*.parquet gs://static.getml.com/datasets/jaffle_shop/
```
39 changes: 39 additions & 0 deletions integration/jaffle-shop-data/convert_jaffle_csv_to_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from pathlib import Path

import pandas as pd

NAMES: list[str] = [
"raw_customers",
"raw_items",
"raw_orders",
"raw_products",
"raw_stores",
"raw_supplies",
"raw_tweets",
]

JAFFLE_CSV_DATA_PATH = Path("jaffle-data")

if not JAFFLE_CSV_DATA_PATH.exists():
raise FileNotFoundError(
f"Jaffle CSV data path {JAFFLE_CSV_DATA_PATH} does not exist."
" Please run `jafgen` to generate CSVs."
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references jafgen to generate CSVs, but the actual command shown in the documentation (GENERATE_JAFFLE_SHOP_PARQUET.md) is pipx run jafgen 6. The error message should provide the complete command to help users resolve the issue more easily: "Please run pipx run jafgen 6 to generate CSVs."

Suggested change
" Please run `jafgen` to generate CSVs."
" Please run `pipx run jafgen 6` to generate CSVs."

Copilot uses AI. Check for mistakes.
)

JAFFLE_PARQUET_DATA_PATH = JAFFLE_CSV_DATA_PATH / "parquet"
Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to Path.mkdir() is incorrect. The method should be called on the path instance, not on the Path class. Change Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True) to JAFFLE_PARQUET_DATA_PATH.mkdir(exist_ok=True) or JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True) to ensure parent directories are also created if needed.

Suggested change
Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True)
JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True)

Copilot uses AI. Check for mistakes.


for name in NAMES:
csv_filepath = JAFFLE_CSV_DATA_PATH / f"{name}.csv"
parquet_filepath = JAFFLE_PARQUET_DATA_PATH / f"{name}.parquet"
print(f"Loading {csv_filepath}...")

# 1. Read CSV into memory
df: pd.DataFrame = pd.read_csv(csv_filepath)

# 2. Write DataFrame to Parquet
# 'index=False' prevents pandas from adding an extra index column
df.to_parquet(parquet_filepath, index=False)

print(f"Converted {name} to parquet format at {parquet_filepath}.")
137 changes: 137 additions & 0 deletions integration/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
[project]
name = "getml-featurestore-integrations"
version = "0.1.0"
description = "Integrations and Data Preparation for getML Feature Stores"
authors = [
{ name = "Code17 GmbH", email = "hello@code17.io" },
{ name = "getML", email = "hello@getml.com" },
]
maintainers = [
{ name = "Code17 GmbH", email = "hello@code17.io" },
{ name = "getML", email = "hello@getml.com" },
]
license = { text = "Proprietary" }
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Operating System :: OS Independent",
"Private :: Do Not Upload",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
"Topic :: Software Development :: Libraries",
"Topic :: Software Development :: Libraries :: Python Modules",
]
readme = "README.md"
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyproject.toml references readme = "README.md" but the README.md file does not exist in the integration directory. This will cause build failures when attempting to package the project. Either create the README.md file or remove this line from the configuration.

Suggested change
readme = "README.md"

Copilot uses AI. Check for mistakes.
requires-python = ">=3.12"

dependencies = [
"fastparquet>=2024.11.0",
"httpx>=0.27.0",
"ipykernel>=7.1.0",
"pandas>=2.3.3",
"pyarrow>=18.0.0",
"pydantic>=2.12.5",
"pydantic-settings>=2.12.0",
"snowflake-connector-python>=3.17.3",
"snowflake-snowpark-python>=1.42.0",
]

[dependency-groups]
dev = [
"ruff~=0.12.2",
"basedpyright~=1.28.4",
"pytest~=8.0.0",
"pytest-cov>=6.2.1",
"pytest-dependency>=0.6.0",
]

[tool.uv]
package = false

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = ["test_*.py"]
python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
"integration: marks tests as integration tests (require Snowflake credentials)",
]

[project.urls]
"Homepage" = "https://github.com/getml/getml-demo"
"Bug Tracker" = "https://github.com/getml/getml-demo/issues"
"getML" = "https://getml.com"
"Code17 GmbH" = "https://www.code17.io/"

[tool.pyright]
venvPath = "."
venv = ".venv"
reportMissingTypeStubs = false
reportImplicitStringConcatenation = false

[[tool.pyright.executionEnvironments]]
root = "tests"
extraPaths = ["."]
reportUnusedParameter = false

[build-system]
requires = ["uv_build>=0.7.21,<0.8.0"]
build-backend = "uv_build"

[tool.ruff]
line-length = 88
target-version = "py312"

[tool.ruff.format]
preview = false
quote-style = "double"
line-ending = "auto"
docstring-code-format = true

[tool.ruff.lint]
select = ["ALL"]
ignore = [
# Allow for string literals in exceptions
"EM",
# Allow missing copyright notice at top of files
"CPY001",
# Allow missing docstrings in public modules
"D100",
# Allow missing docstrings in public classes
"D101",
# Allow missing docstrings in public packages
"D104",
# Allow docstrings without blank line before class docstring
"D203",
# Allow multi-line docstring summary to start at second line
"D213",
# Allow first-party imports outside type-checking blocks
"TC001",
# Allow third-party imports outside type-checking blocks
"TC002",
# Allow standard library imports outside type-checking blocks
"TC003",
# Allow TODO comments
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace found at the end of this comment line. Remove the extra spaces.

Suggested change
# Allow TODO comments
# Allow TODO comments

Copilot uses AI. Check for mistakes.
"FIX002",
# Allow TODO comments without author
"TD002",
# Allow TODO comments without link to issue
"TD003",
# Allow specifying long messages outside the exception class
"TRY003",
# Conflicts with formatter - trailing commas are handled by ruff format
"COM812",
]

fixable = ["ALL"]

[tool.ruff.lint.pydocstyle]
convention = "google"

[tool.ruff.lint.per-file-ignores]
# S101: Allow for use of the assert keyword
# PLR2004: Allow "magic value" used in comparison
"test_*.py" = ["S101", "PLR2004"]