Skip to content

Conversation

@cyclux
Copy link
Contributor

@cyclux cyclux commented Dec 5, 2025

This pull request introduces the initial setup for the Jaffle Shop data integration, focusing on enabling CSV-to-Parquet conversion and preparing the environment for efficient data processing and integration with Snowflake and GCP. Key changes include a new conversion script, a comprehensive project configuration, and usage documentation.

Data conversion and workflow setup

  • Added convert_jaffle_csv_to_parquet.py script to automate conversion of Jaffle Shop CSV files into Parquet format, improving data storage and query efficiency for downstream use in Snowflake.
  • Added GENERATE_JAFFLE_SHOP_PARQUET.md documentation to guide users through generating CSV data, converting it to Parquet, and uploading Parquet files to GCP, including prerequisites and step-by-step instructions.

Project configuration and dependencies

  • Introduced pyproject.toml for project metadata, dependency management (including pandas, pyarrow, fastparquet, Snowflake connectors), development tools, and configuration for code quality and testing.

…rmat and instructions to upload to GCS bucket
@cyclux cyclux self-assigned this Dec 5, 2025
@cyclux cyclux linked an issue Dec 5, 2025 that may be closed by this pull request
@cyclux cyclux requested review from Copilot and srnnkls December 5, 2025 17:55
Copilot finished reviewing on behalf of cyclux December 5, 2025 17:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request establishes the initial infrastructure for Jaffle Shop data integration, introducing a Python script to convert CSV files to Parquet format, comprehensive project configuration, and step-by-step documentation. The changes prepare the environment for efficient data processing and integration with Snowflake and GCP.

Key Changes:

  • Added CSV-to-Parquet conversion script with automated processing for seven Jaffle Shop data tables
  • Configured project dependencies and development tooling via pyproject.toml with Python 3.12+ support
  • Documented complete workflow from CSV generation through GCP upload

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
integration/pyproject.toml Establishes project metadata, dependencies (pandas, pyarrow, Snowflake connectors), and development tool configurations for the integration package
integration/jaffle-shop-data/convert_jaffle_csv_to_parquet.py Implements automated CSV-to-Parquet conversion for seven Jaffle Shop datasets with basic error handling
integration/jaffle-shop-data/GENERATE_JAFFLE_SHOP_PARQUET.md Provides user documentation covering prerequisites, CSV generation, conversion steps, and GCP upload instructions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"TC002",
# Allow standard library imports outside type-checking blocks
"TC003",
# Allow TODO comments
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace found at the end of this comment line. Remove the extra spaces.

Suggested change
# Allow TODO comments
# Allow TODO comments

Copilot uses AI. Check for mistakes.
if not JAFFLE_CSV_DATA_PATH.exists():
raise FileNotFoundError(
f"Jaffle CSV data path {JAFFLE_CSV_DATA_PATH} does not exist."
" Please run `jafgen` to generate CSVs."
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references jafgen to generate CSVs, but the actual command shown in the documentation (GENERATE_JAFFLE_SHOP_PARQUET.md) is pipx run jafgen 6. The error message should provide the complete command to help users resolve the issue more easily: "Please run pipx run jafgen 6 to generate CSVs."

Suggested change
" Please run `jafgen` to generate CSVs."
" Please run `pipx run jafgen 6` to generate CSVs."

Copilot uses AI. Check for mistakes.
Comment on lines +5 to +6
- pipx
- gcloud CLI
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks around pipx for consistency with other command-line tool references. Should be - `pipx` to match the formatting of gcloud CLI and other tool references in the document.

Suggested change
- pipx
- gcloud CLI
- `pipx`
- `gcloud CLI`

Copilot uses AI. Check for mistakes.
"Topic :: Software Development :: Libraries",
"Topic :: Software Development :: Libraries :: Python Modules",
]
readme = "README.md"
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyproject.toml references readme = "README.md" but the README.md file does not exist in the integration directory. This will cause build failures when attempting to package the project. Either create the README.md file or remove this line from the configuration.

Suggested change
readme = "README.md"

Copilot uses AI. Check for mistakes.
)

JAFFLE_PARQUET_DATA_PATH = JAFFLE_CSV_DATA_PATH / "parquet"
Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to Path.mkdir() is incorrect. The method should be called on the path instance, not on the Path class. Change Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True) to JAFFLE_PARQUET_DATA_PATH.mkdir(exist_ok=True) or JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True) to ensure parent directories are also created if needed.

Suggested change
Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True)
JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True)

Copilot uses AI. Check for mistakes.
## Prerequisites

- pipx
- gcloud CLI
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks around gcloud for consistency with other command-line tool references. Should be - `gcloud` CLI to match the formatting style used for other tools.

Suggested change
- gcloud CLI
- `gcloud` CLI

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Host jaffle-shop parquet files on GCS

2 participants