-
Notifications
You must be signed in to change notification settings - Fork 8
Generate and Convert Jaffle Shop CSVs to Parquet Format for GCS #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…rmat and instructions to upload to GCS bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request establishes the initial infrastructure for Jaffle Shop data integration, introducing a Python script to convert CSV files to Parquet format, comprehensive project configuration, and step-by-step documentation. The changes prepare the environment for efficient data processing and integration with Snowflake and GCP.
Key Changes:
- Added CSV-to-Parquet conversion script with automated processing for seven Jaffle Shop data tables
- Configured project dependencies and development tooling via pyproject.toml with Python 3.12+ support
- Documented complete workflow from CSV generation through GCP upload
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
integration/pyproject.toml |
Establishes project metadata, dependencies (pandas, pyarrow, Snowflake connectors), and development tool configurations for the integration package |
integration/jaffle-shop-data/convert_jaffle_csv_to_parquet.py |
Implements automated CSV-to-Parquet conversion for seven Jaffle Shop datasets with basic error handling |
integration/jaffle-shop-data/GENERATE_JAFFLE_SHOP_PARQUET.md |
Provides user documentation covering prerequisites, CSV generation, conversion steps, and GCP upload instructions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "TC002", | ||
| # Allow standard library imports outside type-checking blocks | ||
| "TC003", | ||
| # Allow TODO comments |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing whitespace found at the end of this comment line. Remove the extra spaces.
| # Allow TODO comments | |
| # Allow TODO comments |
| if not JAFFLE_CSV_DATA_PATH.exists(): | ||
| raise FileNotFoundError( | ||
| f"Jaffle CSV data path {JAFFLE_CSV_DATA_PATH} does not exist." | ||
| " Please run `jafgen` to generate CSVs." |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message references jafgen to generate CSVs, but the actual command shown in the documentation (GENERATE_JAFFLE_SHOP_PARQUET.md) is pipx run jafgen 6. The error message should provide the complete command to help users resolve the issue more easily: "Please run pipx run jafgen 6 to generate CSVs."
| " Please run `jafgen` to generate CSVs." | |
| " Please run `pipx run jafgen 6` to generate CSVs." |
| - pipx | ||
| - gcloud CLI |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks around pipx for consistency with other command-line tool references. Should be - `pipx` to match the formatting of gcloud CLI and other tool references in the document.
| - pipx | |
| - gcloud CLI | |
| - `pipx` | |
| - `gcloud CLI` |
| "Topic :: Software Development :: Libraries", | ||
| "Topic :: Software Development :: Libraries :: Python Modules", | ||
| ] | ||
| readme = "README.md" |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyproject.toml references readme = "README.md" but the README.md file does not exist in the integration directory. This will cause build failures when attempting to package the project. Either create the README.md file or remove this line from the configuration.
| readme = "README.md" |
| ) | ||
|
|
||
| JAFFLE_PARQUET_DATA_PATH = JAFFLE_CSV_DATA_PATH / "parquet" | ||
| Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to Path.mkdir() is incorrect. The method should be called on the path instance, not on the Path class. Change Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True) to JAFFLE_PARQUET_DATA_PATH.mkdir(exist_ok=True) or JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True) to ensure parent directories are also created if needed.
| Path.mkdir(JAFFLE_PARQUET_DATA_PATH, exist_ok=True) | |
| JAFFLE_PARQUET_DATA_PATH.mkdir(parents=True, exist_ok=True) |
| ## Prerequisites | ||
|
|
||
| - pipx | ||
| - gcloud CLI |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks around gcloud for consistency with other command-line tool references. Should be - `gcloud` CLI to match the formatting style used for other tools.
| - gcloud CLI | |
| - `gcloud` CLI |
This pull request introduces the initial setup for the Jaffle Shop data integration, focusing on enabling CSV-to-Parquet conversion and preparing the environment for efficient data processing and integration with Snowflake and GCP. Key changes include a new conversion script, a comprehensive project configuration, and usage documentation.
Data conversion and workflow setup
convert_jaffle_csv_to_parquet.pyscript to automate conversion of Jaffle Shop CSV files into Parquet format, improving data storage and query efficiency for downstream use in Snowflake.GENERATE_JAFFLE_SHOP_PARQUET.mddocumentation to guide users through generating CSV data, converting it to Parquet, and uploading Parquet files to GCP, including prerequisites and step-by-step instructions.Project configuration and dependencies
pyproject.tomlfor project metadata, dependency management (including pandas, pyarrow, fastparquet, Snowflake connectors), development tools, and configuration for code quality and testing.