-
Notifications
You must be signed in to change notification settings - Fork 8
Build data ingestion infrastructure for Databricks notebook #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
||
|
|
||
| def _download_parquet(url: str) -> pd.DataFrame: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Fundamental Architecture Flaw
This downloads parquet to local memory via pandas, defeating Spark's purpose. Use spark.read.parquet(url) directly.
| DEFAULT_BUCKET: Final[str] = "https://static.getml.com/datasets/jaffle_shop" | ||
| DEFAULT_CATALOG: Final[str] = "workspace" | ||
| DEFAULT_SCHEMA: Final[str] = "jaffle_shop" | ||
| DEFAULT_PROFILE: Final[str] = "Code17" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded personal profile
"Code17" is a personal profile name. Use the standard Databricks CLI default:
DEFAULT_PROFILE: Final[str] = "DEFAULT"The README should document the DATABRICKS_CONFIG_PROFILE env var. Contributors set it locally via .mise.local.toml (gitignored):
# .mise.local.toml
[env]
DATABRICKS_CONFIG_PROFILE = "Code17"|
|
||
| import logging | ||
| import sys | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relative import not allowed - use from integration.databricks.data import ingestion
| @@ -0,0 +1,7 @@ | |||
| # Databricks Integration Dependencies | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pyproject.toml instead - this project uses uv for dependency management. Also missing newline at EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Python API (Recommended) | ||
|
|
||
| Use the modules directly in notebooks or scripts: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References non-existent module - preparation doesn't exist in this PR.
| ### Python Version Issues | ||
|
|
||
| Databricks serverless requires Python 3.12: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Irrelevant troubleshooting - Python version belongs in pyproject.toml, not README.
| sdf.write.format("delta").mode("overwrite").option( | ||
| "overwriteSchema", "true" | ||
| ).saveAsTable(config.full_table_name) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL injection potential - validate catalog/schema as identifiers before interpolating into SQL.
Review SummaryStyle Reference: Python Style Guide (getml/code17-northstar#18) Context: Requirements from #42This PR implements Databricks ingestion, mirroring the pattern established in #42 (Build data preparation infrastructure for feature store notebooks). Issue #42 defines the expected architecture: Expected usage pattern from #42: from integration.{platform}.data import ingestion, preparation
ingestion.load_from_gcs(
bucket="gs://static.getml.com/datasets/jaffle-shop/",
destination_schema="RAW"
)The key expectation: data warehouses/platforms should use their native capabilities to ingest from GCS - not download through Python. Critical Deviation: ArchitectureThe current implementation downloads parquet files to local memory via
Correct approach: # Spark reads parquet directly from URL - no local memory needed
spark.read.parquet(source_url).write.format("delta").saveAsTable(target_table)Deviations from #42 Structure
Issues SummaryCritical (blocking):
High priority:
Medium priority:
Recommended Changes
|
Closes: #50 & #52