Build data ingestion infrastructure for Databricks notebook #51

awaismirza92 · 2025-12-02T14:07:01Z

Closes: #50 & #52

srnnkls · 2025-12-03T08:33:30Z

integration/databricks/data/ingestion.py

+
+
+def _download_parquet(url: str) -> pd.DataFrame:
+    """


Critical: Fundamental Architecture Flaw

This downloads parquet to local memory via pandas, defeating Spark's purpose. Use spark.read.parquet(url) directly.

srnnkls · 2025-12-03T08:33:56Z

integration/databricks/data/ingestion.py

+DEFAULT_BUCKET: Final[str] = "https://static.getml.com/datasets/jaffle_shop"
+DEFAULT_CATALOG: Final[str] = "workspace"
+DEFAULT_SCHEMA: Final[str] = "jaffle_shop"
+DEFAULT_PROFILE: Final[str] = "Code17"


Hardcoded personal profile

"Code17" is a personal profile name. Use the standard Databricks CLI default:

DEFAULT_PROFILE: Final[str] = "DEFAULT"

The README should document the DATABRICKS_CONFIG_PROFILE env var. Contributors set it locally via .mise.local.toml (gitignored):

# .mise.local.toml [env] DATABRICKS_CONFIG_PROFILE = "Code17"

srnnkls · 2025-12-03T08:34:05Z

integration/databricks/ingest_to_databricks.py

+
+import logging
+import sys
+


Relative import not allowed - use from integration.databricks.data import ingestion

srnnkls · 2025-12-03T08:34:11Z

integration/databricks/requirements.txt

@@ -0,0 +1,7 @@
+# Databricks Integration Dependencies


Use pyproject.toml instead - this project uses uv for dependency management. Also missing newline at EOF.

Done: 6464559

Originally, I wanted it to be a separate task (took a few hours for me): #52

srnnkls · 2025-12-03T08:34:54Z

integration/databricks/README.md

+### Python API (Recommended)
+
+Use the modules directly in notebooks or scripts:
+


References non-existent module - preparation doesn't exist in this PR.

srnnkls · 2025-12-03T08:35:01Z

integration/databricks/README.md

+### Python Version Issues
+
+Databricks serverless requires Python 3.12:
+


Irrelevant troubleshooting - Python version belongs in pyproject.toml, not README.

srnnkls · 2025-12-03T08:35:15Z

integration/databricks/data/ingestion.py

+    sdf.write.format("delta").mode("overwrite").option(
+        "overwriteSchema", "true"
+    ).saveAsTable(config.full_table_name)
+


SQL injection potential - validate catalog/schema as identifiers before interpolating into SQL.

srnnkls · 2025-12-03T08:37:40Z

Review Summary

Style Reference: Python Style Guide (getml/code17-northstar#18)

Context: Requirements from #42

This PR implements Databricks ingestion, mirroring the pattern established in #42 (Build data preparation infrastructure for feature store notebooks). Issue #42 defines the expected architecture:

integration/{platform}/
├── data/
│   ├── ingestion.py      # GCS → Platform loader
│   ├── preparation.py    # Orchestration module  
│   └── sql/              # Externalized SQL queries
└── tests/

Expected usage pattern from #42:

from integration.{platform}.data import ingestion, preparation

ingestion.load_from_gcs(
    bucket="gs://static.getml.com/datasets/jaffle-shop/",
    destination_schema="RAW"
)

The key expectation: data warehouses/platforms should use their native capabilities to ingest from GCS - not download through Python.

Critical Deviation: Architecture

The current implementation downloads parquet files to local memory via requests.get() + pandas, then converts to Spark DataFrame. This fundamentally misunderstands how Spark/Databricks works.

Aspect	Expected (per #42 pattern)	Actual Implementation
Data flow	GCS → Spark → Delta (direct)	GCS → Python memory → pandas → Spark → Delta
Scalability	Distributed across cluster	Limited by local memory
Dependencies	`pyspark`, `databricks-connect`	+ `pandas`, `requests`, `pyarrow`
Performance	Native Spark parallelism	Single-threaded download

Correct approach:

# Spark reads parquet directly from URL - no local memory needed
spark.read.parquet(source_url).write.format("delta").saveAsTable(target_table)

Deviations from #42 Structure

Requirement from #42	Status in PR
`ingestion.py` module	✓ Present (but wrong approach)
`preparation.py` module	✗ Missing (but referenced in README)
`sql/` directory	✗ Missing
Integration tests	✗ Missing
`pyproject.toml`	✗ Uses `requirements.txt` instead

Issues Summary

Critical (blocking):

Architecture fundamentally wrong - must use Spark's native parquet reading

High priority:

DEFAULT_PROFILE = "Code17" - hardcoded personal config
Relative import in CLI script (from data import ingestion)
requirements.txt instead of pyproject.toml (project uses uv)
SQL injection potential in schema/catalog interpolation

Medium priority:

Cryptic variable names (pdf, sdf)
README references non-existent preparation module
README includes irrelevant Python version troubleshooting
Empty __init__.py without __all__ exports
Broad except Exception handling

Recommended Changes

Rewrite ingestion to use Spark native reading:

def load_table(spark: SparkSession, source_url: str, target_table: str) -> int:
    df = spark.read.parquet(source_url)
    df.write.format("delta").mode("overwrite").saveAsTable(target_table)
    return df.count()

Remove pandas/requests dependencies - they're not needed
Add pyproject.toml with uv-compatible structure
Either add preparation.py or remove references from README
Validate SQL identifiers before interpolation
Use absolute imports throughout

Add Databricks ingestion script and requirements file

aa9a410

awaismirza92 linked an issue Dec 2, 2025 that may be closed by this pull request

Build data ingestion infrastructure for Databricks notebook #50

Open

awaismirza92 self-assigned this Dec 2, 2025

awaismirza92 added 2 commits December 2, 2025 16:06

Make the injection function callable

f329566

Mention import of injection function

7c91e7b

awaismirza92 marked this pull request as ready for review December 2, 2025 15:43

awaismirza92 requested a review from srnnkls December 2, 2025 15:43

This comment was marked as outdated.

Sign in to view

srnnkls reviewed Dec 3, 2025

View reviewed changes

Migrate to pyproject.toml & uv

6464559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build data ingestion infrastructure for Databricks notebook #51

Build data ingestion infrastructure for Databricks notebook #51

Uh oh!

awaismirza92 commented Dec 2, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

srnnkls Dec 3, 2025 •

edited

Loading

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

awaismirza92 Dec 3, 2025

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

srnnkls Dec 3, 2025

Uh oh!

srnnkls commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		### Python API (Recommended)

		Use the modules directly in notebooks or scripts:

		### Python Version Issues

		Databricks serverless requires Python 3.12:

Build data ingestion infrastructure for Databricks notebook #51

Are you sure you want to change the base?

Build data ingestion infrastructure for Databricks notebook #51

Uh oh!

Conversation

awaismirza92 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

awaismirza92 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

srnnkls commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Summary

Context: Requirements from #42

Critical Deviation: Architecture

Deviations from #42 Structure

Issues Summary

Recommended Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

awaismirza92 commented Dec 2, 2025 •

edited

Loading

srnnkls Dec 3, 2025 •

edited

Loading

srnnkls commented Dec 3, 2025 •

edited

Loading