Skip to content

Conversation

@awaismirza92
Copy link
Collaborator

@awaismirza92 awaismirza92 commented Dec 2, 2025

Closes: #50 & #52

@awaismirza92 awaismirza92 linked an issue Dec 2, 2025 that may be closed by this pull request
@awaismirza92 awaismirza92 self-assigned this Dec 2, 2025
@awaismirza92 awaismirza92 marked this pull request as ready for review December 2, 2025 15:43
@awaismirza92 awaismirza92 requested a review from srnnkls December 2, 2025 15:43
srnnkls

This comment was marked as outdated.



def _download_parquet(url: str) -> pd.DataFrame:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Fundamental Architecture Flaw

This downloads parquet to local memory via pandas, defeating Spark's purpose. Use spark.read.parquet(url) directly.

DEFAULT_BUCKET: Final[str] = "https://static.getml.com/datasets/jaffle_shop"
DEFAULT_CATALOG: Final[str] = "workspace"
DEFAULT_SCHEMA: Final[str] = "jaffle_shop"
DEFAULT_PROFILE: Final[str] = "Code17"
Copy link
Collaborator

@srnnkls srnnkls Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded personal profile

"Code17" is a personal profile name. Use the standard Databricks CLI default:

DEFAULT_PROFILE: Final[str] = "DEFAULT"

The README should document the DATABRICKS_CONFIG_PROFILE env var. Contributors set it locally via .mise.local.toml (gitignored):

# .mise.local.toml
[env]
DATABRICKS_CONFIG_PROFILE = "Code17"


import logging
import sys

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative import not allowed - use from integration.databricks.data import ingestion

@@ -0,0 +1,7 @@
# Databricks Integration Dependencies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pyproject.toml instead - this project uses uv for dependency management. Also missing newline at EOF.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 6464559

Originally, I wanted it to be a separate task (took a few hours for me): #52

### Python API (Recommended)

Use the modules directly in notebooks or scripts:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References non-existent module - preparation doesn't exist in this PR.

### Python Version Issues

Databricks serverless requires Python 3.12:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrelevant troubleshooting - Python version belongs in pyproject.toml, not README.

sdf.write.format("delta").mode("overwrite").option(
"overwriteSchema", "true"
).saveAsTable(config.full_table_name)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL injection potential - validate catalog/schema as identifiers before interpolating into SQL.

@srnnkls
Copy link
Collaborator

srnnkls commented Dec 3, 2025

Review Summary

Style Reference: Python Style Guide (getml/code17-northstar#18)


Context: Requirements from #42

This PR implements Databricks ingestion, mirroring the pattern established in #42 (Build data preparation infrastructure for feature store notebooks). Issue #42 defines the expected architecture:

integration/{platform}/
├── data/
│   ├── ingestion.py      # GCS → Platform loader
│   ├── preparation.py    # Orchestration module  
│   └── sql/              # Externalized SQL queries
└── tests/

Expected usage pattern from #42:

from integration.{platform}.data import ingestion, preparation

ingestion.load_from_gcs(
    bucket="gs://static.getml.com/datasets/jaffle-shop/",
    destination_schema="RAW"
)

The key expectation: data warehouses/platforms should use their native capabilities to ingest from GCS - not download through Python.


Critical Deviation: Architecture

The current implementation downloads parquet files to local memory via requests.get() + pandas, then converts to Spark DataFrame. This fundamentally misunderstands how Spark/Databricks works.

Aspect Expected (per #42 pattern) Actual Implementation
Data flow GCS → Spark → Delta (direct) GCS → Python memory → pandas → Spark → Delta
Scalability Distributed across cluster Limited by local memory
Dependencies pyspark, databricks-connect + pandas, requests, pyarrow
Performance Native Spark parallelism Single-threaded download

Correct approach:

# Spark reads parquet directly from URL - no local memory needed
spark.read.parquet(source_url).write.format("delta").saveAsTable(target_table)

Deviations from #42 Structure

Requirement from #42 Status in PR
ingestion.py module ✓ Present (but wrong approach)
preparation.py module ✗ Missing (but referenced in README)
sql/ directory ✗ Missing
Integration tests ✗ Missing
pyproject.toml ✗ Uses requirements.txt instead

Issues Summary

Critical (blocking):

  • Architecture fundamentally wrong - must use Spark's native parquet reading

High priority:

  • DEFAULT_PROFILE = "Code17" - hardcoded personal config
  • Relative import in CLI script (from data import ingestion)
  • requirements.txt instead of pyproject.toml (project uses uv)
  • SQL injection potential in schema/catalog interpolation

Medium priority:

  • Cryptic variable names (pdf, sdf)
  • README references non-existent preparation module
  • README includes irrelevant Python version troubleshooting
  • Empty __init__.py without __all__ exports
  • Broad except Exception handling

Recommended Changes

  1. Rewrite ingestion to use Spark native reading:

    def load_table(spark: SparkSession, source_url: str, target_table: str) -> int:
        df = spark.read.parquet(source_url)
        df.write.format("delta").mode("overwrite").saveAsTable(target_table)
        return df.count()
  2. Remove pandas/requests dependencies - they're not needed

  3. Add pyproject.toml with uv-compatible structure

  4. Either add preparation.py or remove references from README

  5. Validate SQL identifiers before interpolation

  6. Use absolute imports throughout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build data ingestion infrastructure for Databricks notebook

3 participants