Skip to content

[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520

Open
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:databricks-dogfood-scripts
Open

[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:databricks-dogfood-scripts

Conversation

@Baunsgaard

@Baunsgaard Baunsgaard commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds scripts/databricks/, a self-contained kit for deploying and running
SystemDS on a Databricks cluster. Tested on DBR 16.4 LTS (Spark 3.5.2,
Scala 2.12), where the SystemDS jar runs unchanged.

  • deploy.sh — one script for the whole loop: create a UC volume, upload
    SystemDS.jar, create a single-user cluster with the required Vector API /
    --add-opens JVM flags, install the Delta Kernel Maven libraries, and import
    the demo notebooks. All settings come from a .env file (.env.example
    template); local state (.env, .cluster_id) is git-ignored.
  • Demo notebooks covering the MLContext (Scala) API, the Python API, a
    Spark-ML head-to-head benchmark, and an end-to-end Delta pipeline.

Notebooks

Notebook What it shows
SystemDS_MLContext_Demo.scala UC table round-trip via MLContext; configurable DML + execution mode.
SystemDS_vs_SparkML_LinReg.scala transformencode + lm vs Spark ML OneHotEncoder + LinearRegression.
SystemDS_Delta_E2E.scala Read a Delta table natively as a frame → transformencodelm, vs the equivalent Spark ML pipeline.
SystemDS_Python_Demo.py / demo.dml Minimal Python API and DML smoke tests.

Benchmark numbers

Single-node i3.xlarge (4 vCPU / 30.5 GiB), 1M rows, single cold run:

Workload Spark ML SystemDS Speedup
LinReg, 130 features (encode + train) 20.6 s 11.0 s ~1.9×
Delta E2E, 700 features (read + encode + train) 116.6 s 55.4 s ~2.1×

Add scripts/databricks: a self-contained kit for deploying and running
SystemDS on a Databricks cluster (tested on DBR 16.4 LTS, Spark 3.5.2,
Scala 2.12, where the SystemDS jar runs unchanged).

- deploy.sh: create a UC volume, upload SystemDS.jar, create a single-user
  cluster with the required Vector API / --add-opens JVM flags, install the
  Delta Kernel Maven libraries, and import the demo notebooks. All settings
  come from a .env file (template in .env.example); local state (.env,
  .cluster_id) is git-ignored.
- SystemDS_MLContext_Demo.scala: Unity Catalog table round-trip via the
  MLContext (Scala) API with a configurable DML script and execution mode.
- SystemDS_vs_SparkML_LinReg.scala: linear regression with categorical
  encoding (transformencode + lm) vs a Spark ML OneHotEncoder +
  LinearRegression pipeline, timing encode + train.
- SystemDS_Delta_E2E.scala: end-to-end Delta -> transformencode -> lm,
  reading a Delta table natively as a frame, compared against the equivalent
  Spark ML pipeline; prints a per-instruction breakdown.
- SystemDS_Python_Demo.py and demo.dml: minimal Python API and DML smoke
  tests.
- README.md: setup, configuration, node-type guidance, Delta Kernel library
  requirement, and indicative benchmark numbers.
…orage

- Remove SystemDS_vs_SparkML_LinReg.scala: the Delta E2E notebook already
  benchmarks transformencode + lm against the Spark ML pipeline on a stored
  Delta table, so the in-memory-only benchmark is redundant.
- Remove SystemDS_Python_Demo.py to keep the kit focused on the Delta story.
- Change demo.dml to read a matrix from storage (parameterized in/fmt/out)
  instead of generating random data, so the smoke test exercises the read path.
- Update deploy.sh, .env.example, and README to the reduced two-notebook set.
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.60%. Comparing base (0637bf1) to head (c1b0ce6).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2520      +/-   ##
============================================
+ Coverage     71.55%   71.60%   +0.05%     
- Complexity    49062    49154      +92     
============================================
  Files          1574     1575       +1     
  Lines        189570   189793     +223     
  Branches      37188    37235      +47     
============================================
+ Hits         135650   135908     +258     
+ Misses        43442    43406      -36     
- Partials      10478    10479       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant