[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution by Baunsgaard · Pull Request #2520 · apache/systemds

Baunsgaard · 2026-06-26T16:52:02Z

Summary

Adds scripts/databricks/, a self-contained kit for deploying and running
SystemDS on a Databricks cluster. Tested on DBR 16.4 LTS (Spark 3.5.2,
Scala 2.12), where the SystemDS jar runs unchanged.

deploy.sh — one script for the whole loop: create a UC volume, upload
SystemDS.jar, create a single-user cluster with the required Vector API /
--add-opens JVM flags, install the Delta Kernel Maven libraries, and import
the demo notebooks. All settings come from a .env file (.env.example
template); local state (.env, .cluster_id) is git-ignored.
Demo notebooks covering the MLContext (Scala) API, the Python API, a
Spark-ML head-to-head benchmark, and an end-to-end Delta pipeline.

Notebooks

Notebook	What it shows
`SystemDS_MLContext_Demo.scala`	UC table round-trip via MLContext; configurable DML + execution mode.
`SystemDS_vs_SparkML_LinReg.scala`	`transformencode` + `lm` vs Spark ML `OneHotEncoder` + `LinearRegression`.
`SystemDS_Delta_E2E.scala`	Read a Delta table natively as a frame → `transformencode` → `lm`, vs the equivalent Spark ML pipeline.
`SystemDS_Python_Demo.py` / `demo.dml`	Minimal Python API and DML smoke tests.

Benchmark numbers

Single-node i3.xlarge (4 vCPU / 30.5 GiB), 1M rows, single cold run:

Workload	Spark ML	SystemDS	Speedup
LinReg, 130 features (encode + train)	20.6 s	11.0 s	~1.9×
Delta E2E, 700 features (read + encode + train)	116.6 s	55.4 s	~2.1×

Add scripts/databricks: a self-contained kit for deploying and running SystemDS on a Databricks cluster (tested on DBR 16.4 LTS, Spark 3.5.2, Scala 2.12, where the SystemDS jar runs unchanged). - deploy.sh: create a UC volume, upload SystemDS.jar, create a single-user cluster with the required Vector API / --add-opens JVM flags, install the Delta Kernel Maven libraries, and import the demo notebooks. All settings come from a .env file (template in .env.example); local state (.env, .cluster_id) is git-ignored. - SystemDS_MLContext_Demo.scala: Unity Catalog table round-trip via the MLContext (Scala) API with a configurable DML script and execution mode. - SystemDS_vs_SparkML_LinReg.scala: linear regression with categorical encoding (transformencode + lm) vs a Spark ML OneHotEncoder + LinearRegression pipeline, timing encode + train. - SystemDS_Delta_E2E.scala: end-to-end Delta -> transformencode -> lm, reading a Delta table natively as a frame, compared against the equivalent Spark ML pipeline; prints a per-instruction breakdown. - SystemDS_Python_Demo.py and demo.dml: minimal Python API and DML smoke tests. - README.md: setup, configuration, node-type guidance, Delta Kernel library requirement, and indicative benchmark numbers.

…orage - Remove SystemDS_vs_SparkML_LinReg.scala: the Delta E2E notebook already benchmarks transformencode + lm against the Spark ML pipeline on a stored Delta table, so the in-memory-only benchmark is redundant. - Remove SystemDS_Python_Demo.py to keep the kit focused on the Delta story. - Change demo.dml to read a matrix from storage (parameterized in/fmt/out) instead of generating random data, so the smoke test exercises the read path. - Update deploy.sh, .env.example, and README to the reduced two-notebook set.

codecov · 2026-06-26T17:32:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.60%. Comparing base (0637bf1) to head (c1b0ce6).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2520      +/-   ##
============================================
+ Coverage     71.55%   71.60%   +0.05%     
- Complexity    49062    49154      +92     
============================================
  Files          1574     1575       +1     
  Lines        189570   189793     +223     
  Branches      37188    37235      +47     
============================================
+ Hits         135650   135908     +258     
+ Misses        43442    43406      -36     
- Partials      10478    10479       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-project-automation Bot added this to SystemDS PR Queue Jun 26, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520

[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:databricks-dogfood-scripts

Baunsgaard commented Jun 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Baunsgaard commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notebooks

Benchmark numbers

Uh oh!

codecov Bot commented Jun 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Baunsgaard commented Jun 26, 2026 •

edited

Loading