[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520
Open
Baunsgaard wants to merge 2 commits into
Open
[SYSTEMDS-3949] Add Demo for Databricks Workspace Execution#2520Baunsgaard wants to merge 2 commits into
Baunsgaard wants to merge 2 commits into
Conversation
Add scripts/databricks: a self-contained kit for deploying and running SystemDS on a Databricks cluster (tested on DBR 16.4 LTS, Spark 3.5.2, Scala 2.12, where the SystemDS jar runs unchanged). - deploy.sh: create a UC volume, upload SystemDS.jar, create a single-user cluster with the required Vector API / --add-opens JVM flags, install the Delta Kernel Maven libraries, and import the demo notebooks. All settings come from a .env file (template in .env.example); local state (.env, .cluster_id) is git-ignored. - SystemDS_MLContext_Demo.scala: Unity Catalog table round-trip via the MLContext (Scala) API with a configurable DML script and execution mode. - SystemDS_vs_SparkML_LinReg.scala: linear regression with categorical encoding (transformencode + lm) vs a Spark ML OneHotEncoder + LinearRegression pipeline, timing encode + train. - SystemDS_Delta_E2E.scala: end-to-end Delta -> transformencode -> lm, reading a Delta table natively as a frame, compared against the equivalent Spark ML pipeline; prints a per-instruction breakdown. - SystemDS_Python_Demo.py and demo.dml: minimal Python API and DML smoke tests. - README.md: setup, configuration, node-type guidance, Delta Kernel library requirement, and indicative benchmark numbers.
…orage - Remove SystemDS_vs_SparkML_LinReg.scala: the Delta E2E notebook already benchmarks transformencode + lm against the Spark ML pipeline on a stored Delta table, so the in-memory-only benchmark is redundant. - Remove SystemDS_Python_Demo.py to keep the kit focused on the Delta story. - Change demo.dml to read a matrix from storage (parameterized in/fmt/out) instead of generating random data, so the smoke test exercises the read path. - Update deploy.sh, .env.example, and README to the reduced two-notebook set.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2520 +/- ##
============================================
+ Coverage 71.55% 71.60% +0.05%
- Complexity 49062 49154 +92
============================================
Files 1574 1575 +1
Lines 189570 189793 +223
Branches 37188 37235 +47
============================================
+ Hits 135650 135908 +258
+ Misses 43442 43406 -36
- Partials 10478 10479 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
scripts/databricks/, a self-contained kit for deploying and runningSystemDS on a Databricks cluster. Tested on DBR 16.4 LTS (Spark 3.5.2,
Scala 2.12), where the SystemDS jar runs unchanged.
deploy.sh— one script for the whole loop: create a UC volume, uploadSystemDS.jar, create a single-user cluster with the required Vector API /--add-opensJVM flags, install the Delta Kernel Maven libraries, and importthe demo notebooks. All settings come from a
.envfile (.env.exampletemplate); local state (
.env,.cluster_id) is git-ignored.Spark-ML head-to-head benchmark, and an end-to-end Delta pipeline.
Notebooks
SystemDS_MLContext_Demo.scalaSystemDS_vs_SparkML_LinReg.scalatransformencode+lmvs Spark MLOneHotEncoder+LinearRegression.SystemDS_Delta_E2E.scalatransformencode→lm, vs the equivalent Spark ML pipeline.SystemDS_Python_Demo.py/demo.dmlBenchmark numbers
Single-node
i3.xlarge(4 vCPU / 30.5 GiB), 1M rows, single cold run: