From 50786b9e69a6fc5143b7cd52fe90ee20ec20b64f Mon Sep 17 00:00:00 2001 From: ronanstokes-db Date: Thu, 31 Jul 2025 15:48:55 -0700 Subject: [PATCH 1/4] Updated build environment to use Databricks Runtime 13.3 LTS as baseline --- .github/workflows/push.yml | 4 ++-- .github/workflows/release.yml | 4 ++-- CHANGELOG.md | 3 ++- CONTRIBUTING.md | 14 +++++++------- README.md | 17 ++++++++++------- makefile | 4 ++-- python/dev_require.txt | 16 ++++++++-------- python/require.txt | 16 ++++++++-------- setup.py | 2 +- 9 files changed, 42 insertions(+), 38 deletions(-) diff --git a/.github/workflows/push.yml b/.github/workflows/push.yml index 936414f0..610e5a23 100644 --- a/.github/workflows/push.yml +++ b/.github/workflows/push.yml @@ -31,10 +31,10 @@ jobs: sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java java -version - - name: Set up Python 3.9.21 + - name: Set up Python 3.10.12 uses: actions/setup-python@v5 with: - python-version: '3.9.21' + python-version: '3.10.12' cache: 'pipenv' - name: Check Python version diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 26761f9c..0b4680cd 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -24,10 +24,10 @@ jobs: sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java java -version - - name: Set up Python 3.9.21 + - name: Set up Python 3.10.12 uses: actions/setup-python@v5 with: - python-version: '3.9.21' + python-version: '3.10.12' cache: 'pipenv' - name: Check Python version diff --git a/CHANGELOG.md b/CHANGELOG.md index f90e8fff..c02aa497 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,8 @@ All notable changes to the Databricks Labs Data Generator will be documented in * Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime #### Changed -* Changed base Databricks runtime version to DBR 11.3 LTS (based on Apache Spark 3.3.0) +* Changed base Databricks runtime version to DBR 13.3 LTS (based on Apache Spark 3.4.1) - minimum supported version + of Python is now 3.12 #### Added * Added support for serialization to/from JSON format diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a8bdcc8f..843aecba 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -43,7 +43,7 @@ Our recommended mechanism for building the code is to use a `conda` or `pipenv` But it can be built with any Python virtualization environment. ### Spark dependencies -The builds have been tested against Spark 3.3.0. This requires the OpenJDK 1.8.56 or later version of Java 8. +The builds have been tested against Apache Spark 3.4.1. The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing. These are not installed automatically by the build process, so you will need to install them separately. @@ -72,7 +72,7 @@ To build with `pipenv`, perform the following commands: - Run `make dist` from the main project directory - The resulting wheel file will be placed in the `dist` subdirectory -The resulting build has been tested against Spark 3.3.0 +The resulting build has been tested against Spark 3.4.1 ## Creating the HTML documentation @@ -158,19 +158,19 @@ See https://legacy.python.org/dev/peps/pep-0008/ # Github expectations When running the unit tests on Github, the environment should use the same environment as the latest Databricks -runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 11.3 onwards, +runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 LTS onwards, unit tests will be run on the environment corresponding to the latest LTS release. -Libraries will use the same versions as the earliest supported LTS release - currently 11.3 LTS +Libraries will use the same versions as the earliest supported LTS release - currently 13.3 LTS This means for the current build: -- Use of Ubuntu 22.04 for the test runner +- Use of Ubuntu 22.04.2 LTS for the test runner - Use of Java 8 -- Use of Python 3.9.21 when testing / building the image +- Use of Python 3.10.12 when testing / building the image See the following resources for more information = https://docs.databricks.com/en/release-notes/runtime/15.4lts.html -- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html +- https://docs.databricks.com/aws/en/release-notes/runtime/13.3lts - https://github.com/actions/runner-images/issues/10636 diff --git a/README.md b/README.md index 3c000b7d..ccda9660 100644 --- a/README.md +++ b/README.md @@ -83,23 +83,26 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag contains details of installation using alternative mechanisms. ## Compatibility -The Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are -compatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support, -we recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred) +The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are +compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog +compatibily. For full library compatibility for a specific Databricks Spark release, see the Databricks release notes for library compatibility - https://docs.databricks.com/release-notes/runtime/releases.html -When using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments, +In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments, the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks runtimes prior to release 13.2. This is because some needed features are not available in `Shared` mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases. -Depending on settings, the `Custom` access mode may be supported. +Depending on settings, the `Custom` access mode may be supported for those releases. The use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2 -onwards. +onwards. + +*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported +version and alleviates these issues.* See the following documentation for more information: @@ -155,7 +158,7 @@ The GitHub repository also contains further examples in the examples directory. ## Spark and Databricks Runtime Compatibility The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including -older LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes, +older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes, including `current` and `preview`. While we don't specifically drop support for older runtimes, changes in Pyspark APIs or diff --git a/makefile b/makefile index 6e597efc..152e996d 100644 --- a/makefile +++ b/makefile @@ -27,11 +27,11 @@ prepare: clean create-dev-env: @echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)" - conda create -n $(ENV_NAME) python=3.9.21 + conda create -n $(ENV_NAME) python=3.10.12 create-github-build-env: @echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)" - conda create -n pip_$(ENV_NAME) python=3.9.21 + conda create -n pip_$(ENV_NAME) python=3.10.12 install-dev-dependencies: @echo "$(OK_COLOR)=> installing dev environment requirements$(NO_COLOR)" diff --git a/python/dev_require.txt b/python/dev_require.txt index 3eabad8f..1f2798d7 100644 --- a/python/dev_require.txt +++ b/python/dev_require.txt @@ -1,19 +1,19 @@ # The following packages are used in building the test data generator framework. # All packages used are already installed in the Databricks runtime environment for version 6.5 or later numpy==1.22.0 -pandas==1.3.4 +pandas==1.4.4 pickleshare==0.7.5 py4j>=0.10.9.3 -pyarrow==7.0.0 -pyspark==3.3.0 +pyarrow==8.0.0 +pyspark==3.4.1 python-dateutil==2.8.2 six==1.16.0 -pyparsing==3.0.4 +pyparsing==3.0.9 jmespath==0.10.0 # The following packages are required for development only -wheel==0.37.0 -setuptools==58.0.4 +wheel==0.37.1 +setuptools==63.4.1 bumpversion pytest pytest-cov @@ -28,9 +28,9 @@ sphinx_rtd_theme nbsphinx numpydoc==0.8 pypandoc -ipython==7.32.0 +ipython==8.10.0 recommonmark sphinx-markdown-builder -Jinja2 < 3.1 +Jinja2 < 3.1, >= 2.11.3 sphinx-copybutton diff --git a/python/require.txt b/python/require.txt index bad13fa2..150a1570 100644 --- a/python/require.txt +++ b/python/require.txt @@ -1,19 +1,19 @@ # The following packages are used in building the test data generator framework. # All packages used are already installed in the Databricks runtime environment for version 6.5 or later numpy==1.22.0 -pandas==1.3.4 +pandas==1.4.4 pickleshare==0.7.5 py4j==0.10.9 -pyarrow==7.0.0 -pyspark==3.3.0 +pyarrow==8.0.0 +pyspark==3.4.1 python-dateutil==2.8.2 six==1.16.0 -pyparsing==3.0.4 +pyparsing==3.0.9 jmespath==0.10.0 # The following packages are required for development only -wheel==0.37.0 -setuptools==58.0.4 +wheel==0.37.1 +setuptools==63.4.1 bumpversion pytest pytest-cov @@ -27,9 +27,9 @@ sphinx_rtd_theme nbsphinx numpydoc==0.8 pypandoc -ipython==7.32.0 +ipython==8.10.0 recommonmark sphinx-markdown-builder -Jinja2 < 3.1 +Jinja2 < 3.1, >= 2.11.3 sphinx-copybutton diff --git a/setup.py b/setup.py index 5baa5541..7ee9eeff 100644 --- a/setup.py +++ b/setup.py @@ -55,5 +55,5 @@ "Intended Audience :: Developers", "Intended Audience :: System Administrators" ], - python_requires='>=3.9.21', + python_requires='>=3.10.12', ) From f35c791c1bf9ac008a5ddc6ad938075b2dcffa21 Mon Sep 17 00:00:00 2001 From: ronanstokes-db Date: Thu, 31 Jul 2025 16:10:28 -0700 Subject: [PATCH 2/4] removed duplicate column addition from test (previously undetected by earlier version of PySpark) --- tests/test_generation_from_data.py | 1 - 1 file changed, 1 deletion(-) diff --git a/tests/test_generation_from_data.py b/tests/test_generation_from_data.py index dde7e9f3..f5793b8a 100644 --- a/tests/test_generation_from_data.py +++ b/tests/test_generation_from_data.py @@ -49,7 +49,6 @@ def generation_spec(self): .withColumn("short_value", "short", max=32767, percentNulls=0.1) .withColumn("byte_value", "tinyint", max=127) .withColumn("decimal_value", "decimal(10,2)", max=1000000) - .withColumn("decimal_value", "decimal(10,2)", max=1000000) .withColumn("date_value", "date", expr="current_date()", random=True) .withColumn("binary_value", "binary", expr="cast('spark' as binary)", random=True) From dc90d179ddb01961b3e3d322db59113500950607 Mon Sep 17 00:00:00 2001 From: ronanstokes-db Date: Thu, 31 Jul 2025 20:06:11 -0700 Subject: [PATCH 3/4] updated pipfile dependencies --- Pipfile | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/Pipfile b/Pipfile index 6ab8fcaa..9f41b6b6 100644 --- a/Pipfile +++ b/Pipfile @@ -10,7 +10,7 @@ sphinx = ">=2.0.0,<3.1.0" nbsphinx = "*" numpydoc = "==0.8" pypandoc = "*" -ipython = "==7.32.0" +ipython = "==8.10.0" pydata-sphinx-theme = "*" recommonmark = "*" sphinx-markdown-builder = "*" @@ -19,13 +19,13 @@ prospector = "*" [packages] numpy = "==1.22.0" -pyspark = "==3.3.0" -pyarrow = "==7.0.0" -wheel = "==0.37.0" -pandas = "==1.3.4" -setuptools = "==58.0.4" -pyparsing = "==3.0.4" +pyspark = "==3.4.1" +pyarrow = "==8.0.0" +wheel = "==0.37.1" +pandas = "==1.4.4" +setuptools = "==63.4.1" +pyparsing = "==3.0.9" jmespath = "==0.10.0" [requires] -python_version = "3.9.21" +python_version = "3.10.12" From 8c3b86787dcca94bdd1d4624b9dba9ed2d727833 Mon Sep 17 00:00:00 2001 From: ronanstokes-db Date: Mon, 4 Aug 2025 13:06:01 -0700 Subject: [PATCH 4/4] changed comment in changelog to reflect correct Python dependencies --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index c02aa497..998c6269 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -10,7 +10,7 @@ All notable changes to the Databricks Labs Data Generator will be documented in #### Changed * Changed base Databricks runtime version to DBR 13.3 LTS (based on Apache Spark 3.4.1) - minimum supported version - of Python is now 3.12 + of Python is now 3.10.12 #### Added * Added support for serialization to/from JSON format