Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ jobs:
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
java -version

- name: Set up Python 3.9.21
- name: Set up Python 3.10.12
uses: actions/setup-python@v5
with:
python-version: '3.9.21'
python-version: '3.10.12'
cache: 'pipenv'

- name: Check Python version
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ jobs:
sudo update-alternatives --set java /usr/lib/jvm/temurin-8-jdk-amd64/bin/java
java -version

- name: Set up Python 3.9.21
- name: Set up Python 3.10.12
uses: actions/setup-python@v5
with:
python-version: '3.9.21'
python-version: '3.10.12'
cache: 'pipenv'

- name: Check Python version
Expand Down
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ All notable changes to the Databricks Labs Data Generator will be documented in
* Updated build scripts to use Ubuntu 22.04 to correspond to environment in Databricks runtime

#### Changed
* Changed base Databricks runtime version to DBR 11.3 LTS (based on Apache Spark 3.3.0)
* Changed base Databricks runtime version to DBR 13.3 LTS (based on Apache Spark 3.4.1) - minimum supported version
of Python is now 3.10.12

#### Added
* Added support for serialization to/from JSON format
Expand Down
14 changes: 7 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Our recommended mechanism for building the code is to use a `conda` or `pipenv`
But it can be built with any Python virtualization environment.

### Spark dependencies
The builds have been tested against Spark 3.3.0. This requires the OpenJDK 1.8.56 or later version of Java 8.
The builds have been tested against Apache Spark 3.4.1.
The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
These are not installed automatically by the build process, so you will need to install them separately.

Expand Down Expand Up @@ -72,7 +72,7 @@ To build with `pipenv`, perform the following commands:
- Run `make dist` from the main project directory
- The resulting wheel file will be placed in the `dist` subdirectory

The resulting build has been tested against Spark 3.3.0
The resulting build has been tested against Spark 3.4.1

## Creating the HTML documentation

Expand Down Expand Up @@ -158,19 +158,19 @@ See https://legacy.python.org/dev/peps/pep-0008/

# Github expectations
When running the unit tests on Github, the environment should use the same environment as the latest Databricks
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 11.3 onwards,
runtime latest LTS release. While compatibility is preserved on LTS releases from Databricks runtime 13.3 LTS onwards,
unit tests will be run on the environment corresponding to the latest LTS release.

Libraries will use the same versions as the earliest supported LTS release - currently 11.3 LTS
Libraries will use the same versions as the earliest supported LTS release - currently 13.3 LTS

This means for the current build:

- Use of Ubuntu 22.04 for the test runner
- Use of Ubuntu 22.04.2 LTS for the test runner
- Use of Java 8
- Use of Python 3.9.21 when testing / building the image
- Use of Python 3.10.12 when testing / building the image

See the following resources for more information
= https://docs.databricks.com/en/release-notes/runtime/15.4lts.html
- https://docs.databricks.com/en/release-notes/runtime/11.3lts.html
- https://docs.databricks.com/aws/en/release-notes/runtime/13.3lts
- https://github.com/actions/runner-images/issues/10636

16 changes: 8 additions & 8 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ sphinx = ">=2.0.0,<3.1.0"
nbsphinx = "*"
numpydoc = "==0.8"
pypandoc = "*"
ipython = "==7.32.0"
ipython = "==8.10.0"
pydata-sphinx-theme = "*"
recommonmark = "*"
sphinx-markdown-builder = "*"
Expand All @@ -19,13 +19,13 @@ prospector = "*"

[packages]
numpy = "==1.22.0"
pyspark = "==3.3.0"
pyarrow = "==7.0.0"
wheel = "==0.37.0"
pandas = "==1.3.4"
setuptools = "==58.0.4"
pyparsing = "==3.0.4"
pyspark = "==3.4.1"
pyarrow = "==8.0.0"
wheel = "==0.37.1"
pandas = "==1.4.4"
setuptools = "==63.4.1"
pyparsing = "==3.0.9"
jmespath = "==0.10.0"

[requires]
python_version = "3.9.21"
python_version = "3.10.12"
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,23 +83,26 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
contains details of installation using alternative mechanisms.

## Compatibility
The Databricks Labs Data Generator framework can be used with Pyspark 3.3.0 and Python 3.9.21 or later. These are
compatible with the Databricks runtime 11.3 LTS and later releases. For full Unity Catalog support,
we recommend using Databricks runtime 13.2 or later (Databricks 13.3 LTS or above preferred)
The Databricks Labs Data Generator framework can be used with Pyspark 3.4.1 and Python 3.10.12 or later. These are
compatible with the Databricks runtime 13.3 LTS and later releases. This version also provides Unity Catalog
compatibily.

For full library compatibility for a specific Databricks Spark release, see the Databricks
release notes for library compatibility

- https://docs.databricks.com/release-notes/runtime/releases.html

When using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
In older releases, when using the Databricks Labs Data Generator on "Unity Catalog" enabled Databricks environments,
the Data Generator requires the use of `Single User` or `No Isolation Shared` access modes when using Databricks
runtimes prior to release 13.2. This is because some needed features are not available in `Shared`
mode (for example, use of 3rd party libraries, use of Python UDFs) in these releases.
Depending on settings, the `Custom` access mode may be supported.
Depending on settings, the `Custom` access mode may be supported for those releases.

The use of Unity Catalog `Shared` access mode is supported in Databricks runtimes from Databricks runtime release 13.2
onwards.
onwards.

*This version of the data generator uses the Databricks runtime 13.3 LTS as the minimum supported
version and alleviates these issues.*

See the following documentation for more information:

Expand Down Expand Up @@ -155,7 +158,7 @@ The GitHub repository also contains further examples in the examples directory.

## Spark and Databricks Runtime Compatibility
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
older LTS versions at least from 11.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
older LTS versions at least from 13.3 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
including `current` and `preview`.

While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
Expand Down
4 changes: 2 additions & 2 deletions makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ prepare: clean

create-dev-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n $(ENV_NAME) python=3.9.21
conda create -n $(ENV_NAME) python=3.10.12

create-github-build-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n pip_$(ENV_NAME) python=3.9.21
conda create -n pip_$(ENV_NAME) python=3.10.12

install-dev-dependencies:
@echo "$(OK_COLOR)=> installing dev environment requirements$(NO_COLOR)"
Expand Down
16 changes: 8 additions & 8 deletions python/dev_require.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.3.4
pandas==1.4.4
pickleshare==0.7.5
py4j>=0.10.9.3
pyarrow==7.0.0
pyspark==3.3.0
pyarrow==8.0.0
pyspark==3.4.1
python-dateutil==2.8.2
six==1.16.0
pyparsing==3.0.4
pyparsing==3.0.9
jmespath==0.10.0

# The following packages are required for development only
wheel==0.37.0
setuptools==58.0.4
wheel==0.37.1
setuptools==63.4.1
bumpversion
pytest
pytest-cov
Expand All @@ -28,9 +28,9 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.32.0
ipython==8.10.0
recommonmark
sphinx-markdown-builder
Jinja2 < 3.1
Jinja2 < 3.1, >= 2.11.3
sphinx-copybutton

16 changes: 8 additions & 8 deletions python/require.txt
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.3.4
pandas==1.4.4
pickleshare==0.7.5
py4j==0.10.9
pyarrow==7.0.0
pyspark==3.3.0
pyarrow==8.0.0
pyspark==3.4.1
python-dateutil==2.8.2
six==1.16.0
pyparsing==3.0.4
pyparsing==3.0.9
jmespath==0.10.0

# The following packages are required for development only
wheel==0.37.0
setuptools==58.0.4
wheel==0.37.1
setuptools==63.4.1
bumpversion
pytest
pytest-cov
Expand All @@ -27,9 +27,9 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.32.0
ipython==8.10.0
recommonmark
sphinx-markdown-builder
Jinja2 < 3.1
Jinja2 < 3.1, >= 2.11.3
sphinx-copybutton

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,5 +55,5 @@
"Intended Audience :: Developers",
"Intended Audience :: System Administrators"
],
python_requires='>=3.9.21',
python_requires='>=3.10.12',
)
1 change: 0 additions & 1 deletion tests/test_generation_from_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ def generation_spec(self):
.withColumn("short_value", "short", max=32767, percentNulls=0.1)
.withColumn("byte_value", "tinyint", max=127)
.withColumn("decimal_value", "decimal(10,2)", max=1000000)
.withColumn("decimal_value", "decimal(10,2)", max=1000000)
.withColumn("date_value", "date", expr="current_date()", random=True)
.withColumn("binary_value", "binary", expr="cast('spark' as binary)", random=True)

Expand Down