Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature standard datasets - part 1 #258

Merged
merged 81 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
ece0726
work in progress
ronanstokes-db Mar 22, 2024
be90276
work in progress
ronanstokes-db Mar 22, 2024
38f66a3
work in progress
ronanstokes-db Mar 22, 2024
00a71da
wip
ronanstokes-db Mar 22, 2024
fdbcef2
wip
ronanstokes-db Mar 24, 2024
81b9680
added implementations for Datasets describe and listing
ronanstokes-db Mar 28, 2024
8cf15d9
bumpedBuild
ronanstokes-db Mar 28, 2024
0b12ad4
bumpedBuild
ronanstokes-db Mar 28, 2024
0d73fcc
bumpedBuild
ronanstokes-db Mar 28, 2024
471ee44
bumpedBuild
ronanstokes-db Mar 28, 2024
b79d611
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
40ff743
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
c262363
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
76a09f0
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
9cca76f
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
26decec
fixed dataset provider imports
ronanstokes-db Mar 28, 2024
19b77df
wip
ronanstokes-db Mar 28, 2024
ac16b50
wip
ronanstokes-db Mar 28, 2024
3a9c5e6
initial working version
ronanstokes-db Mar 28, 2024
4d930ec
initial working version
ronanstokes-db Mar 28, 2024
dbda36e
initial working version
ronanstokes-db Mar 28, 2024
17b1e0d
initial working version
ronanstokes-db Mar 28, 2024
eaa57a3
initial working version
ronanstokes-db Mar 28, 2024
a02d150
initial working version
ronanstokes-db Mar 28, 2024
eba1724
initial working version
ronanstokes-db Mar 28, 2024
b0bc9ad
initial working version
ronanstokes-db Mar 28, 2024
a99f362
initial working version
ronanstokes-db Mar 28, 2024
c3f8ab8
added telephony plans
ronanstokes-db Mar 28, 2024
a6aa7d7
added telephony plans
ronanstokes-db Mar 28, 2024
6a7a6b8
Merge branch 'master' into feature_standard_datasets
ronanstokes-db Mar 28, 2024
32ecd13
added telephony plans
ronanstokes-db Mar 28, 2024
6167b27
initial working version
ronanstokes-db Mar 28, 2024
0c316e9
Added tokei.rs badge (#253)
nfx Feb 28, 2024
f966845
Prep for release 036 (#251)
ronanstokes-db Mar 28, 2024
4410b3d
added telephony plans
ronanstokes-db Mar 28, 2024
f702c8f
initial implementation
ronanstokes-db Mar 28, 2024
0fba14b
added basic/iot dataset
ronanstokes-db Mar 28, 2024
fc28735
wip
ronanstokes-db Mar 29, 2024
17af167
wip
ronanstokes-db Apr 1, 2024
679c473
work in progress
ronanstokes-db Apr 4, 2024
9609cda
wip
ronanstokes-db Apr 4, 2024
ecdd31b
wip
ronanstokes-db Apr 4, 2024
813ce9a
wip
ronanstokes-db Apr 6, 2024
e7abd60
wip
ronanstokes-db Apr 6, 2024
ecb888a
wip
ronanstokes-db Apr 6, 2024
035c29e
wip
ronanstokes-db Apr 8, 2024
afc0788
work in progress
ronanstokes-db Apr 12, 2024
28d9afd
wip
ronanstokes-db Apr 16, 2024
b8601d3
Merge branch 'master' into feature_standard_datasets
ronanstokes-db May 23, 2024
c12ff2f
Merge branch 'master' into feature_standard_datasets
ronanstokes-db May 23, 2024
b1e6ff4
Merge branch 'master' into feature_standard_datasets
ronanstokes-db May 23, 2024
c6352dc
wip
ronanstokes-db May 23, 2024
1bddbd9
wip
ronanstokes-db May 24, 2024
5171eff
wip
ronanstokes-db May 24, 2024
5fd9810
wip
ronanstokes-db May 25, 2024
6073049
wip
ronanstokes-db May 28, 2024
9f50b75
Merge branch 'master' into feature_standard_datasets
ronanstokes-db May 28, 2024
1ffa812
wip
ronanstokes-db May 28, 2024
2441e73
wip
ronanstokes-db May 29, 2024
f3a68ad
wip
ronanstokes-db May 29, 2024
4cfbb35
wip
ronanstokes-db May 29, 2024
a620072
wip
ronanstokes-db May 30, 2024
cdf61bd
wip
ronanstokes-db May 31, 2024
c3be5fc
wip
ronanstokes-db Jun 1, 2024
341f9c3
Merge branch 'master' into feature_standard_datasets
ronanstokes-db Jun 1, 2024
0b9ffa5
wip
ronanstokes-db Jun 1, 2024
809f7d1
wip
ronanstokes-db Jun 1, 2024
005b744
wip
ronanstokes-db Jun 1, 2024
1d0d77a
wip
ronanstokes-db Jun 1, 2024
8fd915d
wip
ronanstokes-db Jun 1, 2024
02d9634
wip
ronanstokes-db Jun 4, 2024
9fea524
wip
ronanstokes-db Jun 5, 2024
6f95d6b
wip
ronanstokes-db Jun 5, 2024
64de31d
wip
ronanstokes-db Jun 5, 2024
8dcf623
wip
ronanstokes-db Jun 5, 2024
d336fdc
additional coverage tests
ronanstokes-db Jun 5, 2024
1d2bc8e
additional coverage
ronanstokes-db Jun 5, 2024
548cb75
additional coverage
ronanstokes-db Jun 5, 2024
17a37bc
additional coverage
ronanstokes-db Jun 5, 2024
acb835c
additional coverage
ronanstokes-db Jun 5, 2024
639347b
Merge branch 'master' into feature_standard_datasets
ronanstokes-db Jun 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
# Databricks Labs Data Generator Release Notes
# Databricks Labs Synthetic Data Generator Release Notes

## Change History
All notable changes to the Databricks Labs Data Generator will be documented in this file.

### Unreleased

### Changed
#### Changed
* Modified data generator to allow specification of constraints to the data generation process
* Updated documentation for generating text data.
* Modified data distribiutions to use abstract base classes
* migrated data distribution tests to use `pytest`

### Added
#### Added
* Added classes for constraints on the data generation via new package `dbldatagen.constraints`
* Added support for standard data sets via the new package `dbldatagen.datasets`


### Version 0.3.6 Post 1
Expand All @@ -24,7 +25,6 @@ All notable changes to the Databricks Labs Data Generator will be documented in
#### Fixed
* Fixed scenario where `DataAnalyzer` is used on dataframe containing a column named `summary`


### Version 0.3.6

#### Changed
Expand Down
1,739 changes: 1,176 additions & 563 deletions Pipfile.lock

Large diffs are not rendered by default.

12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ used in other computations
* plugin mechanism to allow use of 3rd party libraries such as Faker
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
* Generate synthetic data generation code from existing schema or data (experimental)
* Use of standard datasets for quick generation of synthetic data

Details of these features can be found in the online documentation -
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
Expand Down Expand Up @@ -110,6 +111,17 @@ in your environment.

Once the library has been installed, you can use it to generate a data frame composed of synthetic data.

The easiest way to use the data generator is to use one of the standard datasets which can be further customized
for your use case.

```buildoutcfg
import dbldatagen as dg
df = dg.Datasets(spark, "basic/user").get(rows=1000_000).build()
num_rows=df.count()
```

You can also define fully custom data sets using the `DataGenerator` class.

For example

```buildoutcfg
Expand Down
4 changes: 3 additions & 1 deletion dbldatagen/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
deprecated, parse_time_interval, DataGenError, split_list_matching_condition, strip_margins, \
json_value_from_path, system_time_millis

from ._version import __version__
from .column_generation_spec import ColumnGenerationSpec
from .column_spec_options import ColumnSpecOptions
Expand All @@ -43,11 +44,12 @@
from .text_generators import TemplateGenerator, ILText, TextGenerator
from .text_generator_plugins import PyfuncText, PyfuncTextFactory, FakerTextFactory, fakerText
from .html_utils import HtmlUtils
from .datasets_object import Datasets

__all__ = ["data_generator", "data_analyzer", "schema_parser", "daterange", "nrange",
"column_generation_spec", "utils", "function_builder",
"spark_singleton", "text_generators", "datarange", "datagen_constants",
"text_generator_plugins", "html_utils"
"text_generator_plugins", "html_utils", "datasets_object"
]


Expand Down
8 changes: 8 additions & 0 deletions dbldatagen/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from .dataset_provider import DatasetProvider, dataset_definition
from .basic_user import BasicUserProvider
from .multi_table_telephony_provider import MultiTableTelephonyProvider

__all__ = ["dataset_provider",
"basic_user",
"multi_table_telephony_provider"
]
64 changes: 64 additions & 0 deletions dbldatagen/datasets/basic_user.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
from .dataset_provider import DatasetProvider, dataset_definition


@dataset_definition(name="basic/user", summary="Basic User Data Set", autoRegister=True, supportsStreaming=True)
class BasicUserProvider(DatasetProvider.NoAssociatedDatasetsMixin, DatasetProvider):
"""
Basic User Data Set
===================

This is a basic user data set with customer id, name, email, ip address, and phone number.

It takes the following optins when retrieving the table:
- random: if True, generates random data
- dummyValues: number of additional dummy value columns to generate (to widen row size if necessary)
- rows : number of rows to generate. Default is 100000
- partitions: number of partitions to use. If -1, it will be computed based on the number of rows
-

As the data specification is a DataGenerator object, you can add further columns to the data set and
add constraints (when the feature is available)

Note that this datset does not use any features that would prevent it from being used as a source for a
streaming dataframe, and so the flag `supportsStreaming` is set to True.

"""
MAX_LONG = 9223372036854775807
COLUMN_COUNT = 5

@DatasetProvider.allowed_options(options=["random", "dummyValues"])
def getTableGenerator(self, sparkSession, *, tableName=None, rows=-1, partitions=-1,
**options):
import dbldatagen as dg

generateRandom = options.get("random", False)
dummyValues = options.get("dummyValues", 0)

if rows is None or rows < 0:
rows = DatasetProvider.DEFAULT_ROWS

if partitions is None or partitions < 0:
partitions = self.autoComputePartitions(rows, self.COLUMN_COUNT + dummyValues)

assert tableName is None or tableName == DatasetProvider.DEFAULT_TABLE_NAME, "Invalid table name"
df_spec = (
dg.DataGenerator(sparkSession=sparkSession, rows=rows,
partitions=partitions,
randomSeedMethod="hash_fieldname")
.withColumn("customer_id", "long", minValue=1000000, maxValue=self.MAX_LONG, random=generateRandom)
.withColumn("name", "string",
template=r'\w \w|\w \w \w', random=generateRandom)
.withColumn("email", "string",
template=r'\w.\w@\w.com|\w@\w.co.u\k', random=generateRandom)
.withColumn("ip_addr", "string",
template=r'\n.\n.\n.\n', random=generateRandom)
.withColumn("phone", "string",
template=r'(ddd)-ddd-dddd|1(ddd) ddd-dddd|ddd ddddddd',
random=generateRandom)
)

if dummyValues > 0:
df_spec = df_spec.withColumn("dummy", "long", random=True, numColumns=dummyValues,
minValue=1, maxValue=self.MAX_LONG)

return df_spec
Loading