Open
Conversation
Add next-gen synthetic data engine alongside existing API following Pydantic's v1/v2 migration pattern. New code lives under dbldatagen.v1 with partition-independent determinism, FK integrity, CDC, and Pydantic specs. Usage: from dbldatagen.v1 import DataGenPlan, generate, pk_auto, fk, integer Source (40 files): - dbldatagen/v1/ -- schema, DSL, engine, connectors (CSV, JDBC, SQL) Tests (40 files + fixtures): - tests/v1/ -- unit and integration tests with session-scoped Spark fixture Integration tests: - integration_tests/v1/ -- DABs bundle with 6 Databricks notebooks pyproject.toml: - Add pydantic>=2.0 to dependencies - Add v1-faker, v1-jdbc, v1-csv, v1-sql, v1-dev optional dep groups - Add v1 dev deps to hatch env - Add test-v1 and test-all scripts; test now ignores tests/v1/ - Coverage omit for v1 (separate coverage tracking) All imports rewritten from synth_data -> dbldatagen.v1. Fully compliant with ruff, pylint (10/10), mypy, and black. Existing tests unaffected (932 passed).
Complete the integration of synth_data as dbldatagen.v1 with the following changes: API refinements: - Rename parameter `fmt` to `format` across CDC generation APIs - Rename `min_val`/`max_val` to `min`/`max` in DSL column constructors - Remove typing_extensions.Self dependency in favor of explicit types - Clean up unused imports and apply linting fixes throughout Compatibility layer (dbldatagen/v1/compat.py): - Add from_data_generator() to convert v0 DataGenerator specs to v1 DataGenPlan, enabling gradual migration - Map v0 Spark types to v1 DataType enum - Handle column strategies (range, values, weighted, expression, pattern) - Emit warnings for unsupported v0 features (constraints, TextGenerator, Beta/Gamma distributions) Engine improvements: - Refactor CDC chunk generation for efficiency with Faker columns - Improve generator.py with better batch handling and type safety - Enhance ingest_generator.py with expanded seed and streaming support - Add ceiling division idiom for batch calculations Build and CI: - Separate v0 and v1 coverage configs (.coveragerc, .coveragerc-v1) - Add v1 test step to GitHub Actions workflow - Add makefile targets: test-v1, test-v1-cov, test-all - Move pydantic to v1-specific extras in pyproject.toml - Add mypy overrides and ruff exclusions for v1 module Tests: - Update all 42 test files with new parameter names and imports - Add test_compat.py (435 lines) for v0-to-v1 conversion coverage - Fix conftest.py to avoid premature SparkSession shutdown Documentation: - Add V1_ANNOUNCEMENT.md with feature overview and examples - Add docs/MIGRATION_V0_TO_V1.md with detailed migration guide
4 tasks
… black The integration test notebooks use Databricks-injected globals (spark, dbutils) and imports after restartPython(), which ruff flags as F821 and E402. Exclude integration_tests/ from ruff linting via per-file-ignores and apply black formatting to pass the fmt check.
Mypy was checking tests/v1/ and integration_tests/v1/ which use Databricks globals (spark, dbutils) and dynamic pydantic union types that produce false positives. Add exclusions matching the existing ignore_errors override for dbldatagen.v1.*.
The makefile installed hatch unpinned, which could override the hatch==1.13.0 pinned in the CI workflow and cause virtualenv compatibility errors (propose_interpreters).
virtualenv 21.1.0 removed the propose_interpreters API that hatch 1.13.0 depends on. Pin virtualenv<21 in both the makefile and CI workflow to resolve the incompatibility.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces
dbldatagen.v1, a next-generation data generation engine. This adds a pydantic-based schema DSL, CDC generation, streaming support, multi-table plans with foreign keys, andconnectors (SQL, CSV, JDBC) — all under a new
v1subpackage that coexists with the existing v0 API.Key additions:
dbldatagen.v1.compat) —from_data_generator()converts v0 DataGenerator specs to v1 DataGenPlan for gradual migrationdocs/MIGRATION_V0_TO_V1.mdwith side-by-side v0/v1 examplesBuild & CI:
test-v1,test-v1-cov,test-allTest plan
make test— v0 tests pass (no regressions)make test-v1— v1 tests passmake test-all— both suites pass end-to-endResolves #..
Requirements