dbldatagen.v1 by anupkalburgi · Pull Request #381 · databrickslabs/dbldatagen

anupkalburgi · 2026-03-05T17:26:01Z

Summary

Introduces dbldatagen.v1, a next-generation data generation engine. This adds a pydantic-based schema DSL, CDC generation, streaming support, multi-table plans with foreign keys, and
connectors (SQL, CSV, JDBC) — all under a new v1 subpackage that coexists with the existing v0 API.

Key additions:

Schema DSL — Pydantic models for defining tables, columns, distributions, and nested types
CDC generation — Stateful and stateless change-data-capture with SCD Type 2 support
Compatibility layer (dbldatagen.v1.compat) — from_data_generator() converts v0 DataGenerator specs to v1 DataGenPlan for gradual migration
Connectors — SQL inference, CSV, and JDBC schema import
Streaming & ingest — Micro-batch streaming and bulk ingest simulation
Migration guide — docs/MIGRATION_V0_TO_V1.md with side-by-side v0/v1 examples

Build & CI:

Separate v0/v1 test runs and coverage configs
pydantic moved to v1-specific extras (v0 unaffected)
New makefile targets: test-v1, test-v1-cov, test-all

Test plan

make test — v0 tests pass (no regressions)
make test-v1 — v1 tests pass
make test-all — both suites pass end-to-end

Resolves #..

Requirements

manually tested - Yes
updated documentation - Will be soon followup PR
updated demos - Planned
updated tests - Updated

Add next-gen synthetic data engine alongside existing API following Pydantic's v1/v2 migration pattern. New code lives under dbldatagen.v1 with partition-independent determinism, FK integrity, CDC, and Pydantic specs. Usage: from dbldatagen.v1 import DataGenPlan, generate, pk_auto, fk, integer Source (40 files): - dbldatagen/v1/ -- schema, DSL, engine, connectors (CSV, JDBC, SQL) Tests (40 files + fixtures): - tests/v1/ -- unit and integration tests with session-scoped Spark fixture Integration tests: - integration_tests/v1/ -- DABs bundle with 6 Databricks notebooks pyproject.toml: - Add pydantic>=2.0 to dependencies - Add v1-faker, v1-jdbc, v1-csv, v1-sql, v1-dev optional dep groups - Add v1 dev deps to hatch env - Add test-v1 and test-all scripts; test now ignores tests/v1/ - Coverage omit for v1 (separate coverage tracking) All imports rewritten from synth_data -> dbldatagen.v1. Fully compliant with ruff, pylint (10/10), mypy, and black. Existing tests unaffected (932 passed).

Complete the integration of synth_data as dbldatagen.v1 with the following changes: API refinements: - Rename parameter `fmt` to `format` across CDC generation APIs - Rename `min_val`/`max_val` to `min`/`max` in DSL column constructors - Remove typing_extensions.Self dependency in favor of explicit types - Clean up unused imports and apply linting fixes throughout Compatibility layer (dbldatagen/v1/compat.py): - Add from_data_generator() to convert v0 DataGenerator specs to v1 DataGenPlan, enabling gradual migration - Map v0 Spark types to v1 DataType enum - Handle column strategies (range, values, weighted, expression, pattern) - Emit warnings for unsupported v0 features (constraints, TextGenerator, Beta/Gamma distributions) Engine improvements: - Refactor CDC chunk generation for efficiency with Faker columns - Improve generator.py with better batch handling and type safety - Enhance ingest_generator.py with expanded seed and streaming support - Add ceiling division idiom for batch calculations Build and CI: - Separate v0 and v1 coverage configs (.coveragerc, .coveragerc-v1) - Add v1 test step to GitHub Actions workflow - Add makefile targets: test-v1, test-v1-cov, test-all - Move pydantic to v1-specific extras in pyproject.toml - Add mypy overrides and ruff exclusions for v1 module Tests: - Update all 42 test files with new parameter names and imports - Add test_compat.py (435 lines) for v0-to-v1 conversion coverage - Fix conftest.py to avoid premature SparkSession shutdown Documentation: - Add V1_ANNOUNCEMENT.md with feature overview and examples - Add docs/MIGRATION_V0_TO_V1.md with detailed migration guide

… black The integration test notebooks use Databricks-injected globals (spark, dbutils) and imports after restartPython(), which ruff flags as F821 and E402. Exclude integration_tests/ from ruff linting via per-file-ignores and apply black formatting to pass the fmt check.

Mypy was checking tests/v1/ and integration_tests/v1/ which use Databricks globals (spark, dbutils) and dynamic pydantic union types that produce false positives. Add exclusions matching the existing ignore_errors override for dbldatagen.v1.*.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The makefile installed hatch unpinned, which could override the hatch==1.13.0 pinned in the CI workflow and cause virtualenv compatibility errors (propose_interpreters).

virtualenv 21.1.0 removed the propose_interpreters API that hatch 1.13.0 depends on. Pin virtualenv<21 in both the makefile and CI workflow to resolve the incompatibility.

anupkalburgi added 3 commits February 19, 2026 05:50

Remove V1_ANNOUNCEMENT.md

cf51993

anupkalburgi requested review from a team as code owners March 5, 2026 17:26

anupkalburgi requested review from nfx and suryasaitura-db and removed request for a team March 5, 2026 17:26

anupkalburgi mentioned this pull request Mar 5, 2026

Specification driven datagen #372

Closed

4 tasks

anupkalburgi added 2 commits March 5, 2026 12:59

alexott requested review from Copilot and ghanse March 5, 2026 19:09

Copilot AI reviewed Mar 5, 2026

View reviewed changes

anupkalburgi added 2 commits March 5, 2026 15:06

Fix CI: pin hatch version in makefile to match workflow

f21c3c7

The makefile installed hatch unpinned, which could override the hatch==1.13.0 pinned in the CI workflow and cause virtualenv compatibility errors (propose_interpreters).

Fix CI: pin virtualenv<21 to fix propose_interpreters error

8514e59

virtualenv 21.1.0 removed the propose_interpreters API that hatch 1.13.0 depends on. Pin virtualenv<21 in both the makefile and CI workflow to resolve the incompatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbldatagen.v1#381

dbldatagen.v1#381
anupkalburgi wants to merge 7 commits intomasterfrom
ak/synth-next

anupkalburgi commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anupkalburgi commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Requirements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anupkalburgi commented Mar 5, 2026 •

edited

Loading