Skip to content

dbldatagen.v1#381

Open
anupkalburgi wants to merge 7 commits intomasterfrom
ak/synth-next
Open

dbldatagen.v1#381
anupkalburgi wants to merge 7 commits intomasterfrom
ak/synth-next

Conversation

@anupkalburgi
Copy link
Collaborator

@anupkalburgi anupkalburgi commented Mar 5, 2026

Summary

Introduces dbldatagen.v1, a next-generation data generation engine. This adds a pydantic-based schema DSL, CDC generation, streaming support, multi-table plans with foreign keys, and
connectors (SQL, CSV, JDBC) — all under a new v1 subpackage that coexists with the existing v0 API.

Key additions:

  • Schema DSL — Pydantic models for defining tables, columns, distributions, and nested types
  • CDC generation — Stateful and stateless change-data-capture with SCD Type 2 support
  • Compatibility layer (dbldatagen.v1.compat) — from_data_generator() converts v0 DataGenerator specs to v1 DataGenPlan for gradual migration
  • Connectors — SQL inference, CSV, and JDBC schema import
  • Streaming & ingest — Micro-batch streaming and bulk ingest simulation
  • Migration guidedocs/MIGRATION_V0_TO_V1.md with side-by-side v0/v1 examples

Build & CI:

  • Separate v0/v1 test runs and coverage configs
  • pydantic moved to v1-specific extras (v0 unaffected)
  • New makefile targets: test-v1, test-v1-cov, test-all

Test plan

  • make test — v0 tests pass (no regressions)
  • make test-v1 — v1 tests pass
  • make test-all — both suites pass end-to-end

Resolves #..

Requirements

  • manually tested - Yes
  • updated documentation - Will be soon followup PR
  • updated demos - Planned
  • updated tests - Updated

Add next-gen synthetic data engine alongside existing API following
Pydantic's v1/v2 migration pattern. New code lives under dbldatagen.v1
with partition-independent determinism, FK integrity, CDC, and Pydantic specs.

Usage: from dbldatagen.v1 import DataGenPlan, generate, pk_auto, fk, integer

Source (40 files):
- dbldatagen/v1/ -- schema, DSL, engine, connectors (CSV, JDBC, SQL)

Tests (40 files + fixtures):
- tests/v1/ -- unit and integration tests with session-scoped Spark fixture

Integration tests:
- integration_tests/v1/ -- DABs bundle with 6 Databricks notebooks

pyproject.toml:
- Add pydantic>=2.0 to dependencies
- Add v1-faker, v1-jdbc, v1-csv, v1-sql, v1-dev optional dep groups
- Add v1 dev deps to hatch env
- Add test-v1 and test-all scripts; test now ignores tests/v1/
- Coverage omit for v1 (separate coverage tracking)

All imports rewritten from synth_data -> dbldatagen.v1.
Fully compliant with ruff, pylint (10/10), mypy, and black.
Existing tests unaffected (932 passed).
Complete the integration of synth_data as dbldatagen.v1 with the
following changes:

API refinements:
- Rename parameter `fmt` to `format` across CDC generation APIs
- Rename `min_val`/`max_val` to `min`/`max` in DSL column constructors
- Remove typing_extensions.Self dependency in favor of explicit types
- Clean up unused imports and apply linting fixes throughout

Compatibility layer (dbldatagen/v1/compat.py):
- Add from_data_generator() to convert v0 DataGenerator specs to v1
  DataGenPlan, enabling gradual migration
- Map v0 Spark types to v1 DataType enum
- Handle column strategies (range, values, weighted, expression, pattern)
- Emit warnings for unsupported v0 features (constraints, TextGenerator,
  Beta/Gamma distributions)

Engine improvements:
- Refactor CDC chunk generation for efficiency with Faker columns
- Improve generator.py with better batch handling and type safety
- Enhance ingest_generator.py with expanded seed and streaming support
- Add ceiling division idiom for batch calculations

Build and CI:
- Separate v0 and v1 coverage configs (.coveragerc, .coveragerc-v1)
- Add v1 test step to GitHub Actions workflow
- Add makefile targets: test-v1, test-v1-cov, test-all
- Move pydantic to v1-specific extras in pyproject.toml
- Add mypy overrides and ruff exclusions for v1 module

Tests:
- Update all 42 test files with new parameter names and imports
- Add test_compat.py (435 lines) for v0-to-v1 conversion coverage
- Fix conftest.py to avoid premature SparkSession shutdown

Documentation:
- Add V1_ANNOUNCEMENT.md with feature overview and examples
- Add docs/MIGRATION_V0_TO_V1.md with detailed migration guide
@anupkalburgi anupkalburgi requested review from a team as code owners March 5, 2026 17:26
@anupkalburgi anupkalburgi requested review from nfx and suryasaitura-db and removed request for a team March 5, 2026 17:26
@anupkalburgi anupkalburgi mentioned this pull request Mar 5, 2026
4 tasks
… black

The integration test notebooks use Databricks-injected globals (spark,
dbutils) and imports after restartPython(), which ruff flags as F821 and
E402. Exclude integration_tests/ from ruff linting via per-file-ignores
and apply black formatting to pass the fmt check.
Mypy was checking tests/v1/ and integration_tests/v1/ which use
Databricks globals (spark, dbutils) and dynamic pydantic union types
that produce false positives. Add exclusions matching the existing
ignore_errors override for dbldatagen.v1.*.
@alexott alexott requested review from Copilot and ghanse March 5, 2026 19:09
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

The makefile installed hatch unpinned, which could override the
hatch==1.13.0 pinned in the CI workflow and cause virtualenv
compatibility errors (propose_interpreters).
virtualenv 21.1.0 removed the propose_interpreters API that
hatch 1.13.0 depends on. Pin virtualenv<21 in both the makefile
and CI workflow to resolve the incompatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants