feat(synthesize): use string-length stats for unstructured text columns#3864
Conversation
When `qsv stats` provides `min_length`/`max_length`/`avg_length`/`stddev_length` for a String column AND the column is routed to an unstructured text generator (`lorem_*`, `free_text`, or the no-faker fallback), synthesized values are now truncated so their character lengths follow `Normal(avg_length, stddev_length)` clamped to `[min_length, max_length]`. Falls back to uniform when stddev is absent or zero. Structured semantic fakers (email, name, uuid, phone, address parts, …) keep their natural lengths — truncating them would corrupt their format. Frequency- enumerated values are reproduced verbatim and are never touched. Box-Muller sampling is used so no new dependency is required. End-to-end verified: source `min=11, max=30, avg=20.75, stddev=7.05` reproduces to `min=11, max=30, avg=20.26, stddev=6.01` at n=200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 43 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Pull request overview
This PR teaches qsv synthesize to honor per-column character-length statistics emitted by qsv stats (min_length / max_length / avg_length / stddev_length). For unstructured text columns (lorem_*, free_text, no-faker fallback), generated values are now truncated to a target length sampled from Normal(avg, stddev) clamped to [min, max] (uniform fallback when stddev is missing/zero). Structured fakers (email/uuid/phone/etc.) are intentionally excluded so their semantic format is preserved.
Changes:
- Add
LengthStatsparsed fromStatsRecord::addl_colsand ais_unstructured/UNSTRUCTURED_CONTENT_TYPESgate to decide where length stats apply. - Wire
length_statsintoColumnGenerator::{Faker, LoremFallback}and addsample_target_length(inline Box-Muller) +truncate_to_charsUTF-8-safe truncation invoked innext(). - New unit tests for parsing, sampling, and generator integration (including unstructured pooled vs. structured pooled), plus a new end-to-end integration test for free-text length stats.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/cmd/synthesize/generator.rs | Core implementation: LengthStats, sampler, truncation, and integration into the Faker and LoremFallback variants; extensive unit tests. |
| src/cmd/synthesize/mod.rs | USAGE doc paragraph describing the new length-stats behavior. |
| docs/help/synthesize.md | Generated help mirroring the USAGE doc paragraph. |
| tests/test_synthesize.rs | Adds an end-to-end test that stats → synthesize produces blurbs within the source length window. |
Three Copilot comments on #3864: 1. USAGE in mod.rs claimed "bounded-cardinality pool values are reproduced verbatim and are never truncated" — true only for STRUCTURED faker pools. Unstructured pooled values (lorem_*, free_text, unknown) are truncated to the length distribution per the new design. Rewrote to split structured vs. unstructured pool behavior explicitly. 2. docs/help/synthesize.md carried the same wording; regenerated via `qsv --generate-help-md` after the source fix. 3. sample_target_length: documented that clamping (not rejection sampling) is intentional — rejection could loop unboundedly when stats avg/stddev are inconsistent with min/max, and the small bias toward the bounds is acceptable for synthetic data. Also documented the lo>=hi guard's max(1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
synthesizenow readsmin_length/max_length/avg_length/stddev_lengthfromqsv statsand uses them to shape generated string lengths for unstructured text columns (lorem_*,free_text, no-faker fallback).Normal(avg, stddev)clamped to[min, max], falling back toUniform(min, max)whenstddev_lengthis absent or zero. Generated text is then truncated (UTF-8-safe, never padded) to the target.length_stats = Noneso their semantic format is preserved — truncating an email or UUID would corrupt it.rand_distrdependency.End-to-end check on a source column with
min=11, max=30, avg=20.75, stddev=7.05produced synthesized output withmin=11, max=30, avg=20.26, stddev=6.01at n=200.Test plan
cargo +nightly fmtcargo build --locked --bin qsv -F all_featurescargo test -F all_features synthesize— 35 unit + 13 integration tests pass (new: 9 unit + 1 integration)cargo clippy -F all_features --bin qsv -- -D warningscleanqsv --generate-help-mdregenerateddocs/help/synthesize.md🤖 Generated with Claude Code