Skip to content

feat(synthesize): use string-length stats for unstructured text columns#3864

Merged
jqnatividad merged 3 commits into
masterfrom
synthesize-string-len-stats
May 17, 2026
Merged

feat(synthesize): use string-length stats for unstructured text columns#3864
jqnatividad merged 3 commits into
masterfrom
synthesize-string-len-stats

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • synthesize now reads min_length / max_length / avg_length / stddev_length from qsv stats and uses them to shape generated string lengths for unstructured text columns (lorem_*, free_text, no-faker fallback).
  • Target length per row is drawn from Normal(avg, stddev) clamped to [min, max], falling back to Uniform(min, max) when stddev_length is absent or zero. Generated text is then truncated (UTF-8-safe, never padded) to the target.
  • Structured fakers (email, name, uuid, phone, address parts, …) keep length_stats = None so their semantic format is preserved — truncating an email or UUID would corrupt it.
  • Frequency-enumerated and pooled real-value paths are unaffected.
  • Box-Muller is used inline to avoid adding a rand_distr dependency.
  • Behavior is automatic when length stats are present — no new CLI flag.

End-to-end check on a source column with min=11, max=30, avg=20.75, stddev=7.05 produced synthesized output with min=11, max=30, avg=20.26, stddev=6.01 at n=200.

Test plan

  • cargo +nightly fmt
  • cargo build --locked --bin qsv -F all_features
  • cargo test -F all_features synthesize — 35 unit + 13 integration tests pass (new: 9 unit + 1 integration)
  • cargo clippy -F all_features --bin qsv -- -D warnings clean
  • qsv --generate-help-md regenerated docs/help/synthesize.md
  • Manual smoke test: stats → synthesize → stats round-trip reproduces source length distribution

🤖 Generated with Claude Code

When `qsv stats` provides `min_length`/`max_length`/`avg_length`/`stddev_length`
for a String column AND the column is routed to an unstructured text generator
(`lorem_*`, `free_text`, or the no-faker fallback), synthesized values are now
truncated so their character lengths follow `Normal(avg_length, stddev_length)`
clamped to `[min_length, max_length]`. Falls back to uniform when stddev is
absent or zero.

Structured semantic fakers (email, name, uuid, phone, address parts, …) keep
their natural lengths — truncating them would corrupt their format. Frequency-
enumerated values are reproduced verbatim and are never touched.

Box-Muller sampling is used so no new dependency is required. End-to-end
verified: source `min=11, max=30, avg=20.75, stddev=7.05` reproduces to
`min=11, max=30, avg=20.26, stddev=6.01` at n=200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 17, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 43 complexity

Metric Results
Complexity 43

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR teaches qsv synthesize to honor per-column character-length statistics emitted by qsv stats (min_length / max_length / avg_length / stddev_length). For unstructured text columns (lorem_*, free_text, no-faker fallback), generated values are now truncated to a target length sampled from Normal(avg, stddev) clamped to [min, max] (uniform fallback when stddev is missing/zero). Structured fakers (email/uuid/phone/etc.) are intentionally excluded so their semantic format is preserved.

Changes:

  • Add LengthStats parsed from StatsRecord::addl_cols and a is_unstructured/UNSTRUCTURED_CONTENT_TYPES gate to decide where length stats apply.
  • Wire length_stats into ColumnGenerator::{Faker, LoremFallback} and add sample_target_length (inline Box-Muller) + truncate_to_chars UTF-8-safe truncation invoked in next().
  • New unit tests for parsing, sampling, and generator integration (including unstructured pooled vs. structured pooled), plus a new end-to-end integration test for free-text length stats.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/cmd/synthesize/generator.rs Core implementation: LengthStats, sampler, truncation, and integration into the Faker and LoremFallback variants; extensive unit tests.
src/cmd/synthesize/mod.rs USAGE doc paragraph describing the new length-stats behavior.
docs/help/synthesize.md Generated help mirroring the USAGE doc paragraph.
tests/test_synthesize.rs Adds an end-to-end test that stats → synthesize produces blurbs within the source length window.

Comment thread src/cmd/synthesize/mod.rs Outdated
Comment thread docs/help/synthesize.md Outdated
Comment thread src/cmd/synthesize/generator.rs
Three Copilot comments on #3864:

1. USAGE in mod.rs claimed "bounded-cardinality pool values are reproduced
   verbatim and are never truncated" — true only for STRUCTURED faker pools.
   Unstructured pooled values (lorem_*, free_text, unknown) are truncated to
   the length distribution per the new design. Rewrote to split structured
   vs. unstructured pool behavior explicitly.

2. docs/help/synthesize.md carried the same wording; regenerated via
   `qsv --generate-help-md` after the source fix.

3. sample_target_length: documented that clamping (not rejection sampling) is
   intentional — rejection could loop unboundedly when stats avg/stddev are
   inconsistent with min/max, and the small bias toward the bounds is
   acceptable for synthetic data. Also documented the lo>=hi guard's max(1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit 1fa820d into master May 17, 2026
18 checks passed
@jqnatividad jqnatividad deleted the synthesize-string-len-stats branch May 17, 2026 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants