feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format by jqnatividad · Pull Request #3884 · dathere/qsv

jqnatividad · 2026-05-21T22:44:53Z

Summary

describegpt --infer-content-type previously classified every temporal column as content_type = "unknown" — CONTENT_TYPE_VOCAB had no date/datetime token, so the LLM (told to only use allowed tokens) correctly fell back to unknown. A data dictionary labeling a DateTime column "unknown" reads as a tooling failure.

This PR adds date and datetime content-type tokens that carry an LLM-inferred chrono strftime format as a suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), reusing the existing duration:N suffix machinery, and wires the format into qsv synthesize.

How it works

Deterministic token, LLM-inferred format. The bare date/datetime token is stamped deterministically from the stats Type column (mirroring the unique_id stamp). The LLM only supplies the :<fmt> suffix, inferred from the raw frequency-distribution values.
Two-stage validation. normalize_datetime_token syntactically validates the strftime suffix (chrono StrftimeItems); validate_date_formats semantically validates it against a real frequency sample and strips a format that doesn't parse the data.
All-midnight reclassification. qsv stats types a column DateTime whenever any value carries a non-midnight time, which over-reports columns that are plainly dates stored with a zero time-of-day (e.g. NYC 311 Created Date, where every sampled value is MM/DD/YYYY 12:00:00 AM). downgrade_all_midnight_datetime_columns reclassifies such a column to date and strips the time specifiers from <fmt> (datetime:%m/%d/%Y %I:%M:%S %p → date:%m/%d/%Y). A column with a real time anywhere in its sample (e.g. Due Date) stays datetime.
synthesize consumption. parse_date_format extracts the format; build_date / DateQuantile / next() emit generated dates in the inferred on-disk format instead of hardcoded ISO. Bare tokens fall back to %Y-%m-%d / RFC 3339.

Example

For NYC 311 data:

Column	stats Type	content_type
Created Date	DateTime	`date:%m/%d/%Y`
Closed Date	DateTime	`date:%m/%d/%Y`
Due Date	DateTime	`datetime:%m/%d/%Y %I:%M:%S %p`

Testing

cargo build --bin qsv -F all_features — clean; cargo clippy — no new warnings.
New unit tests in dictionary.rs (normalize/stamp/merge/validate/downgrade/strip_time_from_format) and faker_map.rs (parse_date_format, is_faker_token).
New integration test synthesize_with_dictionary_applies_inferred_date_format.
Regression: describegpt 70 + 65, synthesize 20 + 37, dictionary unit 43 — all pass.
The first-pass and refine prompts (plus the byte-identical DEFAULT_DICTIONARY_REFINE_PROMPT const) are updated; the default_dictionary_refine_prompt_matches_resource sync test passes.

🤖 Generated with Claude Code

…d chrono format describegpt --infer-content-type previously classified date/datetime columns as "unknown" because CONTENT_TYPE_VOCAB had no date token. Add `date` and `datetime` content-type tokens that carry an LLM-inferred chrono strftime format as a suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), reusing the existing duration:N suffix machinery: - The bare date/datetime token is stamped deterministically from the stats Type column (mirroring the unique_id stamp); the LLM supplies only the <fmt> suffix. - normalize_datetime_token syntactically validates the strftime suffix (chrono StrftimeItems); validate_date_formats semantically validates it against real frequency-distribution samples, stripping a format that does not parse the data. - merge_content_type lets the LLM contribute only the format suffix to a code-stamped date column - it cannot reclassify a column or add a date token to a non-date column. - synthesize consumes the inferred format: parse_date_format extracts it and build_date / DateQuantile emit generated dates in the original on-disk format instead of the hardcoded ISO output. - First-pass and refine prompts (and the mirrored const) updated to instruct the LLM on the date:<fmt> / datetime:<fmt> forms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qsv stats types a column DateTime whenever any single value carries a non-midnight time, which over-reports columns whose values are plainly dates stored with a zero time-of-day (e.g. NYC 311 "Created Date", where every frequency-sampled value is "MM/DD/YYYY 12:00:00 AM"). Add downgrade_all_midnight_datetime_columns, a deterministic post-merge step run after validate_date_formats: when every parseable frequency sample of a datetime:<fmt> column falls exactly on midnight, the column is reclassified as date and strip_time_from_format drops the time specifiers from <fmt> (datetime:%m/%d/%Y %I:%M:%S %p becomes date:%m/%d/%Y). Unparseable samples (e.g. NULL sentinels) are ignored rather than blocking the downgrade; a column with a real time anywhere in its frequency sample (e.g. "Due Date") stays datetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codacy-production · 2026-05-21T22:45:53Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 23 complexity

Metric Results

Complexity 23

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Address roborev review (job 2345): validate_date_formats checked the LLM-inferred date/datetime strftime suffix against only the first usable frequency value, so an ambiguous format could survive incorrectly - `%m/%d/%Y` parses a first sample `01/02/2020` even when a later sample `13/02/2020` proves the column is actually `%d/%m/%Y`, making synthesize emit dates in the wrong layout. Now every usable frequency sample must parse with the inferred format; if any fails, the suffix is stripped back to the bare token. Extract a shared `usable_samples_by_field` helper (also used by `downgrade_all_midnight_datetime_columns`) and add a regression test where the first sample is ambiguous but a later one disambiguates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ipping Address roborev review (job 2346): MEDIUM - downgrade_all_midnight_datetime_columns used filter_map, which silently dropped any sample that failed to parse with the inferred format. A real frequency value with a non-midnight time but a mismatched format could be dropped, wrongly downgrading the column to date. It now collects into Option<Vec<bool>>: any sample that fails to parse blocks the downgrade and the column stays datetime. LOW - strip_time_from_format only trimmed whitespace, `T` and `,` before the time specifier, so a format like `%Y-%m-%d-%H:%M:%S` downgraded to `date:%Y-%m-%d-` (trailing separator). The trim set now also covers `-` `/` `.` `:` `_`. Updated the downgrade test to assert an unparseable sample keeps the column datetime, and added a strip_time_from_format case for the `-` separator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address roborev review (job 2347): usable_samples_by_field only excluded rank-0 rows and the <ALL_UNIQUE> sentinel. `frequency --pct-nulls` gives the null row a real (non-zero) rank, so the `(NULL)` row was kept as a usable sample. Since validate_date_formats now requires every sample to parse, a single ranked null row stripped an otherwise-valid date:<fmt> / datetime:<fmt> suffix. usable_samples_by_field now also excludes rows whose value is `(NULL)` (frequency's default --null-text), so both validate_date_formats and downgrade_all_midnight_datetime_columns ignore them. Added regression test validate_date_formats_ignores_ranked_null_rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address roborev review (job 2348): the finding — a (NULL) row ranked by `frequency --pct-nulls` blocking the datetime->date downgrade — shares its root cause with job 2347 and was already fixed in f26d1b0: usable_samples_by_field now excludes (NULL) rows by value, and downgrade_all_midnight_datetime_columns consumes that helper, so a ranked (NULL) row never reaches the strict parse check. Add the downgrade-path regression test the review asked for: an all-midnight datetime column with a valid sample plus a ranked (NULL) row still downgrades to date. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address roborev review (job 2349): usable_samples_by_field excluded the null row by matching the hard-coded `(NULL)` label. That was both too narrow - `describegpt --freq-options` can pass a custom `frequency --null-text`, whose ranked null row (`--pct-nulls`) then slipped through and stripped valid date/datetime suffixes - and too broad - a real data value literally equal to `(NULL)` would be dropped. Identify the null row structurally instead: its `count` equals the column's null count, which `DictionaryEntry.null_count` already carries. usable_samples_by_field now takes the entries and excludes the row whose count matches the field's null count, independent of `--null-text` and `--pct-nulls`. Regression tests updated to set null_count and use a custom null label (`<MISSING>`) to prove label-independence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address roborev review (job 2351): usable_samples_by_field identified the frequency null row by `count == null_count` alone. That is too broad - a real, non-null value whose count merely equals the column's null count would be dropped, which can keep a bad date suffix or wrongly downgrade a datetime column to date. Require BOTH signals `frequency` guarantees for the emitted null row: its `value` equals the configured `--null-text` AND its `count` equals the column's null count. Value alone is too broad (a real datum reading as the null label) and count alone is too broad (a real value sharing the null count); together they confine the exclusion to the genuine null sentinel, even when `frequency --pct-nulls` gives that row a real rank. The configured null-text is threaded through validate_date_formats and downgrade_all_midnight_datetime_columns. A new configured_null_text helper parses `--null-text` out of `--freq-options` (default `(NULL)`). Regression tests added: a real sample sharing the null count is not dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address roborev review (job 2352). Two findings, both genuine but very narrow corners whose robust fixes are feature-level: - Medium: a `file:`-backed frequency CSV generated with a custom `--null-text` - describegpt did not generate that CSV, so it cannot know the label and `configured_null_text` falls back to `(NULL)`. - Low: a real data value identical to the null label AND sharing the column's null count is indistinguishable from the null sentinel in the CSV - an inherent ambiguity that only a controlled null-text could remove, which would pollute dictionary enumerations/examples. The existing null-text + null-count check is correct for all normal usage. Rather than add a new CLI flag, document the `file:` limitation explicitly: the `--freq-options` USAGE text now states that a `file:`-backed CSV is assumed to use frequency's default `(NULL)` null text, and the `configured_null_text` doc comment spells out the residual `file:` + custom `--null-text` + `--pct-nulls` gap. Help doc regenerated from the updated USAGE text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds first-class temporal content-type support to describegpt --infer-content-type by introducing deterministic date/datetime tokens that can carry an LLM-inferred chrono strftime suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), validates the suffix against real frequency samples, and threads the validated format into qsv synthesize so generated outputs match the source on-disk representation instead of defaulting to ISO/RFC3339.

Changes:

Extend describegpt dictionary content-type vocabulary with date/datetime plus :<fmt> suffix handling, including merge rules and sample-based validation/downgrade logic.
Teach synthesize to parse and apply inferred date/datetime output formats while keeping date/datetime generation type-based (not faker-based).
Add unit + integration tests and update default prompts/docs to document the new suffix forms and known file: limitations around null-text.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_synthesize.rs	Adds an integration test asserting synthesize applies inferred date formatting from a dictionary.
src/cmd/synthesize/generator.rs	Threads an optional inferred strftime format into the date/datetime quantile generator and applies it at render time.
src/cmd/synthesize/faker_map.rs	Adds `date`/`datetime` as non-faker tokens and implements `parse_date_format` to extract/validate the `:<fmt>` suffix.
src/cmd/describegpt/dictionary.rs	Adds `date`/`datetime` tokens to the vocab, normalization/merge rules, format validation against frequency samples, and all-midnight datetime→date downgrade.
src/cmd/describegpt.rs	Wires date-format validation/downgrade into the dictionary build pipeline and adds `--freq-options` null-text parsing (with documented `file:` limitation).
resources/describegpt_defaults.toml	Updates default prompts to instruct the LLM on `date:<fmt>` / `datetime:<fmt>` suffix usage and constraints.
docs/help/describegpt.md	Updates generated help to document the `file:` + custom `--null-text` limitation for date/datetime format validation.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

jqnatividad and others added 2 commits May 21, 2026 18:27

jqnatividad and others added 8 commits May 21, 2026 18:52

typo: unparseable->unparsable

53d82be

jqnatividad requested a review from Copilot May 22, 2026 00:40

Copilot started reviewing on behalf of jqnatividad May 22, 2026 00:40 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread tests/test_synthesize.rs Outdated

tests(synthesize): tighter stdout handling per Copilot review

0a89ae0

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

jqnatividad merged commit 688aba8 into master May 22, 2026
18 checks passed

jqnatividad deleted the feat/describegpt-date-content-type branch May 22, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format#3884

feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format#3884
jqnatividad merged 11 commits into
masterfrom
feat/describegpt-date-content-type

jqnatividad commented May 21, 2026

Uh oh!

codacy-production Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jqnatividad commented May 21, 2026

Summary

How it works

Example

Testing

Uh oh!

codacy-production Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codacy-production Bot commented May 21, 2026 •

edited

Loading