feat(describegpt): date/datetime content-type tokens with LLM-inferred chrono format#3884
Conversation
…d chrono format describegpt --infer-content-type previously classified date/datetime columns as "unknown" because CONTENT_TYPE_VOCAB had no date token. Add `date` and `datetime` content-type tokens that carry an LLM-inferred chrono strftime format as a suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), reusing the existing duration:N suffix machinery: - The bare date/datetime token is stamped deterministically from the stats Type column (mirroring the unique_id stamp); the LLM supplies only the <fmt> suffix. - normalize_datetime_token syntactically validates the strftime suffix (chrono StrftimeItems); validate_date_formats semantically validates it against real frequency-distribution samples, stripping a format that does not parse the data. - merge_content_type lets the LLM contribute only the format suffix to a code-stamped date column - it cannot reclassify a column or add a date token to a non-date column. - synthesize consumes the inferred format: parse_date_format extracts it and build_date / DateQuantile emit generated dates in the original on-disk format instead of the hardcoded ISO output. - First-pass and refine prompts (and the mirrored const) updated to instruct the LLM on the date:<fmt> / datetime:<fmt> forms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qsv stats types a column DateTime whenever any single value carries a non-midnight time, which over-reports columns whose values are plainly dates stored with a zero time-of-day (e.g. NYC 311 "Created Date", where every frequency-sampled value is "MM/DD/YYYY 12:00:00 AM"). Add downgrade_all_midnight_datetime_columns, a deterministic post-merge step run after validate_date_formats: when every parseable frequency sample of a datetime:<fmt> column falls exactly on midnight, the column is reclassified as date and strip_time_from_format drops the time specifiers from <fmt> (datetime:%m/%d/%Y %I:%M:%S %p becomes date:%m/%d/%Y). Unparseable samples (e.g. NULL sentinels) are ignored rather than blocking the downgrade; a column with a real time anywhere in its frequency sample (e.g. "Due Date") stays datetime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 23 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Address roborev review (job 2345): validate_date_formats checked the LLM-inferred date/datetime strftime suffix against only the first usable frequency value, so an ambiguous format could survive incorrectly - `%m/%d/%Y` parses a first sample `01/02/2020` even when a later sample `13/02/2020` proves the column is actually `%d/%m/%Y`, making synthesize emit dates in the wrong layout. Now every usable frequency sample must parse with the inferred format; if any fails, the suffix is stripped back to the bare token. Extract a shared `usable_samples_by_field` helper (also used by `downgrade_all_midnight_datetime_columns`) and add a regression test where the first sample is ambiguous but a later one disambiguates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipping Address roborev review (job 2346): MEDIUM - downgrade_all_midnight_datetime_columns used filter_map, which silently dropped any sample that failed to parse with the inferred format. A real frequency value with a non-midnight time but a mismatched format could be dropped, wrongly downgrading the column to date. It now collects into Option<Vec<bool>>: any sample that fails to parse blocks the downgrade and the column stays datetime. LOW - strip_time_from_format only trimmed whitespace, `T` and `,` before the time specifier, so a format like `%Y-%m-%d-%H:%M:%S` downgraded to `date:%Y-%m-%d-` (trailing separator). The trim set now also covers `-` `/` `.` `:` `_`. Updated the downgrade test to assert an unparseable sample keeps the column datetime, and added a strip_time_from_format case for the `-` separator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2347): usable_samples_by_field only excluded rank-0 rows and the <ALL_UNIQUE> sentinel. `frequency --pct-nulls` gives the null row a real (non-zero) rank, so the `(NULL)` row was kept as a usable sample. Since validate_date_formats now requires every sample to parse, a single ranked null row stripped an otherwise-valid date:<fmt> / datetime:<fmt> suffix. usable_samples_by_field now also excludes rows whose value is `(NULL)` (frequency's default --null-text), so both validate_date_formats and downgrade_all_midnight_datetime_columns ignore them. Added regression test validate_date_formats_ignores_ranked_null_rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2348): the finding — a (NULL) row ranked by `frequency --pct-nulls` blocking the datetime->date downgrade — shares its root cause with job 2347 and was already fixed in f26d1b0: usable_samples_by_field now excludes (NULL) rows by value, and downgrade_all_midnight_datetime_columns consumes that helper, so a ranked (NULL) row never reaches the strict parse check. Add the downgrade-path regression test the review asked for: an all-midnight datetime column with a valid sample plus a ranked (NULL) row still downgrades to date. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2349): usable_samples_by_field excluded the null row by matching the hard-coded `(NULL)` label. That was both too narrow - `describegpt --freq-options` can pass a custom `frequency --null-text`, whose ranked null row (`--pct-nulls`) then slipped through and stripped valid date/datetime suffixes - and too broad - a real data value literally equal to `(NULL)` would be dropped. Identify the null row structurally instead: its `count` equals the column's null count, which `DictionaryEntry.null_count` already carries. usable_samples_by_field now takes the entries and excludes the row whose count matches the field's null count, independent of `--null-text` and `--pct-nulls`. Regression tests updated to set null_count and use a custom null label (`<MISSING>`) to prove label-independence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2351): usable_samples_by_field identified the frequency null row by `count == null_count` alone. That is too broad - a real, non-null value whose count merely equals the column's null count would be dropped, which can keep a bad date suffix or wrongly downgrade a datetime column to date. Require BOTH signals `frequency` guarantees for the emitted null row: its `value` equals the configured `--null-text` AND its `count` equals the column's null count. Value alone is too broad (a real datum reading as the null label) and count alone is too broad (a real value sharing the null count); together they confine the exclusion to the genuine null sentinel, even when `frequency --pct-nulls` gives that row a real rank. The configured null-text is threaded through validate_date_formats and downgrade_all_midnight_datetime_columns. A new configured_null_text helper parses `--null-text` out of `--freq-options` (default `(NULL)`). Regression tests added: a real sample sharing the null count is not dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address roborev review (job 2352). Two findings, both genuine but very narrow corners whose robust fixes are feature-level: - Medium: a `file:`-backed frequency CSV generated with a custom `--null-text` - describegpt did not generate that CSV, so it cannot know the label and `configured_null_text` falls back to `(NULL)`. - Low: a real data value identical to the null label AND sharing the column's null count is indistinguishable from the null sentinel in the CSV - an inherent ambiguity that only a controlled null-text could remove, which would pollute dictionary enumerations/examples. The existing null-text + null-count check is correct for all normal usage. Rather than add a new CLI flag, document the `file:` limitation explicitly: the `--freq-options` USAGE text now states that a `file:`-backed CSV is assumed to use frequency's default `(NULL)` null text, and the `configured_null_text` doc comment spells out the residual `file:` + custom `--null-text` + `--pct-nulls` gap. Help doc regenerated from the updated USAGE text. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds first-class temporal content-type support to describegpt --infer-content-type by introducing deterministic date/datetime tokens that can carry an LLM-inferred chrono strftime suffix (e.g. datetime:%m/%d/%Y %I:%M:%S %p), validates the suffix against real frequency samples, and threads the validated format into qsv synthesize so generated outputs match the source on-disk representation instead of defaulting to ISO/RFC3339.
Changes:
- Extend describegpt dictionary content-type vocabulary with
date/datetimeplus:<fmt>suffix handling, including merge rules and sample-based validation/downgrade logic. - Teach synthesize to parse and apply inferred date/datetime output formats while keeping date/datetime generation type-based (not faker-based).
- Add unit + integration tests and update default prompts/docs to document the new suffix forms and known
file:limitations around null-text.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_synthesize.rs | Adds an integration test asserting synthesize applies inferred date formatting from a dictionary. |
| src/cmd/synthesize/generator.rs | Threads an optional inferred strftime format into the date/datetime quantile generator and applies it at render time. |
| src/cmd/synthesize/faker_map.rs | Adds date/datetime as non-faker tokens and implements parse_date_format to extract/validate the :<fmt> suffix. |
| src/cmd/describegpt/dictionary.rs | Adds date/datetime tokens to the vocab, normalization/merge rules, format validation against frequency samples, and all-midnight datetime→date downgrade. |
| src/cmd/describegpt.rs | Wires date-format validation/downgrade into the dictionary build pipeline and adds --freq-options null-text parsing (with documented file: limitation). |
| resources/describegpt_defaults.toml | Updates default prompts to instruct the LLM on date:<fmt> / datetime:<fmt> suffix usage and constraints. |
| docs/help/describegpt.md | Updates generated help to document the file: + custom --null-text limitation for date/datetime format validation. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Summary
describegpt --infer-content-typepreviously classified every temporal column ascontent_type = "unknown"—CONTENT_TYPE_VOCABhad nodate/datetimetoken, so the LLM (told to only use allowed tokens) correctly fell back tounknown. A data dictionary labeling aDateTimecolumn "unknown" reads as a tooling failure.This PR adds
dateanddatetimecontent-type tokens that carry an LLM-inferred chrono strftime format as a suffix (e.g.datetime:%m/%d/%Y %I:%M:%S %p), reusing the existingduration:Nsuffix machinery, and wires the format intoqsv synthesize.How it works
date/datetimetoken is stamped deterministically from the statsTypecolumn (mirroring theunique_idstamp). The LLM only supplies the:<fmt>suffix, inferred from the raw frequency-distribution values.normalize_datetime_tokensyntactically validates the strftime suffix (chronoStrftimeItems);validate_date_formatssemantically validates it against a real frequency sample and strips a format that doesn't parse the data.qsv statstypes a columnDateTimewhenever any value carries a non-midnight time, which over-reports columns that are plainly dates stored with a zero time-of-day (e.g. NYC 311Created Date, where every sampled value isMM/DD/YYYY 12:00:00 AM).downgrade_all_midnight_datetime_columnsreclassifies such a column todateand strips the time specifiers from<fmt>(datetime:%m/%d/%Y %I:%M:%S %p→date:%m/%d/%Y). A column with a real time anywhere in its sample (e.g.Due Date) staysdatetime.parse_date_formatextracts the format;build_date/DateQuantile/next()emit generated dates in the inferred on-disk format instead of hardcoded ISO. Bare tokens fall back to%Y-%m-%d/ RFC 3339.Example
For NYC 311 data:
date:%m/%d/%Ydate:%m/%d/%Ydatetime:%m/%d/%Y %I:%M:%S %pTesting
cargo build --bin qsv -F all_features— clean;cargo clippy— no new warnings.dictionary.rs(normalize/stamp/merge/validate/downgrade/strip_time_from_format) andfaker_map.rs(parse_date_format,is_faker_token).synthesize_with_dictionary_applies_inferred_date_format.DEFAULT_DICTIONARY_REFINE_PROMPTconst) are updated; thedefault_dictionary_refine_prompt_matches_resourcesync test passes.🤖 Generated with Claude Code